Added gc_stats basic functionality #803

etcimon · 2014-05-23T18:25:51Z

I've added some stats, they show some important metrics on how efficient the GC is at freeing the data compared to how much data was used at the time, I believe this is where most optimization opportunities are.

From my tests at https://github.com/etcimon/druntime.gctest I've observed that the GC frees on average 3% of used memory per collection cycle. I've suggested sampling as a solution to reduce the amount of inefficient collections.

etcimon · 2014-05-23T20:51:55Z

I was using delete to free the memory, this is the real measure for the current version of druntime.gctest

MartinNowak · 2014-05-24T09:01:05Z

src/core/memory.d

@@ -101,7 +101,8 @@ private
    extern (C) void*    gc_realloc( void* p, size_t sz, uint ba = 0 ) pure nothrow;
    extern (C) size_t   gc_extend( void* p, size_t mx, size_t sz ) pure nothrow;
    extern (C) size_t   gc_reserve( size_t sz ) nothrow;
-    extern (C) void     gc_free( void* p ) pure nothrow;
+	extern (C) void     gc_free( void* p ) pure nothrow;
+	extern (C) GCStats     gc_stats( ) pure nothrow;


No tabs please, you might want to use our editorconfig file.

Yes, I changed that part in Monodevelop and it doesn't have a plugin for editorconfig files :/

etcimon · 2014-05-24T13:18:52Z

It seems like the % freed per collection cycle drops off a cliff when size > 4096, I even get 0.00015% per collection using 10KB, 20KB, 30KB memory segments in my tests. However, I get >40% efficiency with small allocations. It looks like really bad numbers when working with streams

etcimon · 2014-05-24T13:47:06Z

By increasing the page size to 64KB, I get much better numbers: 83.351% ! That's what I'd call acceptable for working with streams

MartinNowak · 2014-05-24T18:21:34Z

src/core/memory.d

@@ -123,6 +124,12 @@ private
    extern (C) void gc_runFinalizers( in void[] segment );
 }

+struct GCStats


Documentation needed here.

Safety0ff · 2014-05-25T16:55:35Z

core.memory is supposed to be a low-level interface implemented by a garbage collector.
I am unsure about adding a gc_stats method that is a "set in stone" struct (which is why previously you had to expose it manually e.g.: http://dpaste.dzfl.pl/4ce04246a148 .)
Unless the API is suitable for any GC implementation, I think we should thread carefully.

MartinNowak · 2014-05-25T17:00:32Z

True that, maybe we shouldn't add it to core.memory yet.

etcimon · 2014-05-25T20:37:10Z

How about a signature like the following?

long[] stats(long cmd);

I'm assuming anything stats returns would be a number.

It could be used like this:

auto stats = GC.stats(STATS_FREED | STATS_USED | STATS_COLLECTIONS);

where stats[0] is totFreed, stats[1] is totUsed and stats[2] is collection.

This could prove to be more flexible in the long run..

dislogical · 2014-05-25T20:44:27Z

The problem with that solution is that there's no way to know what order the arguments are passed in, so if I were to call it like this:

auto stats = GC.stats( STATS_USED | STATS_COLLECTIONS | STATS_FREED );

I would still get them in the order totFreed, totUsed, and collections.

This seems unintuitive.

etcimon · 2014-05-25T20:47:04Z

Well then, maybe it's better off having it like:

long[] stats(long[] cmd)

I'd like if we could have a long[] stats(long[] cmd, long[] args, string[] argsStr) for eventually sending in some arguments to get more specific statistics (like turn on logs for a certain file name or log the longest periods of time between collections, etc)

rainers · 2014-05-26T18:05:54Z

I'd prefer passing a pointer/reference to the GCStat struct as a parameter. The first field indicates which statistics are requested as a combination of flags, and the GC fills the fields it has valid data for and returns their respective flags.
At the same time it can enable the code inside the GC to collect this data.

rainers · 2014-05-26T18:07:36Z

src/gc/gc.d

-                    stats.pageblocks++;
-                else if (bin < B_PAGE)
-                    bsize += PAGESIZE;
-            }


Please don't nuke the existing funcitionality, better just add to it.

The problem is, that the existing stats are GC internal, not targeted to a general API.
They still seem to be useful, so maybe you can keep them in a debug(STATS) block.

Safety0ff · 2014-05-26T18:26:31Z

I'd prefer passing a pointer/reference to the GCStat struct as a parameter. The first field indicates which statistics are requested as a combination of flags, and the GC fills the fields it has valid data for and returns their respective flags.

I was thinking of something even more conservative such as returning string[].

rainers · 2014-05-26T18:34:00Z

I was thinking of something even more conservative such as returning string[].

That would need an allocation, and I think it should not interfere with actual GC operations.

Safety0ff · 2014-05-26T19:43:52Z

Hmm, ok, your solution of passing in a pointer to the struct is workable with respect to adding members to the stats struct (though overtime the struct can become larger.)
The only question is how to signal an unsupported statistic request?

etcimon · 2014-05-26T23:47:44Z

The only question is how to signal an unsupported statistic request?

There will have to be a GCStats.version field at the very start of the struct to see if the druntime has the same as the client application. Am I to understand that this struct is not going to be in core.memory?

Safety0ff · 2014-05-26T23:55:32Z

Am I to understand that this struct is not going to be in core.memory?

It would be in core.memory.

etcimon · 2014-05-27T17:39:59Z

Ok I spent some time revising the code to calculate used memory, and it reveals that, when using the GC with memory sizes within the bucket ranges of 0-4096 bytes, collections are ~22%-47% efficient, when moving outside of those ranges, at 5000 bytes, they're 0.00161% efficient - at 64kb, they're 0.000257% efficient.

MartinNowak · 2014-05-27T19:17:24Z

src/core/memory.d

+	size_t freed;
+	size_t used;
+	size_t collections;
+}


Plese get rid of those tabs.

MartinNowak · 2014-05-27T19:55:44Z

Let's focus on getting a working gcStats function in.
So we need to find a reasonable set of parameters that any GC implementation can cheaply calculate or estimate.
Let's look at the stats of some other GC implementations.
Haskell GHC-Stats
Java GarbageCollectorMxBean
Ruby GC.Stats
Python gc.get_stats

So maybe a good starting point would be the following:
- number of collections
- bytes used (approx.)
- total bytes allocated (approx., total bytes freed = total bytes allocated - bytes used)
- total time spent in stop-the-world marking
- total time spent in sweeping & finalizing

You can use TickDuration for a cheap clock.
If any GC can't support one of those stats, it might simply return 0 or size_t.max for that field.
I like the difference idea of @rainers.

etcimon · 2014-05-27T22:19:04Z

It definitely looks like the Haskell GHC stats are the only ones that aren't completely useless.

etcimon · 2014-05-27T22:23:40Z

total bytes allocated (approx., total bytes freed = total bytes allocated - bytes used)

The sums either increment per collection cycles or per malloc calls. The total bytes allocated would be per malloc calls, and the bytes used would be per collection cycles, so the maths don't add up exactly this way.

etcimon · 2014-05-28T02:39:14Z

Hmm, ok, your solution of passing in a pointer to the struct is workable with respect to adding members to the stats struct (though overtime the struct can become larger.)
The only question is how to signal an unsupported statistic request?

I'm not sure about why it would be necessary to do this rather than return a struct on the stack?

etcimon · 2014-06-01T13:02:27Z

The current implementation immediately "crawls" into the heap, so it might see an inconsistent view of the memory which can cause collection of live objects.

That shouldn't be a problem if you give every newly suspended thread a zeroed GCBits mark and scan object. It would also allow concurrent collection into the stacks and heap, and then you OR all the GCBits objects together to get a global view of the marked objects. There may be additional computations but the overall speed would be improved by the use of a thread pool

etcimon · 2014-06-01T14:45:51Z

This would actually harvest much more performance for the GC considering most of the data allocated on the heap by the threads should actually set to thread-local, so most of the marking in each thread should be new.

rainers · 2014-06-01T16:07:16Z

There is no guarantee or information, whether an allocated object is shared or thread local. You'll need stricter type rules and write/cast-barriers again to generate that information .

etcimon · 2014-06-01T16:26:12Z

There is no guarantee or information, whether an allocated object is shared or thread local. You'll need stricter type rules and write/cast-barriers again to generate that information .

If the marks are not cached from one thread to another, wouldn't it be irrelevant whether the object is shared or thread local?

rainers · 2014-06-01T16:51:38Z

If the marks are not cached from one thread to another, wouldn't it be irrelevant whether the object is shared or thread local?

Could you elaborate how you guarantee that an object is not (indirectly) referenced by another thread's stack after scanning the referenes of the "current" stack? Also, data shared between threads can be mutated during scanning, so the collector might not see references to live objects.

There might be some optimizations possible, but I suspect going for full concurrent collection is the better way.

etcimon · 2014-06-01T17:10:11Z

Could you elaborate how you guarantee that an object is not (indirectly) referenced by another thread's stack after scanning the referenes of the "current" stack? Also, data shared between threads can be mutated during scanning, so the collector might not see references to live objects.

The objects that are shared will get scanned again because the mark GCBits from scanning the other stacks aren't used and thus there is no skipping in mark.testSet, during scanning in a concurret mark() for a later suspended thread. So you end up scanning shared objects multiple times.

Sure, the other threads may change the data while the GC is running on the object, but that would never be a big deal because if they're pointers well, the way RAM is multiplexed internally works on 8 byte increments per channel, it will certainly do the modification atomically because that's just how RAM works. So you don't have corruption.

But anyways, if another thread actually wrote to this shared object, it's no big deal because you will also scan it later when that other thread is suspended, and that's when you'll see the new reference. The worst case is that you get 2 different pointers in the same memory region, which causes a leak - not a corruption.

etcimon · 2014-06-01T18:26:41Z

I'm not sure if I explained it correctly, but sequentially suspending the threads would be done by suspending them one by one and keeping them suspended until scanning is done so they can be resumed all at once

rainers · 2014-06-01T18:40:42Z

I'm not sure if I explained it correctly, but sequentially suspending the threads would be done by suspending them one by one and keeping them suspended until scanning is done so they can be resumed all at once

Ah, I didn't realize that threads were supposed to be kept suspended until all threads are scanned.

Still, assume two threads 1 and 2 sharing an object A, that points to B, C and D. While thread 1 is suspended, the GC scans A, and then B and C. Before it gets to read the reference D, thread 2 (which is still running) exchanges references B and D, then zeroes its reference to A. The GC sees another reference to B instead of D. When the GC scans thread 2, there are no more references to A, thus D will be collected even though the first reference in A points to it.

etcimon · 2014-06-01T19:29:37Z

Well I didn't think of that scenario. Sounds like a solution for that would be too complex to look into for now. Would it be advantageous to scan everything concurrently using a clean mark GCBits in every GC mark() thread in your opinion?

rainers · 2014-06-01T19:50:22Z

Would it be advantageous to scan everything concurrently using a clean mark GCBits in every GC mark() thread in your opinion?

I'm not sure whether you mean "concurrent collection" (the application continues to during collection) or "parallel collection" (multiple threads mark the GC heap while the application is stopped). I assume the latter: yes, that'd be nice to have. To avoid (atomic) locks when modifying GCBits from mutliple threads, it would be better to have copies. But scanning the same object again because the thread doesn't know another thread has already done that is also bad. A compromise might be to read the bits of all scanner threads, but only write to one.

etcimon · 2014-06-01T20:02:29Z

I think a manager thread with a message queue could do the writing and expose the object for reading. It is parallel yes, with your background thread for freeing it gets interesting. Would you have a scenario as well that goes against resuming the thread right after it's marked? It seems like the GC lock takes care of preventing any new allocations so that wouldn't be an issue

etcimon · 2014-06-01T20:23:39Z

The idea of the manager thread is to send info (set it and forget it) without any locking and let the manager handle the writing alone. That way you trade off the total number of locks with recency of information

etcimon · 2014-06-01T22:55:46Z

I think I could write a parallel marking proof of concept and have it as an option through environment variables to specify if the application is mostly tls or shared data, which will either used a shared mark GCBits or thread local one. Another experiment would resume thread after marking. I'll take care of it after stats and sampling are done

rainers · 2014-06-02T06:11:45Z

Would you have a scenario as well that goes against resuming the thread right after it's marked?

Interesting idea. I think it can work if there are no cross-references between threads regarding stack and TLS. In that case, they would have to be treated as GC objects and have to be scanned in the mark-phase of the referencing thread.

rainers · 2014-06-02T06:43:06Z

The idea of the manager thread is to send info (set it and forget it) without any locking and let the manager handle the writing alone. That way you trade off the total number of locks with recency of information

I guess the messaging overhead would be too big for setting GCBits, this happens too often. I just measured a LOCK CMPXCHG operation on my i7 and it seems to usually take 130+ cycles (but sometimes only 32). That's bad, but probably still better than sending a message to another thread. So I'd go for atomic operations on the bits.

etcimon · 2014-06-02T11:24:51Z

So I'd go for atomic operations on the bits.

I think I'll just keep the GCBits thread-local and merge them at the end of the operation, it's faster if you assume there's not many shared objects. Most D applications will use TLS anyways, since that's the default

Typo More typos

removed win64.mak re-added

commit 4ede047 Author: Etienne Cimon <etcimon@gmail.com> Date: Tue May 27 13:39:52 2014 -0400 Added documentation and GCStats version commit c24a060 Author: Etienne Cimon <etcimon@gmail.com> Date: Tue May 27 13:33:45 2014 -0400 Calculate freed size of big allocations commit af31878 Author: Etienne Cimon <etcimon@gmail.com> Date: Tue May 27 13:23:46 2014 -0400 Wasn't returning the proper used size

Added more statistics again Added missing fields fixed errors Added try/catch to make currSystemTick nothrow Fixed free size calcul Fixed max bytes used and freed

etcimon · 2014-06-04T13:01:56Z

Does the TickDuration cause this issue?

OPTLINK (R) for Win32 Release 8.00.15
Copyright (C) Digital Mars 1989-2013 All rights reserved.
http://www.digitalmars.com/ctg/optlink.html
unittest.obj(unittest)
Error 42: Symbol Undefined _D4core4stdc8inttypes12__ModuleInfoZ
unittest.obj(unittest)
Error 42: Symbol Undefined _D4core3sys5posix3sys3uio12__ModuleInfoZ
unittest.obj(unittest)
Error 42: Symbol Undefined _D4core3sys5posix3sys5types12__ModuleInfoZ
unittest.obj(unittest)
Error 42: Symbol Undefined _D4core3sys5posix3sys4time12__ModuleInfoZ
unittest.obj(unittest)
Error 42: Symbol Undefined _D4core3sys5posix4time12__ModuleInfoZ
unittest.obj(unittest)
Error 42: Symbol Undefined _D4core3sys5posix6config12__ModuleInfoZ

etcimon · 2014-06-04T19:06:36Z

I'll get back to this later

MartinNowak · 2014-06-08T19:28:32Z

Sorry for letting you wait with this, I'm trying to address all the GC pulls in the following order.

add GC benchmark suite #818 - add benchmark suite
make GC configurable through environment variable #817 - make GC configurable
Added gc_stats basic functionality #803 - add gc_stats basic functionality
... all other GC optimizations

Verify for cache usage at compile time

e848ae6

MartinNowak reviewed May 24, 2014
View reviewed changes

rainers reviewed May 26, 2014
View reviewed changes

MartinNowak reviewed May 27, 2014
View reviewed changes

src/core/memory.d

size_t freed;

size_t used;

size_t collections;

}

Copy link

Member

MartinNowak May 27, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plese get rid of those tabs.

Etienne Cimon added 10 commits June 4, 2014 08:31

Merged

3766fd2

Added stats

a651f71

Typo More typos

fixed a few typos

3d4ecb1

removed win64.mak re-added

Added the ability to start stats

714f18d

Added more statistics

3e59791

Added more statistics again Added missing fields fixed errors Added try/catch to make currSystemTick nothrow Fixed free size calcul Fixed max bytes used and freed

_

4dedee4

Fixed and documented

cf7fbc0

Fix Merge

76a59b1

Fix Merge...

4504ee7

etcimon closed this Jun 4, 2014

Uh oh!

Added gc_stats basic functionality #803

Added gc_stats basic functionality #803

Uh oh!

Conversation

etcimon commented May 23, 2014

Uh oh!

etcimon commented May 23, 2014

Uh oh!

MartinNowak May 24, 2014

Choose a reason for hiding this comment

Uh oh!

etcimon May 24, 2014

Choose a reason for hiding this comment

Uh oh!

etcimon commented May 24, 2014

Uh oh!

etcimon commented May 24, 2014

Uh oh!

MartinNowak May 24, 2014

Choose a reason for hiding this comment

Uh oh!

Safety0ff commented May 25, 2014

Uh oh!

MartinNowak commented May 25, 2014

Uh oh!

etcimon commented May 25, 2014

Uh oh!

dislogical commented May 25, 2014

Uh oh!

etcimon commented May 25, 2014

Uh oh!

rainers commented May 26, 2014

Uh oh!

rainers May 26, 2014

Choose a reason for hiding this comment

Uh oh!

MartinNowak May 27, 2014

Choose a reason for hiding this comment

Uh oh!

Safety0ff commented May 26, 2014

Uh oh!

rainers commented May 26, 2014

Uh oh!

Safety0ff commented May 26, 2014

Uh oh!

etcimon commented May 26, 2014

Uh oh!

Safety0ff commented May 26, 2014

Uh oh!

etcimon commented May 27, 2014

Uh oh!

MartinNowak May 27, 2014

Choose a reason for hiding this comment

Uh oh!

MartinNowak commented May 27, 2014

Uh oh!

etcimon commented May 27, 2014

Uh oh!

etcimon commented May 27, 2014

Uh oh!

etcimon commented May 28, 2014

Uh oh!

etcimon commented Jun 1, 2014

Uh oh!

etcimon commented Jun 1, 2014

Uh oh!

rainers commented Jun 1, 2014

Uh oh!

etcimon commented Jun 1, 2014

Uh oh!

rainers commented Jun 1, 2014

Uh oh!

etcimon commented Jun 1, 2014

Uh oh!

etcimon commented Jun 1, 2014

Uh oh!

rainers commented Jun 1, 2014

Uh oh!

etcimon commented Jun 1, 2014