-
-
Notifications
You must be signed in to change notification settings - Fork 416
Added gc_stats basic functionality #803
Conversation
@@ -101,7 +101,8 @@ private | |||
extern (C) void* gc_realloc( void* p, size_t sz, uint ba = 0 ) pure nothrow; | |||
extern (C) size_t gc_extend( void* p, size_t mx, size_t sz ) pure nothrow; | |||
extern (C) size_t gc_reserve( size_t sz ) nothrow; | |||
extern (C) void gc_free( void* p ) pure nothrow; | |||
extern (C) void gc_free( void* p ) pure nothrow; | |||
extern (C) GCStats gc_stats( ) pure nothrow; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No tabs please, you might want to use our editorconfig file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I changed that part in Monodevelop and it doesn't have a plugin for editorconfig files :/
It seems like the % freed per collection cycle drops off a cliff when size > 4096, I even get 0.00015% per collection using 10KB, 20KB, 30KB memory segments in my tests. However, I get >40% efficiency with small allocations. It looks like really bad numbers when working with streams |
By increasing the page size to 64KB, I get much better numbers: 83.351% ! That's what I'd call acceptable for working with streams |
@@ -123,6 +124,12 @@ private | |||
extern (C) void gc_runFinalizers( in void[] segment ); | |||
} | |||
|
|||
struct GCStats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation needed here.
|
True that, maybe we shouldn't add it to |
How about a signature like the following?
I'm assuming anything stats returns would be a number. It could be used like this:
where stats[0] is totFreed, stats[1] is totUsed and stats[2] is collection. This could prove to be more flexible in the long run.. |
The problem with that solution is that there's no way to know what order the arguments are passed in, so if I were to call it like this: auto stats = GC.stats( STATS_USED | STATS_COLLECTIONS | STATS_FREED ); I would still get them in the order This seems unintuitive. |
Well then, maybe it's better off having it like:
I'd like if we could have a |
I'd prefer passing a pointer/reference to the GCStat struct as a parameter. The first field indicates which statistics are requested as a combination of flags, and the GC fills the fields it has valid data for and returns their respective flags. |
stats.pageblocks++; | ||
else if (bin < B_PAGE) | ||
bsize += PAGESIZE; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't nuke the existing funcitionality, better just add to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is, that the existing stats are GC internal, not targeted to a general API.
They still seem to be useful, so maybe you can keep them in a debug(STATS)
block.
I was thinking of something even more conservative such as returning |
That would need an allocation, and I think it should not interfere with actual GC operations. |
Hmm, ok, your solution of passing in a pointer to the struct is workable with respect to adding members to the stats struct (though overtime the struct can become larger.) |
There will have to be a |
It would be in |
Ok I spent some time revising the code to calculate used memory, and it reveals that, when using the GC with memory sizes within the bucket ranges of 0-4096 bytes, collections are ~22%-47% efficient, when moving outside of those ranges, at 5000 bytes, they're 0.00161% efficient - at 64kb, they're 0.000257% efficient. |
size_t freed; | ||
size_t used; | ||
size_t collections; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plese get rid of those tabs.
Let's focus on getting a working gcStats function in. So maybe a good starting point would be the following: You can use TickDuration for a cheap clock. |
It definitely looks like the Haskell GHC stats are the only ones that aren't completely useless. |
The sums either increment per collection cycles or per malloc calls. The total bytes allocated would be per malloc calls, and the bytes used would be per collection cycles, so the maths don't add up exactly this way. |
I'm not sure about why it would be necessary to do this rather than return a struct on the stack? |
That shouldn't be a problem if you give every newly suspended thread a zeroed GCBits mark and scan object. It would also allow concurrent collection into the stacks and heap, and then you OR all the GCBits objects together to get a global view of the marked objects. There may be additional computations but the overall speed would be improved by the use of a thread pool |
This would actually harvest much more performance for the GC considering most of the data allocated on the heap by the threads should actually set to thread-local, so most of the marking in each thread should be new. |
There is no guarantee or information, whether an allocated object is shared or thread local. You'll need stricter type rules and write/cast-barriers again to generate that information . |
If the marks are not cached from one thread to another, wouldn't it be irrelevant whether the object is shared or thread local? |
Could you elaborate how you guarantee that an object is not (indirectly) referenced by another thread's stack after scanning the referenes of the "current" stack? Also, data shared between threads can be mutated during scanning, so the collector might not see references to live objects. There might be some optimizations possible, but I suspect going for full concurrent collection is the better way. |
The objects that are shared will get scanned again because the mark GCBits from scanning the other stacks aren't used and thus there is no skipping in mark.testSet, during scanning in a concurret mark() for a later suspended thread. So you end up scanning shared objects multiple times. Sure, the other threads may change the data while the GC is running on the object, but that would never be a big deal because if they're pointers well, the way RAM is multiplexed internally works on 8 byte increments per channel, it will certainly do the modification atomically because that's just how RAM works. So you don't have corruption. But anyways, if another thread actually wrote to this shared object, it's no big deal because you will also scan it later when that other thread is suspended, and that's when you'll see the new reference. The worst case is that you get 2 different pointers in the same memory region, which causes a leak - not a corruption. |
I'm not sure if I explained it correctly, but sequentially suspending the threads would be done by suspending them one by one and keeping them suspended until scanning is done so they can be resumed all at once |
Ah, I didn't realize that threads were supposed to be kept suspended until all threads are scanned. Still, assume two threads 1 and 2 sharing an object A, that points to B, C and D. While thread 1 is suspended, the GC scans A, and then B and C. Before it gets to read the reference D, thread 2 (which is still running) exchanges references B and D, then zeroes its reference to A. The GC sees another reference to B instead of D. When the GC scans thread 2, there are no more references to A, thus D will be collected even though the first reference in A points to it. |
Well I didn't think of that scenario. Sounds like a solution for that would be too complex to look into for now. Would it be advantageous to scan everything concurrently using a clean mark GCBits in every GC mark() thread in your opinion? |
I'm not sure whether you mean "concurrent collection" (the application continues to during collection) or "parallel collection" (multiple threads mark the GC heap while the application is stopped). I assume the latter: yes, that'd be nice to have. To avoid (atomic) locks when modifying GCBits from mutliple threads, it would be better to have copies. But scanning the same object again because the thread doesn't know another thread has already done that is also bad. A compromise might be to read the bits of all scanner threads, but only write to one. |
I think a manager thread with a message queue could do the writing and expose the object for reading. It is parallel yes, with your background thread for freeing it gets interesting. Would you have a scenario as well that goes against resuming the thread right after it's marked? It seems like the GC lock takes care of preventing any new allocations so that wouldn't be an issue |
The idea of the manager thread is to send info (set it and forget it) without any locking and let the manager handle the writing alone. That way you trade off the total number of locks with recency of information |
I think I could write a parallel marking proof of concept and have it as an option through environment variables to specify if the application is mostly tls or shared data, which will either used a shared mark GCBits or thread local one. Another experiment would resume thread after marking. I'll take care of it after stats and sampling are done |
Interesting idea. I think it can work if there are no cross-references between threads regarding stack and TLS. In that case, they would have to be treated as GC objects and have to be scanned in the mark-phase of the referencing thread. |
I guess the messaging overhead would be too big for setting GCBits, this happens too often. I just measured a LOCK CMPXCHG operation on my i7 and it seems to usually take 130+ cycles (but sometimes only 32). That's bad, but probably still better than sending a message to another thread. So I'd go for atomic operations on the bits. |
I think I'll just keep the GCBits thread-local and merge them at the end of the operation, it's faster if you assume there's not many shared objects. Most D applications will use TLS anyways, since that's the default |
Typo More typos
removed win64.mak re-added
commit 4ede047 Author: Etienne Cimon <etcimon@gmail.com> Date: Tue May 27 13:39:52 2014 -0400 Added documentation and GCStats version commit c24a060 Author: Etienne Cimon <etcimon@gmail.com> Date: Tue May 27 13:33:45 2014 -0400 Calculate freed size of big allocations commit af31878 Author: Etienne Cimon <etcimon@gmail.com> Date: Tue May 27 13:23:46 2014 -0400 Wasn't returning the proper used size
Added more statistics again Added missing fields fixed errors Added try/catch to make currSystemTick nothrow Fixed free size calcul Fixed max bytes used and freed
Does the TickDuration cause this issue? OPTLINK (R) for Win32 Release 8.00.15 |
I'll get back to this later |
Sorry for letting you wait with this, I'm trying to address all the GC pulls in the following order.
|
I've added some stats, they show some important metrics on how efficient the GC is at freeing the data compared to how much data was used at the time, I believe this is where most optimization opportunities are.
From my tests at https://github.com/etcimon/druntime.gctest I've observed that the GC frees on average 3% of used memory per collection cycle. I've suggested sampling as a solution to reduce the amount of inefficient collections.
