Skip to content
This repository was archived by the owner on Oct 12, 2022. It is now read-only.

Added gc_stats basic functionality #803

Closed
wants to merge 11 commits into from
Closed

Added gc_stats basic functionality #803

wants to merge 11 commits into from

Conversation

etcimon
Copy link
Contributor

@etcimon etcimon commented May 23, 2014

I've added some stats, they show some important metrics on how efficient the GC is at freeing the data compared to how much data was used at the time, I believe this is where most optimization opportunities are.

From my tests at https://github.com/etcimon/druntime.gctest I've observed that the GC frees on average 3% of used memory per collection cycle. I've suggested sampling as a solution to reduce the amount of inefficient collections.
collections

@etcimon
Copy link
Contributor Author

etcimon commented May 23, 2014

I was using delete to free the memory, this is the real measure for the current version of druntime.gctest

untitled-2

@@ -101,7 +101,8 @@ private
extern (C) void* gc_realloc( void* p, size_t sz, uint ba = 0 ) pure nothrow;
extern (C) size_t gc_extend( void* p, size_t mx, size_t sz ) pure nothrow;
extern (C) size_t gc_reserve( size_t sz ) nothrow;
extern (C) void gc_free( void* p ) pure nothrow;
extern (C) void gc_free( void* p ) pure nothrow;
extern (C) GCStats gc_stats( ) pure nothrow;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No tabs please, you might want to use our editorconfig file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I changed that part in Monodevelop and it doesn't have a plugin for editorconfig files :/

@etcimon
Copy link
Contributor Author

etcimon commented May 24, 2014

It seems like the % freed per collection cycle drops off a cliff when size > 4096, I even get 0.00015% per collection using 10KB, 20KB, 30KB memory segments in my tests. However, I get >40% efficiency with small allocations. It looks like really bad numbers when working with streams

@etcimon
Copy link
Contributor Author

etcimon commented May 24, 2014

By increasing the page size to 64KB, I get much better numbers: 83.351% ! That's what I'd call acceptable for working with streams

@@ -123,6 +124,12 @@ private
extern (C) void gc_runFinalizers( in void[] segment );
}

struct GCStats
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation needed here.

@Safety0ff
Copy link
Contributor

core.memory is supposed to be a low-level interface implemented by a garbage collector.
I am unsure about adding a gc_stats method that is a "set in stone" struct (which is why previously you had to expose it manually e.g.: http://dpaste.dzfl.pl/4ce04246a148 .)
Unless the API is suitable for any GC implementation, I think we should thread carefully.

@MartinNowak
Copy link
Member

True that, maybe we shouldn't add it to core.memory yet.

@etcimon
Copy link
Contributor Author

etcimon commented May 25, 2014

How about a signature like the following?

long[] stats(long cmd);

I'm assuming anything stats returns would be a number.

It could be used like this:

auto stats = GC.stats(STATS_FREED | STATS_USED | STATS_COLLECTIONS);

where stats[0] is totFreed, stats[1] is totUsed and stats[2] is collection.

This could prove to be more flexible in the long run..

@dislogical
Copy link

The problem with that solution is that there's no way to know what order the arguments are passed in, so if I were to call it like this:

auto stats = GC.stats( STATS_USED | STATS_COLLECTIONS | STATS_FREED );

I would still get them in the order totFreed, totUsed, and collections.

This seems unintuitive.

@etcimon
Copy link
Contributor Author

etcimon commented May 25, 2014

Well then, maybe it's better off having it like:

long[] stats(long[] cmd)

I'd like if we could have a long[] stats(long[] cmd, long[] args, string[] argsStr) for eventually sending in some arguments to get more specific statistics (like turn on logs for a certain file name or log the longest periods of time between collections, etc)

@rainers
Copy link
Member

rainers commented May 26, 2014

I'd prefer passing a pointer/reference to the GCStat struct as a parameter. The first field indicates which statistics are requested as a combination of flags, and the GC fills the fields it has valid data for and returns their respective flags.
At the same time it can enable the code inside the GC to collect this data.

stats.pageblocks++;
else if (bin < B_PAGE)
bsize += PAGESIZE;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't nuke the existing funcitionality, better just add to it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is, that the existing stats are GC internal, not targeted to a general API.
They still seem to be useful, so maybe you can keep them in a debug(STATS) block.

@Safety0ff
Copy link
Contributor

I'd prefer passing a pointer/reference to the GCStat struct as a parameter. The first field indicates which statistics are requested as a combination of flags, and the GC fills the fields it has valid data for and returns their respective flags.

I was thinking of something even more conservative such as returning string[].

@rainers
Copy link
Member

rainers commented May 26, 2014

I was thinking of something even more conservative such as returning string[].

That would need an allocation, and I think it should not interfere with actual GC operations.

@Safety0ff
Copy link
Contributor

Hmm, ok, your solution of passing in a pointer to the struct is workable with respect to adding members to the stats struct (though overtime the struct can become larger.)
The only question is how to signal an unsupported statistic request?

@etcimon
Copy link
Contributor Author

etcimon commented May 26, 2014

The only question is how to signal an unsupported statistic request?

There will have to be a GCStats.version field at the very start of the struct to see if the druntime has the same as the client application. Am I to understand that this struct is not going to be in core.memory?

@Safety0ff
Copy link
Contributor

Am I to understand that this struct is not going to be in core.memory?

It would be in core.memory.

@etcimon
Copy link
Contributor Author

etcimon commented May 27, 2014

Ok I spent some time revising the code to calculate used memory, and it reveals that, when using the GC with memory sizes within the bucket ranges of 0-4096 bytes, collections are ~22%-47% efficient, when moving outside of those ranges, at 5000 bytes, they're 0.00161% efficient - at 64kb, they're 0.000257% efficient.

size_t freed;
size_t used;
size_t collections;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plese get rid of those tabs.

@MartinNowak
Copy link
Member

Let's focus on getting a working gcStats function in.
So we need to find a reasonable set of parameters that any GC implementation can cheaply calculate or estimate.
Let's look at the stats of some other GC implementations.
Haskell GHC-Stats
Java GarbageCollectorMxBean
Ruby GC.Stats
Python gc.get_stats

So maybe a good starting point would be the following:
- number of collections
- bytes used (approx.)
- total bytes allocated (approx., total bytes freed = total bytes allocated - bytes used)
- total time spent in stop-the-world marking
- total time spent in sweeping & finalizing

You can use TickDuration for a cheap clock.
If any GC can't support one of those stats, it might simply return 0 or size_t.max for that field.
I like the difference idea of @rainers.

@etcimon
Copy link
Contributor Author

etcimon commented May 27, 2014

It definitely looks like the Haskell GHC stats are the only ones that aren't completely useless.

@etcimon
Copy link
Contributor Author

etcimon commented May 27, 2014

total bytes allocated (approx., total bytes freed = total bytes allocated - bytes used)

The sums either increment per collection cycles or per malloc calls. The total bytes allocated would be per malloc calls, and the bytes used would be per collection cycles, so the maths don't add up exactly this way.

@etcimon
Copy link
Contributor Author

etcimon commented May 28, 2014

Hmm, ok, your solution of passing in a pointer to the struct is workable with respect to adding members to the stats struct (though overtime the struct can become larger.)
The only question is how to signal an unsupported statistic request?

I'm not sure about why it would be necessary to do this rather than return a struct on the stack?

@etcimon
Copy link
Contributor Author

etcimon commented Jun 1, 2014

The current implementation immediately "crawls" into the heap, so it might see an inconsistent view of the memory which can cause collection of live objects.

That shouldn't be a problem if you give every newly suspended thread a zeroed GCBits mark and scan object. It would also allow concurrent collection into the stacks and heap, and then you OR all the GCBits objects together to get a global view of the marked objects. There may be additional computations but the overall speed would be improved by the use of a thread pool

@etcimon
Copy link
Contributor Author

etcimon commented Jun 1, 2014

This would actually harvest much more performance for the GC considering most of the data allocated on the heap by the threads should actually set to thread-local, so most of the marking in each thread should be new.

@rainers
Copy link
Member

rainers commented Jun 1, 2014

There is no guarantee or information, whether an allocated object is shared or thread local. You'll need stricter type rules and write/cast-barriers again to generate that information .

@etcimon
Copy link
Contributor Author

etcimon commented Jun 1, 2014

There is no guarantee or information, whether an allocated object is shared or thread local. You'll need stricter type rules and write/cast-barriers again to generate that information .

If the marks are not cached from one thread to another, wouldn't it be irrelevant whether the object is shared or thread local?

@rainers
Copy link
Member

rainers commented Jun 1, 2014

If the marks are not cached from one thread to another, wouldn't it be irrelevant whether the object is shared or thread local?

Could you elaborate how you guarantee that an object is not (indirectly) referenced by another thread's stack after scanning the referenes of the "current" stack? Also, data shared between threads can be mutated during scanning, so the collector might not see references to live objects.

There might be some optimizations possible, but I suspect going for full concurrent collection is the better way.

@etcimon
Copy link
Contributor Author

etcimon commented Jun 1, 2014

Could you elaborate how you guarantee that an object is not (indirectly) referenced by another thread's stack after scanning the referenes of the "current" stack? Also, data shared between threads can be mutated during scanning, so the collector might not see references to live objects.

The objects that are shared will get scanned again because the mark GCBits from scanning the other stacks aren't used and thus there is no skipping in mark.testSet, during scanning in a concurret mark() for a later suspended thread. So you end up scanning shared objects multiple times.

Sure, the other threads may change the data while the GC is running on the object, but that would never be a big deal because if they're pointers well, the way RAM is multiplexed internally works on 8 byte increments per channel, it will certainly do the modification atomically because that's just how RAM works. So you don't have corruption.

But anyways, if another thread actually wrote to this shared object, it's no big deal because you will also scan it later when that other thread is suspended, and that's when you'll see the new reference. The worst case is that you get 2 different pointers in the same memory region, which causes a leak - not a corruption.

@etcimon
Copy link
Contributor Author

etcimon commented Jun 1, 2014

I'm not sure if I explained it correctly, but sequentially suspending the threads would be done by suspending them one by one and keeping them suspended until scanning is done so they can be resumed all at once

@rainers
Copy link
Member

rainers commented Jun 1, 2014

I'm not sure if I explained it correctly, but sequentially suspending the threads would be done by suspending them one by one and keeping them suspended until scanning is done so they can be resumed all at once

Ah, I didn't realize that threads were supposed to be kept suspended until all threads are scanned.

Still, assume two threads 1 and 2 sharing an object A, that points to B, C and D. While thread 1 is suspended, the GC scans A, and then B and C. Before it gets to read the reference D, thread 2 (which is still running) exchanges references B and D, then zeroes its reference to A. The GC sees another reference to B instead of D. When the GC scans thread 2, there are no more references to A, thus D will be collected even though the first reference in A points to it.

@etcimon
Copy link
Contributor Author

etcimon commented Jun 1, 2014

Well I didn't think of that scenario. Sounds like a solution for that would be too complex to look into for now. Would it be advantageous to scan everything concurrently using a clean mark GCBits in every GC mark() thread in your opinion?

@rainers
Copy link
Member

rainers commented Jun 1, 2014

Would it be advantageous to scan everything concurrently using a clean mark GCBits in every GC mark() thread in your opinion?

I'm not sure whether you mean "concurrent collection" (the application continues to during collection) or "parallel collection" (multiple threads mark the GC heap while the application is stopped). I assume the latter: yes, that'd be nice to have. To avoid (atomic) locks when modifying GCBits from mutliple threads, it would be better to have copies. But scanning the same object again because the thread doesn't know another thread has already done that is also bad. A compromise might be to read the bits of all scanner threads, but only write to one.

@etcimon
Copy link
Contributor Author

etcimon commented Jun 1, 2014

I think a manager thread with a message queue could do the writing and expose the object for reading. It is parallel yes, with your background thread for freeing it gets interesting. Would you have a scenario as well that goes against resuming the thread right after it's marked? It seems like the GC lock takes care of preventing any new allocations so that wouldn't be an issue

@etcimon
Copy link
Contributor Author

etcimon commented Jun 1, 2014

The idea of the manager thread is to send info (set it and forget it) without any locking and let the manager handle the writing alone. That way you trade off the total number of locks with recency of information

@etcimon
Copy link
Contributor Author

etcimon commented Jun 1, 2014

I think I could write a parallel marking proof of concept and have it as an option through environment variables to specify if the application is mostly tls or shared data, which will either used a shared mark GCBits or thread local one. Another experiment would resume thread after marking. I'll take care of it after stats and sampling are done

@rainers
Copy link
Member

rainers commented Jun 2, 2014

Would you have a scenario as well that goes against resuming the thread right after it's marked?

Interesting idea. I think it can work if there are no cross-references between threads regarding stack and TLS. In that case, they would have to be treated as GC objects and have to be scanned in the mark-phase of the referencing thread.

@rainers
Copy link
Member

rainers commented Jun 2, 2014

The idea of the manager thread is to send info (set it and forget it) without any locking and let the manager handle the writing alone. That way you trade off the total number of locks with recency of information

I guess the messaging overhead would be too big for setting GCBits, this happens too often. I just measured a LOCK CMPXCHG operation on my i7 and it seems to usually take 130+ cycles (but sometimes only 32). That's bad, but probably still better than sending a message to another thread. So I'd go for atomic operations on the bits.

@etcimon
Copy link
Contributor Author

etcimon commented Jun 2, 2014

So I'd go for atomic operations on the bits.

I think I'll just keep the GCBits thread-local and merge them at the end of the operation, it's faster if you assume there's not many shared objects. Most D applications will use TLS anyways, since that's the default

Etienne Cimon added 10 commits June 4, 2014 08:31
Typo

More typos
removed win64.mak

re-added
commit 4ede047
Author: Etienne Cimon <etcimon@gmail.com>
Date:   Tue May 27 13:39:52 2014 -0400

    Added documentation and GCStats version

commit c24a060
Author: Etienne Cimon <etcimon@gmail.com>
Date:   Tue May 27 13:33:45 2014 -0400

    Calculate freed size of big allocations

commit af31878
Author: Etienne Cimon <etcimon@gmail.com>
Date:   Tue May 27 13:23:46 2014 -0400

    Wasn't returning the proper used size
Added more statistics again

Added missing fields

fixed errors

Added try/catch to make currSystemTick nothrow

Fixed free size calcul

Fixed max bytes used and freed
@etcimon
Copy link
Contributor Author

etcimon commented Jun 4, 2014

Does the TickDuration cause this issue?

OPTLINK (R) for Win32 Release 8.00.15
Copyright (C) Digital Mars 1989-2013 All rights reserved.
http://www.digitalmars.com/ctg/optlink.html
unittest.obj(unittest)
Error 42: Symbol Undefined _D4core4stdc8inttypes12__ModuleInfoZ
unittest.obj(unittest)
Error 42: Symbol Undefined _D4core3sys5posix3sys3uio12__ModuleInfoZ
unittest.obj(unittest)
Error 42: Symbol Undefined _D4core3sys5posix3sys5types12__ModuleInfoZ
unittest.obj(unittest)
Error 42: Symbol Undefined _D4core3sys5posix3sys4time12__ModuleInfoZ
unittest.obj(unittest)
Error 42: Symbol Undefined _D4core3sys5posix4time12__ModuleInfoZ
unittest.obj(unittest)
Error 42: Symbol Undefined _D4core3sys5posix6config12__ModuleInfoZ

@etcimon
Copy link
Contributor Author

etcimon commented Jun 4, 2014

I'll get back to this later

@etcimon etcimon closed this Jun 4, 2014
@MartinNowak
Copy link
Member

Sorry for letting you wait with this, I'm trying to address all the GC pulls in the following order.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants