New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Incremental GC #5227

Closed
wants to merge 17 commits into
base: master
from

Conversation

@carnaval
Contributor

carnaval commented Dec 23, 2013

This is work I started when hitting long GC pause in a soft realtime application. No matter how low you can bring the allocation count down (and the language admittedly makes it easy), there is still a point where you must wait for tens of ms which is a pain in, e.g., render code at 60 fps.

To be clear this is only incremental marking, but incremental sweeping should be much easier.
This patch :

  • Adds a new gc bit on each type field of each julia object which records if it is scheduled for marking (the so called grey bit)
  • Adds a write barrier to all pointer stores to gc-managed slots (in C runtime and codegen)
  • Make the GC able to run partial marking phase
  • Adds a verifier which can run before each sweep to verify we did not miss a write.

Some points :

  • The write barrier is explicit in C code. Most of it looks like :
a->b = ...;
gc_wb(a, a->b);

This is ugly and could be resolved with a macro (something like gc_store(a, b, ...);) but I was not sold on changing such pervasive syntax.

  • The current heuristics are quite bad : the collect interval is simply divided by 10 and partial markings works with an increasing timeout (1ms then 1.3*previous_timeout). I expect some low hanging fruits for efficiency here, if only because I chose those constants arbitrarily ;)
  • Performances : nothing is free so there is a perf hit. I believe it could be brought down as I was mainly focused on getting it correct until recently. My measurements are on running the full test suite. It takes about 10% more time to run but the pause times (for the mark phase) goes down from : max 40ms, 95% sub 30ms to max 5ms, 95% sub 2ms.
  • Even if low-pause is not a focus point for the julia community ATM, the write barrier can be used to make the GC generational. In fact, you could turn this PR into a poor man's generational GC simply by not reseting the marks at the end of a sweep. I did not experiment in this direction yet, but doing it properly would require the GC to support moving, which is its own can of worms.

To enable, build with -DGC_INC. GC_VERIFY enables the verifier and GC_TIME prints pause timings.

Of course, feedback is much appreciated. Cheers !

@JeffBezanson

This comment has been minimized.

Member

JeffBezanson commented Dec 23, 2013

Wow, this is really impressive. It would be great to have the option to build with a low-pause GC for users who might want that. Having multiple GCs available is really cool.

I'm planning to work on a new default GC as well. It also requires a write barrier, so that work will not be wasted.

What I want to try is taking advantage of the fact that garbage is very likely to have been referenced only from the stack (i.e. temporary values). So I'd set a bit on heap-referenced values, then during a "minor" GC I'd only mark objects currently on the stack, then do a full sweep. That skips tracing all heap objects, which takes most of the time in my experiments. It seems like you'd be an ideal person to evaluate this idea.

@StefanKarpinski

This comment has been minimized.

Member

StefanKarpinski commented Dec 23, 2013

Very cool work. I worry about having multiple GCs because each one might not get enough testing to be really reliable. It would be ideal, imo, to have a single standard GC that can meet everyone's needs, perhaps with a little assist from the high-level coding style (e.g. if you have real-time requirements, always pre-allocate temporary arrays). On the other hand, GC might be an area where it's not actually possible to satisfy everyone with one system.

@JeffBezanson

This comment has been minimized.

Member

JeffBezanson commented Dec 23, 2013

Whether you care about pause times is a pretty hard dividing line in the GC design space. You can't address this by changing how user code is written, because you also call libraries, and our compiler also allocates objects if it wants to. Even if our GC is so good it takes 1ms 99.9% of the time, that still isn't low-latency. The 0.1% ruins it. Limited pause time is a harsh requirement.

@carnaval Is there an argument that this eventually collects all garbage? I believe it might but it's not obvious. For example, say you constantly add new nodes to a large tree, and on rare occasion unreference a big chunk of it. Will the GC always use up its allotted time dealing with the new updates?

@StefanKarpinski

This comment has been minimized.

Member

StefanKarpinski commented Dec 23, 2013

Whether you care about pause times is a pretty hard dividing line in the GC design space.

Yes, that means requirements on the "one GC" would be pretty severe – high enough throughput for general usage and guaranteed low-latency for real-time requirements. Not even sure if that's doable and the alternative would be having a small set (maybe only two) of GCs that can be swapped out.

@carnaval

This comment has been minimized.

Contributor

carnaval commented Dec 23, 2013

It should collect all garbage eventually because I increase the timeout if the mark queue is not shrinking, until a cutoff value where I do a full collection.
As I said, those heuristics are pretty bad IMO and this would be the next thing I'm looking into.

I'd be happy to do some tests on different ways to do quick collections after this. A first step would be having some basic infrastructure to measure dfferences in gc pause/throughput. ATM I have some basic scripts that need some cleanup but I believe it would be beneficial to automate it to spot regressions (maybe an entry on codespeed ?)

About having multiple gc, in this PR I duplicated the codepath because it allowed me to trust the verification (which is inline to the marking code, meh) but it could be merged quite easily and I don't think there should be any penalty in saying that a full collection is an incremental one with infinte timeout.
As you said, this would not ensure proper testing of the incremental code path, but I believe that simply making the CI server run the testsuite with both modes would be a good first step.

Btw, congrats on the codebase it's really clean and quite nice to dive into.

@JeffBezanson

This comment has been minimized.

Member

JeffBezanson commented Dec 23, 2013

Thanks.

It would be nice if we could somehow avoid calling clock_now() for every object. Surely that has some overhead.

I assume all the verification stuff is just for debugging? It doesn't all seem to get disabled by GC_VERIFY.

@carnaval

This comment has been minimized.

Contributor

carnaval commented Dec 23, 2013

About clock_now, I had a counter at some point to only check once every N objects but didn't see measurable improvements. However a lot of things changed since then so it may very well be worth it now. I will have a look.

You're right, seems like I forgot to disable the verification macros when GC_VERIFY is not defined.

@JeffBezanson

This comment has been minimized.

Member

JeffBezanson commented Dec 24, 2013

A couple formatting points to fix:

  • space between if and ( (if (x) ...)
  • newline before opening { in function definitions
  • f(void) for 0-argument function declarations
  • newline between } and else instead of } else {
  • if one part of an if..else chain uses { } then all should
@StefanKarpinski

This comment has been minimized.

Member

StefanKarpinski commented Dec 24, 2013

f(void) for 0-argument function declarations

This one isn't actually a style point, but required for correctly declaring a zero-argument C function.

@ViralBShah

This comment has been minimized.

Member

ViralBShah commented Dec 24, 2013

This is really impressive, and would enable lots of new applications.

@carnaval

This comment has been minimized.

Contributor

carnaval commented Dec 26, 2013

Let me know if I missed anything.

@ssfrr

This comment has been minimized.

Contributor

ssfrr commented Jan 1, 2014

I'm really excited about bringing the max latency down to enable realtime audio applications, so bringing down the GC delay would be a huge step. Thanks for the work on this!

-s

@JeffBezanson

This comment has been minimized.

Member

JeffBezanson commented Jan 1, 2014

Yes, I plan to merge this.

@carnaval

This comment has been minimized.

Contributor

carnaval commented Jan 21, 2014

I finally got around spending a little more time on this.
Couple questions :

  • should I make it a runtime flag ? It will have never-triggering barrier in the runtime and no barriers in generated code. I will try and measure if this has a noticeable impact.
  • are CI build times important ? (i.e. is anyone waiting on them ?). If not, is it deemed reasonable to run the testsuite twice, with full collections and partials ?
  • in codegen.cpp I'm wondering if there is a better way, given a varinfo_t, to know if it is stack allocated than to check the IR definition (see is_stack)

@ssfrr I have no experience in realtime audio, do you happen to know the order of magnitude of tolerable pause times ?

@JeffBezanson About the escape bit, if I understood you correctly, it would not be too hard to implement on top of what I have now. We would still have to sweep the whole heap though. The only way around this is to implement moving and LLVM does not seem really suited for this. In fact, I only see one way to avoid moving an object while in the middle of some computation using its address. It involves reloading every root after every potentially-allocating operation, this does not strikes me as cheap and would probably confuses LLVM opt passes.

@ssfrr

This comment has been minimized.

Contributor

ssfrr commented Jan 22, 2014

Realtime audio deadlines are around 10-20ms max if you're processing live sound (~5ms is not uncommon). Typically there's an audio callback that gets called with every new frame of incoming audio, and within that context the application processes the sound and returns a new frame to be output. So given a 20ms deadline, ideally as much of that time as possible can be used for processing (if the system is blocked for 10ms of the 20ms, that only allows 50% CPU utilization).

This is the main reason that audio processing software is almost always written in C or C++, with all the memory allocation happening up front and blocking operations like disk access in a separate lower-priority thread.

I think that this is a niche enough use case that I wouldn't expect the GC design to have huge compromises to meet it. It would be nice if it were at least possible to write performant audio processing code in Julia though, even if it required a bit of care on the developer's part. I've got a few audio related projects and have been toying with the idea of using julia to implement the engine.

@vtjnash

This comment has been minimized.

Member

vtjnash commented Jan 29, 2014

What's the status of this? I would like to see it get merged before I start working on #2818, since I expect they will have a fairly high degree of overlap

vtjnash added a commit that referenced this pull request Jan 29, 2014

Add code formatting guidelines
Taken from Jeff's comments on a recent pull request: #5227 (comment)
@carnaval

This comment has been minimized.

Contributor

carnaval commented Jan 29, 2014

I'm cautiously confident that the incremental marking is correct (I've been using it exclusively for some time now).
I added escape tracking and page skipping to allow for quick collections. It appears to give a nice speedup while retaining low pause times.
The only remaining long pauses are sweep phases after a long collection.
On a silly example which allocates a lot of temporaries :

silly(N) = (x = 0; for i=1:N x += {1}[1] end; x)
silly(10); @time silly(40000000)

master : time: 6.727313459 seconds (3199998572 bytes allocated)
with GC_TRACK_ESC : 2.445960981 seconds (3199998572 bytes allocated)

I intend to run a more serious benchmark later today (morning here) comparing master and all combinations of (inc_mark, track_esc) on runtime, pause times, max heap size.
Depending on the measurements, it may make sense to merge it enabled by default - and it would help battle-test the code ;-).
The code is a bit messy for now, I'll clean it up and explain a bit what's happening.
Cheers.

@carnaval

This comment has been minimized.

Contributor

carnaval commented Jan 29, 2014

@JeffBezanson By the way, is there any reason for the non power of 2 gc page size on 64bit cpu ? Git blame gives a very old commit.
To quickly check if an object is pooled and on what page I had to use 2^n aligned pages.

@timholy

This comment has been minimized.

Member

timholy commented Jan 29, 2014

@carnaval, this is really impressive; faster throughput with short pauses sounds almost too good to be true. (I oh-so-vaguely understand generational gc and why it's not impossible, but still...)

Is the Travis error on the Clang build anything to be worried about? Naively it does look like a potential memory corruption problem.

@JeffBezanson

This comment has been minimized.

Member

JeffBezanson commented Jan 29, 2014

This is awesome. We should plan to merge it at the start of the 0.4 cycle
to shake out any bugs.
The reason it's not too good to be true is that the existing gc was far
from optimal :-)
On Jan 29, 2014 7:03 AM, "Tim Holy" notifications@github.com wrote:

@carnaval https://github.com/carnaval, this is really impressive;
faster throughput with short pauses sounds almost too good to be true. (I
oh-so-vaguely understand generational gc and why it's not impossible, but
still...)

Is the Travis error on the Clang build anything to be worried about?
Naively it does look like a potential memory corruption problem.

Reply to this email directly or view it on GitHubhttps://github.com//pull/5227#issuecomment-33578689
.

@timholy

This comment has been minimized.

Member

timholy commented Jan 29, 2014

Had one tried to make it optimal from the outset, Julia would still be stuck with about 4 users (the perfect being the enemy of the good and all that). But it's really great to see this getting attention, it seems very timely given the improvements that are being made in other areas.

@ViralBShah

This comment has been minimized.

Member

ViralBShah commented Jan 30, 2014

How about merging it now but only enabled with a switch?

@carnaval carnaval referenced this pull request Mar 9, 2014

Merged

add missing root #6085

@stevengj stevengj referenced this pull request Apr 21, 2014

Closed

types as C-structs #2818

5 of 5 tasks complete
@ssfrr

This comment has been minimized.

Contributor

ssfrr commented May 24, 2014

In a little non-rigorous testing just now this is performing way better than the standard GC with AudioIO.

I made a sloppy audio node that does a buffer allocate/free on every block of audio rather than iterating. With the normal GC I hear frequent periodic dropouts when I have more then 4-5 of the nodes running at once. With the incremental GC I can get 200 of them running with occasional drop-outs, and 100 was running with no dropouts (again, in my very quick testing)

Granted I'm operating at pretty large buffer sizes right now, but this definitely seems like a huge win.

There was one compile-time issue, I had to change the MAP_ANONYMOUS to MAP_ANON in gc.c. It looks like a linux/OSX compatibility issue (this is an OSX box), perhaps just a

#ifndef MAP_ANONYMOUS
    #define MAP_ANONYMOUS MAP_ANON
#endif
@StefanKarpinski

This comment has been minimized.

Member

StefanKarpinski commented May 24, 2014

So what's in the way of merging this?

@johnmyleswhite

This comment has been minimized.

Member

johnmyleswhite commented Oct 3, 2014

Given the high variability, would be really nice to see histogram/KDE comparing the old and new GC's.

@IainNZ

This comment has been minimized.

Member

IainNZ commented Oct 3, 2014

A fun benchmark would be to take everything on julia-users/Stackoverflow that was slow due to slicing, and throw it at this - some will probably really give the GC a good workout.

@carnaval

This comment has been minimized.

Contributor

carnaval commented Oct 5, 2014

I'm not sure we can expect a speedup on this benchmark, since the generational hypothesis doesn't model well the behavior of the program. If, for example, you throw away those strings quickly instead of storing them in a huge array you should get an improvement over current results.
In fact, apart from the few temporaries generated by the comprehension, the GC cannot free any memory until the whole array becomes garbage. The optimal behavior would then be something like gc_disable(); build_huge_array(); gc_enable();, which is what the current gc tends to by growing the collection interval geometrically.
I agree that in those cases the heuristic should be smarter and stop trying to do young collections as often (which it does, but not aggressively enough).

This effect is made much worse by the fact that, even if the string array gets promoted to old gen, it must be remarked fully every time because we are stuffing it with young objects. The way around this is to use a card table alongside (or as a replacement of) the remembered set to allow partial invalidation of large arrays. This will make the write barrier more complicated because given a pointer it's not straightforward (i.e. O(1) for a very small value of 1 :-) ) to know how it is stored in memory.

That being said I'm having a look right now to see if there are ways to help this specific case.
I also agree that the perf test suite should be extended, as that's mostly what I'm basing myself upon to check for regressions since I don't have that much of a julia corpus to test things on.

@JeffBezanson

This comment has been minimized.

Member

JeffBezanson commented Oct 5, 2014

Good point --- the benchmark should be repeated with the new objects being
garbage.

@jakebolewski

This comment has been minimized.

Member

jakebolewski commented Oct 6, 2014

That makes sense, thanks for the detailed explanation. You are right that the new GC is much better under the conditions you outlined (although both tests represent extremes of "normal" allocation behvaior).

Julia/julia-dev-33 [master] » ./julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.4.0-dev+944 (2014-10-04 23:57 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit c5f136a (1 day old master)
|__/                   |  x86_64-apple-darwin13.4.0

julia> using DataFrames

julia> function test(N)
       for _ in 1:N
           reverse(utf16("this is a test"))
       end
       end
test (generic function with 1 method)

julia> @time test(5_000_000)
elapsed time: 1.410968703 seconds (800865016 bytes allocated, 35.24% gc time)

julia> le = @time open(deserialize, "/Users/jacobbolewski/Julia/benchmarks/labevents.jls");
elapsed time: 9.167030852 seconds (1326598072 bytes allocated, 5.80% gc time)

julia> @time test(5_000_000)
elapsed time: 2.39665995 seconds (800000080 bytes allocated, 58.28% gc time)

julia> @time test(5_000_000)
elapsed time: 6.629903708 seconds (800000080 bytes allocated, 87.44% gc time)

julia> @time test(5_000_000)
elapsed time: 6.533877549 seconds (800000080 bytes allocated, 87.42% gc time)

julia> @time test(5_000_000)
elapsed time: 6.931448025 seconds (800000080 bytes allocated, 88.15% gc time)

julia> @time test(5_000_000)
elapsed time: 6.193674193 seconds (800000080 bytes allocated, 86.91% gc time)

This PR

Julia/julia-dev-33 [newgc] » ./julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.4.0-dev+525 (2014-09-28 22:51 UTC)
 _/ |\__'_|_|_|\__'_|  |  newgc/8c54091 (fork: 13 commits, 27 days)
|__/                   |  x86_64-apple-darwin13.4.0

julia> using DataFrames

julia> function test(N)
       for _ in 1:N
           reverse(utf16("this is a test"))
       end
       end
test (generic function with 1 method)

julia> @time test(5_000_000)
elapsed time: 1.03702919 seconds (763 MB allocated, 9.41% gc time)

julia> le = @time open(deserialize, "/Users/jacobbolewski/Julia/benchmarks/labevents.jls");
elapsed time: 9.971573486 seconds (1265 MB allocated, 6.24% gc time)

julia> @time test(5_000_000)
elapsed time: 2.389991338 seconds (762 MB allocated, 53.33% gc time)

julia> @time test(5_000_000)
elapsed time: 1.740331145 seconds (762 MB allocated, 43.69% gc time)

julia> @time test(5_000_000)
elapsed time: 1.314867743 seconds (762 MB allocated, 29.68% gc time)

julia> @time test(5_000_000)
elapsed time: 1.404829552 seconds (762 MB allocated, 32.67% gc time)

julia> @time test(5_000_000)
elapsed time: 1.397260517 seconds (762 MB allocated, 33.38% gc time
@JeffBezanson

This comment has been minimized.

Member

JeffBezanson commented Oct 6, 2014

Excellent!

@JeffBezanson

This comment has been minimized.

Member

JeffBezanson commented Oct 14, 2014

@carnaval I just added you to the contributors team. You can move this branch to the main repository, under a name like ob/gengc, and reopen this PR from there. That will allow all of us to easily contribute tweaks leading up to merging this.

@carnaval

This comment has been minimized.

Contributor

carnaval commented Oct 16, 2014

Thanks ! I'll try and do this without breaking everything. We all know what comes with great power ... :-)

@ihnorton

This comment has been minimized.

Member

ihnorton commented Oct 16, 2014

Just don't push to master by accident, and especially don't force push to master.

carnaval added some commits Jan 21, 2014

Add incremental GC & write barrier.
This also includes various allocation changes which should improve
performances.
There is also a start of generational behavior for <2k objects.
This broke the heuristics in the process, still pretty much a WIP.
All wb are now backward for quick collections, big objects can be tra…
…nsients,

bugfixes...
Still in a pretty broken state (at the very least incremental codepath isn't
working)
@trrousse :-)
Fix a bug where a soon-to-be-promoted object would escape the write b…
…arrier. A bit more cleanup too. Also some missing write barrier in new code.
Prevent module level assignments from clobbering up the remembered se…
…t. A bit more tweaking of the collection heuristics. We are now faster/less memory hungry on almost every benchmark of the micro, kernel & shootout suite.
Avoid a silly 1-cycle latency to old object remarking after a full co…
…llection. Slight cleanup. Address some of Jeff's comments.
}
static inline int gc_setmark(void *o, int sz, int mark_mode)

This comment has been minimized.

@vtjnash

vtjnash Mar 11, 2015

Member

in some cases, I think this is being called with a sz value that doesn't include the jl_value_t *type tag (e.g. when it comes from jl_datatype_size)

// if parent is marked and buf is not
if (__unlikely((*((uintptr_t*)parent) & 1) == 1))
// (*((uintptr_t*)bufptr) & 3) != 1))
gc_setmark_buf(bufptr, *(uintptr_t*)parent & 3);

This comment has been minimized.

@vtjnash

vtjnash Mar 30, 2015

Member

this will mark bufptr (for the next gc pass), but if parent was old, wouldn't this violate the state transitions setup (for the next+1 gc pass) since you have parent(old) pointing to bufptr(new)? (per usage of this function in

gc_wb_buf(m, b);
)

This comment has been minimized.

@carnaval

carnaval Apr 1, 2015

Contributor

I think it's ok here since we copy the low bits of parent, so if the parent is old the buffer will be marked as old also.

This comment has been minimized.

@vtjnash

vtjnash Apr 1, 2015

Member

I thought the GC bits here didn't include the age bits?

This comment has been minimized.

@carnaval

carnaval Apr 1, 2015

Contributor

It does not include what is referred in the code as the age bit which is where we remember if the object already survived one collection. Once an object turns old for real (i.e., after 2 survivals) we mark it in its low bits (marked as 01 instead of 11). The transition happens at this point and we check for child objects/buffers also there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment