Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generational behavior for the garbage collector #8699

Merged
merged 2 commits into from
Jan 24, 2015
Merged

Generational behavior for the garbage collector #8699

merged 2 commits into from
Jan 24, 2015

Conversation

carnaval
Copy link
Contributor

Same as #5227 but living in this repo & squashed.

@catawbasam
Copy link
Contributor

Looking forward to this!

heaps_lb[heap_i] = i;
if (heaps_ub[heap_i] < i)
heaps_ub[heap_i] = i;
int j = (ffs(heap->freemap[i]) - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is ffs defined? I'm getting an implicit declaration compiler warning on windows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ffs stands for find first set (also named count leading zeros). I thought this was provided by the libc everywhere.
We may need a few defines in some headers if the name of the function is not the same (or provide it ourselves).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we might be able to use __builtin_ffs for MinGW (see https://bugs.freedesktop.org/show_bug.cgi?id=30277), but that won't help for MSVC. We could maybe grab musl's implementation http://git.musl-libc.org/cgit/musl/tree/src/misc/ffs.c - I'm still trying to find atomic.h and the definition of a_ctz_l though.

Edit: ok found atomic.h, looks like it's using arch-dependent assembly in musl, so nevermind. Maybe something with __lzcnt for MSVC would work http://msdn.microsoft.com/en-us/library/bb384809(v=vs.120).aspx

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's an implementation of ntz (aka ctz) in libsupport:

static int ntz(uint32_t x)

wikipedia gives a list of transformations for easily converting from one formula to another in various ways:
http://en.wikipedia.org/wiki/Find_first_set#Properties_and_relations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, yeah, let's just not leave in anything that assumes posix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I completely forgot about this. I just added something which looks right but I don't have mingw/msvc so I didn't even compile the code...

@andreasnoack
Copy link
Member

In #9270 @JeffBezanson proposed the following example as a benchmark for this PR. The LU factorization of a matrix of BigFloats allocates and discards a lot of small BigFloats. Therefore considerably time is spend with GC. With latest master I get

julia> @time lufact!(A);
elapsed time: 0.403522395 seconds (67595000 bytes allocated)

julia> @time lufact!(A);
elapsed time: 0.707745403 seconds (64033072 bytes allocated, 67.91% gc time)

julia> @time lufact!(A);
elapsed time: 0.40898759 seconds (64033072 bytes allocated, 46.39% gc time)

julia> @time lufact!(A);
elapsed time: 0.611059059 seconds (64033072 bytes allocated, 63.41% gc time)

julia> @time lufact!(A);
elapsed time: 0.396888493 seconds (64033072 bytes allocated, 45.54% gc time)

julia> @time lufact!(A);
elapsed time: 0.430425332 seconds (64033072 bytes allocated, 47.28% gc time)

julia> @time lufact!(A);
elapsed time: 0.60929912 seconds (64033072 bytes allocated, 63.39% gc time)

julia> @time lufact!(A);
elapsed time: 0.412151807 seconds (64033072 bytes allocated, 44.69% gc time)

julia> @time lufact!(A);
elapsed time: 0.632113227 seconds (64033072 bytes allocated, 63.05% gc time)

julia> @time lufact!(A);
elapsed time: 0.41026657 seconds (64033072 bytes allocated, 46.10% gc time)

and with this branch the numbers are

julia> A = big(randn(100, 100));

julia> @time lufact!(A);
elapsed time: 0.578027366 seconds (64 MB allocated, 43.96% gc time in 2 pauses with 0 full sweep)

julia> @time lufact!(A);
elapsed time: 0.367875097 seconds (61 MB allocated, 37.41% gc time in 1 pauses with 0 full sweep)

julia> @time lufact!(A);
elapsed time: 0.412482129 seconds (61 MB allocated, 41.65% gc time in 1 pauses with 0 full sweep)

julia> @time lufact!(A);
elapsed time: 0.589456554 seconds (61 MB allocated, 60.84% gc time in 2 pauses with 0 full sweep)

julia> @time lufact!(A);
elapsed time: 0.411744976 seconds (61 MB allocated, 42.76% gc time in 1 pauses with 0 full sweep)

julia> @time lufact!(A);
elapsed time: 0.560099494 seconds (61 MB allocated, 58.66% gc time in 2 pauses with 0 full sweep)

julia> @time lufact!(A);
elapsed time: 0.419516315 seconds (61 MB allocated, 44.61% gc time in 1 pauses with 0 full sweep)

julia> @time lufact!(A);
elapsed time: 0.405336 seconds (61 MB allocated, 42.50% gc time in 1 pauses with 0 full sweep)

julia> @time lufact!(A);
elapsed time: 0.573393226 seconds (61 MB allocated, 59.15% gc time in 2 pauses with 0 full sweep)

julia> @time lufact!(A);
elapsed time: 0.412730634 seconds (61 MB allocated, 41.91% gc time in 1 pauses with 0 full sweep)

The numbers look very similar.

Another thing: from the discussion in another thread some time ago, I got the impression that decimal MBs were the standard, but it appears that this branch uses binary MBs.

@vchuravy
Copy link
Sponsor Member

@andreasnoack The mean and variance of that are interesting: (From the numbers you have supplied)

master:

  • mean: 0.5022458996
  • variance: 0.01485280295568762

generational:

  • mean: 0.47306617910000004
  • variance: 0.007977234256117448

Of course due to the sample size this is not conclusive, but the tendency seems to be slightly faster and more consistent performance. If the consistent performance holds with a larger sample size then it would be a very nice thing for real-time audio and video software.

@carnaval
Copy link
Contributor Author

Hey, thanks for testing this branch, it badly needs it. I'm pretty sure there are performance and correctness regressions lurking.
In that case 80% of the time is spent in finalizers which is reported as gc time. Using GC_TIME reveals this (post_mark is where finalizers are checked and C ones are run). GC_FINAL_STATS should also show this but it doesn't as of now (small time counting bug, I'll push a fix).
I'm not sure we can do much about this. Maybe optimize the way finalizers are stored/checked but in that case I think the bottleneck is in the actual finalizing code freeing the memory (no measure where done to assert this claim).

@carnaval
Copy link
Contributor Author

A quick run through perf seems to show that actually finalizer management overhead (mostly the hashtable lookup for registration) is not negligible, however most of the time is still in mpfr_clear & malloc/free.

@timholy
Copy link
Sponsor Member

timholy commented Dec 11, 2014

To me it seems that BigInt computations are the poster child for reusing memory, not allocating & freeing---if you're doing computations in a loop, you ideally want to have pre-allocated temp variables in which you store your intermediate results, rather than allocating and freeing on each add and multiply.

I can't find it right now, but I swear I remember a recent conversation on one of our mailing lists in which python handily beats Julia for BigInt computations, by such a large factor that they must be doing something different.

@tkelman
Copy link
Contributor

tkelman commented Dec 11, 2014

@JeffBezanson
Copy link
Sponsor Member

We haven't really tried to optimize BigInts yet. The first step is to be
able to stack allocate BigInts that fit in 1 word. The new Bytes type
Stefan is developing for strings could be useful here. This isn't easy but
should be doable.

@andreasnoack
Copy link
Member

@carnaval Thanks for the feedback and sorry for going off topic in your PR.

Just to get a sense of the price we are paying for reallocation of BigInts right now, I tried to add mutating arithmetic for BigInts and run the benchmark in @tkelman's link. This relates to #249, #1115, #3022, #3424.

The code is in this gist. Note that we would probably be able to get (at least a good part of) the speedup without changing the code if we allowed += and *= to update in place. I'm wondering if we are giving too much preference to machine numbers in the decision that a+=b is a = a + b.

julia> @time GMP2.pidigits(10000); # Immutable arithmetic
elapsed time: 4.470367684 seconds (8498375224 bytes allocated, 64.35% gc time)

julia> @time GMP2.pidigits2(10000); # Mutable arithmetic
elapsed time: 0.913993363 seconds (262817612 bytes allocated, 13.60% gc time)

In contrast, the GHC number on my machine are

Andreass-MacBook-Pro:Haskell_GHC andreasnoack$ ./bin 10000 +RTS -sstderr > output.txt
   8,450,072,008 bytes allocated in the heap
       5,839,000 bytes copied during GC
         318,288 bytes maximum residency (116 sample(s))
         117,808 bytes maximum slop
               4 MB total memory in use (1 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     15427 colls,     0 par    0.07s    0.08s     0.0000s    0.0001s
  Gen  1       116 colls,     0 par    0.01s    0.01s     0.0001s    0.0002s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    2.98s  (  3.01s elapsed)
  GC      time    0.08s  (  0.09s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    3.06s  (  3.10s elapsed)

  %GC     time       2.5%  (2.9% elapsed)

  Alloc rate    2,831,170,217 bytes per MUT second

  Productivity  97.5% of total user, 96.4% of total elapsed

but this is not with the LLVM backend. Notice that the allocation is similar to the non-mutating Julia version.

@carnaval
Copy link
Contributor Author

While using explicit mutating arithmetic is surely unbeatable performance wise, I think that using it for ?= operators would lead to annoying aliasing bugs in generic code, and you would have to write different version for big/small numbers anyway. I'm all for exposing those primitives as separate functions.
Am I correctly reading the ~3s timing for the haskell program ? We should be able to get closer by optimizing finalizers & allocation for gmp.
I'm not sure why finalizers are using a hashtable anyway, since we are going through the whole list at every collection (and I'd bet that jl_finalize is less performance critical than bignum allocation).
About allocation, we could use our pools to store small bignums, but it would generate a lot of garbage on realloc calls so I'm not sure it is worth it.
I don't think there is much more we can do without static lifetime analysis, but I'd love to be wrong.

@carnaval
Copy link
Contributor Author

Ha, @timholy is right, we could have bignums register as dead in their finalizers by putting themselves in a recycling pool. I believe this can be done purely in julia. There is still the problem of growing/shrinking this pool but if it is the right size it would remove a lot of malloc/free calls.

@andreasnoack
Copy link
Member

Yes. On my MacBook the Haskell version takes 3s. It is quite a bit slower than expected from the blog post.

using [explicit mutating arithmetic] for ?= operators would lead to annoying aliasing bugs in generic code

You might be right, but could you give an example? While modifying the benchmark code I almost convinced myself that it wouldn't be a problem.

@carnaval
Copy link
Contributor Author

a = 0; b = a; a += 1
(a == b) gives different results in mutating/non-mutating arithmetic.
I agree that it may not show up in idiomatic code, but I'd say it's even worse since bugs will be harder to find.

@JeffBezanson
Copy link
Sponsor Member

Mutable numbers are semantically unacceptable and simply aren't going to happen. We clearly have a significant list of performance improvements to try here.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Dec 11, 2014

Another example:

A += A+B

Any expression where the input and output are allowed to alias could be
invalid or have subtle bugs.

I suspect that even
A *= A'
Could be rather problematic (assuming A is square)
On Thu, Dec 11, 2014 at 11:53 AM Oscar Blumberg notifications@github.com
wrote:

a = 0; b = a; a += 1
(a == b) gives different results in mutating/non-mutating arithmetic.
I agree that it may not show up in idiomatic code, but I'd say it's even
worse since bugs will be harder to find.


Reply to this email directly or view it on GitHub
#8699 (comment).

@JeffBezanson
Copy link
Sponsor Member

The issue is that you want to be able to use the same code for Int and BigInt, and mutable BigInts would break that. If += mutated some types and not others, we'd basically have to tell people to never use it, to make their code "BigInt safe" :shudder:.

@StefanKarpinski
Copy link
Sponsor Member

At this point Julia is in the shockingly rare position that generic code really is generic – you can write something generically and actually expect it to work for all kinds of types. Making += mutating and/or having semantically mutable number types would ruin this. I think the ability to write really generic code is more fragile than people may realize.

@carnaval
Copy link
Contributor Author

@andreasnoack I've pushed a change for finalizers. With it and the few changes that I hadn't pushed yet, the cholfact! bench looks around 2x faster with 2x less memory usage compared to master. About the same for the (non-mutating) pidigits(10000). Can you confirm ?

@carnaval
Copy link
Contributor Author

By the way, this should break finalizer ordering, do we make any guarantee about it ?

@timholy
Copy link
Sponsor Member

timholy commented Dec 11, 2014

@carnaval, my guess is the finalizer is too late---where this would make the biggest difference is in a loop inside a function, and of course the finalizer won't run frequently inside the loop. I fear this may need static analysis.

@andreasnoack
Copy link
Member

Thanks for the examples. I can see the problem. I was thinking that it wouldn't break genericness to update a in a+=b because the expression tells us that our existing a is free to use for the result. However, the examples taught me that this is only true if a isn't aliased with another variable.

It appears that if allowing *= to mutate a BigInt then it would be necessary to disallow aliasing in order to retain the number feeling of BigInts. Aliasing doesn't cause trouble now because we have made sure that all BigInt functions allocate a new output variable. Is there a way of enforcing that two BigInt variables are never aliased?

@JeffBezanson I'm only trying to understand the reasoning and make visible the costs here. I appreciate the explanations and you are probably completely right but "semantically unacceptable" makes almost no meaning in my CS untrained head. In contrast, examples are really good for my understanding.

@StefanKarpinski I don't want to break genericness here. Not even for performance purposes. I want to understand why the "proposal" would break genericness. As explained above, the mutating behavior was only meant for the *= and += type operations which I thought wouldn't cause trouble.

@vtjnash I don't see how A+=A+B is a problem. A+B has to be stored in a temporary variable, say C=A+B (only += should be mutating) and then afterwards A is updated with the value of A+C. Am I wrong? Regarding A *= A', was the prime then intentional? With no prime or lazy transpose, the expression should throw an error. I guess that would be possible with is(A,A).

carnaval added a commit that referenced this pull request Jan 24, 2015
Generational behavior for the garbage collector
@carnaval carnaval merged commit 0bfe05d into master Jan 24, 2015
@tknopp
Copy link
Contributor

tknopp commented Jan 24, 2015

Awesome. Thanks for this contribution.

@timholy
Copy link
Sponsor Member

timholy commented Jan 24, 2015

Hooray! Quite a long saga, but I am really looking forward to this.

@StefanKarpinski
Copy link
Sponsor Member

slow clap orson welles

@staticfloat
Copy link
Sponsor Member

@ivarne
Copy link
Sponsor Member

ivarne commented Jan 24, 2015

The angry mob is hopefully all on release-0.3.

@@ -217,7 +219,9 @@ static jl_value_t *eval(jl_value_t *e, jl_value_t **locals, size_t nl)
size_t i;
for (i=0; i < nl; i++) {
if (locals[i*2] == sym) {
return (locals[i*2+1] = eval(args[1], locals, nl));
locals[i*2+1] = eval(args[1], locals, nl);
gc_wb(jl_current_module, locals[i*2+1]); // not sure about jl_current_module
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure either. this is a stack variable slot. locals is a JL_GC_PUSHARGS alloca'd location.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since locals is a stack location, this gc_wb seems unnecessary? (or perhaps, should target jl_current_task?)

@blakejohnson
Copy link
Contributor

@carnaval I'm seeing some major performance improvement with the new GC. The performance test numbers in our quantum system simulator (https://github.com/BBN-Q/QSimulator.jl) all dropped by nearly a factor of 2. Nice work!

@timholy
Copy link
Sponsor Member

timholy commented Jan 26, 2015

While I'm still adjusting to this change, this is already subtly changing my julia programming style: there's a whole "mid-layer" of problems where I'm finding that I'm noticeably less worried about allocating memory than I used to be.

I'd call that a pretty big impact.

@IainNZ
Copy link
Member

IainNZ commented Jan 26, 2015

I'd love to make a more comprehensive performance test bed that has a few algorithms implemented in a few different styles - some much more garbage-generating than others. The idea is to test how performance varies for code that isn't written optimally (like the kinds of code you see on StackOverflow)

@johnmyleswhite
Copy link
Member

+1 to @IainNZ's idea

@ViralBShah
Copy link
Member

There are improvements in perf benchmarks of vectorized codes. While expected, it is nice to actually realize it. The stockcorr benchmark is now equally fast in both, vectorized and devectorized cases.

@@ -384,11 +391,13 @@ static jl_value_t *eval(jl_value_t *e, jl_value_t **locals, size_t nl)
// temporarily assign so binding is available for field types
check_can_assign_type(b);
b->value = (jl_value_t*)dt;
gc_wb_binding(b,dt);
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you use gc_wb_binding(((void**)b)-1, dt); in the other two usages of gc_wb_binding, but not here? @carnaval

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake. Thanks !

@JeffBezanson JeffBezanson mentioned this pull request Feb 12, 2015
rval = boxed(emit_expr(r, ctx, true),ctx,rt);
rval = boxed(emit_expr(r, ctx, true), ctx, rt);
if (!is_stack(bp)) {
Value* box = builder.CreateGEP(bp, ConstantInt::get(T_size, -1));
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this should be calling gc_queue_binding on -1-offsetof(jl_binding_t,value)/sizeof(jl_value_t*)), no?

edit: eep, this could be a jl_binding_t, or a closure location (Box)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably time for me to understand a bit more about the different ways we store variables.
The case of a closure location is when a captured variable is assigned in the child scope. At this point, if I understand correctly, we store every such variable in a separate box ? For those cases the code here then seems correct ?
I don't see how a jl_binding_t could get here since I thought they were only used for globals which are handled by the runtime in jl_checked_assign.
The rest is stored on the local gc frame on the stack and doesn't need write barrier.
Is there a case I'm not considering ?
Thanks for taking the time to go through this since the codegen can be a bit opaque to me at times :-)

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, no, you are right. i forgot there is a branch in emit_assignment on whether this is a jl_binding_t or a Box

/me goes looking elsewhere for the source of his bug

@tkelman
Copy link
Contributor

tkelman commented Feb 15, 2015

This is really odd, bisect points to this merge as breaking the Linux-to-Windows cross-compile. During bootstrap of the system image, we get a segfault while compiling inference. This could easily be a bug in Wine. https://gist.github.com/c361c8157820e4e8734c

@vtjnash
Copy link
Sponsor Member

vtjnash commented Feb 15, 2015

heh, i had just come to the exact same conclusion a few minutes ago

@ihnorton
Copy link
Member

This really deserves an entry in NEWS.

A blog post with some performance examples would be neat too -- could be written by anyone who sees a big boost from this in an interesting use case. (@ssfrr?)

@ViralBShah
Copy link
Member

I think that the perf benchmark on european option pricing saw a major increase, and is a good one to talk about. I have generally seen many vectorized codes speed up. I was just sitting with someone working with a wave equation solver that was largely vectorized code, and was slower than octave on 0.3, but became faster than octave with 0.4.

@ViralBShah
Copy link
Member

On a slightly unrelated note, it would be great to rebase the threading branch on top of master - where the work is largely to make the new GC thread safe.

Before the generational GC, the runtime was largely thread safe - and we probably want to merge it into master (disabled by default), so that it is easier to maintain and can receive more contributions.

Cc: @kpamnany @ArchRobison who have worked on the threading branch.

#endif

#ifdef __cplusplus
extern "C" {
#endif

typedef struct _gcpage_t {
char data[GC_PAGE_SZ];
#pragma pack(push, 1)
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you forceably un-align all of these data structures? (including overriding the attempted __attribute__((aligned (64))) below. although aligning to 64-bytes was probably a bit overkill too)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet