New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generational behavior for the garbage collector #8699

Merged
merged 2 commits into from Jan 24, 2015

Conversation

Projects
None yet
@carnaval
Contributor

carnaval commented Oct 16, 2014

Same as #5227 but living in this repo & squashed.

@catawbasam

This comment has been minimized.

Contributor

catawbasam commented Oct 16, 2014

Looking forward to this!

src/gc.c Outdated
heaps_lb[heap_i] = i;
if (heaps_ub[heap_i] < i)
heaps_ub[heap_i] = i;
int j = (ffs(heap->freemap[i]) - 1);

This comment has been minimized.

@tkelman

tkelman Oct 17, 2014

Contributor

Where is ffs defined? I'm getting an implicit declaration compiler warning on windows.

This comment has been minimized.

@carnaval

carnaval Oct 18, 2014

Contributor

ffs stands for find first set (also named count leading zeros). I thought this was provided by the libc everywhere.
We may need a few defines in some headers if the name of the function is not the same (or provide it ourselves).

This comment has been minimized.

@tkelman

tkelman Oct 19, 2014

Contributor

It looks like we might be able to use __builtin_ffs for MinGW (see https://bugs.freedesktop.org/show_bug.cgi?id=30277), but that won't help for MSVC. We could maybe grab musl's implementation http://git.musl-libc.org/cgit/musl/tree/src/misc/ffs.c - I'm still trying to find atomic.h and the definition of a_ctz_l though.

Edit: ok found atomic.h, looks like it's using arch-dependent assembly in musl, so nevermind. Maybe something with __lzcnt for MSVC would work http://msdn.microsoft.com/en-us/library/bb384809(v=vs.120).aspx

This comment has been minimized.

@vtjnash

vtjnash Nov 22, 2014

Member

there's an implementation of ntz (aka ctz) in libsupport:

static int ntz(uint32_t x)

wikipedia gives a list of transformations for easily converting from one formula to another in various ways:
http://en.wikipedia.org/wiki/Find_first_set#Properties_and_relations

This comment has been minimized.

@tkelman

tkelman Nov 22, 2014

Contributor

great, yeah, let's just not leave in anything that assumes posix

This comment has been minimized.

@carnaval

carnaval Nov 22, 2014

Contributor

Well I completely forgot about this. I just added something which looks right but I don't have mingw/msvc so I didn't even compile the code...

@andreasnoack

This comment has been minimized.

Member

andreasnoack commented Dec 10, 2014

In #9270 @JeffBezanson proposed the following example as a benchmark for this PR. The LU factorization of a matrix of BigFloats allocates and discards a lot of small BigFloats. Therefore considerably time is spend with GC. With latest master I get

julia> @time lufact!(A);
elapsed time: 0.403522395 seconds (67595000 bytes allocated)

julia> @time lufact!(A);
elapsed time: 0.707745403 seconds (64033072 bytes allocated, 67.91% gc time)

julia> @time lufact!(A);
elapsed time: 0.40898759 seconds (64033072 bytes allocated, 46.39% gc time)

julia> @time lufact!(A);
elapsed time: 0.611059059 seconds (64033072 bytes allocated, 63.41% gc time)

julia> @time lufact!(A);
elapsed time: 0.396888493 seconds (64033072 bytes allocated, 45.54% gc time)

julia> @time lufact!(A);
elapsed time: 0.430425332 seconds (64033072 bytes allocated, 47.28% gc time)

julia> @time lufact!(A);
elapsed time: 0.60929912 seconds (64033072 bytes allocated, 63.39% gc time)

julia> @time lufact!(A);
elapsed time: 0.412151807 seconds (64033072 bytes allocated, 44.69% gc time)

julia> @time lufact!(A);
elapsed time: 0.632113227 seconds (64033072 bytes allocated, 63.05% gc time)

julia> @time lufact!(A);
elapsed time: 0.41026657 seconds (64033072 bytes allocated, 46.10% gc time)

and with this branch the numbers are

julia> A = big(randn(100, 100));

julia> @time lufact!(A);
elapsed time: 0.578027366 seconds (64 MB allocated, 43.96% gc time in 2 pauses with 0 full sweep)

julia> @time lufact!(A);
elapsed time: 0.367875097 seconds (61 MB allocated, 37.41% gc time in 1 pauses with 0 full sweep)

julia> @time lufact!(A);
elapsed time: 0.412482129 seconds (61 MB allocated, 41.65% gc time in 1 pauses with 0 full sweep)

julia> @time lufact!(A);
elapsed time: 0.589456554 seconds (61 MB allocated, 60.84% gc time in 2 pauses with 0 full sweep)

julia> @time lufact!(A);
elapsed time: 0.411744976 seconds (61 MB allocated, 42.76% gc time in 1 pauses with 0 full sweep)

julia> @time lufact!(A);
elapsed time: 0.560099494 seconds (61 MB allocated, 58.66% gc time in 2 pauses with 0 full sweep)

julia> @time lufact!(A);
elapsed time: 0.419516315 seconds (61 MB allocated, 44.61% gc time in 1 pauses with 0 full sweep)

julia> @time lufact!(A);
elapsed time: 0.405336 seconds (61 MB allocated, 42.50% gc time in 1 pauses with 0 full sweep)

julia> @time lufact!(A);
elapsed time: 0.573393226 seconds (61 MB allocated, 59.15% gc time in 2 pauses with 0 full sweep)

julia> @time lufact!(A);
elapsed time: 0.412730634 seconds (61 MB allocated, 41.91% gc time in 1 pauses with 0 full sweep)

The numbers look very similar.

Another thing: from the discussion in another thread some time ago, I got the impression that decimal MBs were the standard, but it appears that this branch uses binary MBs.

@vchuravy

This comment has been minimized.

Member

vchuravy commented Dec 10, 2014

@andreasnoack The mean and variance of that are interesting: (From the numbers you have supplied)

master:

  • mean: 0.5022458996
  • variance: 0.01485280295568762

generational:

  • mean: 0.47306617910000004
  • variance: 0.007977234256117448

Of course due to the sample size this is not conclusive, but the tendency seems to be slightly faster and more consistent performance. If the consistent performance holds with a larger sample size then it would be a very nice thing for real-time audio and video software.

@carnaval

This comment has been minimized.

Contributor

carnaval commented Dec 10, 2014

Hey, thanks for testing this branch, it badly needs it. I'm pretty sure there are performance and correctness regressions lurking.
In that case 80% of the time is spent in finalizers which is reported as gc time. Using GC_TIME reveals this (post_mark is where finalizers are checked and C ones are run). GC_FINAL_STATS should also show this but it doesn't as of now (small time counting bug, I'll push a fix).
I'm not sure we can do much about this. Maybe optimize the way finalizers are stored/checked but in that case I think the bottleneck is in the actual finalizing code freeing the memory (no measure where done to assert this claim).

@carnaval

This comment has been minimized.

Contributor

carnaval commented Dec 10, 2014

A quick run through perf seems to show that actually finalizer management overhead (mostly the hashtable lookup for registration) is not negligible, however most of the time is still in mpfr_clear & malloc/free.

@timholy

This comment has been minimized.

Member

timholy commented Dec 11, 2014

To me it seems that BigInt computations are the poster child for reusing memory, not allocating & freeing---if you're doing computations in a loop, you ideally want to have pre-allocated temp variables in which you store your intermediate results, rather than allocating and freeing on each add and multiply.

I can't find it right now, but I swear I remember a recent conversation on one of our mailing lists in which python handily beats Julia for BigInt computations, by such a large factor that they must be doing something different.

@tkelman

This comment has been minimized.

@JeffBezanson

This comment has been minimized.

Member

JeffBezanson commented Dec 11, 2014

We haven't really tried to optimize BigInts yet. The first step is to be
able to stack allocate BigInts that fit in 1 word. The new Bytes type
Stefan is developing for strings could be useful here. This isn't easy but
should be doable.

@andreasnoack

This comment has been minimized.

Member

andreasnoack commented Dec 11, 2014

@carnaval Thanks for the feedback and sorry for going off topic in your PR.

Just to get a sense of the price we are paying for reallocation of BigInts right now, I tried to add mutating arithmetic for BigInts and run the benchmark in @tkelman's link. This relates to #249, #1115, #3022, #3424.

The code is in this gist. Note that we would probably be able to get (at least a good part of) the speedup without changing the code if we allowed += and *= to update in place. I'm wondering if we are giving too much preference to machine numbers in the decision that a+=b is a = a + b.

julia> @time GMP2.pidigits(10000); # Immutable arithmetic
elapsed time: 4.470367684 seconds (8498375224 bytes allocated, 64.35% gc time)

julia> @time GMP2.pidigits2(10000); # Mutable arithmetic
elapsed time: 0.913993363 seconds (262817612 bytes allocated, 13.60% gc time)

In contrast, the GHC number on my machine are

Andreass-MacBook-Pro:Haskell_GHC andreasnoack$ ./bin 10000 +RTS -sstderr > output.txt
   8,450,072,008 bytes allocated in the heap
       5,839,000 bytes copied during GC
         318,288 bytes maximum residency (116 sample(s))
         117,808 bytes maximum slop
               4 MB total memory in use (1 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     15427 colls,     0 par    0.07s    0.08s     0.0000s    0.0001s
  Gen  1       116 colls,     0 par    0.01s    0.01s     0.0001s    0.0002s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    2.98s  (  3.01s elapsed)
  GC      time    0.08s  (  0.09s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    3.06s  (  3.10s elapsed)

  %GC     time       2.5%  (2.9% elapsed)

  Alloc rate    2,831,170,217 bytes per MUT second

  Productivity  97.5% of total user, 96.4% of total elapsed

but this is not with the LLVM backend. Notice that the allocation is similar to the non-mutating Julia version.

@carnaval

This comment has been minimized.

Contributor

carnaval commented Dec 11, 2014

While using explicit mutating arithmetic is surely unbeatable performance wise, I think that using it for ?= operators would lead to annoying aliasing bugs in generic code, and you would have to write different version for big/small numbers anyway. I'm all for exposing those primitives as separate functions.
Am I correctly reading the ~3s timing for the haskell program ? We should be able to get closer by optimizing finalizers & allocation for gmp.
I'm not sure why finalizers are using a hashtable anyway, since we are going through the whole list at every collection (and I'd bet that jl_finalize is less performance critical than bignum allocation).
About allocation, we could use our pools to store small bignums, but it would generate a lot of garbage on realloc calls so I'm not sure it is worth it.
I don't think there is much more we can do without static lifetime analysis, but I'd love to be wrong.

@carnaval

This comment has been minimized.

Contributor

carnaval commented Dec 11, 2014

Ha, @timholy is right, we could have bignums register as dead in their finalizers by putting themselves in a recycling pool. I believe this can be done purely in julia. There is still the problem of growing/shrinking this pool but if it is the right size it would remove a lot of malloc/free calls.

@andreasnoack

This comment has been minimized.

Member

andreasnoack commented Dec 11, 2014

Yes. On my MacBook the Haskell version takes 3s. It is quite a bit slower than expected from the blog post.

using [explicit mutating arithmetic] for ?= operators would lead to annoying aliasing bugs in generic code

You might be right, but could you give an example? While modifying the benchmark code I almost convinced myself that it wouldn't be a problem.

@carnaval

This comment has been minimized.

Contributor

carnaval commented Dec 11, 2014

a = 0; b = a; a += 1
(a == b) gives different results in mutating/non-mutating arithmetic.
I agree that it may not show up in idiomatic code, but I'd say it's even worse since bugs will be harder to find.

@JeffBezanson

This comment has been minimized.

Member

JeffBezanson commented Dec 11, 2014

Mutable numbers are semantically unacceptable and simply aren't going to happen. We clearly have a significant list of performance improvements to try here.

@vtjnash

This comment has been minimized.

Member

vtjnash commented Dec 11, 2014

Another example:

A += A+B

Any expression where the input and output are allowed to alias could be
invalid or have subtle bugs.

I suspect that even
A *= A'
Could be rather problematic (assuming A is square)
On Thu, Dec 11, 2014 at 11:53 AM Oscar Blumberg notifications@github.com
wrote:

a = 0; b = a; a += 1
(a == b) gives different results in mutating/non-mutating arithmetic.
I agree that it may not show up in idiomatic code, but I'd say it's even
worse since bugs will be harder to find.


Reply to this email directly or view it on GitHub
#8699 (comment).

@JeffBezanson

This comment has been minimized.

Member

JeffBezanson commented Dec 11, 2014

The issue is that you want to be able to use the same code for Int and BigInt, and mutable BigInts would break that. If += mutated some types and not others, we'd basically have to tell people to never use it, to make their code "BigInt safe" :shudder:.

@StefanKarpinski

This comment has been minimized.

Member

StefanKarpinski commented Dec 11, 2014

At this point Julia is in the shockingly rare position that generic code really is generic – you can write something generically and actually expect it to work for all kinds of types. Making += mutating and/or having semantically mutable number types would ruin this. I think the ability to write really generic code is more fragile than people may realize.

@carnaval

This comment has been minimized.

Contributor

carnaval commented Dec 11, 2014

@andreasnoack I've pushed a change for finalizers. With it and the few changes that I hadn't pushed yet, the cholfact! bench looks around 2x faster with 2x less memory usage compared to master. About the same for the (non-mutating) pidigits(10000). Can you confirm ?

@carnaval

This comment has been minimized.

Contributor

carnaval commented Dec 11, 2014

By the way, this should break finalizer ordering, do we make any guarantee about it ?

@timholy

This comment has been minimized.

Member

timholy commented Dec 11, 2014

@carnaval, my guess is the finalizer is too late---where this would make the biggest difference is in a loop inside a function, and of course the finalizer won't run frequently inside the loop. I fear this may need static analysis.

@andreasnoack

This comment has been minimized.

Member

andreasnoack commented Dec 11, 2014

Thanks for the examples. I can see the problem. I was thinking that it wouldn't break genericness to update a in a+=b because the expression tells us that our existing a is free to use for the result. However, the examples taught me that this is only true if a isn't aliased with another variable.

It appears that if allowing *= to mutate a BigInt then it would be necessary to disallow aliasing in order to retain the number feeling of BigInts. Aliasing doesn't cause trouble now because we have made sure that all BigInt functions allocate a new output variable. Is there a way of enforcing that two BigInt variables are never aliased?

@JeffBezanson I'm only trying to understand the reasoning and make visible the costs here. I appreciate the explanations and you are probably completely right but "semantically unacceptable" makes almost no meaning in my CS untrained head. In contrast, examples are really good for my understanding.

@StefanKarpinski I don't want to break genericness here. Not even for performance purposes. I want to understand why the "proposal" would break genericness. As explained above, the mutating behavior was only meant for the *= and += type operations which I thought wouldn't cause trouble.

@vtjnash I don't see how A+=A+B is a problem. A+B has to be stored in a temporary variable, say C=A+B (only += should be mutating) and then afterwards A is updated with the value of A+C. Am I wrong? Regarding A *= A', was the prime then intentional? With no prime or lazy transpose, the expression should throw an error. I guess that would be possible with is(A,A).

@andreasnoack

This comment has been minimized.

Member

andreasnoack commented Dec 11, 2014

@carnaval Cool! I can confirm the speedups. Cholesky and LU are double as fast and pidigits is 1.65 seconds instead of 4.47.

@andreasnoack

This comment has been minimized.

Member

andreasnoack commented Dec 11, 2014

My office mate @jakebolewski pointed out that we should try the benchmarks with some deadweight on the GC. So I ran pidigits again on latest master and this branch, but after allocating

julia> s = [randstring(10) for i = 1:10_000_000];

and the results are

julia> @time GMP2.pidigits(10000); #master
elapsed time: 51.787389151 seconds (8498375224 bytes allocated, 93.30% gc time)

julia> @time GMP2.pidigits(10000); # Oscar branch
elapsed time: 7.850218366 seconds (8104 MB allocated, 88.08% gc time in 1477 pauses with 11 full sweep)

which seems quite impressive.

@JeffBezanson

This comment has been minimized.

Member

JeffBezanson commented Dec 11, 2014

Excellent speedup! @carnaval is this consistent with your earlier observation that "most of the time is still in mpfr_clear & malloc/free" ?

With this approach, we can't really have the finalize function; it will be too much of a performance pitfall to be useful. A possible way to get around that is to implement #1037: have finalize-able types implement a method of finalize, and have finalizer just add an object to the GC's list, and have the GC call the finalize generic function. The disadvantages seem to be (1) you can't have more than one finalizer per object, and (2) it has to be safe to call finalize on an object twice. Those conditions sound fairly reasonable.

@andreasnoack

This comment has been minimized.

Member

andreasnoack commented Dec 11, 2014

@jakebolewski suggested that it might be a better benchmark to allocate an array of BigInts. First of all I should note that the allocation of the array

julia> s = [big(rand(1:10)) for i = 1:10_000_000];

is way faster on the Oscar branch. The pidigits now runs in

julia> @time GMP2.pidigits(10000); # master
elapsed time: 116.508572264 seconds (8502048512 bytes allocated, 98.45% gc time)

julia> @time GMP2.pidigits(10000); # Oscar branch
elapsed time: 55.443133549 seconds (8108 MB allocated, 97.48% gc time in 1359 pauses with 10 full sweep)

and just for completeness

julia> @time GMP2.pidigits2(10000); # Semantically unacceptable solution
elapsed time: 2.686130463 seconds (249 MB allocated, 72.39% gc time in 45 pauses with 1 full sweep)
@carnaval

This comment has been minimized.

Contributor

carnaval commented Dec 11, 2014

Glad to see this working :-)
In fact, as you may have guessed the gc is still the bottleneck in this case, most of the time is now in the sweep phase, waiting for the memory to load the page header only to realize it is old and can be skipped. To have this be faster the relevant fields will have to be packed to be spatially coherent.
It will probably be enough (for now) to simply store the page metadata in a SoA fashion (since they are already in groups of 32), as I believe we have a cache miss on each load as it is now. I will have a look (but most likely not this evening).
For finalizers, I agree with the #1037 solution. The limitations seem in line with how finalizers are used in typical code.

@tknopp

This comment has been minimized.

Contributor

tknopp commented Jan 24, 2015

Awesome. Thanks for this contribution.

@timholy

This comment has been minimized.

Member

timholy commented Jan 24, 2015

Hooray! Quite a long saga, but I am really looking forward to this.

@StefanKarpinski

This comment has been minimized.

Member

StefanKarpinski commented Jan 24, 2015

slow clap orson welles

@staticfloat

This comment has been minimized.

Member

staticfloat commented Jan 24, 2015

@ivarne

This comment has been minimized.

Member

ivarne commented Jan 24, 2015

The angry mob is hopefully all on release-0.3.

@@ -217,7 +219,9 @@ static jl_value_t *eval(jl_value_t *e, jl_value_t **locals, size_t nl)
size_t i;
for (i=0; i < nl; i++) {
if (locals[i*2] == sym) {
return (locals[i*2+1] = eval(args[1], locals, nl));
locals[i*2+1] = eval(args[1], locals, nl);
gc_wb(jl_current_module, locals[i*2+1]); // not sure about jl_current_module

This comment has been minimized.

@vtjnash

vtjnash Jan 25, 2015

Member

i'm not sure either. this is a stack variable slot. locals is a JL_GC_PUSHARGS alloca'd location.

This comment has been minimized.

@vtjnash

vtjnash Feb 15, 2015

Member

since locals is a stack location, this gc_wb seems unnecessary? (or perhaps, should target jl_current_task?)

@blakejohnson

This comment has been minimized.

Contributor

blakejohnson commented Jan 26, 2015

@carnaval I'm seeing some major performance improvement with the new GC. The performance test numbers in our quantum system simulator (https://github.com/BBN-Q/QSimulator.jl) all dropped by nearly a factor of 2. Nice work!

@timholy

This comment has been minimized.

Member

timholy commented Jan 26, 2015

While I'm still adjusting to this change, this is already subtly changing my julia programming style: there's a whole "mid-layer" of problems where I'm finding that I'm noticeably less worried about allocating memory than I used to be.

I'd call that a pretty big impact.

@IainNZ

This comment has been minimized.

Member

IainNZ commented Jan 26, 2015

I'd love to make a more comprehensive performance test bed that has a few algorithms implemented in a few different styles - some much more garbage-generating than others. The idea is to test how performance varies for code that isn't written optimally (like the kinds of code you see on StackOverflow)

@johnmyleswhite

This comment has been minimized.

Member

johnmyleswhite commented Jan 26, 2015

+1 to @IainNZ's idea

@ViralBShah

This comment has been minimized.

Member

ViralBShah commented Jan 26, 2015

There are improvements in perf benchmarks of vectorized codes. While expected, it is nice to actually realize it. The stockcorr benchmark is now equally fast in both, vectorized and devectorized cases.

@@ -384,11 +391,13 @@ static jl_value_t *eval(jl_value_t *e, jl_value_t **locals, size_t nl)
// temporarily assign so binding is available for field types
check_can_assign_type(b);
b->value = (jl_value_t*)dt;
gc_wb_binding(b,dt);

This comment has been minimized.

@vtjnash

vtjnash Feb 11, 2015

Member

why do you use gc_wb_binding(((void**)b)-1, dt); in the other two usages of gc_wb_binding, but not here? @carnaval

This comment has been minimized.

@carnaval

carnaval Feb 11, 2015

Contributor

My mistake. Thanks !

@JeffBezanson JeffBezanson referenced this pull request Feb 12, 2015

Closed

gc enhancements #261

rval = boxed(emit_expr(r, ctx, true),ctx,rt);
rval = boxed(emit_expr(r, ctx, true), ctx, rt);
if (!is_stack(bp)) {
Value* box = builder.CreateGEP(bp, ConstantInt::get(T_size, -1));

This comment has been minimized.

@vtjnash

vtjnash Feb 12, 2015

Member

i think this should be calling gc_queue_binding on -1-offsetof(jl_binding_t,value)/sizeof(jl_value_t*)), no?

edit: eep, this could be a jl_binding_t, or a closure location (Box)

This comment has been minimized.

@carnaval

carnaval Feb 12, 2015

Contributor

It's probably time for me to understand a bit more about the different ways we store variables.
The case of a closure location is when a captured variable is assigned in the child scope. At this point, if I understand correctly, we store every such variable in a separate box ? For those cases the code here then seems correct ?
I don't see how a jl_binding_t could get here since I thought they were only used for globals which are handled by the runtime in jl_checked_assign.
The rest is stored on the local gc frame on the stack and doesn't need write barrier.
Is there a case I'm not considering ?
Thanks for taking the time to go through this since the codegen can be a bit opaque to me at times :-)

This comment has been minimized.

@vtjnash

vtjnash Feb 12, 2015

Member

oops, no, you are right. i forgot there is a branch in emit_assignment on whether this is a jl_binding_t or a Box

/me goes looking elsewhere for the source of his bug

// avoid counting remembered objects & bindings twice in perm_scanned_bytes
for(int i = 0; i < last_remset->len; i++) {
uintptr_t item = (uintptr_t)last_remset->items[i];
void* ptr = (void*)(item & ~(uintptr_t)1);

This comment has been minimized.

@vtjnash

vtjnash Feb 13, 2015

Member

how would the low bit of the item pointer have gotten set?

This comment has been minimized.

@carnaval

carnaval Feb 13, 2015

Contributor

I think it can't anymore. I used to use the low bit to denote objects that must be kept in the remset until the next long collection, but it was only used when the bindings didn't have their own remset. I'm pretty sure it's safe to remove this. I can't build right now to confirm.

@tkelman

This comment has been minimized.

Contributor

tkelman commented Feb 15, 2015

This is really odd, bisect points to this merge as breaking the Linux-to-Windows cross-compile. During bootstrap of the system image, we get a segfault while compiling inference. This could easily be a bug in Wine. https://gist.github.com/c361c8157820e4e8734c

@vtjnash

This comment has been minimized.

Member

vtjnash commented Feb 15, 2015

heh, i had just come to the exact same conclusion a few minutes ago

@ihnorton

This comment has been minimized.

Member

ihnorton commented Feb 18, 2015

This really deserves an entry in NEWS.

A blog post with some performance examples would be neat too -- could be written by anyone who sees a big boost from this in an interesting use case. (@ssfrr?)

@ViralBShah

This comment has been minimized.

Member

ViralBShah commented Feb 21, 2015

I think that the perf benchmark on european option pricing saw a major increase, and is a good one to talk about. I have generally seen many vectorized codes speed up. I was just sitting with someone working with a wave equation solver that was largely vectorized code, and was slower than octave on 0.3, but became faster than octave with 0.4.

@ViralBShah

This comment has been minimized.

Member

ViralBShah commented Feb 21, 2015

On a slightly unrelated note, it would be great to rebase the threading branch on top of master - where the work is largely to make the new GC thread safe.

Before the generational GC, the runtime was largely thread safe - and we probably want to merge it into master (disabled by default), so that it is easier to maintain and can receive more contributions.

Cc: @kpamnany @ArchRobison who have worked on the threading branch.

#endif
#ifdef __cplusplus
extern "C" {
#endif
typedef struct _gcpage_t {
char data[GC_PAGE_SZ];
#pragma pack(push, 1)

This comment has been minimized.

@vtjnash

vtjnash Mar 14, 2015

Member

why do you forceably un-align all of these data structures? (including overriding the attempted __attribute__((aligned (64))) below. although aligning to 64-bytes was probably a bit overkill too)

// layout for small (<2k) objects
#define GC_PAGE_LG2 14 // log2(size of a page)
#define GC_PAGE_SZ (1 << GC_PAGE_LG2) // 16k

This comment has been minimized.

@vtjnash

vtjnash Apr 3, 2015

Member

does any part of this care about the system page size?

e.g. the POWER8 servers hosted my IBM are configured for 16k pages:

/proc/config.gz
CONFIG_PPC_64K_PAGES=y

(gdb) p sysconf(_SC_PAGESIZE)
$12 = 65536

This comment has been minimized.

@vtjnash

vtjnash Apr 3, 2015

Member

ah shoot, there actually is a really subtle one: madvise(p, GC_PAGE_SZ, MADV_DONTNEED); is assuming that p does not overlap a page boundary. but on this particular system, it will.

in particular, since pages is not the first item in the region_t, the madvise accidently frees an extra system-page in some cases.

fixed in 2711529

This comment has been minimized.

@carnaval

carnaval Apr 3, 2015

Contributor

this change is fine, it really should have been this way since the begining.
Everything should work as long as gc_pg_sz = 0 mod sys_pg_sz.
We should probably start to consider huge page systems (several megs). This requires changes in the region allocator but the rest of the code should be oblivious to it.

This comment has been minimized.

@carnaval

carnaval Apr 3, 2015

Contributor

ah, is this 16k a typo ? because if those really are 64k pages I don't see how the current code can work since it assumes that it can free gc pages individually, which it can't.
The required change should not be too hard, just have to check that the whole 64k chunk is free before madvise.

This comment has been minimized.

@vtjnash

vtjnash Apr 3, 2015

Member

no, i just hacked around it by only calling madvise on an integral number of pages (for this platform, that would be zero)

This comment has been minimized.

@carnaval

carnaval Apr 3, 2015

Contributor

that certainly works for now. we'll just have to change it before merging.

@tkelman tkelman deleted the ob/gengc branch Apr 19, 2015

@yuyichao

This comment has been minimized.

Member

yuyichao commented on src/codegen.cpp in 7c8acce May 27, 2015

@carnaval
What do you mean by the comment here?

I did the following change on the current master

diff --git a/src/cgutils.cpp b/src/cgutils.cpp
index a2c931e..d46b612 100644
--- a/src/cgutils.cpp
+++ b/src/cgutils.cpp
@@ -1778,9 +1778,15 @@ static void emit_write_barrier(jl_codectx_t* ctx, Value *parent, Value *ptr)
     Value* parent_type = builder.CreateLoad(parenttag);
     Value* parent_mark_bits = builder.CreateAnd(parent_type, 1);

-    // the branch hint does not seem to make it to the generated code
-    //builder.CreateCall(expect_func, {parent_marked, ConstantInt::get(T_int1, 0)});
     Value* parent_marked = builder.CreateICmpEQ(parent_mark_bits, ConstantInt::get(T_size, 1));
+    // the branch hint does not seem to make it to the generated code
+#ifdef LLVM37
+    parent_marked = builder.CreateCall(expect_func, {parent_marked,
+                ConstantInt::get(T_int1, 0)});
+#else
+    parent_marked = builder.CreateCall2(expect_func, parent_marked,
+                                        ConstantInt::get(T_int1, 0));
+#endif

     BasicBlock* cont = BasicBlock::Create(getGlobalContext(), "cont");
     BasicBlock* barrier_may_trigger = BasicBlock::Create(getGlobalContext(), "wb_may_trigger", ctx->f);

and this is what I've got

julia> type A
       a
       b
       end

julia> @code_llvm A(1, 2)

define %jl_value_t* @julia_call_21016(%jl_value_t*, i64, i64) {
top:
  %3 = alloca [3 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [3 x %jl_value_t*]* %3, i64 0, i64 0
  %4 = getelementptr [3 x %jl_value_t*]* %3, i64 0, i64 2
  %5 = bitcast [3 x %jl_value_t*]* %3 to i64*
  store i64 2, i64* %5, align 8
  %6 = getelementptr [3 x %jl_value_t*]* %3, i64 0, i64 1
  %7 = bitcast %jl_value_t** %6 to %jl_value_t***
  %8 = load %jl_value_t*** @jl_pgcstack, align 8
  store %jl_value_t** %8, %jl_value_t*** %7, align 8
  store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8
  store %jl_value_t* null, %jl_value_t** %4, align 8
  %9 = call %jl_value_t* @jl_box_int64(i64 signext %1)
  store %jl_value_t* %9, %jl_value_t** %4, align 8
  %10 = call %jl_value_t* @alloc_2w()
  %11 = getelementptr inbounds %jl_value_t* %10, i64 -1, i32 0
  store %jl_value_t* inttoptr (i64 140728721870992 to %jl_value_t*), %jl_value_t** %11, align 8
  %12 = getelementptr inbounds %jl_value_t* %10, i64 0, i32 0
  store %jl_value_t* %9, %jl_value_t** %12, align 8
  %13 = getelementptr inbounds %jl_value_t* %10, i64 1, i32 0
  store %jl_value_t* null, %jl_value_t** %13, align 8
  store %jl_value_t* %10, %jl_value_t** %4, align 8
  %14 = call %jl_value_t* @jl_box_int64(i64 signext %2)
  store %jl_value_t* %14, %jl_value_t** %13, align 8
  %15 = icmp eq %jl_value_t* %14, null
  br i1 %15, label %cont1, label %wb_not_null

wb_not_null:                                      ; preds = %top
  %16 = bitcast %jl_value_t** %11 to i64*
  %17 = load i64* %16, align 8
  %18 = and i64 %17, 1
  %19 = icmp ne i64 %18, 0
  %20 = call i1 @llvm.expect.i1(i1 %19, i1 false)
  br i1 %20, label %wb_may_trigger, label %cont1

wb_may_trigger:                                   ; preds = %wb_not_null
  %21 = getelementptr inbounds %jl_value_t* %14, i64 -1, i32 0
  %22 = bitcast %jl_value_t** %21 to i64*
  %23 = load i64* %22, align 8
  %24 = and i64 %23, 1
  %25 = icmp eq i64 %24, 0
  br i1 %25, label %wb_trigger, label %cont1

wb_trigger:                                       ; preds = %wb_may_trigger
  call void @gc_queue_root(%jl_value_t* %10)
  br label %cont1

cont1:                                            ; preds = %wb_trigger, %wb_may_trigger, %wb_not_null, %top
  %26 = load %jl_value_t*** %7, align 8
  store %jl_value_t** %26, %jl_value_t*** @jl_pgcstack, align 8
  ret %jl_value_t* %10
}

Notice the @llvm.expect.i1 call that is assigning to %20

Edit: In this case the generated assembly is the same but IMHO, the expect call does make it to the generated code.

This comment has been minimized.

Member

yuyichao replied May 27, 2015

P.S. I'm looking at this because I'm thinking of using this for the fast path in the generic jlcall wrapper in #11439 .

This comment has been minimized.

Contributor

carnaval replied May 27, 2015

this was a long time ago, maybe llvm improved, or maybe I did it wrong back then. Anyway it would be cool to have this, even if not too critical since it is quite a cheap operation, llvm should use it to move the "slow" path out of the way (the icache). I'm not sure it would make any difference though.

This comment has been minimized.

Member

yuyichao replied May 27, 2015

@carnaval

At least in this case it doesn't make a difference at all (identical assembly) and I'm mostly interested in whether something not so obvious is not working.
I'll try to use it in my PR then and I guess it might be useful to export this to julia code as well.

Are you going to commit this? (IMHO, it's too small for a PR....)

@tkelman

This comment has been minimized.

Contributor

tkelman commented on test/perf/perfutil.jl in 7c8acce Aug 2, 2015

this apparently broke make -C test/perf sort and no one has noticed yet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment