Postrelease opts #969

bdw · 2018-09-24T10:10:42Z

Optimizations and stuff

For a wval that was already accessed, stash it away in a spesh slot and access it that way. This is rather faster, avoiding the various indirections in wval lookup.

When we hit the write barrier, annotate on the object that was being referenced that it was pointed to by a gen2 object. Then, if it is alive at the next GC, immediately promote it to gen2 rather than it having another stay in the nursery. This makes a "populate a million element array" benchmark run in 90% of the time that it used to take, and will help other programs that build up long-lived data structures. CORE.setting compilation is one such program, and seems to also benefit from this. Meanwhile, other benchmarks seem to not be negatively impacted by this.

Otherwise, now that guards have a read and a write register, we can end up with the information being insufficiently accurate in order to keep things alive as needed for deopt.

Otherwise, we might end up scribbling over a register that holds a useful value the other side of deopt, which can happen under our move aggressive `set` elimination optimizations, but potentially in others too.

In preparation for looking at eliminating unrequired guards. This is a common case where we will often be able to do guard elimination.

We can eliminate them when we can prove that the guard will always be true.

* Remove seemingly non-effective box and dead code unbox opts in the first optimize pass * In the second optimize pass, introduce code to delete dead box instructions that are rendered no longer required due to box/unbox pair elimination * Also do a little bit of naming tweakery This gets `my $x = "fooooo"; for ^10_000_000 { my int $i = $x.chars }` to spesh into something that deletes the boxing of the `chars`, which makes it run in around 90% of the time it used to. This also prepares the way for further work on optimizing `box` ops.

This covers boxing of both Num and Str objects in Perl 6, lowering them into a fastcreate and a p6o_bind_[ns], which avoids a C call and the indirections and checks that the normal code has to do.

This clears up things like hllboxtype instructions left behind by lowered or deleted boxing instructions.

do it immediately in the spesh function of the reprs VMArray, MVMHash, and P6opaque. This throws out a bunch of guards that were being left in the code in a test i did.

also, no need to get_and_use_facts when we're changing the operation that writes that register.

This is aimed at helping NQP code rather than Perl 6 code, but does the same trick that we've done for the Perl 6 ones. Alas, the int one loses the use of the int box cache, so needs a bit more work yet.

For JIT, we now emit assembly that does the comparisons and then does a fetch from the integer cache if applicable. Thus the only real call is to obtain memory from the GC if we need to box, the value then being poked into the object at the appropriate location. Also stub in as-yet unimplemented and unused spesh ops that we'll use for optimizing boxing to a P6bigint embedded in a P6opaque.

Almost works, but the range check for whether to store it as a small int always comes out false, so it always takes the bigint path.

Spotted by MasterDuke++.

Sometimes a decont really is an unbox in disguise, so handle those too. So far, implemented it for P6opaque box around a native num and native str.

Nothing else depends on it, and there's a potential risk of it not leaving the dominance tree in good shape.

This aids the performance of NQP code, which uses these rather than P6opaque-based box types. Also includes a fix for the interpretation of sp_get_s, which I can only figure has never been exercised before as it was utterly broke.

Now, if it's a small bigint (fits within 32 bits), we just pull it right out without having to make a function call. We only make a call in the case there's a real mp_int that we need to range check and take the value from.

We know that the body of the object will always be the direct one, not a replacement version of it, so we can simply bind directly into that without paying for the cost of the indirection.

The object was only just created. It must be in the nursery, since that's a rule for using `fastcreate`. Thus it can not be a gen2 object pointing to a nursery object, which is when the barrier would trigger.

So that if we do nested inlines, they don't contribute to the size of the overall thing. This seems to be on the most part a win: various Perl 6 microbenchmarks run faster, spectest seems unimpacted, there is a slight impact on spectest compilation time but nothing enormous.

This gets it to use the write barrier variant that sets the "gen2 referenced" flag on an object to encourage its earlier promotion.

It was repeated in a number of ops, and will be used in a few more soon.

Like sp_fastbox_bi, sp_add_I, and friends, this calculates a fixed offset into a P6opaque's body where the bigintbody can be found and emits a spesh op that directly accesses the memory to do the extremely cheap nonzero check on the contents for both smallbigint and real bigints.

Thus regaining the performance in some programs the depend heavily on that.

Previously, we only had them on the facts of the PHI node itself. However, those facts could then be copied and relied on elsewhere. With this change, we now keep a list of the log guards that we depend on for each fact, and at a PHI node we copy all the input lists into a target list. We then, when we depend on or copy facts, just assign the input log guards list and count (which is safe since it's immutable, and we use the region allocator so there's no memory management concerns).

In the case where the result was going into the same register that an input argument came from, we stored the result prematurely, and then read a NULLed out bigint struct as one of the arguments. This only took place in the fallback path and when the earlier requirement was meant. Fixes various crashing spectests.

Rather than just the first one of each kind that we encounter. With more aggressive optimizations, we could end up with multiple of them being shuffled onto the same instruction.

We could potentially process the start after the end in this case, and so leave inconsistent data in the active handlers data array.

It makes various invocations, and we don't end up with the calling point properly tracked as a deopt point, in turn breaking things like dynamic variable resolutions very close to the point that did the loadbytecode.

* Try going for an outer frame if the immediately available one is inlined (this will work together with a change in Rakudo to mark frames that use `EVAL` as no-inline, and mean that `try EVAL $x` will work out fine) * If that doesn't work, don't SEGV, but throw

This reverts commit d84778a, which caused some breakage. Also, it might well have been better placed (or at least just as well placed) beforehand anyway.

Or at least, change it so it works. My gut feeling is that the way it was before should produce better code (doesn't need to save a value during a call), and that the change I've done should have had no semantic effect.

No need to add on to a pointer only to subtract from it again.

Eliminates some guards that we got back after various boxing and bigint lowering optimizations. This means they were a slightly bigger win than initially thought, because they accidentally resulted in an extra guard in various cases.

If the inliner doesn't set it, then we can assume it's not yet. Note that this is a Rakudo use case assumption, rather than covering every possible use of the ops. This perhaps means we want to adapt a a less general mechanism in the future.

* When it's hllboolfor, be sure to delete the usage of the language name register, otherwise we fail to delete it (caught thanks to the DU chain checker) * Also for hllboolfor don't blindly assume that the language name operand is a constant * Avoid fetching obj_facts twice * Use a better name for the facts on the target result

Parallel GC relies on us only updating the object flags from the thread that owns the object (or the nominated thread to GC that thread's objects). The updates caused by the write barrier touching the object that was referenced caused races that could lead to incorrect GC behavior, which typically showed up as an invalid SC (which was really because the forwarder lives in the same slot that the SC used to on a moved object).

Generate new SSA versions for each return point, and then merge then at a PHI placed after the inlinee. This potentially offers better facts for optimization in such cases, since before we propagated no facts out of the routine if there was multiple object returns, whereas now we let PHI fact merging identify commonalities. Further, we now propagate facts from non-object return values too.

bdw

I'm basically happy, and more importantly, spectest is.
Left some minor notes. But I'm going to hit the merge button anyway, and make sure this gets some wider coverage.

bdw · 2018-09-24T12:37:12Z

src/core/interp.c

@@ -5668,7 +5690,7 @@ void MVM_interp_run(MVMThreadContext *tc, void (*initial_invoke)(MVMThreadContex
                cur_op += 6;
                goto NEXT;
            OP(sp_get_s):
-                GET_REG(cur_op, 0).s = ((MVMString *)((char *)GET_REG(cur_op, 2).o + GET_UI16(cur_op, 4)));
+                GET_REG(cur_op, 0).s = *((MVMString **)((char *)GET_REG(cur_op, 2).o + GET_UI16(cur_op, 4)));


This is now inconsistent with sp_get_o. I think it is correct, and I also think sp_get_o isn't actually used, but it is still odd.

bdw · 2018-09-24T14:15:38Z

src/spesh/optimize.c

+        SeenBox *sb = MVM_malloc(sizeof(SeenBox));
+        sb->bb = bb;
+        sb->ins = ins;
+        MVM_VECTOR_PUSH(pips->seen_box_ins, sb);


Why use a pointer to a malloc'd structure here? If you'd use spesh_alloc, you wouldn't have to free it, and if you'd use the regular vector form, you'd be able to assign in place.

jnthn and others added 30 commits September 17, 2018 17:32

Optimize "hot" wval lookup into spesh slot access

a4b4d83

For a wval that was already accessed, stash it away in a spesh slot and access it that way. This is rather faster, avoiding the various indirections in wval lookup.

More aggressive set opt of non-deopt inlinees

0ccf889

Fix spesh dump for when inline entirely vanishes

7f0d532

Mark [sp_]speshresolve as :invokish

663fa2c

Form deopt usage information before guard inserts

58128e3

Otherwise, now that guards have a read and a write register, we can end up with the information being insufficiently accurate in order to keep things alive as needed for deopt.

Only write guard write regsister if not deopting

2dd34b6

Otherwise, we might end up scribbling over a register that holds a useful value the other side of deopt, which can happen under our move aggressive `set` elimination optimizations, but potentially in others too.

Propagate facts of inlined return value to inliner

d18d536

In preparation for looking at eliminating unrequired guards. This is a common case where we will often be able to do guard elimination.

Turn redundant sp_guard* ops into set

760966c

We can eliminate them when we can prove that the guard will always be true.

insert null-register-ops in ascending order

f9e8e44

Fix typo in comment

8ad1764

Lower box_[ns] with P6opaque

0c7483b

This covers boxing of both Num and Str objects in Perl 6, lowering them into a fastcreate and a p6o_bind_[ns], which avoids a C call and the indirections and checks that the normal code has to do.

Do a dead instruction pass after post-inline pass

5c75cbe

This clears up things like hllboxtype instructions left behind by lowered or deleted boxing instructions.

set type and concrete facts for fastcreate

bb181aa

do it immediately in the spesh function of the reprs VMArray, MVMHash, and P6opaque. This throws out a bunch of guards that were being left in the code in a test i did.

add type/concrete to box_* spesh, too

183aff2

also, no need to get_and_use_facts when we're changing the operation that writes that register.

Box lowering for P6[int|num|str]

429139e

This is aimed at helping NQP code rather than Perl 6 code, but does the same trick that we've done for the Perl 6 ones. Alas, the int one loses the use of the int box cache, so needs a bit more work yet.

First attempt at a fast box for big integers

afe8660

Almost works, but the range check for whether to store it as a small int always comes out false, so it always takes the bigint path.

Correct somehow typo'd integer constant

871dd3b

Fix missing result register writes

ed888b9

Correct op names in error messages

63a80bc

Spotted by MasterDuke++.

Start work towards lowering unbox operations

8f3a16b

Sometimes a decont really is an unbox in disguise, so handle those too. So far, implemented it for P6opaque box around a native num and native str.

Do basic block merging as the last step

ae32ba4

Nothing else depends on it, and there's a potential risk of it not leaving the dominance tree in good shape.

Use mov64 to load a pointer; brrt++

1039733

Lower unbox of P6[int|num|str]

c352746

This aids the performance of NQP code, which uses these rather than P6opaque-based box types. Also includes a fix for the interpretation of sp_get_s, which I can only figure has never been exercised before as it was utterly broke.

Lower the unboxing of a bigint in a P6opaque

98f523d

Now, if it's a small bigint (fits within 32 bits), we just pull it right out without having to make a function call. We only make a call in the case there's a real mp_int that we need to range check and take the value from.

Use a cheaper op for str/num P6opaque box bind

db24256

We know that the body of the object will always be the direct one, not a replacement version of it, so we can simply bind directly into that without paying for the cost of the indirection.

Don't enforce write barrier when boxing

2472255

The object was only just created. It must be in the nursery, since that's a rule for using `fastcreate`. Thus it can not be a gen2 object pointing to a nursery object, which is when the barrier would trigger.

jnthn and others added 26 commits September 17, 2018 17:40

Indicate unspecialized inlines in the inline log

36558ca

Correct a typo

5639a7f

Update lego JIT for updated write barriering

f4aff5a

This gets it to use the write barrier variant that sets the "gen2 referenced" flag on an object to encourage its earlier promotion.

Factor out fastcreate logic

692f403

It was repeated in a number of ops, and will be used in a few more soon.

Get specialized bigint ops using the int cache

e038244

Thus regaining the performance in some programs the depend heavily on that.

TMP4d is callee save, so save it during call

0b4b439

save 16 bytes in MVMStaticFrameBody according to pahole

a7fa9e6

Correct indentation

be4acd1

Populate all deopt points with an offset

faab5a8

Rather than just the first one of each kind that we encounter. With more aggressive optimizations, we could end up with multiple of them being shuffled onto the same instruction.

Consider handler with start > end unreachable

64941a0

Harden against handler start/end on same ins

81e84fe

We could potentially process the start after the end in this case, and so leave inconsistent data in the active handlers data array.

Bring JIT of getlexreldyn in line with interp

2db224a

Do not inline loadbytecode

a895064

It makes various invocations, and we don't end up with the calling point properly tracked as a deopt point, in turn breaking things like dynamic variable resolutions very close to the point that did the loadbytecode.

Revert "Do basic block merging as the last step"

d9e9298

This reverts commit d84778a, which caused some breakage. Also, it might well have been better placed (or at least just as well placed) beforehand anyway.

Fix expmod_I template

03efc47

Or at least, change it so it works. My gut feeling is that the way it was before should produce better code (doesn't need to save a value during a call), and that the change I've done should have had no semantic effect.

Simplify realloc_at_safepoint debug code

1d987fa

No need to add on to a pointer only to subtract from it again.

Add debugging range check on SC index

13d0b4a

Discover facts on fastcreate-like ops

249d182

Eliminates some guards that we got back after various boxing and bigint lowering optimizations. This means they were a slightly bigger win than initially thought, because they accidentally resulted in an extra guard in various cases.

bdw assigned jnthn Sep 24, 2018

bdw requested a review from niner September 24, 2018 10:10

bdw commented Sep 24, 2018

View reviewed changes

bdw merged commit 8aaf696 into master Sep 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Postrelease opts #969

Postrelease opts #969

bdw commented Sep 24, 2018

bdw left a comment

bdw Sep 24, 2018

bdw Sep 24, 2018

Postrelease opts #969

Postrelease opts #969

Conversation

bdw commented Sep 24, 2018

bdw left a comment

Choose a reason for hiding this comment

bdw Sep 24, 2018

Choose a reason for hiding this comment

bdw Sep 24, 2018

Choose a reason for hiding this comment