New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Postrelease opts #969
Postrelease opts #969
Conversation
For a wval that was already accessed, stash it away in a spesh slot and access it that way. This is rather faster, avoiding the various indirections in wval lookup.
When we hit the write barrier, annotate on the object that was being referenced that it was pointed to by a gen2 object. Then, if it is alive at the next GC, immediately promote it to gen2 rather than it having another stay in the nursery. This makes a "populate a million element array" benchmark run in 90% of the time that it used to take, and will help other programs that build up long-lived data structures. CORE.setting compilation is one such program, and seems to also benefit from this. Meanwhile, other benchmarks seem to not be negatively impacted by this.
Otherwise, now that guards have a read and a write register, we can end up with the information being insufficiently accurate in order to keep things alive as needed for deopt.
Otherwise, we might end up scribbling over a register that holds a useful value the other side of deopt, which can happen under our move aggressive `set` elimination optimizations, but potentially in others too.
In preparation for looking at eliminating unrequired guards. This is a common case where we will often be able to do guard elimination.
We can eliminate them when we can prove that the guard will always be true.
* Remove seemingly non-effective box and dead code unbox opts in the first optimize pass * In the second optimize pass, introduce code to delete dead box instructions that are rendered no longer required due to box/unbox pair elimination * Also do a little bit of naming tweakery This gets `my $x = "fooooo"; for ^10_000_000 { my int $i = $x.chars }` to spesh into something that deletes the boxing of the `chars`, which makes it run in around 90% of the time it used to. This also prepares the way for further work on optimizing `box` ops.
This covers boxing of both Num and Str objects in Perl 6, lowering them into a fastcreate and a p6o_bind_[ns], which avoids a C call and the indirections and checks that the normal code has to do.
This clears up things like hllboxtype instructions left behind by lowered or deleted boxing instructions.
do it immediately in the spesh function of the reprs VMArray, MVMHash, and P6opaque. This throws out a bunch of guards that were being left in the code in a test i did.
also, no need to get_and_use_facts when we're changing the operation that writes that register.
This is aimed at helping NQP code rather than Perl 6 code, but does the same trick that we've done for the Perl 6 ones. Alas, the int one loses the use of the int box cache, so needs a bit more work yet.
For JIT, we now emit assembly that does the comparisons and then does a fetch from the integer cache if applicable. Thus the only real call is to obtain memory from the GC if we need to box, the value then being poked into the object at the appropriate location. Also stub in as-yet unimplemented and unused spesh ops that we'll use for optimizing boxing to a P6bigint embedded in a P6opaque.
Almost works, but the range check for whether to store it as a small int always comes out false, so it always takes the bigint path.
Spotted by MasterDuke++.
Sometimes a decont really is an unbox in disguise, so handle those too. So far, implemented it for P6opaque box around a native num and native str.
Nothing else depends on it, and there's a potential risk of it not leaving the dominance tree in good shape.
This aids the performance of NQP code, which uses these rather than P6opaque-based box types. Also includes a fix for the interpretation of sp_get_s, which I can only figure has never been exercised before as it was utterly broke.
Now, if it's a small bigint (fits within 32 bits), we just pull it right out without having to make a function call. We only make a call in the case there's a real mp_int that we need to range check and take the value from.
We know that the body of the object will always be the direct one, not a replacement version of it, so we can simply bind directly into that without paying for the cost of the indirection.
The object was only just created. It must be in the nursery, since that's a rule for using `fastcreate`. Thus it can not be a gen2 object pointing to a nursery object, which is when the barrier would trigger.
So that if we do nested inlines, they don't contribute to the size of the overall thing. This seems to be on the most part a win: various Perl 6 microbenchmarks run faster, spectest seems unimpacted, there is a slight impact on spectest compilation time but nothing enormous.
This gets it to use the write barrier variant that sets the "gen2 referenced" flag on an object to encourage its earlier promotion.
It was repeated in a number of ops, and will be used in a few more soon.
Like sp_fastbox_bi, sp_add_I, and friends, this calculates a fixed offset into a P6opaque's body where the bigintbody can be found and emits a spesh op that directly accesses the memory to do the extremely cheap nonzero check on the contents for both smallbigint and real bigints.
Thus regaining the performance in some programs the depend heavily on that.
Previously, we only had them on the facts of the PHI node itself. However, those facts could then be copied and relied on elsewhere. With this change, we now keep a list of the log guards that we depend on for each fact, and at a PHI node we copy all the input lists into a target list. We then, when we depend on or copy facts, just assign the input log guards list and count (which is safe since it's immutable, and we use the region allocator so there's no memory management concerns).
In the case where the result was going into the same register that an input argument came from, we stored the result prematurely, and then read a NULLed out bigint struct as one of the arguments. This only took place in the fallback path and when the earlier requirement was meant. Fixes various crashing spectests.
Rather than just the first one of each kind that we encounter. With more aggressive optimizations, we could end up with multiple of them being shuffled onto the same instruction.
We could potentially process the start after the end in this case, and so leave inconsistent data in the active handlers data array.
It makes various invocations, and we don't end up with the calling point properly tracked as a deopt point, in turn breaking things like dynamic variable resolutions very close to the point that did the loadbytecode.
* Try going for an outer frame if the immediately available one is inlined (this will work together with a change in Rakudo to mark frames that use `EVAL` as no-inline, and mean that `try EVAL $x` will work out fine) * If that doesn't work, don't SEGV, but throw
This reverts commit d84778a, which caused some breakage. Also, it might well have been better placed (or at least just as well placed) beforehand anyway.
Or at least, change it so it works. My gut feeling is that the way it was before should produce better code (doesn't need to save a value during a call), and that the change I've done should have had no semantic effect.
No need to add on to a pointer only to subtract from it again.
Eliminates some guards that we got back after various boxing and bigint lowering optimizations. This means they were a slightly bigger win than initially thought, because they accidentally resulted in an extra guard in various cases.
If the inliner doesn't set it, then we can assume it's not yet. Note that this is a Rakudo use case assumption, rather than covering every possible use of the ops. This perhaps means we want to adapt a a less general mechanism in the future.
* When it's hllboolfor, be sure to delete the usage of the language name register, otherwise we fail to delete it (caught thanks to the DU chain checker) * Also for hllboolfor don't blindly assume that the language name operand is a constant * Avoid fetching obj_facts twice * Use a better name for the facts on the target result
Parallel GC relies on us only updating the object flags from the thread that owns the object (or the nominated thread to GC that thread's objects). The updates caused by the write barrier touching the object that was referenced caused races that could lead to incorrect GC behavior, which typically showed up as an invalid SC (which was really because the forwarder lives in the same slot that the SC used to on a moved object).
Generate new SSA versions for each return point, and then merge then at a PHI placed after the inlinee. This potentially offers better facts for optimization in such cases, since before we propagated no facts out of the routine if there was multiple object returns, whereas now we let PHI fact merging identify commonalities. Further, we now propagate facts from non-object return values too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm basically happy, and more importantly, spectest is.
Left some minor notes. But I'm going to hit the merge button anyway, and make sure this gets some wider coverage.
@@ -5668,7 +5690,7 @@ void MVM_interp_run(MVMThreadContext *tc, void (*initial_invoke)(MVMThreadContex | |||
cur_op += 6; | |||
goto NEXT; | |||
OP(sp_get_s): | |||
GET_REG(cur_op, 0).s = ((MVMString *)((char *)GET_REG(cur_op, 2).o + GET_UI16(cur_op, 4))); | |||
GET_REG(cur_op, 0).s = *((MVMString **)((char *)GET_REG(cur_op, 2).o + GET_UI16(cur_op, 4))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now inconsistent with sp_get_o. I think it is correct, and I also think sp_get_o isn't actually used, but it is still odd.
SeenBox *sb = MVM_malloc(sizeof(SeenBox)); | ||
sb->bb = bb; | ||
sb->ins = ins; | ||
MVM_VECTOR_PUSH(pips->seen_box_ins, sb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use a pointer to a malloc'd structure here? If you'd use spesh_alloc, you wouldn't have to free it, and if you'd use the regular vector form, you'd be able to assign in place.
Optimizations and stuff