Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postrelease opts #969

Merged
merged 68 commits into from Sep 24, 2018
Merged

Postrelease opts #969

merged 68 commits into from Sep 24, 2018

Conversation

bdw
Copy link
Contributor

@bdw bdw commented Sep 24, 2018

Optimizations and stuff

jnthn and others added 30 commits September 17, 2018 17:32
For a wval that was already accessed, stash it away in a spesh slot
and access it that way. This is rather faster, avoiding the various
indirections in wval lookup.
When we hit the write barrier, annotate on the object that was being
referenced that it was pointed to by a gen2 object. Then, if it is
alive at the next GC, immediately promote it to gen2 rather than it
having another stay in the nursery.

This makes a "populate a million element array" benchmark run in 90%
of the time that it used to take, and will help other programs that
build up long-lived data structures. CORE.setting compilation is one
such program, and seems to also benefit from this. Meanwhile, other
benchmarks seem to not be negatively impacted by this.
Otherwise, now that guards have a read and a write register, we can
end up with the information being insufficiently accurate in order to
keep things alive as needed for deopt.
Otherwise, we might end up scribbling over a register that holds a
useful value the other side of deopt, which can happen under our
move aggressive `set` elimination optimizations, but potentially in
others too.
In preparation for looking at eliminating unrequired guards. This is
a common case where we will often be able to do guard elimination.
We can eliminate them when we can prove that the guard will always be
true.
* Remove seemingly non-effective box and dead code unbox opts in the
  first optimize pass
* In the second optimize pass, introduce code to delete dead box
  instructions that are rendered no longer required due to box/unbox
  pair elimination
* Also do a little bit of naming tweakery

This gets `my $x = "fooooo"; for ^10_000_000 { my int $i = $x.chars }`
to spesh into something that deletes the boxing of the `chars`, which
makes it run in around 90% of the time it used to.

This also prepares the way for further work on optimizing `box` ops.
This covers boxing of both Num and Str objects in Perl 6, lowering
them into a fastcreate and a p6o_bind_[ns], which avoids a C call and
the indirections and checks that the normal code has to do.
This clears up things like hllboxtype instructions left behind by
lowered or deleted boxing instructions.
do it immediately in the spesh function of the reprs
VMArray, MVMHash, and P6opaque. This throws out a
bunch of guards that were being left in the code in
a test i did.
also, no need to get_and_use_facts when we're
changing the operation that writes that register.
This is aimed at helping NQP code rather than Perl 6 code, but does
the same trick that we've done for the Perl 6 ones. Alas, the int one
loses the use of the int box cache, so needs a bit more work yet.
For JIT, we now emit assembly that does the comparisons and then does
a fetch from the integer cache if applicable. Thus the only real call
is to obtain memory from the GC if we need to box, the value then
being poked into the object at the appropriate location.

Also stub in as-yet unimplemented and unused spesh ops that we'll use
for optimizing boxing to a P6bigint embedded in a P6opaque.
Almost works, but the range check for whether to store it as a small
int always comes out false, so it always takes the bigint path.
Spotted by MasterDuke++.
Sometimes a decont really is an unbox in disguise, so handle those too.
So far, implemented it for P6opaque box around a native num and native
str.
Nothing else depends on it, and there's a potential risk of it not
leaving the dominance tree in good shape.
This aids the performance of NQP code, which uses these rather than
P6opaque-based box types. Also includes a fix for the interpretation
of sp_get_s, which I can only figure has never been exercised before
as it was utterly broke.
Now, if it's a small bigint (fits within 32 bits), we just pull it
right out without having to make a function call. We only make a call
in the case there's a real mp_int that we need to range check and take
the value from.
We know that the body of the object will always be the direct one, not
a replacement version of it, so we can simply bind directly into that
without paying for the cost of the indirection.
The object was only just created. It must be in the nursery, since
that's a rule for using `fastcreate`. Thus it can not be a gen2
object pointing to a nursery object, which is when the barrier would
trigger.
So that if we do nested inlines, they don't contribute to the size of
the overall thing. This seems to be on the most part a win: various
Perl 6 microbenchmarks run faster, spectest seems unimpacted, there is
a slight impact on spectest compilation time but nothing enormous.
jnthn and others added 26 commits September 17, 2018 17:40
This gets it to use the write barrier variant that sets the "gen2
referenced" flag on an object to encourage its earlier promotion.
It was repeated in a number of ops, and will be used in a few more
soon.
Like sp_fastbox_bi, sp_add_I, and friends, this
calculates a fixed offset into a P6opaque's body
where the bigintbody can be found and emits a
spesh op that directly accesses the memory to
do the extremely cheap nonzero check on the
contents for both smallbigint and real bigints.
Thus regaining the performance in some programs the depend heavily
on that.
Previously, we only had them on the facts of the PHI node itself.
However, those facts could then be copied and relied on elsewhere. With
this change, we now keep a list of the log guards that we depend on for
each fact, and at a PHI node we copy all the input lists into a target
list. We then, when we depend on or copy facts, just assign the input
log guards list and count (which is safe since it's immutable, and we
use the region allocator so there's no memory management concerns).
In the case where the result was going into the same register that an
input argument came from, we stored the result prematurely, and then
read a NULLed out bigint struct as one of the arguments. This only
took place in the fallback path and when the earlier requirement was
meant. Fixes various crashing spectests.
Rather than just the first one of each kind that we encounter. With
more aggressive optimizations, we could end up with multiple of them
being shuffled onto the same instruction.
We could potentially process the start after the end in this case, and
so leave inconsistent data in the active handlers data array.
It makes various invocations, and we don't end up with the calling
point properly tracked as a deopt point, in turn breaking things like
dynamic variable resolutions very close to the point that did the
loadbytecode.
* Try going for an outer frame if the immediately available one is
  inlined (this will work together with a change in Rakudo to mark
  frames that use `EVAL` as no-inline, and mean that `try EVAL $x`
  will work out fine)
* If that doesn't work, don't SEGV, but throw
This reverts commit d84778a, which
caused some breakage. Also, it might well have been better placed (or
at least just as well placed) beforehand anyway.
Or at least, change it so it works. My gut feeling is that the way
it was before should produce better code (doesn't need to save a
value during a call), and that the change I've done should have had
no semantic effect.
No need to add on to a pointer only to subtract from it again.
Eliminates some guards that we got back after various boxing and bigint
lowering optimizations. This means they were a slightly bigger win than
initially thought, because they accidentally resulted in an extra guard
in various cases.
If the inliner doesn't set it, then we can assume it's not yet. Note
that this is a Rakudo use case assumption, rather than covering every
possible use of the ops. This perhaps means we want to adapt a a less
general mechanism in the future.
* When it's hllboolfor, be sure to delete the usage of the language
  name register, otherwise we fail to delete it (caught thanks to the
  DU chain checker)
* Also for hllboolfor don't blindly assume that the language name
  operand is a constant
* Avoid fetching obj_facts twice
* Use a better name for the facts on the target result
Parallel GC relies on us only updating the object flags from the thread
that owns the object (or the nominated thread to GC that thread's
objects). The updates caused by the write barrier touching the object
that was referenced caused races that could lead to incorrect GC
behavior, which typically showed up as an invalid SC (which was really
because the forwarder lives in the same slot that the SC used to on a
moved object).
Generate new SSA versions for each return point, and then merge then at
a PHI placed after the inlinee. This potentially offers better facts for
optimization in such cases, since before we propagated no facts out of
the routine if there was multiple object returns, whereas now we let PHI
fact merging identify commonalities. Further, we now propagate facts
from non-object return values too.
@bdw bdw requested a review from niner September 24, 2018 10:10
Copy link
Contributor Author

@bdw bdw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm basically happy, and more importantly, spectest is.
Left some minor notes. But I'm going to hit the merge button anyway, and make sure this gets some wider coverage.

@@ -5668,7 +5690,7 @@ void MVM_interp_run(MVMThreadContext *tc, void (*initial_invoke)(MVMThreadContex
cur_op += 6;
goto NEXT;
OP(sp_get_s):
GET_REG(cur_op, 0).s = ((MVMString *)((char *)GET_REG(cur_op, 2).o + GET_UI16(cur_op, 4)));
GET_REG(cur_op, 0).s = *((MVMString **)((char *)GET_REG(cur_op, 2).o + GET_UI16(cur_op, 4)));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now inconsistent with sp_get_o. I think it is correct, and I also think sp_get_o isn't actually used, but it is still odd.

SeenBox *sb = MVM_malloc(sizeof(SeenBox));
sb->bb = bb;
sb->ins = ins;
MVM_VECTOR_PUSH(pips->seen_box_ins, sb);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use a pointer to a malloc'd structure here? If you'd use spesh_alloc, you wouldn't have to free it, and if you'd use the regular vector form, you'd be able to assign in place.

@bdw bdw merged commit 8aaf696 into master Sep 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants