-
Notifications
You must be signed in to change notification settings - Fork 755
X86: Check multiple itable entries after lastITable cache test #21939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
X86: Check multiple itable entries after lastITable cache test #21939
Conversation
If the lastITable cache does not match the declaring interface class, check a few entries on the iTable. The max number of entries to check is capped at 4. Running Dacapo pmd shows checking 4 entries render the best result. This enhancement can be disabled by the env variable: TR_DisableITableIterationsAfterLastITableCacheCheck The number of iterations can be changed by setting: TR_NumITableIterationsAfterLastITableCacheCheck=n Signed-off-by: Annabelle Huo <Annabelle.Huo@ibm.com>
@0xdaryl May I ask you to review this change? Thank you! Here is an example of the instructions generated: log_sample_instructions.txt |
fyi @zl-wang @knn-k @r30shah that this needs to be added on to the Power, AArch64, and IBM Z codegen backlogs. It may not be a bad idea to make sure all the heuristics are made consistent across the different codegens for interface dispatch when you do the work, i.e. static PICs, dynamic PICs, last itable cache and this multiple itable cache are all employed in more or less the same way across platforms. @ehsankianifar is already looking at doing something for IBM Z, I believe. |
@@ -2628,6 +2628,107 @@ void J9::X86::PrivateLinkage::buildVPIC(TR::X86CallSite &site, TR::LabelSymbol * | |||
cg()->reserveNTrampolines(VPicParameters.defaultNumberOfSlots); | |||
} | |||
|
|||
/** | |||
* \brief | |||
* Generate a sequence of instructions that compares the `declaringClass` with x number of entries on the iTable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use iterations
rather than x
so the comment matches the code.
* Generate a sequence of instructions that compares the `declaringClass` with x number of entries on the iTable. | ||
* | ||
* \parm iterations | ||
* The number of entries on the iTable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on -> in
TR::LabelSymbol *gotoLastITableDispatchLabel, | ||
TR_J9VMBase *fej9, | ||
TR::Compilation *comp, | ||
TR::CodeGenerator *cg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to pass all three of these as part of the API. comp
and fej9
can be obtained from the CodeGenerator
object.
* \parm declaringClass | ||
* The class that is to be compared. | ||
* | ||
* \parm use32BitInterfacePointers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call this use32BitInterfaceClassPointers
for naming consistency with other parms (and clarity).
iterations = (maxInterfaces > MAX_ITABLE_ITERATIONS) ? MAX_ITABLE_ITERATIONS : maxInterfaces; | ||
|
||
if (trace) | ||
traceMsg(comp(), "%s: declaringClass 0x%p numImplementers %d maxInterfaces %d iterations %d\n", __FUNCTION__, declaringClass, numImplementers, maxInterfaces, iterations); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove 0x
in front of %p
.
{ | ||
TR_ASSERT_FATAL(comp->target().is64Bit(), "Only 64-bit path should reach here."); | ||
auto cds = cg->findOrCreate8ByteConstant(callNode, (intptr_t)declaringClass); | ||
TR::MemoryReference *interfaceClassAddr = generateX86MemoryReference(cds, cg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On 64-bit, it is more efficient to use an immediate load for this address. Something like this:
generateRegImm64Instruction(TR::InstOpCode::MOV8RegImm64, callNode, interfaceClassReg, declaringClass, cg, reloKind);
In this case, reloKind
should be TR_ClassPointer
. Whenever you embed a class constant in the code cache you need to make sure there is a relocation created for it.
generateMemImmInstruction(TR::InstOpCode::CMP4MemImm4, | ||
callNode, | ||
generateX86MemoryReference(scratchReg, fej9->getOffsetOfInterfaceClassFromITableField(), cg), | ||
(int32_t)(intptr_t)declaringClass, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need a relocation for the class pointer immediate if you embed it in the code cache.
TR::CodeGenerator *cg) | ||
{ | ||
for (uint32_t i = 0; i<iterations; i++) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the number of iterations is > 1 (and especially if it ever reaches 4), I'm worried about the code footprint cost.
Are you unrolling to this extent because the performance improvement over a "rolled" loop is significant? If the perf improvement is marginal, I think you should go for a much denser implementation:
MOV iterReg, iterations
MOV icReg, declaringClass
MOV scratchReg, [vft+itable]
loop:
TEST scratchReg, scratchReg
JZ lookupDispatchSnippet
CMP [scratchReg + iclassFromItable], icReg
JZ gotoLastITableDispatchLabel
MOV scratchReg, [scratchReg + next]
DEC iterReg
JNZ loop
JMP lookupDispatchSnippet
I may not have this completely right, but I think you get the idea.
If you know you are only going to iterate once, then you can avoid the loop control.
|
||
generateLabelInstruction(TR::InstOpCode::label, callNode, gotoLastITableDispatchLabel, cg()); | ||
if (comp()->target().is32Bit()) | ||
generatePaddingInstruction(3, callNode, cg()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this padding for?
// mov rdi, [vft + lastITableOffset] | | ; cached ITable | ||
// cmp [rdi + interfaceClassOffset], interfaceClass | | ; check if it's our interface class | ||
// jne Outlined Label iterateITableLabel ------------|---+ | ; iterate a few entries on iTable | ||
// jne lookupDispatchSnippet ------------------------|---|----|-----> ; if not, jump to the slow path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two jne
s back-to-back is an indication that something is wrong with this sequence. Probably should be jmp
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does seem weird but there might be a good reason for it. Seems like there is an explanation for it. I'm not quite satisfied so I will have to do some archaeology.
static char *numITableIterationsAfterLastITableCacheCheck = feGetEnv("TR_NumITableIterationsAfterLastITableCacheCheck"); | ||
static const int32_t numITableIterationsAfterLastITableCacheCheckValue = numITableIterationsAfterLastITableCacheCheck ? atoi(numITableIterationsAfterLastITableCacheCheck) : MAX_ITABLE_ITERATIONS; | ||
|
||
iterations = numITableIterationsAfterLastITableCacheCheck ? numITableIterationsAfterLastITableCacheCheckValue : iterations; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seem the value of TR_NumITableIterationsAfterLastITableCacheCheck
overrides the calculated iterations
if it was provided. So should we skip calculating iterations
in that case?
Also is it valid if the value of TR_NumITableIterationsAfterLastITableCacheCheck
is larger than calculated iterations
? It feels like we would check for implementations that we know don't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case when someone sets the env var is just when someone is tuning this area of code (not in production most likely) for experimenting. In such a case, the slight cost of having calculated iterations
is not too much of an issue (compile time is usually taken mostly in the optimizer typically).
maxInterfaces = (numInterfaces > maxInterfaces) ? numInterfaces : maxInterfaces; | ||
} | ||
|
||
iterations = (maxInterfaces > MAX_ITABLE_ITERATIONS) ? MAX_ITABLE_ITERATIONS : maxInterfaces; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we calculate the number of implementations to be 1, doesn't it mean that the value of iTable is equal to lastITable? and if this assumption is true, should we bother inserting iTable check instructions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@a7ehuo would have the definitive answer here, but my understanding is the answer is no.
iterations
controls the maximum number of itables that will be walked at runtime. It is a performance knob that is heuristically set. For an interface I
, it is set based on the maximum number of (all) interfaces implemented by the set of all classes that implement I
. Just because iterations
might be 1 does not mean this is the only interface implemented by this class at runtime. It just means it's only going to look one level deep.
Just wanted to ask what the plan for this PR is. In @a7ehuo absence, should we deliver some part of it ? |
@BradleyWood, as @a7ehuo will be away for a bit, may I ask you to pick up her changes from this pull request, and address the comments that are still outstanding from the review? |
If the lastITable cache does not match the declaring interface class, check a few entries on the iTable. The max number of entries to check is capped at 4. Running Dacapo pmd shows checking 4 entries render the best result.
This enhancement can be disabled by the env variable:
TR_DisableITableIterationsAfterLastITableCacheCheck
The number of iterations can be changed by setting:
TR_NumITableIterationsAfterLastITableCacheCheck=n