Skip to content

X86: Check multiple itable entries after lastITable cache test #21939

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

a7ehuo
Copy link
Contributor

@a7ehuo a7ehuo commented May 23, 2025

If the lastITable cache does not match the declaring interface class, check a few entries on the iTable. The max number of entries to check is capped at 4. Running Dacapo pmd shows checking 4 entries render the best result.

This enhancement can be disabled by the env variable: TR_DisableITableIterationsAfterLastITableCacheCheck

The number of iterations can be changed by setting: TR_NumITableIterationsAfterLastITableCacheCheck=n

If the lastITable cache does not match the declaring interface
class, check a few entries on the iTable. The max number of entries
to check is capped at 4. Running Dacapo pmd shows checking 4 entries
render the best result.

This enhancement can be disabled by the env variable:
TR_DisableITableIterationsAfterLastITableCacheCheck

The number of iterations can be changed by setting:
TR_NumITableIterationsAfterLastITableCacheCheck=n

Signed-off-by: Annabelle Huo <Annabelle.Huo@ibm.com>
@a7ehuo
Copy link
Contributor Author

a7ehuo commented May 23, 2025

@0xdaryl May I ask you to review this change? Thank you!

Here is an example of the instructions generated: log_sample_instructions.txt

@vijaysun-omr @hzongaro fyi

@vijaysun-omr
Copy link
Contributor

fyi @zl-wang @knn-k @r30shah that this needs to be added on to the Power, AArch64, and IBM Z codegen backlogs. It may not be a bad idea to make sure all the heuristics are made consistent across the different codegens for interface dispatch when you do the work, i.e. static PICs, dynamic PICs, last itable cache and this multiple itable cache are all employed in more or less the same way across platforms.

@ehsankianifar is already looking at doing something for IBM Z, I believe.

@@ -2628,6 +2628,107 @@ void J9::X86::PrivateLinkage::buildVPIC(TR::X86CallSite &site, TR::LabelSymbol *
cg()->reserveNTrampolines(VPicParameters.defaultNumberOfSlots);
}

/**
* \brief
* Generate a sequence of instructions that compares the `declaringClass` with x number of entries on the iTable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use iterations rather than x so the comment matches the code.

* Generate a sequence of instructions that compares the `declaringClass` with x number of entries on the iTable.
*
* \parm iterations
* The number of entries on the iTable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on -> in

TR::LabelSymbol *gotoLastITableDispatchLabel,
TR_J9VMBase *fej9,
TR::Compilation *comp,
TR::CodeGenerator *cg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to pass all three of these as part of the API. comp and fej9 can be obtained from the CodeGenerator object.

* \parm declaringClass
* The class that is to be compared.
*
* \parm use32BitInterfacePointers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call this use32BitInterfaceClassPointers for naming consistency with other parms (and clarity).

iterations = (maxInterfaces > MAX_ITABLE_ITERATIONS) ? MAX_ITABLE_ITERATIONS : maxInterfaces;

if (trace)
traceMsg(comp(), "%s: declaringClass 0x%p numImplementers %d maxInterfaces %d iterations %d\n", __FUNCTION__, declaringClass, numImplementers, maxInterfaces, iterations);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove 0x in front of %p.

{
TR_ASSERT_FATAL(comp->target().is64Bit(), "Only 64-bit path should reach here.");
auto cds = cg->findOrCreate8ByteConstant(callNode, (intptr_t)declaringClass);
TR::MemoryReference *interfaceClassAddr = generateX86MemoryReference(cds, cg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On 64-bit, it is more efficient to use an immediate load for this address. Something like this:

generateRegImm64Instruction(TR::InstOpCode::MOV8RegImm64, callNode, interfaceClassReg, declaringClass, cg, reloKind);

In this case, reloKind should be TR_ClassPointer. Whenever you embed a class constant in the code cache you need to make sure there is a relocation created for it.

generateMemImmInstruction(TR::InstOpCode::CMP4MemImm4,
callNode,
generateX86MemoryReference(scratchReg, fej9->getOffsetOfInterfaceClassFromITableField(), cg),
(int32_t)(intptr_t)declaringClass,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need a relocation for the class pointer immediate if you embed it in the code cache.

TR::CodeGenerator *cg)
{
for (uint32_t i = 0; i<iterations; i++)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the number of iterations is > 1 (and especially if it ever reaches 4), I'm worried about the code footprint cost.

Are you unrolling to this extent because the performance improvement over a "rolled" loop is significant? If the perf improvement is marginal, I think you should go for a much denser implementation:

   MOV iterReg, iterations
   MOV icReg, declaringClass
   MOV scratchReg, [vft+itable]
loop:
   TEST scratchReg, scratchReg
   JZ lookupDispatchSnippet
   CMP [scratchReg + iclassFromItable], icReg
   JZ gotoLastITableDispatchLabel
   MOV scratchReg, [scratchReg + next]
   DEC iterReg
   JNZ loop
   JMP lookupDispatchSnippet

I may not have this completely right, but I think you get the idea.

If you know you are only going to iterate once, then you can avoid the loop control.


generateLabelInstruction(TR::InstOpCode::label, callNode, gotoLastITableDispatchLabel, cg());
if (comp()->target().is32Bit())
generatePaddingInstruction(3, callNode, cg());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this padding for?

// mov rdi, [vft + lastITableOffset] | | ; cached ITable
// cmp [rdi + interfaceClassOffset], interfaceClass | | ; check if it's our interface class
// jne Outlined Label iterateITableLabel ------------|---+ | ; iterate a few entries on iTable
// jne lookupDispatchSnippet ------------------------|---|----|-----> ; if not, jump to the slow path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two jnes back-to-back is an indication that something is wrong with this sequence. Probably should be jmp ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem weird but there might be a good reason for it. Seems like there is an explanation for it. I'm not quite satisfied so I will have to do some archaeology.

static char *numITableIterationsAfterLastITableCacheCheck = feGetEnv("TR_NumITableIterationsAfterLastITableCacheCheck");
static const int32_t numITableIterationsAfterLastITableCacheCheckValue = numITableIterationsAfterLastITableCacheCheck ? atoi(numITableIterationsAfterLastITableCacheCheck) : MAX_ITABLE_ITERATIONS;

iterations = numITableIterationsAfterLastITableCacheCheck ? numITableIterationsAfterLastITableCacheCheckValue : iterations;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seem the value of TR_NumITableIterationsAfterLastITableCacheCheck overrides the calculated iterations if it was provided. So should we skip calculating iterations in that case?
Also is it valid if the value of TR_NumITableIterationsAfterLastITableCacheCheck is larger than calculated iterations? It feels like we would check for implementations that we know don't exist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case when someone sets the env var is just when someone is tuning this area of code (not in production most likely) for experimenting. In such a case, the slight cost of having calculated iterations is not too much of an issue (compile time is usually taken mostly in the optimizer typically).

maxInterfaces = (numInterfaces > maxInterfaces) ? numInterfaces : maxInterfaces;
}

iterations = (maxInterfaces > MAX_ITABLE_ITERATIONS) ? MAX_ITABLE_ITERATIONS : maxInterfaces;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we calculate the number of implementations to be 1, doesn't it mean that the value of iTable is equal to lastITable? and if this assumption is true, should we bother inserting iTable check instructions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a7ehuo would have the definitive answer here, but my understanding is the answer is no.

iterations controls the maximum number of itables that will be walked at runtime. It is a performance knob that is heuristically set. For an interface I, it is set based on the maximum number of (all) interfaces implemented by the set of all classes that implement I. Just because iterations might be 1 does not mean this is the only interface implemented by this class at runtime. It just means it's only going to look one level deep.

@vijaysun-omr
Copy link
Contributor

Just wanted to ask what the plan for this PR is. In @a7ehuo absence, should we deliver some part of it ?

@hzongaro
Copy link
Member

@BradleyWood, as @a7ehuo will be away for a bit, may I ask you to pick up her changes from this pull request, and address the comments that are still outstanding from the review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants