X86: Check multiple itable entries after lastITable cache test #21939

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

a7ehuo wants to merge 1 commit into eclipse-openj9:master from a7ehuo:invokeInterface-inline-iTable-after-lastITable-8

Contributor

a7ehuo commented May 23, 2025

If the lastITable cache does not match the declaring interface class, check a few entries on the iTable. The max number of entries to check is capped at 4. Running Dacapo pmd shows checking 4 entries render the best result.

This enhancement can be disabled by the env variable: TR_DisableITableIterationsAfterLastITableCacheCheck

The number of iterations can be changed by setting: TR_NumITableIterationsAfterLastITableCacheCheck=n


          X86: Check multiple itable entries after lastITable cache test

41706b5

If the lastITable cache does not match the declaring interface
class, check a few entries on the iTable. The max number of entries
to check is capped at 4. Running Dacapo pmd shows checking 4 entries
render the best result.

This enhancement can be disabled by the env variable:
TR_DisableITableIterationsAfterLastITableCacheCheck

The number of iterations can be changed by setting:
TR_NumITableIterationsAfterLastITableCacheCheck=n

Signed-off-by: Annabelle Huo <Annabelle.Huo@ibm.com>

a7ehuo added comp:jit arch:x86 labels

Contributor Author

a7ehuo commented May 23, 2025 •

edited

Loading

@0xdaryl May I ask you to review this change? Thank you!

Here is an example of the instructions generated: log_sample_instructions.txt

@vijaysun-omr @hzongaro fyi

Contributor

vijaysun-omr commented May 23, 2025

fyi @zl-wang @knn-k @r30shah that this needs to be added on to the Power, AArch64, and IBM Z codegen backlogs. It may not be a bad idea to make sure all the heuristics are made consistent across the different codegens for interface dispatch when you do the work, i.e. static PICs, dynamic PICs, last itable cache and this multiple itable cache are all employed in more or less the same way across platforms.

@ehsankianifar is already looking at doing something for IBM Z, I believe.

knn-k mentioned this pull request

AArch64: Check multiple itable entries after lastITable cache test #21948

Open

ehsankianifar mentioned this pull request

Implement LastITable for Z platform #21969

Open

0xdaryl self-assigned this

0xdaryl requested changes

View reviewed changes

runtime/compiler/x/codegen/X86PrivateLinkage.cpp

                  cg()->reserveNTrampolines(VPicParameters.defaultNumberOfSlots);
                  }
+              /**
+               * \brief
+               *    Generate a sequence of instructions that compares the `declaringClass` with x number of entries on the iTable.

Contributor

0xdaryl May 30, 2025

Use iterations rather than x so the comment matches the code.

runtime/compiler/x/codegen/X86PrivateLinkage.cpp

+               *    Generate a sequence of instructions that compares the `declaringClass` with x number of entries on the iTable.
+               *
+               * \parm iterations
+               *    The number of entries on the iTable.

Contributor

0xdaryl May 30, 2025

on -> in

runtime/compiler/x/codegen/X86PrivateLinkage.cpp

+                                                     TR::LabelSymbol *gotoLastITableDispatchLabel,
+                                                     TR_J9VMBase *fej9,
+                                                     TR::Compilation *comp,
+                                                     TR::CodeGenerator *cg)

Contributor

0xdaryl May 30, 2025

You don't need to pass all three of these as part of the API. comp and fej9 can be obtained from the CodeGenerator object.

runtime/compiler/x/codegen/X86PrivateLinkage.cpp

+               * \parm declaringClass
+               *    The class that is to be compared.
+               *
+               * \parm use32BitInterfacePointers

Contributor

0xdaryl May 30, 2025

Call this use32BitInterfaceClassPointers for naming consistency with other parms (and clarity).

runtime/compiler/x/codegen/X86PrivateLinkage.cpp

+                          iterations = (maxInterfaces > MAX_ITABLE_ITERATIONS) ? MAX_ITABLE_ITERATIONS : maxInterfaces;
+                          if (trace)
+                             traceMsg(comp(), "%s: declaringClass 0x%p numImplementers %d maxInterfaces %d iterations %d\n", __FUNCTION__, declaringClass, numImplementers, maxInterfaces, iterations);

Contributor

0xdaryl May 30, 2025

Remove 0x in front of %p.

runtime/compiler/x/codegen/X86PrivateLinkage.cpp

+                       {
+                       TR_ASSERT_FATAL(comp->target().is64Bit(), "Only 64-bit path should reach here.");
+                       auto cds = cg->findOrCreate8ByteConstant(callNode, (intptr_t)declaringClass);
+                       TR::MemoryReference *interfaceClassAddr = generateX86MemoryReference(cds, cg);

Contributor

0xdaryl May 30, 2025

On 64-bit, it is more efficient to use an immediate load for this address. Something like this:

generateRegImm64Instruction(TR::InstOpCode::MOV8RegImm64, callNode, interfaceClassReg, declaringClass, cg, reloKind);

In this case, reloKind should be TR_ClassPointer. Whenever you embed a class constant in the code cache you need to make sure there is a relocation created for it.

runtime/compiler/x/codegen/X86PrivateLinkage.cpp

+                       generateMemImmInstruction(TR::InstOpCode::CMP4MemImm4,
+                                                callNode,
+                                                generateX86MemoryReference(scratchReg, fej9->getOffsetOfInterfaceClassFromITableField(), cg),
+                                                (int32_t)(intptr_t)declaringClass,

Contributor

0xdaryl May 30, 2025

You'll need a relocation for the class pointer immediate if you embed it in the code cache.

runtime/compiler/x/codegen/X86PrivateLinkage.cpp

+                                                     TR::CodeGenerator *cg)
+                 {
+                 for (uint32_t i = 0; i<iterations; i++)
+                    {

Contributor

0xdaryl May 30, 2025

If the number of iterations is > 1 (and especially if it ever reaches 4), I'm worried about the code footprint cost.

Are you unrolling to this extent because the performance improvement over a "rolled" loop is significant? If the perf improvement is marginal, I think you should go for a much denser implementation:

   MOV iterReg, iterations
   MOV icReg, declaringClass
   MOV scratchReg, [vft+itable]
loop:
   TEST scratchReg, scratchReg
   JZ lookupDispatchSnippet
   CMP [scratchReg + iclassFromItable], icReg
   JZ gotoLastITableDispatchLabel
   MOV scratchReg, [scratchReg + next]
   DEC iterReg
   JNZ loop
   JMP lookupDispatchSnippet

I may not have this completely right, but I think you get the idea.

If you know you are only going to iterate once, then you can avoid the loop control.

runtime/compiler/x/codegen/X86PrivateLinkage.cpp

+                    generateLabelInstruction(TR::InstOpCode::label, callNode, gotoLastITableDispatchLabel, cg());
+                    if (comp()->target().is32Bit())
+                       generatePaddingInstruction(3, callNode, cg());

Contributor

0xdaryl May 30, 2025

What is this padding for?

0xdaryl reviewed

View reviewed changes

runtime/compiler/x/codegen/X86PrivateLinkage.cpp

+                 //          mov rdi, [vft + lastITableOffset]                  |        |       ; cached ITable
+                 //          cmp [rdi + interfaceClassOffset], interfaceClass   |        |       ; check if it's our interface class
+                 //          jne Outlined Label iterateITableLabel  ------------|---+    |       ; iterate a few entries on iTable
+                 //          jne lookupDispatchSnippet  ------------------------|---|----|-----> ; if not, jump to the slow path

Contributor

0xdaryl May 30, 2025

Two jnes back-to-back is an indication that something is wrong with this sequence. Probably should be jmp ?

Member

BradleyWood Jun 23, 2025

It does seem weird but there might be a good reason for it. Seems like there is an explanation for it. I'm not quite satisfied so I will have to do some archaeology.

ehsankianifar reviewed

View reviewed changes

runtime/compiler/x/codegen/X86PrivateLinkage.cpp

+                    static char *numITableIterationsAfterLastITableCacheCheck = feGetEnv("TR_NumITableIterationsAfterLastITableCacheCheck");
+                    static const int32_t numITableIterationsAfterLastITableCacheCheckValue = numITableIterationsAfterLastITableCacheCheck ? atoi(numITableIterationsAfterLastITableCacheCheck) : MAX_ITABLE_ITERATIONS;
+                    iterations = numITableIterationsAfterLastITableCacheCheck ? numITableIterationsAfterLastITableCacheCheckValue : iterations;

Contributor

ehsankianifar Jun 2, 2025

It seem the value of TR_NumITableIterationsAfterLastITableCacheCheck overrides the calculated iterations if it was provided. So should we skip calculating iterations in that case?
Also is it valid if the value of TR_NumITableIterationsAfterLastITableCacheCheck is larger than calculated iterations? It feels like we would check for implementations that we know don't exist.

Contributor

vijaysun-omr Jun 2, 2025

The case when someone sets the env var is just when someone is tuning this area of code (not in production most likely) for experimenting. In such a case, the slight cost of having calculated iterations is not too much of an issue (compile time is usually taken mostly in the optimizer typically).

ehsankianifar reviewed

View reviewed changes

runtime/compiler/x/codegen/X86PrivateLinkage.cpp

+                             maxInterfaces = (numInterfaces > maxInterfaces) ? numInterfaces : maxInterfaces;
+                             }
+                          iterations = (maxInterfaces > MAX_ITABLE_ITERATIONS) ? MAX_ITABLE_ITERATIONS : maxInterfaces;

Contributor

ehsankianifar Jun 2, 2025

If we calculate the number of implementations to be 1, doesn't it mean that the value of iTable is equal to lastITable? and if this assumption is true, should we bother inserting iTable check instructions?

Contributor

0xdaryl Jun 2, 2025

@a7ehuo would have the definitive answer here, but my understanding is the answer is no.

iterations controls the maximum number of itables that will be walked at runtime. It is a performance knob that is heuristically set. For an interface I, it is set based on the maximum number of (all) interfaces implemented by the set of all classes that implement I. Just because iterations might be 1 does not mean this is the only interface implemented by this class at runtime. It just means it's only going to look one level deep.

Contributor

vijaysun-omr commented Jun 16, 2025

Just wanted to ask what the plan for this PR is. In @a7ehuo absence, should we deliver some part of it ?

Member

hzongaro commented Jun 16, 2025

@BradleyWood, as @a7ehuo will be away for a bit, may I ask you to pick up her changes from this pull request, and address the comments that are still outstanding from the review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arch:x86 comp:jit