Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling descriptor offsets as relocations #474

Closed
s-perron opened this issue Feb 19, 2020 · 29 comments
Closed

Handling descriptor offsets as relocations #474

s-perron opened this issue Feb 19, 2020 · 29 comments
Assignees
Labels

Comments

@s-perron
Copy link
Contributor

When constructing relocatable shader elf, there is some data that cannot be known at compile time. One example is the offset needed to do a descriptor load. We want to be able to use a relocation to represent this offset, which can then be filled in during the linking phase that will create the pipeline elf.

If we want to replace the constant offset with a relocation, we need to figure out 3 different ways of representing the value for the different phases of the compilation.

We already have a version of the code that will generate relocation and replace them. This code is in #457 and GPUOpen-Drivers/llvm-project#1. This version of the code is wrong in many ways, and needs to be improved.

Relocation in elf

This is the code that is generated by the initial version of llpc/llvm that generates relocations.

s_movk_i32 s0, 0x0                  // 000000000000: B0000000 
s_getpc_b64 s[6:7]                  // 000000000004: BE861C00 
0000000000000004:  R_AMDGPU_ABS32       doff_0_0
s_add_u32 s0, s3, s0                // 000000000008: 80000003 
s_mov_b32 s6, s4                    // 00000000000C: BE860004 
s_addc_u32 s1, s7, 0                // 000000000010: 82018007 
v_add_u32_e32 v0, s5, v0            // 000000000014: 68000005 
s_load_dwordx4 s[0:3], s[0:1], 0x0  // 000000000018: C00A0000 00000000 

This is wrong in a few different ways. First, the relocation offset is wrong. The relocation should point to the location in the s_movk_i32 instruction that contains the constant value. Second, the descriptor offset can only be 16-bit because that is all the s_movk_i32 instruction can hold. Third, the type of relocation is wrong. The R_AMDGPU_ABS32 is meant to be used as the absolute address of the symbol to which the relocation applies. We need to create a new relocation.

The code we want generated would be something more like:

s_mov_b32 s2, 0x0                                     // 000000000010: BE8200FF 00000000 
000000000014: R_AMDGPU_VAL32           doff_0_0
s_load_dwordx4 s[0:3], s[0:1], s2               // 000000000018: C0080000 00000002 

We still want to create dummy symbol doff_0_0 (the descriptor offset for set 0 and binding 0). This symbol is intended to hold the constant offset of the resource, and the new relocation says we should replace the location by that constant 32-bit value.

For performance, I do not know what is best. I would like the force the s_mov_b32 to always immediately precede the load s_load_dwordx4. Then we could implement an optimization in the linking phase: if the offset is less than 2^20, then make the s_mov_b32 a nop instruction and rewrite the load so it uses an immediate offset instead of the register.

Representation in the compilation

The offset is added to the code during "Patch LLVM for descriptor load operations". This happens earlier in the compilation. We need a way to track that the offset of a particular descriptor is to be used. As we go through the different phases of the compilation, that representation needs to change.

Offset representation in llvm-ir

The translation from Spir-V to llvm-ir, will generate a call to @llpc.descriptor.load.buffer. This gets expanded in "Patch LLVM for descriptor load operations", and part of the expansion is calculating the offset of the descriptor using the descriptor offset field of the pipeline info. See https://github.com/GPUOpen-Drivers/llpc/blob/dev/llpc/patch/llpcPatchDescriptorLoad.cpp#L316 for where this is calculated.

The our initial implementation, we chose to replace the offset with a builtin function @llvm.amdgcn.desc.offset that represents the offset.

%20 = call i32 @llvm.amdgcn.desc.offset(i32 0, i32 0)
%21 = mul i32 0, 16
%22 = add i32 %21, %20
%23 = zext i32 %22 to i64
%24 = getelementptr [4294967295 x i8], [4294967295 x i8] addrspace(4)* %13, i64 0, i64 %23

This worked in the sense that it correctly marked that the offset of a particular descriptor is needed. However, the pattern recognition when lowering to MIR does not work well.

Offset representation in MIR

The next phase is MIR. The main problem is that the ISEL does not recognize the code generated above as having an offset. So it does operations on the base register instead of creating an "S_LOAD_DWORDX4_SGPR" where the offset is in a register. I'm guessing we need to fix up the pattern matching.

The bigger problem is how to represent the offset of a particular descriptor. I wanted to create a dummy instruction that has two parameter, the set and binding. However, I am not sure how to properly generate that instruction: %16:sreg_32 = S_DESC_OFFSET 0, 0.

Currently in AMDGPU DAG->DAG Pattern Instruction Selection, I identified the builtin above and replace it with the new instruction. This is a sample of code that would currently be generated:

%16:sreg_32 = S_DESC_OFFSET 0, 0
%17:sreg_32 = S_MOV_B32 0                                                                                                                                                                                                                                                             
%18:sreg_64 = REG_SEQUENCE killed %16:sreg_32, %subreg.sub0, killed %17:sreg_32, %subreg.sub1                                                                                                                                                                                         
%19:sreg_64 = S_ADD_U64_PSEUDO killed %13:sreg_64, killed %18:sreg_64, implicit-def dead $scc
%20:sgpr_128 = S_LOAD_DWORDX4_IMM killed %19:sreg_64, 0, 0, 0 :: (invariant load 16 from %ir.21, addrspace 4)               

We need to figure out how to define an instruction that takes two immediate operands in MIR, but outputs an "S_MOV_B32 " when doing machine code lowering. I can see how to do this, but I do not want to put the effort into doing it if the design is simply wrong.

An alternative would be to create S_LOAD_DWORDX4_RELOC. We would do a pattern recognition to that the offset is a reloction. We would have to be careful how we handle large offset and offsets into descriptor arrays.

Machine code lowering

In AMDGPUMCInstLower::lower, the S_DESC_OFFSET instruction is lowered to an s_movk_i32 instruction with a fixup for the immediate value. As has already been mentioned, this is a problem because it is only at 16-bit value. This is also where we choose the relocation type. I was unsure of what needs to be done if I want to create a new relocation type, so I reused an existing one. Anything that can point us to how to create a new relocation type will be helpful.

Then SIMCCodeEmitter::getMachineOpValue is used to actually output the relocation. The offset is currently wrong, so it is not actually on the immediate value. The size of the fix up is also wrong. It outputs a 32-bit fixup, when there is only 16-bits that can be written. All of this can be easily fixed up once we have the instruction correct.

I should be able to output an s_mov_b32 instruction instead, and do a 32-bit fixup that would cover the entire offset.

@nhaehnle
Copy link
Member

We want to be able to use a relocation to represent this offset, which can then be filled in during the linking phase that will create the pipeline elf.

I still maintain that in the long run, it probably makes more sense to do this in PAL. At a high level, it would work by passing to PAL:

  • the VS ELF
  • the PS ELF
  • some sideband data for the missing metadata as well as symbol definitions for the relocations required by the other ELFs -- could be a third ELF, could be something different (something different could be convenient for the goal of avoiding excessive re-uploads of the same shader)

But I'm just reiterating this to make sure it isn't forgotten. Cross-project coordination of such projects can be tricky and if we can't get PAL moving quickly enough then doing it in LLPC initially is acceptable to me. Though if you have reasons to believe that this is somehow inherently not a good idea, I would be interested in hearing them.

The code we want generated would be something more like:

s_mov_b32 s2, 0x0                                     // 000000000010: BE8200FF 00000000 
000000000014: R_AMDGPU_VAL32           doff_0_0
s_load_dwordx4 s[0:3], s[0:1], s2               // 000000000018: C0080000 00000002 

I agree that this and using the "dummy symbol", is the right way to go. I'm not sure that you absolutely need a new relocation type though. If we define the value of the dummy symbol to be the offset into the descriptor table, then R_AMDGPU_ABS32 is in some sense really the correct relocation to use.

I would guesstimate the performance benefit of s_nop'ing out the s_mov as so small that it's probably just not worth it, but let's keep it in mind. That certainly would require a new relocation type.

The our initial implementation, we chose to replace the offset with a builtin function @llvm.amdgcn.desc.offset that represents the offset.

I'm tempted to suggest simply using an LLVM global variable instead. However, that would result in a weird ptrtoint cast, so it's probably not such a great idea after all.

However, what I'd really implore you to do is to design this in a more open-ended way. It seems preferable to me to have an anyint @llvm.amdgcn.reloc.constant(metadata) intrinsic, where we define the metadata as {i32, string}, with the i32 being a "kind" discriminator, and the string being the name of the generated ELF symbol.

We need to figure out how to define an instruction that takes two immediate operands in MIR, but outputs an "S_MOV_B32 " when doing machine code lowering.

With the above, it'd probably be possible to just select the S_MOV_B32 with a TargetGlobalAddress operand. All the rest will likely work out, because this unknown operand shouldn't fold into the S_LOAD_DWORDX4. To give you an idea, the steps of compiling test/CodeGen/AMDGPU/lds-reloc.ll contain stuff like:

    t8: i32 = add GlobalAddress:i32<[0 x i32] addrspace(3)* @lds.external> 0, t7

...

Legalizing: t5: i32 = GlobalAddress<[0 x i32] addrspace(3)* @lds.external> 0
Trying custom legalization
Creating new node: t24: i32 = LDS TargetGlobalAddress:i32<[0 x i32] addrspace(3)* @lds.external> 0 [TF=8]
Successfully custom legalized node
 ... replacing: t5: i32 = GlobalAddress<[0 x i32] addrspace(3)* @lds.external> 0
     with:      t24: i32 = LDS TargetGlobalAddress:i32<[0 x i32] addrspace(3)* @lds.external> 0 [TF=8]
(yes, SelectionDAG can be a bit stupid sometimes)

...

ISEL: Starting selection on root node: t24: i32 = LDS TargetGlobalAddress:i32<[0 x i32] addrspace(3)* @lds.external> 0 [TF=8]
ISEL: Starting pattern match
  Initial Opcode index to 391415
  Skipped scope entry (due to false predicate) at index 391423, continuing at 391432
  Morphed node: t24: i64 = S_MOV_B32 TargetGlobalAddress:i32<[0 x i32] addrspace(3)* @lds.external> 0 [TF=8]
ISEL: Match complete!

... and finally in MIR:

  %3:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @lds.external

You'll still have to do some annoying plumbing in the MC layer. I can never remember how that works either, but what I remember from when I did the LDS relocations is that I just figured stuff out as I went and it was mostly okay. Don't be too surprised if you see code that seems like it doesn't make sense -- it's quite likely that your impression is correct, and feel free to propose cleanups if you run into something like that ;)

I was unsure of what needs to be done if I want to create a new relocation type, so I reused an existing one. Anything that can point us to how to create a new relocation type will be helpful.

Again, I'm not convinced that it's actually needed. But in any case, since relocation types are target-specific, we're the owners of our destiny here. The basic process is to propose it as a patch on LLVM Phabricator and make sure that people like Tony Tye and myself have had a chance to comment on it. Documentation of relocation types goes into AMDGPUUsage.rst.

Back to the question of relocating directly into the S_LOAD_DWORDX4. I haven't thought about this in much detail, but you can probably at least determine an upper bound on the offset statically, right? If this turns out to be possible, I would suggest adding the known bounds as i64 field to the metadata in LLVM IR, somehow(?) making sure it survives sufficiently long through SelectionDAG, and upgrading the SMEM selection logic to fold the operand if the metadata says it's guaranteed below 2^20. Then define a new relocation type for that. This actually seems surprisingly feasible, compared to the alternative of NOPing out the S_MOV and all that. And it would likely address virtually all of the realistic scenarios. You'll probably still want to plumb the simpler version through first :)

@s-perron
Copy link
Contributor Author

I still maintain that in the long run, it probably makes more sense to do this in PAL.

I do not have a problem with that, we will try to code the application of the relocation in a way that can be easily factored out, and moved to PAL when needed.

I'm not sure that you absolutely need a new relocation type though.

If the reviewer are okay with reusing the same relocation type, I am very happy to do so. My gut feeling is that it would not have been acceptable.

It seems preferable to me to have an anyint @llvm.amdgcn.reloc.constant(metadata) intrinsic, where we define the metadata as {i32, string}, with the i32 being a "kind" discriminator, and the string being the name of the generated ELF symbol.

That is a great idea. We will try to implement it that way.

The rest of what you says seems to make sense. Thanks for pointing out another example that we can use. We will start rewriting the initial implementation.

@s-perron s-perron self-assigned this Feb 21, 2020
@csyonghe
Copy link

It seems preferable to me to have an anyint @llvm.amdgcn.reloc.constant(metadata) intrinsic, where we define the metadata as {i32, string}, with the i32 being a "kind" discriminator, and the string being the name of the generated ELF symbol.

@nhaehnle What do you think is the right representation for this metadata? Do you refer to an llvm::MDNode attached to the amdgcn.reloc.constant instruction? Or do you mean a new type of operand? If it is an llvm metadata node, I am not sure how to plumb that info down to an MCInst since that happens inside ISel. If we represent it as a normal operand, I am not quite sure how to represent a string typed / custom typed operand. Is there an existing example somewhere in the codebase that you can point me to?

@csyonghe
Copy link

I figured that I can use define the instrinsic as

def int_amdgcn_reloc_constant : Intrinsic<
  [llvm_i32_ty], [llvm_metadata_ty],
  [IntrNoMem, IntrSpeculatable]
>;

Then in llpc, create a MetadataAsValue operand, which wraps a MDNode.

Next, in SOPInstructions.td, use the following code to select the intrinsic into S_RELOC_CONSTANT machine instruction:

let SubtargetPredicate = isGFX6GFX7GFX8GFX9 in
def S_RELOC_CONSTANT : SOPK_Pseudo <
  "s_movk_fix_i32",
  (outs SReg_32:$sdst),
  (ins unknown:$src),
  "$sdst, $src",
  [(set SReg_32:$sdst, (int_amdgcn_reloc_constant MetadataVT:$src))]>;

Then in AMDGPUMCInstLower.cpp, AMDGPUMCInstLower::lower function, we can extract the metadata from the operand, and lower this S_RELOC_CONSTANT into a S_MOV_B32:

} else if (Opcode == AMDGPU::S_RELOC_CONSTANT) {
    auto varName = llvm::cast<llvm::MDString>(MI->getOperand(1).getMetadata()->getOperand(0))->getString();
    OutMI.setOpcode(TII->pseudoToMCOpcode(AMDGPU::S_MOV_B32));

What I haven't figure out, is how to define a global value named varName, and how to create a TargetGlobalAddress operand that loads the global value, and how to create a relocation entry for the global value. @nhaehnle @s-perron Am I on the right track? Do you have any ideas on how to proceed from here?

@nhaehnle
Copy link
Member

Maybe a custom lowering for the intrinsic would help? Ideally, the custom lowering could generate the new GlobalVariable at the time of lowering (if it does not already exist) or in some other way ensure the symbol will exist, and then generate an S_MOV_B32 with a TargetGlobalAddress operand directly. This is mostly a gut feeling, admittedly.

@s-perron
Copy link
Contributor Author

s-perron commented Feb 24, 2020

The builtin definition and generation look right.

I agree with Nicolai, that we should try to avoid defining a new MIR instruction. So code related to S_RELOC_CONSTANT should be removed. The custom lower would not be done in the *.td files. I think the right place to do this is in AMDGPUISelLowering.cpp or SIISelLowering.cpp. I don't know how they relate to each other. See the AMDGPUTargetLowering::Lower* functions as examples.

For your lowering function, you will want to

  1. Create the new symbol. I'm not sure how to do that in that part of the code.
  2. Create a TargetGlobalAddress node. See SITargetLowering::LowerGlobalAddress. That might give an idea how to handle that.
  3. Create an S_MOV_B32 instruction with the TargetGlobalAddress as the parameter.

@csyonghe
Copy link

csyonghe commented Feb 25, 2020

I find that I can do the custom lowering in SITargetLowering::LowerINTRINSIC_WO_CHAIN.
Here's the code:

  case Intrinsic::amdgcn_reloc_constant: {
    Module *M = const_cast<Module*>(MF.getFunction().getParent());
    const MDNode* metadata = cast<MDNodeSDNode>(Op.getOperand(1))->getMD();
    auto symbolName = cast<MDString>(metadata->getOperand(0))->getString();
    auto relocSymbol = cast<GlobalVariable>(M->getOrInsertGlobal(symbolName, Type::getInt32Ty(M->getContext())));
    relocSymbol->setLinkage(llvm::GlobalValue::LinkageTypes::InternalLinkage);
    SDValue GA = DAG.getTargetGlobalAddress(relocSymbol, DL, MVT::i32, 0,
                                            SIInstrInfo::MO_ABS32_LO);
    return {DAG.getMachineNode(AMDGPU::S_MOV_B32, DL, MVT::i32, GA), 0};
  }

Here I create a new global variable and generate a S_MOV_B32 from the global address of that variable. However, I am not sure how this global variable will be handled by downstream logic. I tried to run amdllpc with a simple vertex-fragment shader. In the resulting elf, I can see the global variable in the symbol table, but the symbol has size 0 and points to section 0, and section 0's size is also 0. The elf also contains a relocation entry that points to this symbol, with offset = 8 and size = 0, which doesn't look right either.

My understanding of the intend is that the global variable will occupy 4 bytes of space in the resulting elf, and the operand of S_MOV_B32 will point to that address in the elf file. When we linking multiple elfs from different shader stages together, we need to modify all the S_MOV_B32 operands to point to the newly linked variable, and then modify the value of the variable itself to the right offset. Is that right?

If this understanding is correct, then we must face the challenge of identifying all the TargetGlobalAddresses in the resulting ELF. Will llvm emit one relocation for each TargetGlobalAddress operand? If not this can be very tricky to do. Given that the symbol occupies 0 bytes in the elf, is there an assumption that all global variables are stored at some special space?

@s-perron
Copy link
Contributor Author

s-perron commented Feb 25, 2020

relocSymbol->setLinkage(llvm::GlobalValue::LinkageTypes::InternalLinkage);

I don't think you want internal linkage. We are pretending like this symbol is defined in another module (or in the pipeline create info). Internal means it is suppose to be defined in the current module.

The elf also contains a relocation entry that points to this symbol, with offset = 8 and size = 0, which doesn't look right either.

Something sounds odd about that. We should look at it together to see if we can make sense of it.

When we linking multiple elfs from different shader stages together, we need to modify all the S_MOV_B32 operands to point to the newly linked variable, and then modify the value of the variable itself to the right offset.

That is not correct. The plan is to modify the S_MOV_B32 operands to contain the offset itself, and then the symbol is discarded. This is why we did not load the variable and then use to value being loaded in the S_MOV_B32 instruction.

Will llvm emit one relocation for each TargetGlobalAddress operand?

Yes, llvm should emit a relocation for every S_MOV_32 instruction that needs to be updated.

@csyonghe
Copy link

Here's the prelink elf:

vs.elf: file format ELF64-amdgpu

Disassembly of section .text:
0000000000000000 _amdgpu_vs_main:
        s_getpc_b64 s[6:7]                                         // 000000000000: BE861C00 
        s_add_u32 s0, s3, s0                                       // 000000000004: 80000003 
        v_cndmask_b32_e32 v0, s0, v0, vcc                          // 000000000008: 00000000 
                0000000000000008:  R_AMDGPU_ABS32       doff_0_0
        s_mov_b32 s6, s4                                           // 00000000000C: BE860004 
        s_addc_u32 s1, s7, 0                                       // 000000000010: 82018007 
        v_add_co_u32_e32 v0, vcc, s5, v0                           // 000000000014: 32000005 
        s_load_dwordx4 s[0:3], s[0:1], 0x0                         // 000000000018: C00A0000 00000000 
        s_load_dwordx4 s[4:7], s[6:7], 0x10                        // 000000000020: C00A0103 00000010 
        s_waitcnt vmcnt(15) lgkmcnt(0)                             // 000000000028: BF8C007F 
        s_buffer_load_dwordx4 s[0:3], s[0:3], 0x80                 // 00000000002C: C02A0000 00000080 
        tbuffer_load_format_xyz v[0:3], v0, s[4:7],  dfmt:14, nfmt:7, 0 idxen// 000000000034: EBF12000 80010000 
        s_waitcnt vmcnt(15) lgkmcnt(0)                             // 00000000003C: BF8C007F 
        v_mov_b32_e32 v3, s0                                       // 000000000040: 7E060200 
        v_mov_b32_e32 v4, s1                                       // 000000000044: 7E080201 
        v_mov_b32_e32 v5, s2                                       // 000000000048: 7E0A0202 
        v_mov_b32_e32 v6, s3                                       // 00000000004C: 7E0C0203 
        exp pos0 v3, v4, v5, v6 done                               // 000000000050: C40008CF 06050403 
        s_waitcnt vmcnt(0)                                         // 000000000058: BF8C0F70 
        exp param0 v0, v1, v2, off                                 // 00000000005C: C4000207 00020100 
        s_endpgm                                                   // 000000000064: BF810000 
SYMBOL TABLE:
0000000000000000 g     F .text  00000068 _amdgpu_vs_main
0000000000000000  w      *UND*  00000000 doff_0_0

@csyonghe
Copy link

@s-perron Noticed that the compiler did not correctly select the variant of s_load_dwordx4 instruction which directly uses an offset register.

After digging into the pattern matching code in AMDGPUISelDAGToDAG, I find out that reason is AMDGPUDAGToDAGISel::SelectSMRDOffset does not properly implement the case that Offset is not a constant. My fix is to check if the offset DAG is a scalar 32bit int, or if it is a 64bit int as a result of ZERO_EXTEND. If so, return the 32 bit value as the resulting Offset. This allowed the compiler to select the correct variant.

There is one remaining issue: the SSRC0 field of the emitted S_MOV_B32 instruction is 0, which means fetching from SGPR0. Instead, we want it to be 255, which means the constant literal following that instruction. Currently I handle in llpc's relocation logic (https://github.com/GPUOpen-Drivers/llpc/pull/457/files#diff-b9f030405a2a66df2f48475e88ad4f6dR1130), but this doesn't seem to be the right place. My current relocation implementation assumes that the instruction preceeding the relocation address must be S_MOV_B32 so it can apply the patch up. But since there is no way I can guarantee the S_MOV_B32 instruction does not get constant propagated into the users, this may not work If that happens.

@csyonghe
Copy link

@s-perron I have updated the two PRs in amdgpu and llpc. Could you take a look?

@csyonghe
Copy link

Pre-link elf:

Disassembly of section .text:
0000000000000000 _amdgpu_vs_main:
        s_getpc_b64 s[6:7]                                         // 000000000000: BE861C00 
        s_mov_b64 s[0:1], s[6:7]                                   // 000000000004: BE800106 
        s_mov_b32 s0, s3                                           // 000000000008: BE800003 
        s_mov_b32 s6, s4                                           // 00000000000C: BE860004 
        s_mov_b32 s2, s0                                           // 000000000010: BE820000 
        v_cndmask_b32_e32 v0, s0, v0, vcc                          // 000000000014: 00000000 
                0000000000000014:  R_AMDGPU_ABS32       doff_0_0
        v_add_co_u32_e32 v0, vcc, s5, v0                           // 000000000018: 32000005 
        s_load_dwordx4 s[0:3], s[0:1], s2                          // 00000000001C: C0080000 00000002 
        s_load_dwordx4 s[4:7], s[6:7], 0x10                        // 000000000024: C00A0103 00000010 
        s_waitcnt vmcnt(15) lgkmcnt(0)                             // 00000000002C: BF8C007F 
        tbuffer_load_format_xyz v[0:3], v0, s[4:7],  dfmt:14, nfmt:7, 0 idxen// 000000000030: EBF12000 80010000 
        s_movk_i32 s4, 0x80                                        // 000000000038: B0040080 
        s_buffer_load_dwordx4 s[0:3], s[0:3], s4                   // 00000000003C: C0280000 00000004 
        s_waitcnt vmcnt(15) lgkmcnt(0)                             // 000000000044: BF8C007F 
        v_mov_b32_e32 v3, s0                                       // 000000000048: 7E060200 
        v_mov_b32_e32 v4, s1                                       // 00000000004C: 7E080201 
        v_mov_b32_e32 v5, s2                                       // 000000000050: 7E0A0202 
        v_mov_b32_e32 v6, s3                                       // 000000000054: 7E0C0203 
        exp pos0 v3, v4, v5, v6 done                               // 000000000058: C40008CF 06050403 
        s_waitcnt vmcnt(0)                                         // 000000000060: BF8C0F70 
        exp param0 v0, v1, v2, off                                 // 000000000064: C4000207 00020100 
        s_endpgm                                                   // 00000000006C: BF810000 

Post-link elf:

0000000000000000 _amdgpu_vs_main:
        s_getpc_b64 s[6:7]                                         // 000000000000: BE861C00 
        s_mov_b64 s[0:1], s[6:7]                                   // 000000000004: BE800106 
        s_mov_b32 s0, s3                                           // 000000000008: BE800003 
        s_mov_b32 s6, s4                                           // 00000000000C: BE860004 
        s_mov_b32 s2, 0                                            // 000000000010: BE8200FF 00000000 
        v_add_co_u32_e32 v0, vcc, s5, v0                           // 000000000018: 32000005 
        s_load_dwordx4 s[0:3], s[0:1], s2                          // 00000000001C: C0080000 00000002 
        s_load_dwordx4 s[4:7], s[6:7], 0x10                        // 000000000024: C00A0103 00000010 
        s_waitcnt vmcnt(15) lgkmcnt(0)                             // 00000000002C: BF8C007F 
        tbuffer_load_format_xyz v[0:3], v0, s[4:7],  dfmt:14, nfmt:7, 0 idxen// 000000000030: EBF12000 80010000 
        s_movk_i32 s4, 0x80                                        // 000000000038: B0040080 
        s_buffer_load_dwordx4 s[0:3], s[0:3], s4                   // 00000000003C: C0280000 00000004 
        s_waitcnt vmcnt(15) lgkmcnt(0)                             // 000000000044: BF8C007F 
        v_mov_b32_e32 v3, s0                                       // 000000000048: 7E060200 
        v_mov_b32_e32 v4, s1                                       // 00000000004C: 7E080201 
        v_mov_b32_e32 v5, s2                                       // 000000000050: 7E0A0202 
        v_mov_b32_e32 v6, s3                                       // 000000000054: 7E0C0203 
        exp pos0 v3, v4, v5, v6 done                               // 000000000058: C40008CF 06050403 
        s_waitcnt vmcnt(0)                                         // 000000000060: BF8C0F70 
        exp param0 v0, v1, v2, off                                 // 000000000064: C4000207 00020100 
        s_endpgm                                                   // 00000000006C: BF810000 

@trenouf
Copy link
Member

trenouf commented Feb 28, 2020

Widening out to more of the problem space:

How does the pre-link code know what sgpr the set's descriptor table pointer is in?

In the current pipeline compilation model, the lookup of the descriptor in PatchDescriptorLoad that gives the offset in the descriptor table (the thing you're turning into a reloc above) also returns the offset of the descriptor table pointer in the overall top-level resource data.

A limited number of those descriptor table pointers (set pointers) can be passed in sgprs on wave dispatch. Earlier on in the compilation, based on already-gathered flags saying which ones are used, PatchEntryPointMutate decides which ones are going to be passed in sgprs. Any other set needs to have its pointer loaded from the overall top-level resource data, the "spill table", whose address is itself passed in an sgpr if needed (decided by PatchEntryPointMutate).

What to do with a descriptor table pointer that is passed in sgpr? I guess all the logic that handles that could be modified to refer to it by set number instead of top-level resource data offset, and the set number would end up in the place of the top-level resource data offset in the PAL metadata that says what is loaded into what sgpr. It would then be the job of the link phase to convert that PAL metadata entry from set number into the offset that PAL is expecting.

What if the set number is pathologically large (e.g. 0x12345678)? That would confuse everyone, because the PAL metadata reserves values over 0x10000000 (I think) to mean something other than "value from this offset in the spill table". Perhaps such a shader would fail separate compilation, and could only be pipeline compiled.

What to do with the set pointers that need a load from the "spill table"? (Because there were too many set pointers to fit into the sgprs available at wave dispatch.) The offset for that could be another reloc. Or we could decide that a shader that needs one of those is not separately compilable, and could only be pipeline compiled.

The sets-in-sgprs-at-wave-dispatch limit is essentially 13 on gfx9 or 29 on gfx10 in a VS (PAL takes two for itself and the vertex buffer table takes one), minus one if you need one for the spill table, minus another one if you have streamout.

What about compute shaders? The complication here is that the assignment of set pointers into wave dispatch sgprs is more strictly limited: the N sgprs have to be the first N entries in the top-level resource table, in the right order. I guess we could still follow the scheme above, where an sgpr is allocated to a set pointer without knowing where that set pointer is going to be in the top-level resource table, but then the link step would need to glue on some prologue code to swap sgprs around and maybe load some of them.

@trenouf
Copy link
Member

trenouf commented Mar 1, 2020

What if the set number is pathologically large (e.g. 0x12345678)?

I withdraw that comment. :-) I think there is a small limit on descriptor set number.

@s-perron
Copy link
Contributor Author

s-perron commented Mar 2, 2020

@trenouf I only have a vague understand of what you are referring to. We are not overly familiar with PAL and LLPC, so I can understand that there could be a problem. However, I don't know how to create a test case out of what you said.

Can you provide an example shader or pipeline that would show the problem? That way we could run it through our code, figure out how to handle it, and turn it into a shaderdb test.

@csyonghe
Copy link

csyonghe commented Mar 2, 2020

I wonder if @trenouf was referring to a shader that uses different descriptor sets, something like this:

#version 450
#extension GL_ARB_separate_shader_objects : enable

layout(binding = 0) uniform UniformBufferObject {
    mat4 model;
    mat4 view;
    vec4 proj;
} ubo;

layout(set=1, binding=0) uniform UBO1 {
    vec3 v1;
} ubo1;

layout(set=10, binding=0) uniform UBO2 {
    vec3 v2;
} ubo2;

layout(location = 0) in vec2 inPosition;
layout(location = 1) in vec3 inColor;

layout(location = 0) out vec3 fragColor;

void main() {
    gl_Position = ubo.proj;
    fragColor = inColor + ubo1.v1 + ubo2.v2;
}

@csyonghe
Copy link

csyonghe commented Mar 3, 2020

After testing PR #457 with our internal renderdoc traces, I captured a bunch of failure cases. These failed cases share a common problem: the symbol table dumped by llvm-objdump is not showing an entry for the vertex shader, even though such an entry is there in the elf file.

It seems that the first entry in the symtab section must be a symbol defined in a valid section, and that first symbol is somehow not recognized by objdump. The fix I implemented is simply to insert a dummy symbol (see https://github.com/GPUOpen-Drivers/llpc/pull/457/files#diff-b9f030405a2a66df2f48475e88ad4f6dR1222). Although this fixes all issues and allowed all the tests to pass, I cannot explain why such behavior exists. Is this a known convention of the ELF format that is documented somewhere?

@trenouf
Copy link
Member

trenouf commented Mar 3, 2020

You mean symbol with index 0? Yes, I think that is a dummy one, because symbol index 0 is used in several places to mean "no symbol". I could be wrong though.

@Flakebi
Copy link
Member

Flakebi commented Mar 3, 2020

@trenouf is right, according to the elf64 spec: https://uclibc.org/docs/elf-64-gen.pdf

The first symbol table entry is reserved and must be all zeroes. The symbolic
constant STN_UNDEF is used to refer to this entry.

@csyonghe
Copy link

csyonghe commented Mar 4, 2020

Regarding the top level descriptor table address issue, I think a reasonable plan is to just spill everything, and passing only a top-level pointer to the spill table in a sgpr. This will greatly simply compiler and PAL implementation, and this has worked well for PS4. I think it is worthwhile to give it a try since it is easy to implement, and see how this affects performance. @trenouf What's your thoughts?

@csyonghe
Copy link

csyonghe commented Mar 4, 2020

If we go with the path that all descriptors are stored in memory, we can even further simply this by deriving the offset of top level descriptor tables directly from the "binding" number provided in the spirv. This would allow us to eliminate the relocation step altogether.

@trenouf
Copy link
Member

trenouf commented Mar 4, 2020

You mean directly from the "descriptor set" number?

That scheme of leaving the top-level table in memory would work, but I think it is not too difficult to take the next step of allowing some entries to be in sgprs as now.

@csyonghe
Copy link

csyonghe commented Mar 4, 2020

Yeah, directly from the "descriptor set" number, as this will eliminate the relocation step to obtain the descriptor table pointer.

Apart from this, I am still trying to understand the concepts that appear in PatchDescriptorLoad:

  1. What is a BufferCompactDesc and when are they used?
  2. What is a dynamic descriptor?
  3. When is DescriptorLoadAddress used?

@trenouf
Copy link
Member

trenouf commented Mar 5, 2020

  1. It's a 64-bit descriptor that gets expanded to 128-bit before use. But I'm not sure what circumstance they're used in. Maybe Qun and Rex know. Let's hope they are unusual, because the presence of one breaks this whole idea and we'll have to fall back to pipeline compilation.
  2. A dynamic descriptor is an array of descriptors that typically have a dynamic (variable) index.
  3. Looks like DescriptorLoadAddress is never used. I think it's a hangover from an earlier way of doing descriptors in llpc. Its definition, and the code that handles it in llpcPatchDescriptorLoad, should be removed.

@linqun
Copy link
Member

linqun commented Mar 6, 2020

  • BufferCompactDesc is used for dynamic descriptor, include both ubo and ssbo. the major purpose is to reduce the number of user data. since we only have 16/32 physical user data. it is enabled only if robust access isn't enabled.
  • In vulkan, dynamic descriptor means the descriptor can be modified dynamically. in driver, we allocate dynamic descriptor in root level, i.e. it is in user data register.

@nhaehnle
Copy link
Member

nhaehnle commented Mar 9, 2020

One other thing to watch out for is how descriptor sets interact with push constants. I don't know the details of how (whether) they're interleaved.

Still, in graphics shaders I was under the impression that userdata SGPRs are always "compacted", so most or even all cases should be okay?

Edit: That is to say: the compiler can know for certain whether a descriptor table offset will go into the spill table or not. If it doesn't go into the spill table, the compiler can also know the SGPR for certain. Only when the offset ends up being spilled, the compiler does not know where in the spill table the offset is spilled to. But that's the same problem as with the offsets inside of descriptor sets themselves, so the same mechanism should apply.

@nhaehnle
Copy link
Member

One more related thing to watch out for is the issue that is starting to be addressed by #488 (and see #497). No ISA changes required for this, it's all about the metadata; specifically, what the offset of each descriptor set pointer is in the userdata table that xgl maintains via ICmdBuffer::CmdSetUserData.

@trenouf
Copy link
Member

trenouf commented Mar 13, 2020

Hi @linqun

In vulkan, dynamic descriptor means the descriptor can be modified dynamically. in driver, we allocate dynamic descriptor in root level, i.e. it is in user data register.

This kind of opens another can of worms...

It looks like one descriptor set can contain a mix of "dynamic" descriptors (VK_DESCRIPTOR_TYPE_{UNIFORM|STORAGE}_BUFFER_DYNAMIC) and normal descriptors, and the driver puts the dynamic ones in the root level and the other ones in a descriptor table for that set (where the root level contains a pointer to the descriptor table for the set).

How can this work with shader compilation? The compiler will not know whether a descriptor is dynamic just from looking at the spir-v.

Should we instead put even the dynamic descriptors into the set's descriptor table? Will that cause problems with the dynamic update mechanism?

@trenouf
Copy link
Member

trenouf commented Mar 16, 2020

Just realized that, if we have an array of image or sampler descriptors, the array stride needs to be a reloc. We don't know at shader compile time whether it is an array of separate image/sampler descriptors (stride 32 or 16 bytes), or an array of combined image+sampler descriptors (stride 48 bytes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants