-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling descriptor offsets as relocations #474
Comments
I still maintain that in the long run, it probably makes more sense to do this in PAL. At a high level, it would work by passing to PAL:
But I'm just reiterating this to make sure it isn't forgotten. Cross-project coordination of such projects can be tricky and if we can't get PAL moving quickly enough then doing it in LLPC initially is acceptable to me. Though if you have reasons to believe that this is somehow inherently not a good idea, I would be interested in hearing them.
I agree that this and using the "dummy symbol", is the right way to go. I'm not sure that you absolutely need a new relocation type though. If we define the value of the dummy symbol to be the offset into the descriptor table, then R_AMDGPU_ABS32 is in some sense really the correct relocation to use. I would guesstimate the performance benefit of s_nop'ing out the s_mov as so small that it's probably just not worth it, but let's keep it in mind. That certainly would require a new relocation type.
I'm tempted to suggest simply using an LLVM global variable instead. However, that would result in a weird ptrtoint cast, so it's probably not such a great idea after all. However, what I'd really implore you to do is to design this in a more open-ended way. It seems preferable to me to have an
With the above, it'd probably be possible to just select the S_MOV_B32 with a TargetGlobalAddress operand. All the rest will likely work out, because this unknown operand shouldn't fold into the S_LOAD_DWORDX4. To give you an idea, the steps of compiling test/CodeGen/AMDGPU/lds-reloc.ll contain stuff like:
You'll still have to do some annoying plumbing in the MC layer. I can never remember how that works either, but what I remember from when I did the LDS relocations is that I just figured stuff out as I went and it was mostly okay. Don't be too surprised if you see code that seems like it doesn't make sense -- it's quite likely that your impression is correct, and feel free to propose cleanups if you run into something like that ;)
Again, I'm not convinced that it's actually needed. But in any case, since relocation types are target-specific, we're the owners of our destiny here. The basic process is to propose it as a patch on LLVM Phabricator and make sure that people like Tony Tye and myself have had a chance to comment on it. Documentation of relocation types goes into AMDGPUUsage.rst. Back to the question of relocating directly into the S_LOAD_DWORDX4. I haven't thought about this in much detail, but you can probably at least determine an upper bound on the offset statically, right? If this turns out to be possible, I would suggest adding the known bounds as i64 field to the metadata in LLVM IR, somehow(?) making sure it survives sufficiently long through SelectionDAG, and upgrading the SMEM selection logic to fold the operand if the metadata says it's guaranteed below 2^20. Then define a new relocation type for that. This actually seems surprisingly feasible, compared to the alternative of NOPing out the S_MOV and all that. And it would likely address virtually all of the realistic scenarios. You'll probably still want to plumb the simpler version through first :) |
I do not have a problem with that, we will try to code the application of the relocation in a way that can be easily factored out, and moved to PAL when needed.
If the reviewer are okay with reusing the same relocation type, I am very happy to do so. My gut feeling is that it would not have been acceptable.
That is a great idea. We will try to implement it that way. The rest of what you says seems to make sense. Thanks for pointing out another example that we can use. We will start rewriting the initial implementation. |
@nhaehnle What do you think is the right representation for this |
I figured that I can use define the instrinsic as
Then in llpc, create a Next, in
Then in
What I haven't figure out, is how to define a global value named |
Maybe a custom lowering for the intrinsic would help? Ideally, the custom lowering could generate the new GlobalVariable at the time of lowering (if it does not already exist) or in some other way ensure the symbol will exist, and then generate an S_MOV_B32 with a TargetGlobalAddress operand directly. This is mostly a gut feeling, admittedly. |
The builtin definition and generation look right. I agree with Nicolai, that we should try to avoid defining a new MIR instruction. So code related to S_RELOC_CONSTANT should be removed. The custom lower would not be done in the For your lowering function, you will want to
|
I find that I can do the custom lowering in
Here I create a new global variable and generate a My understanding of the intend is that the global variable will occupy 4 bytes of space in the resulting elf, and the operand of S_MOV_B32 will point to that address in the elf file. When we linking multiple elfs from different shader stages together, we need to modify all the S_MOV_B32 operands to point to the newly linked variable, and then modify the value of the variable itself to the right offset. Is that right? If this understanding is correct, then we must face the challenge of identifying all the TargetGlobalAddresses in the resulting ELF. Will llvm emit one relocation for each TargetGlobalAddress operand? If not this can be very tricky to do. Given that the symbol occupies 0 bytes in the elf, is there an assumption that all global variables are stored at some special space? |
I don't think you want internal linkage. We are pretending like this symbol is defined in another module (or in the pipeline create info). Internal means it is suppose to be defined in the current module.
Something sounds odd about that. We should look at it together to see if we can make sense of it.
That is not correct. The plan is to modify the S_MOV_B32 operands to contain the offset itself, and then the symbol is discarded. This is why we did not load the variable and then use to value being loaded in the S_MOV_B32 instruction.
Yes, llvm should emit a relocation for every S_MOV_32 instruction that needs to be updated. |
Here's the prelink elf:
|
@s-perron Noticed that the compiler did not correctly select the variant of After digging into the pattern matching code in There is one remaining issue: the SSRC0 field of the emitted S_MOV_B32 instruction is 0, which means fetching from SGPR0. Instead, we want it to be 255, which means the constant literal following that instruction. Currently I handle in llpc's relocation logic (https://github.com/GPUOpen-Drivers/llpc/pull/457/files#diff-b9f030405a2a66df2f48475e88ad4f6dR1130), but this doesn't seem to be the right place. My current relocation implementation assumes that the instruction preceeding the relocation address must be S_MOV_B32 so it can apply the patch up. But since there is no way I can guarantee the S_MOV_B32 instruction does not get constant propagated into the users, this may not work If that happens. |
@s-perron I have updated the two PRs in amdgpu and llpc. Could you take a look? |
Pre-link elf:
Post-link elf:
|
Widening out to more of the problem space: How does the pre-link code know what sgpr the set's descriptor table pointer is in? In the current pipeline compilation model, the lookup of the descriptor in PatchDescriptorLoad that gives the offset in the descriptor table (the thing you're turning into a reloc above) also returns the offset of the descriptor table pointer in the overall top-level resource data. A limited number of those descriptor table pointers (set pointers) can be passed in sgprs on wave dispatch. Earlier on in the compilation, based on already-gathered flags saying which ones are used, PatchEntryPointMutate decides which ones are going to be passed in sgprs. Any other set needs to have its pointer loaded from the overall top-level resource data, the "spill table", whose address is itself passed in an sgpr if needed (decided by PatchEntryPointMutate). What to do with a descriptor table pointer that is passed in sgpr? I guess all the logic that handles that could be modified to refer to it by set number instead of top-level resource data offset, and the set number would end up in the place of the top-level resource data offset in the PAL metadata that says what is loaded into what sgpr. It would then be the job of the link phase to convert that PAL metadata entry from set number into the offset that PAL is expecting. What if the set number is pathologically large (e.g. 0x12345678)? That would confuse everyone, because the PAL metadata reserves values over 0x10000000 (I think) to mean something other than "value from this offset in the spill table". Perhaps such a shader would fail separate compilation, and could only be pipeline compiled. What to do with the set pointers that need a load from the "spill table"? (Because there were too many set pointers to fit into the sgprs available at wave dispatch.) The offset for that could be another reloc. Or we could decide that a shader that needs one of those is not separately compilable, and could only be pipeline compiled. The sets-in-sgprs-at-wave-dispatch limit is essentially 13 on gfx9 or 29 on gfx10 in a VS (PAL takes two for itself and the vertex buffer table takes one), minus one if you need one for the spill table, minus another one if you have streamout. What about compute shaders? The complication here is that the assignment of set pointers into wave dispatch sgprs is more strictly limited: the N sgprs have to be the first N entries in the top-level resource table, in the right order. I guess we could still follow the scheme above, where an sgpr is allocated to a set pointer without knowing where that set pointer is going to be in the top-level resource table, but then the link step would need to glue on some prologue code to swap sgprs around and maybe load some of them. |
I withdraw that comment. :-) I think there is a small limit on descriptor set number. |
@trenouf I only have a vague understand of what you are referring to. We are not overly familiar with PAL and LLPC, so I can understand that there could be a problem. However, I don't know how to create a test case out of what you said. Can you provide an example shader or pipeline that would show the problem? That way we could run it through our code, figure out how to handle it, and turn it into a shaderdb test. |
I wonder if @trenouf was referring to a shader that uses different descriptor sets, something like this:
|
After testing PR #457 with our internal renderdoc traces, I captured a bunch of failure cases. These failed cases share a common problem: the symbol table dumped by llvm-objdump is not showing an entry for the vertex shader, even though such an entry is there in the elf file. It seems that the first entry in the symtab section must be a symbol defined in a valid section, and that first symbol is somehow not recognized by objdump. The fix I implemented is simply to insert a dummy symbol (see https://github.com/GPUOpen-Drivers/llpc/pull/457/files#diff-b9f030405a2a66df2f48475e88ad4f6dR1222). Although this fixes all issues and allowed all the tests to pass, I cannot explain why such behavior exists. Is this a known convention of the ELF format that is documented somewhere? |
You mean symbol with index 0? Yes, I think that is a dummy one, because symbol index 0 is used in several places to mean "no symbol". I could be wrong though. |
@trenouf is right, according to the elf64 spec: https://uclibc.org/docs/elf-64-gen.pdf
|
Regarding the top level descriptor table address issue, I think a reasonable plan is to just spill everything, and passing only a top-level pointer to the spill table in a sgpr. This will greatly simply compiler and PAL implementation, and this has worked well for PS4. I think it is worthwhile to give it a try since it is easy to implement, and see how this affects performance. @trenouf What's your thoughts? |
If we go with the path that all descriptors are stored in memory, we can even further simply this by deriving the offset of top level descriptor tables directly from the "binding" number provided in the spirv. This would allow us to eliminate the relocation step altogether. |
You mean directly from the "descriptor set" number? That scheme of leaving the top-level table in memory would work, but I think it is not too difficult to take the next step of allowing some entries to be in sgprs as now. |
Yeah, directly from the "descriptor set" number, as this will eliminate the relocation step to obtain the descriptor table pointer. Apart from this, I am still trying to understand the concepts that appear in PatchDescriptorLoad:
|
|
|
One other thing to watch out for is how descriptor sets interact with push constants. I don't know the details of how (whether) they're interleaved. Still, in graphics shaders I was under the impression that userdata SGPRs are always "compacted", so most or even all cases should be okay? Edit: That is to say: the compiler can know for certain whether a descriptor table offset will go into the spill table or not. If it doesn't go into the spill table, the compiler can also know the SGPR for certain. Only when the offset ends up being spilled, the compiler does not know where in the spill table the offset is spilled to. But that's the same problem as with the offsets inside of descriptor sets themselves, so the same mechanism should apply. |
One more related thing to watch out for is the issue that is starting to be addressed by #488 (and see #497). No ISA changes required for this, it's all about the metadata; specifically, what the offset of each descriptor set pointer is in the userdata table that xgl maintains via ICmdBuffer::CmdSetUserData. |
Hi @linqun
This kind of opens another can of worms... It looks like one descriptor set can contain a mix of "dynamic" descriptors (VK_DESCRIPTOR_TYPE_{UNIFORM|STORAGE}_BUFFER_DYNAMIC) and normal descriptors, and the driver puts the dynamic ones in the root level and the other ones in a descriptor table for that set (where the root level contains a pointer to the descriptor table for the set). How can this work with shader compilation? The compiler will not know whether a descriptor is dynamic just from looking at the spir-v. Should we instead put even the dynamic descriptors into the set's descriptor table? Will that cause problems with the dynamic update mechanism? |
Just realized that, if we have an array of image or sampler descriptors, the array stride needs to be a reloc. We don't know at shader compile time whether it is an array of separate image/sampler descriptors (stride 32 or 16 bytes), or an array of combined image+sampler descriptors (stride 48 bytes). |
When constructing relocatable shader elf, there is some data that cannot be known at compile time. One example is the offset needed to do a descriptor load. We want to be able to use a relocation to represent this offset, which can then be filled in during the linking phase that will create the pipeline elf.
If we want to replace the constant offset with a relocation, we need to figure out 3 different ways of representing the value for the different phases of the compilation.
We already have a version of the code that will generate relocation and replace them. This code is in #457 and GPUOpen-Drivers/llvm-project#1. This version of the code is wrong in many ways, and needs to be improved.
Relocation in elf
This is the code that is generated by the initial version of llpc/llvm that generates relocations.
This is wrong in a few different ways. First, the relocation offset is wrong. The relocation should point to the location in the
s_movk_i32
instruction that contains the constant value. Second, the descriptor offset can only be 16-bit because that is all thes_movk_i32
instruction can hold. Third, the type of relocation is wrong. TheR_AMDGPU_ABS32
is meant to be used as the absolute address of the symbol to which the relocation applies. We need to create a new relocation.The code we want generated would be something more like:
We still want to create dummy symbol
doff_0_0
(the descriptor offset for set 0 and binding 0). This symbol is intended to hold the constant offset of the resource, and the new relocation says we should replace the location by that constant 32-bit value.For performance, I do not know what is best. I would like the force the
s_mov_b32
to always immediately precede the loads_load_dwordx4
. Then we could implement an optimization in the linking phase: if the offset is less than 2^20, then make thes_mov_b32
anop
instruction and rewrite the load so it uses an immediate offset instead of the register.Representation in the compilation
The offset is added to the code during "Patch LLVM for descriptor load operations". This happens earlier in the compilation. We need a way to track that the offset of a particular descriptor is to be used. As we go through the different phases of the compilation, that representation needs to change.
Offset representation in llvm-ir
The translation from Spir-V to llvm-ir, will generate a call to
@llpc.descriptor.load.buffer
. This gets expanded in "Patch LLVM for descriptor load operations", and part of the expansion is calculating the offset of the descriptor using the descriptor offset field of the pipeline info. See https://github.com/GPUOpen-Drivers/llpc/blob/dev/llpc/patch/llpcPatchDescriptorLoad.cpp#L316 for where this is calculated.The our initial implementation, we chose to replace the offset with a builtin function
@llvm.amdgcn.desc.offset
that represents the offset.This worked in the sense that it correctly marked that the offset of a particular descriptor is needed. However, the pattern recognition when lowering to MIR does not work well.
Offset representation in MIR
The next phase is MIR. The main problem is that the ISEL does not recognize the code generated above as having an offset. So it does operations on the base register instead of creating an "S_LOAD_DWORDX4_SGPR" where the offset is in a register. I'm guessing we need to fix up the pattern matching.
The bigger problem is how to represent the offset of a particular descriptor. I wanted to create a dummy instruction that has two parameter, the set and binding. However, I am not sure how to properly generate that instruction:
%16:sreg_32 = S_DESC_OFFSET 0, 0
.Currently in
AMDGPU DAG->DAG Pattern Instruction Selection
, I identified the builtin above and replace it with the new instruction. This is a sample of code that would currently be generated:We need to figure out how to define an instruction that takes two immediate operands in MIR, but outputs an "S_MOV_B32 " when doing machine code lowering. I can see how to do this, but I do not want to put the effort into doing it if the design is simply wrong.
An alternative would be to create
S_LOAD_DWORDX4_RELOC
. We would do a pattern recognition to that the offset is a reloction. We would have to be careful how we handle large offset and offsets into descriptor arrays.Machine code lowering
In
AMDGPUMCInstLower::lower
, theS_DESC_OFFSET
instruction is lowered to ans_movk_i32
instruction with a fixup for the immediate value. As has already been mentioned, this is a problem because it is only at 16-bit value. This is also where we choose the relocation type. I was unsure of what needs to be done if I want to create a new relocation type, so I reused an existing one. Anything that can point us to how to create a new relocation type will be helpful.Then
SIMCCodeEmitter::getMachineOpValue
is used to actually output the relocation. The offset is currently wrong, so it is not actually on the immediate value. The size of the fix up is also wrong. It outputs a 32-bit fixup, when there is only 16-bits that can be written. All of this can be easily fixed up once we have the instruction correct.I should be able to output an
s_mov_b32
instruction instead, and do a 32-bit fixup that would cover the entire offset.The text was updated successfully, but these errors were encountered: