Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LGC: shader compilation proposal #507

Closed
trenouf opened this issue Mar 17, 2020 · 5 comments
Closed

LGC: shader compilation proposal #507

trenouf opened this issue Mar 17, 2020 · 5 comments
Labels

Comments

@trenouf
Copy link
Member

trenouf commented Mar 17, 2020

LGC: shader compilation proposal

There are several different efforts to move away from whole-pipeline
compilation in LLPC, or that will affect LLPC in the future. This proposal is
to unify them in new LGC (LLPC middle-end) functionality.

  • There is a "partial pipeline compilation" scheme in LLPC that kind of hacks
    into LGC's otherwise whole-pipeline compilation, and does ELF linking in the
    front-end using ad-hoc ELF reading and writing code, rather than LLVM code.

  • Steven et al have started work on their scheme to be able to compile separate
    shaders (VS, FS, CS) offline to pre-populate a shader cache, with some
    pipeline state missing, and some pipeline state guessed with multiple
    combinations per shader. This builds on the front-end linking functionality
    above. See Github issues
    Cache creator tool,
    Relocatable elf vertex input handling,
    Handling descriptor offsets as relocations.

  • There are AMD-internal discussions about shader compilation.

This proposal is to unify these different efforts to use new LGC (LLPC
middle-end) functionality. The link stage in particular requires knowledge that
should be in the middle-end, such as the workings of PAL metadata, and ELF
reading and writing, and needs to be shared and used by potential multiple LLPC
front-ends.

Background

Existing whole pipeline compilation

Whole-pipeline compilation in LLPC works like this:

  1. For each shader, run the front-end shader compilation: SPIR-V reader and
    various "lowering" passes use Builder to construct the IR for a shader. This
    phase does not use pipeline state.
  2. LGC (the middle-end) is given the pipeline state, and it links the shader IR
    modules into a pipeline IR module.
  3. LGC runs its middle-end passes and optimizations, then passes the resulting
    pipeline IR module to the AMDGPU back-end for pipeline ELF generation.

Existing(ish) shader and partial pipeline caching

Existing partial pipeline compilation

There are some changes on top of this to handle a "partial pipeline
compilation" mode. Part way through step 2, LGC calls a callback provided by
the front-end with a hash of each shader and the pipeline state and
input/output info pertaining to it. The callback in the front-end can ask to
omit a shader stage, if it finds it already has a cached ELF containing that
shader. Then, the front-end has a post-compilation ELF linking step to use the
part of that cached ELF for the omitted shader. This only works for VS-FS, and
has some other provisos, because of the way that it plucks the part of the
pipeline it needs out of a whole pipeline ELF.

This scheme has some disadvantages, especially the way that it allows the
middle-end to think that it is compiling a whole pipeline, but it then
post-processes the ELF to extract the part it needs. A more holistic approach
would be for the middle-end to know that it is not compiling a whole pipeline,
and for the link stage to be in the middle-end where knowledge of (for example)
PAL metadata should be confined to.

Steven et al's shader caching

Steven's scheme is to offline compile shaders to pre-populate a shader cache.
This would involve compiling a shader with most of the pipeline state missing
(principally resource descriptor layout, vertex buffer info and color export
info), and with some "bounded" items in the pipeline state set to a guessed
value. The resulting compiled shader ELF would be cached keyed on the input
SPIR-V and (I assume) the "bounded" parts of the pipeline state that were set.

The proposal

This proposal outlines a shader compilation scheme using relocs, prologs and
epilogs, and a pipeline linking stage, all handled in LGC (the LLPC
middle-end).

Shader compilation vs pipeline compilation

This proposal does not cover how and when a driver decides to do shader
compilation. Of the two compilation modes:

  • shader compilation and caching with pipeline linking for minimized compile time;
  • full pipeline compilation for optimal code;

there is scope for API and/or driver changes to use shader compilation first,
then kick off a background thread to do the optimized compilation and swap the
result in at the next opportunity.

Early vs late shader caching

We can divide existing and proposed shader caching schemes into two types:

  • Early shader caching caches the shader keyed on just its input language
    (SPIR-V for Vulkan), possibly combined with some of the pipeline state.
    Steven's scheme is an example.

  • Late shader caching caches the shader after some part of the compilation has
    taken place, and keys it on the state of the compilation at that point. The
    existing partial pipeline compilation scheme is an example.

I propose to focus here on early shader caching, which has the following pros and cons:

  • Pro: Minimize compilation time for cache-hit case
  • Pro: Fits in with Steven's scheme
  • Con: Only limited VS-FS optimization possible (although even late shader
    caching still has some limits on this, unless you make it so late you have
    done a large chunk of the compilation).

Nicolai also suggests taking the existing partial pipeline compilation scheme,
a late shader caching scheme, and tidying up its interface and implementation
(see
Inter-shader data cache tracking.
One problem is that we do pretty much have to choose one or the other; within
one application run, you can't use both at the same time, as trying to means
that a shader gets cached early and late, and next time the same shader is
seen, the early cache check always succeeds.

The choice partly depends on how you view the existing partial pipeline
compilation scheme: was a late shader caching scheme chosen for the possibility
of VS-FS optimizations, or was it chosen because that meant that it could be
implemented without implementing the relocs and prologs and epilogs in this
proposal? I suspect the latter, and I reckon we're better with an early shader
caching scheme for the two pros I list above.

What shaders are cached

This proposal makes no attempt to cache VS, TCS, TES, GS shaders that make up
part of a geometry or tessellation vertex-processing stage. The FS in such a
pipeline can still be cached though. So the shader types that can be cached
are:

  • CS
  • VS as long as it is standalone (if a VS accidentally gets compiled and cached
    and turns out not to be standalone, it just gets ignored, but hopefully you
    can tell that a VS is likely not to be standalone before getting to that
    point)
  • FS

In addition, we can compile the whole vertex-processing stage (VS-GS,
VS-TCS-TES, or VS-TCS-TES-GS) without the FS, or with an already-compiled FS.

Failure of shader compilation or pipeline linking

There needs to be scope for shader compilation or pipeline linking to fail, in
which case the front-end needs to do full pipeline compilation instead:

  • Shader compilation can fail if the compiler can tell in advance that the
    shader does something that will not work in the shader compilation model, for
    example a VS that is obviously not a standalone VS.

  • Pipeline linking can fail because the pipeline uses something that is not
    possible to implement in this model, for example:

    • converting sampler
    • specialization constant that shader compilation did not render as a reloc
      (used in a type) and whose value does not match the shader default
    • descriptor set split between its own table and the top-level table
      Also it needs to fail because the pipeline uses something that has not yet
      been implemented in this model.

This kind of failure is different to normal compilation failure, in that it
needs to exit cleanly and clean up, because the driver or front-end is going to
retry as a full pipeline compilation. If any such condition is detected in an
LLVM pass flow, we need to come up with a clean exit mechanism, such as
deleting all the code in the module and detecting that at the end.

Prologs and epilogs

Compiling shaders with some or all pipeline state missing and without the other
shader to refer to means that the pipeline linker needs to generate prologs and
epilogs.

CS prolog

If the compilation of a CS without resource descriptor layout puts its user
data sgprs in the wrong order for the layout in the pipeline state, then the
linker needs to generate a CS prolog that loads and/or swaps around user data
sgprs. The linker picks up the descriptor set to sgpr mapping that the CS
compilation used from the user data registers in the PAL metadata.

VS prolog

If vertex buffer information is unavailable at VS compile time, then the linker
needs to generate a VS prolog (a "fetch shader") that loads vertex buffer
values required by the VS. The VS expects the values to be passed in vgprs, and
the linker picks up details of which vertex buffer locations and in what format
from extra pre-link metadata attached to the VS ELF.

VS epilog

If the VS (or whole vertex-processing stage) is compiled without information on
how the FS packs its parameter inputs, then the VS compilation does not know
how to export parameters, and the linker needs to generate a VS epilog. The VS
(or last vertex-processing-stage shader) exits with the parameter values in
vgprs, and the VS epilog takes those and exports them. The linker picks up
information on what parameter locations are in which vgprs and in what format
from extra pre-link metadata attached to the VS ELF, and information on how
parameter locations are packed and arranged from extra pre-link metadata
attached to the FS ELF.

No FS prolog

No FS prolog is ever needed. FS compilation decides how to pack and arrange its
input parameters.

FS epilog

If the FS is compiled without color export pipeline state, then it does not
know how to do its exports, and the linker needs to generate an FS epilog. The
FS exits with its color export values in vgprs (and the exec mask set to the
surviving pixels after kills/demotes), and the FS epilog takes those and
exports them. The linker picks up information on what color exports are in
which vgprs and in what format from extra pre-link metadata attached to the FS
ELF.

Prolog/epilog compilation notes

A prolog has the same input registers as the shader it will be attached to,
minus the vgprs that are generated by the prolog for passing to the shader
proper. That is, the shader's SPI register settings that determine what
registers are set up at wave dispatch apply to the prolog.

For a VS prolog where the VS is part of a merged shader (including the NGG
case), the code to set exec needs to be in the prolog.

The exact same set of registers are also outputs from the prolog, plus the
vgprs that are generated by the prolog.

A prolog/epilog is generated as an IR module then compiled. The compiled ELF is
cached with the hash of the inputs to the prolog/epilog IR generator being the
key.

In the context of a prolog being generated as IR then compiled:

  • Input args represent the input registers, with sgprs marked as "inreg", same
    as the IR for a shader.
  • IR can only have a single return value, which here is a struct containing the
    preserved input sgprs and vgprs, plus the vgprs generated by the prolog for
    passing to the shader. By including sgprs as ints and vgprs as floats in the
    return value struct, the back-end calling convention ensures that they are
    allocated to sgprs and vgprs appropriately.
  • We can assume that compiling a prolog will never need scratch, so with that
    single "shader prolog/epilog" calling convention, we don't need to worry that
    it doesn't know how to find the scratch descriptor (which is different
    between compute, single shader and merged shader including NGG).
  • Compiling the prolog with that "shader prolog/epilog" calling convention
    leaves its sgpr and vgpr usage in some well-known place, e.g. the
    SPI_SHADER_RSRC1_VS register in PAL metadata. The linker needs to take the
    maximum usage of that and the shader proper.

An epilog's input registers are the same as the shader's output registers,
which is the vgprs containing the values to export. (This may need to change to
also have some sgprs passed for VS epilog parameter export on gfx11, if
parameter exports are going to be replaced by normal off-chip memory writes.)

Prolog/epilog generation even in pipeline compilation

In a case where a particular prolog or epilog is not needed (e.g. the VS prolog
when vertex buffer information is available at VS compilation time), I propose
that LGC internally uses the same scheme of setting up a shader as if it is
going to use the prolog/epilog (including setting up the metadata for the
linker), and then uses the same code to generate the IR for the prolog/epilog
as would otherwise be used at link time. Then it would merge the prolog/epilog
into the shader at the IR stage, allowing optimizations from there.

The advantage of that is that there is less different code in LGC between the
shader and pipeline compilation cases.

A change this causes is that the vertex buffer loads are all at the start of
the VS, even in a pipeline compilation. I'm not sure whether that is good, bad
or neutral for performance. (Ignoring the NGG culling issue for now.)

NGG culling

An early version of this feature should probably just ignore this case, because
it is quite complex.

With NGG culling, it is advantageous to delay vertex buffer loads that are only
used for parameter calculations until after the culling. Thus, for an NGG VS,
there should be two VS prologs (fetch shaders). The VS compilation needs to
generate the post-culling part as a separate shader, such that the second fetch
shader can be glued in between them. At that point (the exit of the first
shader), sgprs and vgprs need to be as at wave dispatch, except that the vgprs
(vertex index etc) have been copied through LDS to account for the vertices
being compacted. Also exec needs to reflect the compacted vertices.

Jumping between prolog, shader and epilog

I'm not sure how possible this is, or if there is a better idea, but:

We want the generated code to reflect that it is going to jump to the next part
of the shader. So, when generating the prolog, or when generating the shader
proper when there will be an epilog, we want to have an s_branch with a reloc,
rather than an s_endpgm. Perhaps we could tell the backend that by defining a
new function attribute giving the symbol name to s_branch to when generating
what would otherwise be an s_endpgm.

Linking a prolog, shader and epilog would then just work with the s_branch.
Linking could optimize that by ensuring the chunks of code are glued together
in the right order, and removing a final s_branch. Alignment is a
consideration:

  • The start of the glued-together shader must be a multiple of 256.
  • The main part of the shader should start cache-line-aligned, so anything the
    compiler has done to align loop starts etc remains valid.
  • Padding could be done by adding s_nops, except that any final s_waitcnts
    should be moved to after the s_nops as an optimization.

The LGC interface

I propose that we extend LGC (LLPC middle-end) to handle the various requirements.

Currently LGC has an interface that says:

  • Here are the IR modules for the shaders and the pipeline state; link into a
    pipeline IR module.
  • Go and run middle-end and back-end passes to generate a pipeline ELF.

That interface needs to be extended to allow compilation of a shader with
missing or incomplete pipeline state, and to allow linking of
previously-compiled shader ELFs and pipeline state.

We would probably want to implement compilation of a geometry and/or
tessellation pipeline by providing LGC with IR modules for non-FS shaders, a
previously-compiled shader ELF for the FS, and the pipeline state. That allows
the other shaders to be compiled knowing which attribute exports will be unused
by the FS so can be removed.

Compilation modes

The compilation modes LGC would support (in probable order of implementation priority) are:

  1. Pipeline compilation, as now. Must be provided with full pipeline state.
    Generates a pipeline ELF satisfying the PAL pipeline ELF spec.
  2. Compilation of a single shader with missing or partial pipeline state. The
    shader must be CS, FS, or VS in a non-tessellation non-geometry pipeline.
    For VS or FS, this may or may not be provided with the other shader already
    compiled, which would provide parameter information. Generates an ELF that
    needs to be pipeline linked. Then there is a link stage in LGC that takes
    such ELFs and generates a pipeline ELF satisfying the PAL pipeline ELF spec.
  3. Compilation of the vertex-processing part of a geometry or tessellation
    pipeline, with full pipeline state. This may or may not be provided with the
    already-compiled FS ELF, which would supply parameter layout information.
    Generates an ELF that needs to be pipeline linked.

Note that the above modes do not include any case where a shader is compiled
separately, and then in the link stage needs to be combined with another shader
to create a merged shader or an NGG prim shader.

Tuning options

As proposed by Rob, tuning options should always be made available at shader
compilation time. This does probably mean that all tuning has to be done by
shader, not pipeline. Most tuning options are per-shader anyway, except the NGG
ones, which obviously apply only to the VS in a VS-FS pipeline.

Use of the LGC interface by the front-end

VS-FS parameter optimization

As pointed out by Nicolai, the use of early shader caching limits the parameter
optimizations that can be done between VS and FS, and how that is limited
depends on whether you compile the VS first or the FS first. I consider that it
is worth taking this hit because of the saving in compile time in the cache-hit
case.

FS first

In this scheme, at VS compilation time, we know exactly how parameters are
packed by the FS, so we can generate the parameter exports and we do not need a
VS epilog. We can also see where the FS does not use a parameter at all, and
DCE it and its calculation in the VS. However we cannot do constant parameter
propagation into the FS.

VS first

In this scheme, VS compilation does not know how parameters will be laid out by
the FS, so we need a VS epilog. This does allow constant parameter propagation
into the FS, because the VS's parameter metadata can include an indication that
a parameter is a constant so is not being returned in a vgpr at all. FS
compilation will see this metadata, and propagate the constant into the FS,
saving an export/import. (Note that LLPC doesn't do this at all currently.)
However, the dead parameter (one not used by the FS) optimization is limited to
the VS epilog spotting it does not need to export it. The calculation of the
dead parameter, and any vertex buffer load needed only for that, does not get
DCEd.

Other VS-FS parameter optimizations we miss out on

Here are some examples of potential optimizations Nicolai mentioned that we
miss out on by using early shader caching:

  • A transform that lifts certain instructions, such as "multiply parameter by a
    constant" to the vertex shader.
  • A transform that lifts uniformity backwards, e.g. if there is information
    (such as an annotation) in the fragment shader that proves that a parameter
    must be uniform, that information could be back-propagated into the vertex
    shader.
  • A transform that propagates range / scalar evolution information ("this
    parameter is always an integer between 0 and 10")

All these are possible when doing a full pipeline compile.

LLPC front-end changes

The LLPC interface would need to change so that a partial pipeline state (and
tuning options) is provided to the shader compile function. That function would
then check the shader cache, and, if a compile is needed, do front-end
compilation then call the LGC interface with the partial pipeline state.

The pipeline compile function would check the cache for its shaders or partial
pipeline. The difficulty here is that it does not know how much of the pipeline
state was known at shader compile time, so there may need to be some mechanism for
multiple shader ELFs to be stored for a particular shader in the cache, with a
way of finding one whose known pipeline state at the time is compatible.

amdllpc

Steven proposes using a modified amdllpc as his offline shader compile tool.
Thus, that will be calling the LLPC shader compile function with an incomplete
pipeline state containing values for the "bounded" items.

The proposed un-pipeline-linked ELF module

Such an ELF is the result of anything other than full pipeline compilation. It
contains various things to represent the parts of the pipeline state or
inter-shader-stage linking information that was unavailable at the time it was
compiled.

Representation of metadata needed for linking

Some of the items below list metadata that needs to be left in the unlinked ELF
for the link stage to read. I propose that we will define a new section in the
PAL metadata msgpack tree to put these in. The link stage will remove that
metadata.

Representation of final PAL metadata

Some parts of the PAL metadata can be directly generated in a shader compile
before linking. Hopefully all the link stage needs to do is merge the two
msgpack trees, ORing together any register that appears in both. That handles
the case that the same register has a part used by VS and a part used by FS.

Resource descriptor layout

If resource descriptor layout was unavailable at shader compile time, then the
load of a descriptor from its descriptor table has a reloc on its offset where
the symbol name gives the descriptor set and binding. Such relocs are resolved
at link time, when the resource descriptor layout pipeline state is available.
This work is already underway by Steven from Gibraltar.

In addition, an array of image or sampler descriptors needs a reloc for the
array stride. That is different depending on whether it is actually an array of
combined image+samplers, and you can't tell at shader compile time.

For a descriptor set pointer that can fit into a user data sgpr, the PAL
metadata register for that user data sgpr contains the descriptor set number.
The link stage updates that to give the spill table offset. Work on this
mechanism is underway by David Zhou in AMD
(although in the context of the front-end ELF linking mechanism). There needs
to be some way of telling whether the PAL metadata register represents a
fully-linked spill table offset, or an unlinked descriptor set number. I
believe David's work already does that.

For a descriptor set pointer that cannot fit into a user data sgpr, it is
loaded from the spill table with a reloc on the offset whose symbol gives the
descriptor set. That reloc is resolved at link time.

We will have to ban the driver putting any descriptors into the top level of
the descriptor layout:

  • Currently, if a descriptor set contains both dynamic and non-dynamic
    descriptors, the driver puts the dynamic ones in the top level. This proposal
    would not be able to find them.

  • Banning that also avoids the use of compact descriptors, which we also cannot
    cope with in this proposal.

A compute shader's user data has a restriction on which spill table entries can
be put into user data sgprs, and in what order. For that reason, the link stage
may need to prepend code to load and/or swap around sgprs for descriptor set
pointers.

Vertex inputs

If vertex input information is unavailable at VS compile time, then vertex
inputs are passed into the vertex shader in vgprs, with metadata saying which
inputs they are and what type. The link stage then constructs a "fetch shader",
and glues it on to the front of the shader.

The fetch shader has an ABI where the vertex shader's input registers are also
the fetch shader's inputs and outputs, except that the vertex input values are
obviously not part of the fetch shader's inputs.

Color exports

If color export information is unavailable at FS compile time, then color
exports are passed out of the fragment shader in vgprs, with metadata saying
which exports they are and what type. The link stage then constructs an FS
epilog, and glues it on to the back of the shader. The shader exits with exec
set to pixels that are not killed/demoted.

The following pipeline state items also affect color export code, so the
absence of any of them also forces the use of an FS epilog:

  • alphaToCoverageEnable
  • dualSourceBlendEnable

Parameter exports and attribute inputs

In a shader compile, parameter exports are passed out of the last stage
vertex-processing shader in vgprs, with metadata saying which parameters they
are. In an unlinked fragment shader, attributes are packed and there is
metadata saying how that is done. The link stage then ties them up, and adds an
epilog to the last stage vertex-processing stage.

enableMultiView

enableMultiView has several impacts:

  • What gl_Layer and gl_ViewIndex actually are
  • Whether and what to export as pos1

It looks like the best way of handling this if enableMultiView is unavailable
at VS compile time is to compile the two alternatives for each thing inside an
if..else..endif with a reloc as the condition.

perSampleShading

If the perSampleShading item is unavailable at FS compile time, and the FS uses
gl_SampleMask or gl_PointCoord, then the compiler needs to generate code for
both alternatives inside an if..else..endif where the condition is a reloc.

PAL metadata items

Certain pipeline state items do not affect compilation except for being copied straight into PAL metadata registers:

  • depthClipEnable
  • rasterizerDiscardEnable
  • topology
  • userClipPlaneMask

In a shader compile with a link stage, it is the link stage that copies these items into PAL metadata.

Relocatable items

As pointed out by Steven's document
pipeline state - Sheet1 (1).pdf,
the following items are relocatable. That is, if the item is unavailable in
pipeline state at shader compile time, a simple 32-bit constant load with a
reloc will work, so it can be resolved at link time:

  • deviceIndex
  • numSamples
  • samplePatternIdx

We should probably add the shadow descriptor table high 32 bits to this too.

Specialization constants

Steven's document claims that SPIR-V specialization constants can be handled by
relocs. That is only partly true:

  • Where a specialization constant is used somewhere a reloc can be used (an
    operand to an instruction in function code), then the SPIR-V reader could
    call a new Builder function "get reloc value". The name of the symbol
    referenced by the reloc is private to the SPIR-V LLPC front-end, and is not
    understood by LGC.

  • Where a specialization constant is used somewhere a reloc cannot be used
    (e.g. the size of an array type), then the SPIR-V reader uses the default
    value for that constant, and it somehow needs to record what value it used so
    the linker can later check that the specialization constants supplied with
    the pipeline do not clash with that. If they do clash, then the link fails
    and the front-end needs to start again compiling that shader.

  • At the link stage, the front-end needs to supply a list of symbol,value pairs
    to the linker to satisfy the relocs. I'm not sure whether it is worth
    encapsulating that in an ELF.

Bounded items that we need to make relocatable

These are pipeline state items that Steven's document lists as "bounded", that
is, there is a limited range of values that each one can take. Gibraltar's
proposal to handle this in their offline shader cache populating scheme is to
compile a shader multiple times with these items set to the most popular
values, in the hope of covering most cases that the shader is used in a
pipeline.

The implication of this is that the shader cache needs to be able to keep
multiple ELFs for the same shader, with different assumptions about these
pipeline state items. When a pipeline compile looks for a cached shader, there
needs to be some mechanism where it can find the one with a compatible state
for these items.

However, for the purposes of app runtime shader compilation, we need to find
some way of making these fixuppable by the link stage. In some cases, that
might involve generating code that can handle all possibilities, and then
having a branch with a reloc to select the required alternative.

  • perSampleShading

NGG control items

These items are supplied to the compiler through pipeline state to save needing
to load them at runtime from the primitive shader table. If they are
unavailable at shader compile time, then the compiler is forced to load from
the primitive shader table.

  • cullMode
  • depthBiasEnable
  • frontFace

These items are similar, except certain settings also need to force NGG
pass-through mode. Therefore, if the items are unavailable at shader compile
time, we need to force NGG pass-through mode.

  • polygonMode, except that setting polygonMode to line or point forces NGG pass-through mode

Items only needed for tessellation or geometry

These pipeline state items are only used for tessellation or geometry. Because
this proposal insists that a vertex-processing half-pipeline with tessellation
or geometry has to be compiled with full pipeline state, these items do not
need to be handled by a reloc:

  • patchControlPoints
  • switchWinding

The link stage

The link stage needs to:

  • generate CS or VS prolog;
  • generate VS epilog;
  • generate FS epilog;
  • merge and patch up PAL metadata;
  • glue prolog and epilogs on to the corresponding shader;
  • apply relocs;
  • assemble the pipeline ELF.

A prolog is generated to end with an s_branch with a reloc to branch to the VS.

Where an FS needs an epilog (color export information was unavailable at shader
compile time), it is generated with an s_branch with a reloc instead of an
s_endpgm, to branch to its epilog code.

In both cases, we can optimize by gluing sections in the right order, and
applying the optimization that a chunk of code that ends with an s_branch can
have the s_branch removed and turned into a fallthrough. There may need to be
special handling for a prolog to ensure that the CS or VS remains
instruction-cache-line-aligned, such as inserting s_nop padding before the
fetch shader.

Prologs will be generated as IR then compiled. They will be cached so that will
not happen very often.

@s-perron
Copy link
Contributor

This looks good. Thanks.

@kuhar
Copy link
Contributor

kuhar commented Mar 18, 2020

As bystander, I really appreciate your summary, Tim.
It's great you gave this are more structure and provided a high level overview of the design space -- usually a few folks would just come in with some corner-case in the design that they are aware of and it was very difficult for me connect the dots when that happened.
Many things are much more clear to me now, although I still don't understand the details.

@trenouf
Copy link
Member Author

trenouf commented Apr 6, 2020

I have opened #545 LGC shader compilation interface proposal, to detail how the front-end would call LGC (the middle-end) to do shader compilation and linking.

@trenouf
Copy link
Member Author

trenouf commented May 26, 2020

Now that I have pushed #720 fetch shader for review, here are some ideas on how to go about implementing the color export shader:

  1. Analogous to "New vertex fetch pass" in fetch shader #720, handle color exports in a similar way: use a new lgc.output.export.color call (instead of lgc.output.export.generic) for writing to a color export in InOutBuilder, and add a new pass into the existing FragColorExport.cpp that runs before PatchEntryPointMutate to lower the lgc.output.export.color calls to export intrinsics.
  2. In that new pass, spot that it is an unlinked compile and no color export info was provided. In that case, write the info from the color export calls to metadata, mutate the shader to return a struct containing the export values, and hook up the return value elements to the inputs to the color export calls. The FS is then an "exportless" FS.
  3. In the linker, spot that it is an "exportless" FS (perhaps by the presence of the metadata), and create a color export shader (new subclass of GlueShader), analogous to a fetch shader. Actually it is quite a bit simpler than a fetch shader, because it does not need to ask PalMetadata to tell it how many sgprs and vgprs it has on entry, or where any entry register is.

@s-perron
Copy link
Contributor

That is exactly what I was thinking. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants