GPU driven rendering #33

Try · 2022-05-08T18:18:09Z

This ticket is to track ideas/known solutions to GPU-driven.

Vulkan-Extensions

Known production solutions

Current idea

Use VK_NV_mesh_shader as starting point. And build some emulation layer to enable mesh-shader on wider range of hardware.

The text was updated successfully, but these errors were encountered:

Try · 2022-06-29T20:37:45Z

Now, since mesh-shading is released for OpenGothic can start thinking about next steps.

With VK_NV_mesh_shader all fits fine with the engine, just need to emulate them on else platforms.

Idea for emulation workflow:

Split mesh shader into 2 compute shaders + 1 vertex shader
Shaders: counting pass + workload shader + vertex-passthrough
Extra data:
counting_buffer[], indirect_buffer[draw_count], var_buffer[]
var_buffer - is buffer with varyings outputed from .mesh shader

Spirv patching notes:

OpDecorate %1234 BuiltIn PrimitiveCountNV    <-- should be noped/removed
%gl_PrimitiveCountNV = OpVariable %_ptr_Output_uint Output  <-- should be mutated to shared-variable

Counting shader

// upfront. Using set=1 is ideal, since engine doesn't work with multiple descriptor sets
layout(set = 1, binding = 0) buffer EngineInternal
{
    uint countersCount;
    uint counters[];
} engine;
---
// tail of the main function
  if(_gl_PrimitiveCountNV!=0) {
    uint pos = atomicAdd(engine.countersCount, 1);
    engine.counters[pos] = _gl_PrimitiveCountNV;
    }

Once counter are done, internal shader has to build multi-draw-indirect buffer, with prefix summed counts.

// recap note about indirect commands
struct VkDrawIndexedIndirectCommand {
   uint32_t    indexCount;
   uint32_t    instanceCount;
   uint32_t    firstIndex; // prefix sum
   int32_t     vertexOffset; // can be abused to offset into var_buffer
   uint32_t    firstInstance; // caps: should be zero
   };

Final draw

each vkCmdDrawMeshTasks get replaced by vkCmdDrawIndexedIndirect, that consumes var_buffer and passing it to fragment shader.

Multiple renderpasses

vkEvent should be fine to synchronize execution of previous set of compute shaders for now.

Split command-buffers

Generating extra compute shaders will require a way to insert vkCmdDispatch commands into begin of render-pass.
Can be done by deferred command recording or by spliting one engine-level command buffer into multiple vulkan-command buffers.
Cons:

if deferred: validation is gonna be delayed as well, making debug problematic
multiple vulkan command buffers: full-screen quad pass will produce command buffer with single draw-call

Issues

Rasterization order - not considered, zbuffer is more than fine to achieve correct 3D rendering
Mesh shader side effects - not possible due to counting pass
Per-primitive data - not now
All buffers has to be preallocated with finite size. Unfortunately we can runout of buffer memory and there is no lazy-allocated buffers in vulkan
Not task shader support for now - OpenGothic doesn't need it

Try · 2022-07-07T22:19:54Z

Some experiments:

Added libspiv - internal utility library for spir-v tooling
First attempts to convert .mesh to .comp

; SPIR-V
; Version: 1.0
; Generator: Khronos Glslang Reference Front End; 10
; Bound: 82
; Schema: 0
               OpCapability Shader
          %1 = OpExtInstImport "GLSL.std.450"
               OpMemoryModel Logical GLSL450
               OpEntryPoint GLCompute %main "main"
               OpExecutionMode %main LocalSize 1 1 1
               OpSource GLSL 450
               OpSourceExtension "GL_NV_mesh_shader"
               OpName %main "main"
               OpName %g1_MeshPerVertexNV "g1_MeshPerVertexNV"
               OpMemberName %g1_MeshPerVertexNV 0 "g1_Position"
               OpMemberName %g1_MeshPerVertexNV 1 "g1_PointSize"
               OpMemberName %g1_MeshPerVertexNV 2 "g1_ClipDistance"
               OpMemberName %g1_MeshPerVertexNV 3 "g1_CullDistance"
               OpMemberName %g1_MeshPerVertexNV 4 "g1_PositionPerViewNV"
               OpMemberName %g1_MeshPerVertexNV 5 "gl_ClipDistancePerViewNV"
               OpMemberName %g1_MeshPerVertexNV 6 "gl_CullDistancePerViewNV"
               OpName %g1_MeshVerticesNV "g1_MeshVerticesNV"
               OpName %Vbo "Vbo"
               OpMemberName %Vbo 0 "vertices"
               OpName %_ ""
               OpName %PerVertexData "PerVertexData"
               OpMemberName %PerVertexData 0 "color"
               OpName %v_out "v_out"
               OpName %g1_PrimitiveIndicesNV "g1_PrimitiveIndicesNV"
               OpName %g1_PrimitiveCountNV "g1_PrimitiveCountNV"
               OpName %VkDrawIndexedIndirectCommand "VkDrawIndexedIndirectCommand"
               OpMemberName %VkDrawIndexedIndirectCommand 0 "indexCount"
               OpMemberName %VkDrawIndexedIndirectCommand 1 "instanceCount"
               OpMemberName %VkDrawIndexedIndirectCommand 2 "firstIndex"
               OpMemberName %VkDrawIndexedIndirectCommand 3 "vertexOffset"
               OpMemberName %VkDrawIndexedIndirectCommand 4 "firstInstance"
               OpDecorate %_runtimearr_v2float ArrayStride 8
               OpMemberDecorate %Vbo 0 NonWritable
               OpMemberDecorate %Vbo 0 Offset 0
               OpDecorate %Vbo BufferBlock
               OpDecorate %_ DescriptorSet 0
               OpDecorate %_ Binding 0
               OpDecorate %v_out Location 0
               OpDecorate %gl_WorkGroupSize BuiltIn WorkgroupSize
               OpDecorate %VkDrawIndexedIndirectCommand BufferBlock
               OpDecorate %80 DescriptorSet 1
               OpDecorate %80 Binding 0
               OpMemberDecorate %VkDrawIndexedIndirectCommand 0 Offset 0
               OpMemberDecorate %VkDrawIndexedIndirectCommand 1 Offset 4
               OpMemberDecorate %VkDrawIndexedIndirectCommand 2 Offset 8
               OpMemberDecorate %VkDrawIndexedIndirectCommand 3 Offset 12
               OpMemberDecorate %VkDrawIndexedIndirectCommand 4 Offset 16
       %void = OpTypeVoid
          %3 = OpTypeFunction %void
      %float = OpTypeFloat 32
    %v4float = OpTypeVector %float 4
       %uint = OpTypeInt 32 0
     %uint_1 = OpConstant %uint 1
%_arr_float_uint_1 = OpTypeArray %float %uint_1
     %uint_4 = OpConstant %uint 4
%_arr_v4float_uint_4 = OpTypeArray %v4float %uint_4
%_arr__arr_float_uint_1_uint_4 = OpTypeArray %_arr_float_uint_1 %uint_4
%g1_MeshPerVertexNV = OpTypeStruct %v4float %float %_arr_float_uint_1 %_arr_float_uint_1 %_arr_v4float_uint_4 %_arr__arr_float_uint_1_uint_4 %_arr__arr_float_uint_1_uint_4
     %uint_3 = OpConstant %uint 3
%_arr_g1_MeshPerVertexNV_uint_3 = OpTypeArray %g1_MeshPerVertexNV %uint_3
%_ptr_Workgroup__arr_g1_MeshPerVertexNV_uint_3 = OpTypePointer Workgroup %_arr_g1_MeshPerVertexNV_uint_3
%g1_MeshVerticesNV = OpVariable %_ptr_Workgroup__arr_g1_MeshPerVertexNV_uint_3 Workgroup
        %int = OpTypeInt 32 1
      %int_0 = OpConstant %int 0
    %v2float = OpTypeVector %float 2
%_runtimearr_v2float = OpTypeRuntimeArray %v2float
        %Vbo = OpTypeStruct %_runtimearr_v2float
%_ptr_Uniform_Vbo = OpTypePointer Uniform %Vbo
          %_ = OpVariable %_ptr_Uniform_Vbo Uniform
%_ptr_Uniform_v2float = OpTypePointer Uniform %v2float
    %float_0 = OpConstant %float 0
    %float_1 = OpConstant %float 1
%_ptr_Workgroup_v4float = OpTypePointer Workgroup %v4float
      %int_1 = OpConstant %int 1
      %int_2 = OpConstant %int 2
%PerVertexData = OpTypeStruct %v4float
%_arr_PerVertexData_uint_3 = OpTypeArray %PerVertexData %uint_3
%_ptr_Workgroup__arr_PerVertexData_uint_3 = OpTypePointer Workgroup %_arr_PerVertexData_uint_3
      %v_out = OpVariable %_ptr_Workgroup__arr_PerVertexData_uint_3 Workgroup
         %54 = OpConstantComposite %v4float %float_1 %float_0 %float_0 %float_1
         %56 = OpConstantComposite %v4float %float_0 %float_1 %float_0 %float_1
         %58 = OpConstantComposite %v4float %float_0 %float_0 %float_1 %float_1
%_arr_uint_uint_3 = OpTypeArray %uint %uint_3
%_ptr_Workgroup__arr_uint_uint_3 = OpTypePointer Workgroup %_arr_uint_uint_3
%g1_PrimitiveIndicesNV = OpVariable %_ptr_Workgroup__arr_uint_uint_3 Workgroup
     %uint_0 = OpConstant %uint 0
%_ptr_Workgroup_uint = OpTypePointer Workgroup %uint
     %uint_2 = OpConstant %uint 2
%g1_PrimitiveCountNV = OpVariable %_ptr_Workgroup_uint Workgroup
     %v3uint = OpTypeVector %uint 3
%gl_WorkGroupSize = OpConstantComposite %v3uint %uint_1 %uint_1 %uint_1
    %v3float = OpTypeVector %float 3
%_arr_v3float_uint_3 = OpTypeArray %v3float %uint_3
         %74 = OpConstantComposite %v3float %float_1 %float_0 %float_0
         %75 = OpConstantComposite %v3float %float_0 %float_1 %float_0
         %76 = OpConstantComposite %v3float %float_0 %float_0 %float_1
         %77 = OpConstantComposite %_arr_v3float_uint_3 %74 %75 %76
%VkDrawIndexedIndirectCommand = OpTypeStruct %uint %uint %uint %int %uint
%_ptr_Uniform_VkDrawIndexedIndirectCommand = OpTypePointer Uniform %VkDrawIndexedIndirectCommand
         %80 = OpVariable %_ptr_Uniform_VkDrawIndexedIndirectCommand Uniform
       %main = OpFunction %void None %3
          %5 = OpLabel
         %27 = OpAccessChain %_ptr_Uniform_v2float %_ %int_0 %int_0
         %28 = OpLoad %v2float %27
         %31 = OpCompositeExtract %float %28 0
         %32 = OpCompositeExtract %float %28 1
         %33 = OpCompositeConstruct %v4float %31 %32 %float_0 %float_1
         %35 = OpAccessChain %_ptr_Workgroup_v4float %g1_MeshVerticesNV %int_0 %int_0
               OpStore %35 %33
         %37 = OpAccessChain %_ptr_Uniform_v2float %_ %int_0 %int_1
         %38 = OpLoad %v2float %37
         %39 = OpCompositeExtract %float %38 0
         %40 = OpCompositeExtract %float %38 1
         %41 = OpCompositeConstruct %v4float %39 %40 %float_0 %float_1
         %42 = OpAccessChain %_ptr_Workgroup_v4float %g1_MeshVerticesNV %int_1 %int_0
               OpStore %42 %41
         %44 = OpAccessChain %_ptr_Uniform_v2float %_ %int_0 %int_2
         %45 = OpLoad %v2float %44
         %46 = OpCompositeExtract %float %45 0
         %47 = OpCompositeExtract %float %45 1
         %48 = OpCompositeConstruct %v4float %46 %47 %float_0 %float_1
         %49 = OpAccessChain %_ptr_Workgroup_v4float %g1_MeshVerticesNV %int_2 %int_0
               OpStore %49 %48
         %55 = OpAccessChain %_ptr_Workgroup_v4float %v_out %int_0 %int_0
               OpStore %55 %54
         %57 = OpAccessChain %_ptr_Workgroup_v4float %v_out %int_1 %int_0
               OpStore %57 %56
         %59 = OpAccessChain %_ptr_Workgroup_v4float %v_out %int_2 %int_0
               OpStore %59 %58
         %65 = OpAccessChain %_ptr_Workgroup_uint %g1_PrimitiveIndicesNV %int_0
               OpStore %65 %uint_0
         %66 = OpAccessChain %_ptr_Workgroup_uint %g1_PrimitiveIndicesNV %int_1
               OpStore %66 %uint_1
         %68 = OpAccessChain %_ptr_Workgroup_uint %g1_PrimitiveIndicesNV %int_2
               OpStore %68 %uint_2
               OpStore %g1_PrimitiveCountNV %uint_1
               OpReturn
               OpFunctionEnd

In here:

Mesh-related buildtins promoted to shared memory
Entry point adjusted to have no out variables for spirv<1.4
Entry point changed to GLCompute
Extra SSBO binding at (set=1, binding=0) introduced (set to be used as count buffer)
gl_* prefix changed to g1_ to make spirv-cross happy

#33

Try · 2022-07-12T20:05:48Z

Strategy update, for compue-driven workflow:

single execution of .mesh.comp - this will simplify code-gen and C++ workflow
index sorting/packing with internal shaders
manual vertex pull in generated .vert

Extra descriptor set:

struct IndirectCmd { // 32 bytes
  uint    indexCount;
  uint    instanceCount;
  uint    firstIndex;    // prefix sum
  int     vertexOffset;  // can be abused to offset into var_buffer
  uint    firstInstance; // caps: should be zero

  uint    self;  // sequential id of dispatchMesh class, in render-pass
  uint    padd0;
  uint    padd1;
  }; // 32 bytes

layout(set = 1, binding = 0, std430) buffer EngineInternal0 {
  IndirectCmd cmd[];
  } indirect; // indirect buffer, mostly set by CPU, except for indexCount, firstIndex

layout(set = 1, binding = 1, std430) buffer EngineInternal1 {
  uint    grow;
  uint    ibo[];
  } ind;

layout(set = 1, binding = 2, std430) buffer EngineInternal2 {
  uint    grow;
  uint    vbo[];
  } var;

layout(set = 1, binding = 3, std430) buffer EngineInternal3 {
  uint    grow; // and dispatchX
  uint    dispatchY; // =1
  uint    dispatchZ; // =1
  uint    desc[];
  } mesh;

layout(set = 1, binding = 4, std430) buffer EngineInternal4 {
  uint    ibo[];
  } indFlat;

Workflow by example:

      enc.setFramebuffer({{fbo,Vec4(0,0,1,1),Tempest::Preserve}});
      enc.setUniforms(pso,ubo);
      enc.dispatchMesh(0,3);
      enc.dispatchMesh(3,2);

Will be translated as:

      enc.setUniforms(pso_compute_ms,ubo);
      // vkCmdBindDescriptorSets(internalSet, dynOffset = 0);
      enc.dispatch(3, 1,1);
      // vkCmdBindDescriptorSets(internalSet, dynOffset = commandId);
      // TODO: pass base taskID somehow
      enc.dispatch(2, 1,1);
     ....
      VkBufferMemoryBarrier(comp -> comp, indirect.ind);
      // after all 'dispatchMesh' are done
      // prefix summ pass doest 2 jobs actually:
      // indirect.ind[i] firstIndex = prefixSumm(indexCount);
      // indirect.ind[i] indexCount = 0; <-- will be re-accumulated in compactage pass
      enc.setUniforms(psoSum,uboSum);
      enc.dispatch(1,1,1); // 1 group with 256 threads
      // should be dispatch-indirect
      VkBufferMemoryBarrier(comp -> comp, all helper buffers, except var);
      enc.setUniforms(psoCompactage,uboCompactage);
      enc.dispatchIndirect(mesh.grow,1,1);
      VkBufferMemoryBarrier(comp -> vert);

      // main rendering, as drawIndirect
      enc.setFramebuffer({{fbo,Vec4(0,0,1,1),Tempest::Preserve}});
      enc.setUniforms(pso,ubo);
      env.drawIndirect(indirect.cmd[0]);
      env.drawIndirect(indirect.cmd[1]);
      // vert -> comp barrier at end of render-pass

#33

Try · 2022-08-04T20:02:20Z

Current implementation:

Each dispatch-mesh call works as pair of compute shader + draw-indirect
Compute shader as well as vertex passthru shaders are generated from single mesh shader: cc326ee
Once all compute-passes related to draw-calls are finished, output should be sorted (only in prototype, not in engine) and forwarded to vkCmdDrawIndexedIndirect

TODO:

Add VMeshShaderEmulated as special case in related pieces in engine
Take care of pipeline-memory allocation and scheduling in general

#33

Try · 2022-08-06T18:05:54Z

First proof of concept kind triangle:

TODO: Need to pass somehow firstTask and selfId to compute shader

Try · 2022-08-06T19:21:34Z

Current idea for firstTask and selfId pass:

Use Y/Z inputs of vkCmdDispatchBase.
Use case: vkCmdDispatchBase(impl, firstTask, self, 0, taskCount, 1,1). This will break some builtin variables.

// workgroup dimensions
in uvec3 gl_NumWorkGroups; // not sure how this interacts with vkCmdDispatchBase
const uvec3 gl_WorkGroupSize;  // unaffected

// workgroup and invocation IDs
in uvec3 gl_WorkGroupID;  // Y is polluted
in uvec3 gl_LocalInvocationID; // unaffected

// derived variables
in uvec3 gl_GlobalInvocationID; // polluted, since it is byproduct of gl_WorkGroupID
in uint gl_LocalInvocationIndex; // unaffected

#33

Try · 2022-08-06T21:57:41Z

Almost there:

Normals are bugged-out, because translator can't handle arrayed varyings

#33

Try · 2022-08-07T10:24:08Z

Running stable on OpenGothic:

Try · 2022-08-08T16:14:17Z

New idea on how to avoid scratch buffer traffic problems(and make solution more Intel-friendly):
Decouple .mesh into separate index and vertex shaders. This can be done, for the most cases, if vertex computation is uniform-function.

uniform-function to me is:
Function that can use only constants, locals, uniforms, read-only ssbo, push-constants in various combinations and have no side-effects.
Similar to pure function in a way, but less restricted. This will allow to move most of computation to vertex shader.

The only problem is gl_WorkGroupID.x that is used all over the place

Try · 2023-06-18T22:39:17Z

GL_EXT_spirv_intrinsics is out. Surprisingly allows to bypass some of compiler
https://shader-playground.timjones.io/626ea18db0663c9ef7d1257940b7a195

Try · 2024-09-15T23:01:49Z

Closing: indirect is mostly implemented in engine (except builtin's - ignore them for now)

Try added a commit that referenced this issue Jul 10, 2022

emulatable mesh shaders initial: count pass

53cd9e5

#33

Try added a commit that referenced this issue Jul 12, 2022

Prototype compute pass as small unit-test

fcb30ac

#33

Try added a commit that referenced this issue Jul 18, 2022

mesh prototype in progress

783c650

#33

Try added a commit that referenced this issue Jul 30, 2022

meshlets proptorype in progress

6de802e

#33

Try added a commit that referenced this issue Aug 4, 2022

mesh emulation in progress

056a69b

#33

Try added a commit that referenced this issue Aug 5, 2022

meshlets pipeline memory

6338f7d

#33

Try added a commit that referenced this issue Aug 5, 2022

mesh emulator in progress

1f229c8

#33

Try added a commit that referenced this issue Aug 6, 2022

draw-indirect initial

338c779

#33

Try added a commit that referenced this issue Aug 6, 2022

meshlet emulation: fist tringle

50d3549

#33

Try added a commit that referenced this issue Aug 6, 2022

fix multiple validation layer issues

278c622

#33

Try added a commit that referenced this issue Aug 6, 2022

fix some meshlets bugs

f864f30

#33

Try added a commit that referenced this issue Aug 6, 2022

arrayed varyings support

f8f7806

#33

Try closed this as completed in b7d1cc7 Aug 7, 2022

Try added a commit that referenced this issue Aug 7, 2022

tune meshlet pipeline memory

6844536

#33

Try reopened this Aug 7, 2022

Try mentioned this issue Aug 7, 2022

Mesh shader emulation over draw-indirect #38

Open

8 tasks

Try mentioned this issue Aug 8, 2022

Low FPS issue Try/OpenGothic#188

Closed

Try closed this as completed Sep 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU driven rendering #33

GPU driven rendering #33

Try commented May 8, 2022 •

edited

Loading

Try commented Jun 29, 2022

Try commented Jul 7, 2022

Try commented Jul 12, 2022

Try commented Aug 4, 2022

Try commented Aug 6, 2022

Try commented Aug 6, 2022

Try commented Aug 6, 2022

Try commented Aug 7, 2022

Try commented Aug 8, 2022

Try commented Jun 18, 2023

Try commented Sep 15, 2024

GPU driven rendering #33

GPU driven rendering #33

Comments

Try commented May 8, 2022 • edited Loading

Vulkan-Extensions

Known production solutions

Current idea

Try commented Jun 29, 2022

Spirv patching notes:

Counting shader

Final draw

Multiple renderpasses

Split command-buffers

Issues

Try commented Jul 7, 2022

Try commented Jul 12, 2022

Try commented Aug 4, 2022

Try commented Aug 6, 2022

Try commented Aug 6, 2022

Try commented Aug 6, 2022

Try commented Aug 7, 2022

Try commented Aug 8, 2022

Try commented Jun 18, 2023

Try commented Sep 15, 2024

Try commented May 8, 2022 •

edited

Loading