Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU driven rendering #33

Closed
Try opened this issue May 8, 2022 · 11 comments
Closed

GPU driven rendering #33

Try opened this issue May 8, 2022 · 11 comments

Comments

@Try
Copy link
Owner Author

Try commented Jun 29, 2022

Now, since mesh-shading is released for OpenGothic can start thinking about next steps.

With VK_NV_mesh_shader all fits fine with the engine, just need to emulate them on else platforms.

Idea for emulation workflow:

  • Split mesh shader into 2 compute shaders + 1 vertex shader
  • Shaders: counting pass + workload shader + vertex-passthrough
  • Extra data:
    counting_buffer[], indirect_buffer[draw_count], var_buffer[]
    var_buffer - is buffer with varyings outputed from .mesh shader

Spirv patching notes:

OpDecorate %1234 BuiltIn PrimitiveCountNV    <-- should be noped/removed
%gl_PrimitiveCountNV = OpVariable %_ptr_Output_uint Output  <-- should be mutated to shared-variable

Counting shader

// upfront. Using set=1 is ideal, since engine doesn't work with multiple descriptor sets
layout(set = 1, binding = 0) buffer EngineInternal
{
    uint countersCount;
    uint counters[];
} engine;
---
// tail of the main function
  if(_gl_PrimitiveCountNV!=0) {
    uint pos = atomicAdd(engine.countersCount, 1);
    engine.counters[pos] = _gl_PrimitiveCountNV;
    }

Once counter are done, internal shader has to build multi-draw-indirect buffer, with prefix summed counts.

// recap note about indirect commands
struct VkDrawIndexedIndirectCommand {
   uint32_t    indexCount;
   uint32_t    instanceCount;
   uint32_t    firstIndex; // prefix sum
   int32_t     vertexOffset; // can be abused to offset into var_buffer
   uint32_t    firstInstance; // caps: should be zero
   };

Final draw

each vkCmdDrawMeshTasks get replaced by vkCmdDrawIndexedIndirect, that consumes var_buffer and passing it to fragment shader.

Multiple renderpasses

vkEvent should be fine to synchronize execution of previous set of compute shaders for now.

Split command-buffers

Generating extra compute shaders will require a way to insert vkCmdDispatch commands into begin of render-pass.
Can be done by deferred command recording or by spliting one engine-level command buffer into multiple vulkan-command buffers.
Cons:

  • if deferred: validation is gonna be delayed as well, making debug problematic
  • multiple vulkan command buffers: full-screen quad pass will produce command buffer with single draw-call

Issues

  • Rasterization order - not considered, zbuffer is more than fine to achieve correct 3D rendering
  • Mesh shader side effects - not possible due to counting pass
  • Per-primitive data - not now
  • All buffers has to be preallocated with finite size. Unfortunately we can runout of buffer memory and there is no lazy-allocated buffers in vulkan
  • Not task shader support for now - OpenGothic doesn't need it

@Try
Copy link
Owner Author

Try commented Jul 7, 2022

Some experiments:

  1. Added libspiv - internal utility library for spir-v tooling
  2. First attempts to convert .mesh to .comp
; SPIR-V
; Version: 1.0
; Generator: Khronos Glslang Reference Front End; 10
; Bound: 82
; Schema: 0
               OpCapability Shader
          %1 = OpExtInstImport "GLSL.std.450"
               OpMemoryModel Logical GLSL450
               OpEntryPoint GLCompute %main "main"
               OpExecutionMode %main LocalSize 1 1 1
               OpSource GLSL 450
               OpSourceExtension "GL_NV_mesh_shader"
               OpName %main "main"
               OpName %g1_MeshPerVertexNV "g1_MeshPerVertexNV"
               OpMemberName %g1_MeshPerVertexNV 0 "g1_Position"
               OpMemberName %g1_MeshPerVertexNV 1 "g1_PointSize"
               OpMemberName %g1_MeshPerVertexNV 2 "g1_ClipDistance"
               OpMemberName %g1_MeshPerVertexNV 3 "g1_CullDistance"
               OpMemberName %g1_MeshPerVertexNV 4 "g1_PositionPerViewNV"
               OpMemberName %g1_MeshPerVertexNV 5 "gl_ClipDistancePerViewNV"
               OpMemberName %g1_MeshPerVertexNV 6 "gl_CullDistancePerViewNV"
               OpName %g1_MeshVerticesNV "g1_MeshVerticesNV"
               OpName %Vbo "Vbo"
               OpMemberName %Vbo 0 "vertices"
               OpName %_ ""
               OpName %PerVertexData "PerVertexData"
               OpMemberName %PerVertexData 0 "color"
               OpName %v_out "v_out"
               OpName %g1_PrimitiveIndicesNV "g1_PrimitiveIndicesNV"
               OpName %g1_PrimitiveCountNV "g1_PrimitiveCountNV"
               OpName %VkDrawIndexedIndirectCommand "VkDrawIndexedIndirectCommand"
               OpMemberName %VkDrawIndexedIndirectCommand 0 "indexCount"
               OpMemberName %VkDrawIndexedIndirectCommand 1 "instanceCount"
               OpMemberName %VkDrawIndexedIndirectCommand 2 "firstIndex"
               OpMemberName %VkDrawIndexedIndirectCommand 3 "vertexOffset"
               OpMemberName %VkDrawIndexedIndirectCommand 4 "firstInstance"
               OpDecorate %_runtimearr_v2float ArrayStride 8
               OpMemberDecorate %Vbo 0 NonWritable
               OpMemberDecorate %Vbo 0 Offset 0
               OpDecorate %Vbo BufferBlock
               OpDecorate %_ DescriptorSet 0
               OpDecorate %_ Binding 0
               OpDecorate %v_out Location 0
               OpDecorate %gl_WorkGroupSize BuiltIn WorkgroupSize
               OpDecorate %VkDrawIndexedIndirectCommand BufferBlock
               OpDecorate %80 DescriptorSet 1
               OpDecorate %80 Binding 0
               OpMemberDecorate %VkDrawIndexedIndirectCommand 0 Offset 0
               OpMemberDecorate %VkDrawIndexedIndirectCommand 1 Offset 4
               OpMemberDecorate %VkDrawIndexedIndirectCommand 2 Offset 8
               OpMemberDecorate %VkDrawIndexedIndirectCommand 3 Offset 12
               OpMemberDecorate %VkDrawIndexedIndirectCommand 4 Offset 16
       %void = OpTypeVoid
          %3 = OpTypeFunction %void
      %float = OpTypeFloat 32
    %v4float = OpTypeVector %float 4
       %uint = OpTypeInt 32 0
     %uint_1 = OpConstant %uint 1
%_arr_float_uint_1 = OpTypeArray %float %uint_1
     %uint_4 = OpConstant %uint 4
%_arr_v4float_uint_4 = OpTypeArray %v4float %uint_4
%_arr__arr_float_uint_1_uint_4 = OpTypeArray %_arr_float_uint_1 %uint_4
%g1_MeshPerVertexNV = OpTypeStruct %v4float %float %_arr_float_uint_1 %_arr_float_uint_1 %_arr_v4float_uint_4 %_arr__arr_float_uint_1_uint_4 %_arr__arr_float_uint_1_uint_4
     %uint_3 = OpConstant %uint 3
%_arr_g1_MeshPerVertexNV_uint_3 = OpTypeArray %g1_MeshPerVertexNV %uint_3
%_ptr_Workgroup__arr_g1_MeshPerVertexNV_uint_3 = OpTypePointer Workgroup %_arr_g1_MeshPerVertexNV_uint_3
%g1_MeshVerticesNV = OpVariable %_ptr_Workgroup__arr_g1_MeshPerVertexNV_uint_3 Workgroup
        %int = OpTypeInt 32 1
      %int_0 = OpConstant %int 0
    %v2float = OpTypeVector %float 2
%_runtimearr_v2float = OpTypeRuntimeArray %v2float
        %Vbo = OpTypeStruct %_runtimearr_v2float
%_ptr_Uniform_Vbo = OpTypePointer Uniform %Vbo
          %_ = OpVariable %_ptr_Uniform_Vbo Uniform
%_ptr_Uniform_v2float = OpTypePointer Uniform %v2float
    %float_0 = OpConstant %float 0
    %float_1 = OpConstant %float 1
%_ptr_Workgroup_v4float = OpTypePointer Workgroup %v4float
      %int_1 = OpConstant %int 1
      %int_2 = OpConstant %int 2
%PerVertexData = OpTypeStruct %v4float
%_arr_PerVertexData_uint_3 = OpTypeArray %PerVertexData %uint_3
%_ptr_Workgroup__arr_PerVertexData_uint_3 = OpTypePointer Workgroup %_arr_PerVertexData_uint_3
      %v_out = OpVariable %_ptr_Workgroup__arr_PerVertexData_uint_3 Workgroup
         %54 = OpConstantComposite %v4float %float_1 %float_0 %float_0 %float_1
         %56 = OpConstantComposite %v4float %float_0 %float_1 %float_0 %float_1
         %58 = OpConstantComposite %v4float %float_0 %float_0 %float_1 %float_1
%_arr_uint_uint_3 = OpTypeArray %uint %uint_3
%_ptr_Workgroup__arr_uint_uint_3 = OpTypePointer Workgroup %_arr_uint_uint_3
%g1_PrimitiveIndicesNV = OpVariable %_ptr_Workgroup__arr_uint_uint_3 Workgroup
     %uint_0 = OpConstant %uint 0
%_ptr_Workgroup_uint = OpTypePointer Workgroup %uint
     %uint_2 = OpConstant %uint 2
%g1_PrimitiveCountNV = OpVariable %_ptr_Workgroup_uint Workgroup
     %v3uint = OpTypeVector %uint 3
%gl_WorkGroupSize = OpConstantComposite %v3uint %uint_1 %uint_1 %uint_1
    %v3float = OpTypeVector %float 3
%_arr_v3float_uint_3 = OpTypeArray %v3float %uint_3
         %74 = OpConstantComposite %v3float %float_1 %float_0 %float_0
         %75 = OpConstantComposite %v3float %float_0 %float_1 %float_0
         %76 = OpConstantComposite %v3float %float_0 %float_0 %float_1
         %77 = OpConstantComposite %_arr_v3float_uint_3 %74 %75 %76
%VkDrawIndexedIndirectCommand = OpTypeStruct %uint %uint %uint %int %uint
%_ptr_Uniform_VkDrawIndexedIndirectCommand = OpTypePointer Uniform %VkDrawIndexedIndirectCommand
         %80 = OpVariable %_ptr_Uniform_VkDrawIndexedIndirectCommand Uniform
       %main = OpFunction %void None %3
          %5 = OpLabel
         %27 = OpAccessChain %_ptr_Uniform_v2float %_ %int_0 %int_0
         %28 = OpLoad %v2float %27
         %31 = OpCompositeExtract %float %28 0
         %32 = OpCompositeExtract %float %28 1
         %33 = OpCompositeConstruct %v4float %31 %32 %float_0 %float_1
         %35 = OpAccessChain %_ptr_Workgroup_v4float %g1_MeshVerticesNV %int_0 %int_0
               OpStore %35 %33
         %37 = OpAccessChain %_ptr_Uniform_v2float %_ %int_0 %int_1
         %38 = OpLoad %v2float %37
         %39 = OpCompositeExtract %float %38 0
         %40 = OpCompositeExtract %float %38 1
         %41 = OpCompositeConstruct %v4float %39 %40 %float_0 %float_1
         %42 = OpAccessChain %_ptr_Workgroup_v4float %g1_MeshVerticesNV %int_1 %int_0
               OpStore %42 %41
         %44 = OpAccessChain %_ptr_Uniform_v2float %_ %int_0 %int_2
         %45 = OpLoad %v2float %44
         %46 = OpCompositeExtract %float %45 0
         %47 = OpCompositeExtract %float %45 1
         %48 = OpCompositeConstruct %v4float %46 %47 %float_0 %float_1
         %49 = OpAccessChain %_ptr_Workgroup_v4float %g1_MeshVerticesNV %int_2 %int_0
               OpStore %49 %48
         %55 = OpAccessChain %_ptr_Workgroup_v4float %v_out %int_0 %int_0
               OpStore %55 %54
         %57 = OpAccessChain %_ptr_Workgroup_v4float %v_out %int_1 %int_0
               OpStore %57 %56
         %59 = OpAccessChain %_ptr_Workgroup_v4float %v_out %int_2 %int_0
               OpStore %59 %58
         %65 = OpAccessChain %_ptr_Workgroup_uint %g1_PrimitiveIndicesNV %int_0
               OpStore %65 %uint_0
         %66 = OpAccessChain %_ptr_Workgroup_uint %g1_PrimitiveIndicesNV %int_1
               OpStore %66 %uint_1
         %68 = OpAccessChain %_ptr_Workgroup_uint %g1_PrimitiveIndicesNV %int_2
               OpStore %68 %uint_2
               OpStore %g1_PrimitiveCountNV %uint_1
               OpReturn
               OpFunctionEnd

In here:

  • Mesh-related buildtins promoted to shared memory
  • Entry point adjusted to have no out variables for spirv<1.4
  • Entry point changed to GLCompute
  • Extra SSBO binding at (set=1, binding=0) introduced (set to be used as count buffer)
  • gl_* prefix changed to g1_ to make spirv-cross happy

Try added a commit that referenced this issue Jul 10, 2022
Try added a commit that referenced this issue Jul 12, 2022
@Try
Copy link
Owner Author

Try commented Jul 12, 2022

Strategy update, for compue-driven workflow:

  • single execution of .mesh.comp - this will simplify code-gen and C++ workflow
  • index sorting/packing with internal shaders
  • manual vertex pull in generated .vert

Extra descriptor set:

struct IndirectCmd { // 32 bytes
  uint    indexCount;
  uint    instanceCount;
  uint    firstIndex;    // prefix sum
  int     vertexOffset;  // can be abused to offset into var_buffer
  uint    firstInstance; // caps: should be zero

  uint    self;  // sequential id of dispatchMesh class, in render-pass
  uint    padd0;
  uint    padd1;
  }; // 32 bytes

layout(set = 1, binding = 0, std430) buffer EngineInternal0 {
  IndirectCmd cmd[];
  } indirect; // indirect buffer, mostly set by CPU, except for indexCount, firstIndex

layout(set = 1, binding = 1, std430) buffer EngineInternal1 {
  uint    grow;
  uint    ibo[];
  } ind;

layout(set = 1, binding = 2, std430) buffer EngineInternal2 {
  uint    grow;
  uint    vbo[];
  } var;

layout(set = 1, binding = 3, std430) buffer EngineInternal3 {
  uint    grow; // and dispatchX
  uint    dispatchY; // =1
  uint    dispatchZ; // =1
  uint    desc[];
  } mesh;

layout(set = 1, binding = 4, std430) buffer EngineInternal4 {
  uint    ibo[];
  } indFlat;

Workflow by example:

      enc.setFramebuffer({{fbo,Vec4(0,0,1,1),Tempest::Preserve}});
      enc.setUniforms(pso,ubo);
      enc.dispatchMesh(0,3);
      enc.dispatchMesh(3,2);

Will be translated as:

      enc.setUniforms(pso_compute_ms,ubo);
      // vkCmdBindDescriptorSets(internalSet, dynOffset = 0);
      enc.dispatch(3, 1,1);
      // vkCmdBindDescriptorSets(internalSet, dynOffset = commandId);
      // TODO: pass base taskID somehow
      enc.dispatch(2, 1,1);
     ....
      VkBufferMemoryBarrier(comp -> comp, indirect.ind);
      // after all 'dispatchMesh' are done
      // prefix summ pass doest 2 jobs actually:
      // indirect.ind[i] firstIndex = prefixSumm(indexCount);
      // indirect.ind[i] indexCount = 0; <-- will be re-accumulated in compactage pass
      enc.setUniforms(psoSum,uboSum);
      enc.dispatch(1,1,1); // 1 group with 256 threads
      // should be dispatch-indirect
      VkBufferMemoryBarrier(comp -> comp, all helper buffers, except var);
      enc.setUniforms(psoCompactage,uboCompactage);
      enc.dispatchIndirect(mesh.grow,1,1);
      VkBufferMemoryBarrier(comp -> vert);

      // main rendering, as drawIndirect
      enc.setFramebuffer({{fbo,Vec4(0,0,1,1),Tempest::Preserve}});
      enc.setUniforms(pso,ubo);
      env.drawIndirect(indirect.cmd[0]);
      env.drawIndirect(indirect.cmd[1]);
      // vert -> comp barrier at end of render-pass

Try added a commit that referenced this issue Jul 18, 2022
Try added a commit that referenced this issue Jul 30, 2022
@Try
Copy link
Owner Author

Try commented Aug 4, 2022

Current implementation:
изображение

  1. Each dispatch-mesh call works as pair of compute shader + draw-indirect
  2. Compute shader as well as vertex passthru shaders are generated from single mesh shader: cc326ee
  3. Once all compute-passes related to draw-calls are finished, output should be sorted (only in prototype, not in engine) and forwarded to vkCmdDrawIndexedIndirect

TODO:

  1. Add VMeshShaderEmulated as special case in related pieces in engine
  2. Take care of pipeline-memory allocation and scheduling in general

Try added a commit that referenced this issue Aug 4, 2022
Try added a commit that referenced this issue Aug 5, 2022
Try added a commit that referenced this issue Aug 5, 2022
Try added a commit that referenced this issue Aug 6, 2022
Try added a commit that referenced this issue Aug 6, 2022
@Try
Copy link
Owner Author

Try commented Aug 6, 2022

First proof of concept kind triangle:
изображение

TODO: Need to pass somehow firstTask and selfId to compute shader

@Try
Copy link
Owner Author

Try commented Aug 6, 2022

Current idea for firstTask and selfId pass:

Use Y/Z inputs of vkCmdDispatchBase.
Use case: vkCmdDispatchBase(impl, firstTask, self, 0, taskCount, 1,1). This will break some builtin variables.

// workgroup dimensions
in uvec3 gl_NumWorkGroups; // not sure how this interacts with vkCmdDispatchBase
const uvec3 gl_WorkGroupSize;  // unaffected

// workgroup and invocation IDs
in uvec3 gl_WorkGroupID;  // Y is polluted
in uvec3 gl_LocalInvocationID; // unaffected

// derived variables
in uvec3 gl_GlobalInvocationID; // polluted, since it is byproduct of gl_WorkGroupID
in uint gl_LocalInvocationIndex; // unaffected

Try added a commit that referenced this issue Aug 6, 2022
Try added a commit that referenced this issue Aug 6, 2022
@Try
Copy link
Owner Author

Try commented Aug 6, 2022

Almost there:
изображение

Normals are bugged-out, because translator can't handle arrayed varyings

Try added a commit that referenced this issue Aug 6, 2022
@Try Try closed this as completed in b7d1cc7 Aug 7, 2022
Try added a commit that referenced this issue Aug 7, 2022
@Try Try reopened this Aug 7, 2022
@Try
Copy link
Owner Author

Try commented Aug 7, 2022

Running stable on OpenGothic:
изображение

@Try
Copy link
Owner Author

Try commented Aug 8, 2022

New idea on how to avoid scratch buffer traffic problems(and make solution more Intel-friendly):
Decouple .mesh into separate index and vertex shaders. This can be done, for the most cases, if vertex computation is uniform-function.

uniform-function to me is:
Function that can use only constants, locals, uniforms, read-only ssbo, push-constants in various combinations and have no side-effects.
Similar to pure function in a way, but less restricted. This will allow to move most of computation to vertex shader.

The only problem is gl_WorkGroupID.x that is used all over the place

@Try
Copy link
Owner Author

Try commented Jun 18, 2023

GL_EXT_spirv_intrinsics is out. Surprisingly allows to bypass some of compiler
https://shader-playground.timjones.io/626ea18db0663c9ef7d1257940b7a195

@Try
Copy link
Owner Author

Try commented Sep 15, 2024

Closing: indirect is mostly implemented in engine (except builtin's - ignore them for now)

@Try Try closed this as completed Sep 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant