Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global scan #665

Open
wants to merge 54 commits into
base: master
Choose a base branch
from

Conversation

kpentaris
Copy link
Contributor

Description

Initial implementation of the Global Scan in HLSL

Testing

Not yet finished

TODO list:

Still need to test the HLSL code as well as the porting of the example code to the new API of vulkan_1_3 branch

@devshgraphicsprogrammingjenkins
Copy link
Contributor

[CI]: Can one of the admins verify this patch?

Comment on lines 12 to 14
#ifndef NBL_BUILTIN_MAX_SCAN_LEVELS
#define NBL_BUILTIN_MAX_SCAN_LEVELS 7
#endif

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do as NBL_CONSTEXPR instead of define

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this better be formulated slightly differently, in terms of reduction passes, not reduction + downsweep

Comment on lines 25 to 26
uint32_t topLevel;
uint32_t temporaryStorageOffset[NBL_BUILTIN_MAX_SCAN_LEVELS/2];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use uint16_t for this

uint32_t topLevel;
uint32_t temporaryStorageOffset[NBL_BUILTIN_MAX_SCAN_LEVELS/2];
};

Parameters_t getParameters();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for such a forward decl anymore

Comment on lines 31 to 38
struct DefaultSchedulerParameters_t
{
uint32_t finishedFlagOffset[NBL_BUILTIN_MAX_SCAN_LEVELS-1];
uint32_t cumulativeWorkgroupCount[NBL_BUILTIN_MAX_SCAN_LEVELS];

};

DefaultSchedulerParameters_t getSchedulerParameters();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split out schedulers into separate headers

Comment on lines 40 to 54
template<typename Storage_t>
void getData(
inout Storage_t data,
in uint levelInvocationIndex,
in uint localWorkgroupIndex,
in uint treeLevel,
in uint pseudoLevel
NBL_REF_ARG(Storage_t) data,
NBL_CONST_REF_ARG(uint32_t) levelInvocationIndex,
NBL_CONST_REF_ARG(uint32_t) localWorkgroupIndex,
NBL_CONST_REF_ARG(uint32_t) treeLevel,
NBL_CONST_REF_ARG(uint32_t) pseudoLevel
);
}
}
}
#define _NBL_HLSL_SCAN_GET_PADDED_DATA_DECLARED_
#endif

#ifndef _NBL_HLSL_SCAN_SET_DATA_DECLARED_
namespace nbl
{
namespace hlsl
{
namespace scan
{
template<typename Storage_t>
void setData(
in Storage_t data,
in uint levelInvocationIndex,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no more forward declarations, just let it be an accessor.

Also the level const-ref-args can be rolled up into a single struct SDataIndex

Comment on lines 0 to 45
// Copyright (C) 2023 - DevSH Graphics Programming Sp. z O.O.
// This file is part of the "Nabla Engine".
// For conditions of distribution and use, see copyright notice in nabla.h

#ifndef _NBL_HLSL_SCAN_DESCRIPTORS_INCLUDED_
#define _NBL_HLSL_SCAN_DESCRIPTORS_INCLUDED_

// choerent -> globallycoherent No newline at end of file
#include "nbl/builtin/hlsl/scan/declarations.hlsl"
#include "nbl/builtin/hlsl/workgroup/basic.hlsl"

// coherent -> globallycoherent

namespace nbl
{
namespace hlsl
{
namespace scan
{

template<uint32_t scratchElementCount=scratchSz> // (REVIEW): This should be externally defined. Maybe change the scratch buffer to RWByteAddressBuffer? Annoying to manage though...
struct Scratch
{
uint32_t workgroupsStarted;
uint32_t data[scratchElementCount];
};

[[vk::binding(0 ,0)]] RWStructuredBuffer<uint32_t /*Storage_t*/> scanBuffer; // (REVIEW): Make the type externalizable. Decide how (#define?)
[[vk::binding(1 ,0)]] RWStructuredBuffer<Scratch> globallycoherent scanScratchBuf; // (REVIEW): Check if globallycoherent can be used with Vulkan Mem Model

template<typename Storage_t, bool isExclusive=false>
void getData(
NBL_REF_ARG(Storage_t) data,
NBL_CONST_REF_ARG(uint32_t) levelInvocationIndex,
NBL_CONST_REF_ARG(uint32_t) localWorkgroupIndex,
NBL_CONST_REF_ARG(uint32_t) treeLevel,
NBL_CONST_REF_ARG(uint32_t) pseudoLevel
)
{
const Parameters_t params = getParameters(); // defined differently for direct and indirect shaders

uint32_t offset = levelInvocationIndex;
const bool notFirstOrLastLevel = bool(pseudoLevel);
if (notFirstOrLastLevel)
offset += params.temporaryStorageOffset[pseudoLevel-1u];

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this whole file, it should be userspace

Comment on lines 13 to 41
// TODO: Can we make it a static variable?
groupshared uint32_t wgScratch[SharedScratchSz];

#include "nbl/builtin/hlsl/workgroup/arithmetic.hlsl"

template<uint16_t offset>
struct WGScratchProxy
{
uint32_t get(const uint32_t ix)
{
return wgScratch[ix+offset];
}
void set(const uint32_t ix, const uint32_t value)
{
wgScratch[ix+offset] = value;
}

uint32_t atomicAdd(uint32_t ix, uint32_t val)
{
return glsl::atomicAdd(wgScratch[ix + offset], val);
}

void workgroupExecutionAndMemoryBarrier()
{
nbl::hlsl::glsl::barrier();
//nbl::hlsl::glsl::memoryBarrierShared(); implied by the above
}
};
static WGScratchProxy<0> accessor;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scratches are userspace

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use accessors

Comment on lines 55 to 64
/**
* Required since we rely on SubgroupContiguousIndex instead of
* gl_LocalInvocationIndex which means to match the global index
* we can't use the gl_GlobalInvocationID but an index based on
* SubgroupContiguousIndex.
*/
uint32_t globalIndex()
{
return nbl::hlsl::glsl::gl_WorkGroupID().x*WORKGROUP_SIZE+nbl::hlsl::workgroup::SubgroupContiguousIndex();
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have this in a header already

Comment on lines +46 to +53
struct ScanPushConstants
{
nbl::hlsl::scan::Parameters_t scanParams;
nbl::hlsl::scan::DefaultSchedulerParameters_t schedulerParams;
};

[[vk::push_constant]]
ScanPushConstants spc;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything affecting the pipeline layout should be userspace

Comment on lines -40 to +38
#ifndef _NBL_HLSL_MAIN_DEFINED_
[numthreads(_NBL_HLSL_WORKGROUP_SIZE_, 1, 1)]
void CSMain()
[numthreads(WORKGROUP_SIZE,1,1)]
void main()
{
if (bool(nbl::hlsl::scan::getIndirectElementCount()))
nbl::hlsl::scan::main();
if(bool(nbl::hlsl::scan::getIndirectElementCount())) {
// TODO call main from virtual_workgroup.hlsl
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all this should be userspace, also there will be very little difference between direct and indirect

namespace nbl
{
namespace hlsl
{
namespace scan
{
template<class Binop, class Storage_t>
void virtualWorkgroup(in uint treeLevel, in uint localWorkgroupIndex)
template<class Binop, typename Storage_t, bool isExclusive, uint16_t ItemCount, class Accessor, class device_capabilities=void>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not template on the workgroup scan instead?

Within you can alias and extract:

  • binop
  • storage_t
  • exclusive or not
  • item count
  • smem accessor
  • device traits necessary

Comment on lines 22 to 23
const uint32_t levelInvocationIndex = localWorkgroupIndex * glsl::gl_WorkGroupSize().x + SubgroupContiguousIndex();
const bool lastInvocationInGroup = SubgroupContiguousIndex() == (gl_WorkGroupSize().x - 1);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't ItemCount be used instead of glsl::gl_WorkGroupSize().x ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most definitely

default:
break;
#if NBL_BUILTIN_MAX_SCAN_LEVELS>7
const uint32_t workgroupSizeLog2 = firstbithigh(glsl::gl_WorkGroupSize().x);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modify it to not use Vulkan/SPIR-V environment builtins, that way we can use it from C++ too!

(So take workgroup size - or more accurately item per workgroup count - from the outside)

#endif
// REVIEW: Putting topLevel second allows better alignment for packing of constant variables, assuming lastElement has length 4. (https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl-packing-rules)
struct Parameters_t {
uint32_t lastElement[NBL_BUILTIN_MAX_LEVELS/2+1];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add documentation about what lastElement is used for

Comment on lines +64 to +85
_schedulerParams.finishedFlagOffset[1] = 1u;

_scanParams.temporaryStorageOffset[0] = 2u;
break;
case 2u:
_schedulerParams.cumulativeWorkgroupCount[1] = _schedulerParams.cumulativeWorkgroupCount[0]+WorkgroupCount(1);
_schedulerParams.cumulativeWorkgroupCount[2] = _schedulerParams.cumulativeWorkgroupCount[1]+1u;
_schedulerParams.cumulativeWorkgroupCount[3] = _schedulerParams.cumulativeWorkgroupCount[2]+WorkgroupCount(1);
_schedulerParams.cumulativeWorkgroupCount[4] = _schedulerParams.cumulativeWorkgroupCount[3]+WorkgroupCount(0);
// climb up
_schedulerParams.finishedFlagOffset[1] = WorkgroupCount(1);
_schedulerParams.finishedFlagOffset[2] = _schedulerParams.finishedFlagOffset[1]+1u;
// climb down
_schedulerParams.finishedFlagOffset[3] = _schedulerParams.finishedFlagOffset[1]+2u;

_scanParams.temporaryStorageOffset[0] = _schedulerParams.finishedFlagOffset[3]+WorkgroupCount(1);
_scanParams.temporaryStorageOffset[1] = _scanParams.temporaryStorageOffset[0]+WorkgroupCount(0);
break;
case 3u:
_schedulerParams.cumulativeWorkgroupCount[1] = _schedulerParams.cumulativeWorkgroupCount[0]+WorkgroupCount(1);
_schedulerParams.cumulativeWorkgroupCount[2] = _schedulerParams.cumulativeWorkgroupCount[1]+WorkgroupCount(2);
_schedulerParams.cumulativeWorkgroupCount[3] = _schedulerParams.cumulativeWorkgroupCount[2]+1u;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because due to forward progress guarantees you can only do upsweep or downsweep in a single dispatch, you don't need the "climb down"

you probably don't even need the cumulative workgroup counts, as the way you'd find out if your current workgroup truly is the last one, is by checking the return value of the finished flag atomic counter.

e.g. markFinished

uint32_t howManyDone = flagsAccessor.atomicAdd(finishedOffset[level]+virtualWorkgroupIndexInThisLevel);
//uint32_t currentLastWorkgroup = currentLastElement>>ItemsPerWorkgroupLog2;
uint32_t nextLevelLastWorkgroup = currentLastWorkgroup >>ItemsPerWorkgroupLog2;
uint32_t virtualWorkgroupIndexInNextLevel = virtualWorkgroupIndexInThisLevel>>ItemsPerWorkgroupLog2;
// NOT: either last workgroup, or in the last bundle of workgroups and last workgroup dynamically there
if (howManyDone!=(ItemsPerWorkgroup-1) && (virtualWorkgroupIndexInNextLevel!=nextLevelLastWorkgroup || (virtualWorkgroupIndexInThisLevel&(ItemsPerWorkgroupLog2-1))!=howManyDone) )
    return;
// last one takes over
currentLastWorkgroup = nextLevelLastWorkgroup ;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw I'd try to rewrite this as a loop instead of a weird switch that was only written like this to avoid recursion in GLSL

Comment on lines +110 to +117
/**
* treeLevel - the current level in the Blelloch Scan
* localWorkgroupIndex - the workgroup index the current invocation is a part of in the specific virtual dispatch.
* For example, if we have dispatched 10 workgroups and we the virtu al workgroup number is 35, then the localWorkgroupIndex should be 5.
*/
template<class Accessor>
bool getWork(NBL_CONST_REF_ARG(DefaultSchedulerParameters_t) params, NBL_CONST_REF_ARG(uint32_t) topLevel, NBL_REF_ARG(uint32_t) treeLevel, NBL_REF_ARG(uint32_t) localWorkgroupIndex, NBL_REF_ARG(Accessor) sharedScratch)
{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can't do any flagging via sharedmemory because you want inter-workgroup communication, must be a BDA address!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even better a bda::__ptr<uint32_t> cause then you'll have atomicAdd methods

Comment on lines +129 to +164
sharedScratch.workgroupExecutionAndMemoryBarrier();
const uint32_t globalWorkgroupIndex = sharedScratch.get(0u);

treeLevel = sharedScratch.get(1u);
if(treeLevel>lastLevel)
return true;

localWorkgroupIndex = globalWorkgroupIndex;
const bool dependentLevel = treeLevel != 0u;
if(dependentLevel)
{
const uint32_t prevLevel = treeLevel - 1u;
localWorkgroupIndex -= params.cumulativeWorkgroupCount[prevLevel];
if(workgroup::SubgroupContiguousIndex() == 0u)
{
uint32_t dependentsCount = 1u;
if(treeLevel <= topLevel)
{
dependentsCount = glsl::gl_WorkGroupSize().x;
const bool lastWorkgroup = (globalWorkgroupIndex+1u)==params.cumulativeWorkgroupCount[treeLevel];
if (lastWorkgroup)
{
const Parameters_t scanParams = getParameters();
dependentsCount = scanParams.lastElement[treeLevel]+1u;
if (treeLevel<topLevel)
{
dependentsCount -= scanParams.lastElement[treeLevel+1u] * glsl::gl_WorkGroupSize().x;
}
}
}
uint32_t dependentsFinishedFlagOffset = localWorkgroupIndex;
if (treeLevel>topLevel) // !(prevLevel<topLevel) TODO: merge with `else` above?
dependentsFinishedFlagOffset /= glsl::gl_WorkGroupSize().x;
dependentsFinishedFlagOffset += params.finishedFlagOffset[prevLevel];
while (scanScratchBuf[0].data[dependentsFinishedFlagOffset]!=dependentsCount)
glsl::memoryBarrierBuffer();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect a rewrite of this into "last one out closes the door"

Comment on lines +178 to +187
if (treeLevel<topLevel)
{
finishedFlagOffset += localWorkgroupIndex/glsl::gl_WorkGroupSize().x;
glsl::atomicAdd(scanScratchBuf[0].data[finishedFlagOffset], 1u);
}
else if (treeLevel!=(topLevel<<1u))
{
finishedFlagOffset += localWorkgroupIndex;
scanScratchBuf[0].data[finishedFlagOffset] = 1u;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments about what and why is happenning in the if statements

Comment on lines 32 to 33
template<class Accessor>
bool getWork(NBL_REF_ARG(Accessor) accessor)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the accessor for? needs a better name

Also why don't you store the accessor past the create as a member?

Comment on lines +91 to +98
// could do scanScratchBuf[0u].workgroupsStarted[SubgroupContiguousIndex()] = 0u but don't know how many invocations are live during this call
if(workgroup::SubgroupContiguousIndex() == 0u)
{
for(uint32_t i = 0; i < params.topLevel; i++)
{
scanScratchBuf[0u].workgroupsStarted[i] = 0u;
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you know the workgroup size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, this is actually not used anymore. I think it was being used inside an inRage at some point? Not sure. However we can't reset the workgroupStarted buffer within the shader at the end of the Reduce step (which was the initial goal) because it's possible that there are still workgroups that are "alive" where they won't be doing any work but will still increase the workgroupStarted buffer and some times we end up doing the reset before those WGs exit, ending up with some 1 values instead of all 0s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants