Skip to content

Conversation

@devshgraphicsprogramming
Copy link
Member

@devshgraphicsprogramming devshgraphicsprogramming commented Nov 21, 2024

Description

Conversion of ICPU BLAS and TLAS to IGPU including building.

We may need to support a list of IGPUBLAS in IGPUTLAS for sanity/lifetime coupling, but only if update/rebuild is not allowed or something (need to make a separate issue out of it because I have no clue how that's gonna be structured).

Testing

  • Ray Query Example

TODO list:

  • BLAS and TLAS Hashing
  • AS memory allocation
  • Scratch Suballocation
  • BLAS build
  • BLAS Compaction
  • TLAS build
  • TLAS Compaction

devsh added 2 commits November 20, 2024 16:44
Note that pointer/build param encoding stuff shouldn't be in the CPU side but don't touch anything.

Also fix a typo, change the SRange to a std::span, and add default SPIR-V optimizer if none provided to asset converter.
devsh added 5 commits November 21, 2024 11:17
…their storage.

Change more stuff to span in `ICPUBottomLevelAccelerationStructure`

Use a semantically better typedef/alias in `ILogicalDevice::createBottomLevelAccelerationStructure`
Comment on lines +1103 to +1104
// finally the contents
//TODO: hasher << lookup.asset->getContentHash();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self, need to make the ICPUBottomLevelAccelerationStructure and IPreHashed

Comment on lines +2392 to +2512
using mem_prop_f = IDeviceMemoryAllocation::E_MEMORY_PROPERTY_FLAGS;
const auto deviceBuildMemoryTypes = device->getPhysicalDevice()->getMemoryTypeBitsFromMemoryTypeFlags(mem_prop_f::EMPF_DEVICE_LOCAL_BIT);
const auto hostBuildMemoryTypes = device->getPhysicalDevice()->getMemoryTypeBitsFromMemoryTypeFlags(mem_prop_f::EMPF_DEVICE_LOCAL_BIT|mem_prop_f::EMPF_HOST_WRITABLE_BIT|mem_prop_f::EMPF_HOST_CACHED_BIT);

constexpr bool IsTLAS = std::is_same_v<AssetType,ICPUTopLevelAccelerationStructure>;
accelerationStructureParams[IsTLAS].resize(gpuObjects.size());
for (auto& entry : conversionRequests)
for (auto i=0ull; i<entry.second.copyCount; i++)
{
const auto* as = entry.second.canonicalAsset;
const auto& patch = dfsCache.nodes[entry.second.patchIndex.value].patch;
const bool motionBlur = as->usesMotion();
// we will need to temporarily store the build input buffers somewhere
size_t inputSize = 0;
ILogicalDevice::AccelerationStructureBuildSizes sizes = {};
{
const auto buildFlags = patch.getBuildFlags(as);
if constexpr (IsTLAS)
{
AssetVisitor<GetDependantVisit<ICPUTopLevelAccelerationStructure>> visitor = {
{visitBase},
{asset,uniqueCopyGroupID},
patch
};
if (!visitor())
continue;
const auto instanceCount = as->getInstances().size();
sizes = device->getAccelerationStructureBuildSizes(patch.hostBuild,buildFlags,motionBlur,instanceCount);
inputSize = (motionBlur ? sizeof(IGPUTopLevelAccelerationStructure::DevicePolymorphicInstance):sizeof(IGPUTopLevelAccelerationStructure::DeviceStaticInstance))*instanceCount;
}
else
{
const uint32_t* pMaxPrimitiveCounts = as->getGeometryPrimitiveCounts().data();
// the code here is not pretty, but DRY-ing is of this is for later
if (buildFlags.hasFlags(ICPUBottomLevelAccelerationStructure::BUILD_FLAGS::GEOMETRY_TYPE_IS_AABB_BIT))
{
const auto geoms = as->getAABBGeometries();
if (patch.hostBuild)
{
const std::span<const IGPUBottomLevelAccelerationStructure::Triangles<const IGPUBuffer>> cpuGeoms = {
reinterpret_cast<const IGPUBottomLevelAccelerationStructure::Triangles<const IGPUBuffer>*>(geoms.data()),geoms.size()
};
sizes = device->getAccelerationStructureBuildSizes(buildFlags,motionBlur,cpuGeoms,pMaxPrimitiveCounts);
}
else
{
const std::span<const IGPUBottomLevelAccelerationStructure::Triangles<const ICPUBuffer>> cpuGeoms = {
reinterpret_cast<const IGPUBottomLevelAccelerationStructure::Triangles<const ICPUBuffer>*>(geoms.data()),geoms.size()
};
sizes = device->getAccelerationStructureBuildSizes(buildFlags,motionBlur,cpuGeoms,pMaxPrimitiveCounts);
// TODO: check if the strides need to be aligned to 4 bytes for AABBs
for (const auto& geom : geoms)
if (const auto aabbCount=*(pMaxPrimitiveCounts++); aabbCount)
inputSize = core::roundUp(inputSize,sizeof(float))+aabbCount*geom.stride;
}
}
else
{
core::map<uint32_t,size_t> allocationsPerStride;
const auto geoms = as->getTriangleGeometries();
if (patch.hostBuild)
{
const std::span<const IGPUBottomLevelAccelerationStructure::Triangles<const IGPUBuffer>> cpuGeoms = {
reinterpret_cast<const IGPUBottomLevelAccelerationStructure::Triangles<const IGPUBuffer>*>(geoms.data()),geoms.size()
};
sizes = device->getAccelerationStructureBuildSizes(buildFlags,motionBlur,cpuGeoms,pMaxPrimitiveCounts);
}
else
{
const std::span<const IGPUBottomLevelAccelerationStructure::Triangles<const ICPUBuffer>> cpuGeoms = {
reinterpret_cast<const IGPUBottomLevelAccelerationStructure::Triangles<const ICPUBuffer>*>(geoms.data()),geoms.size()
};
sizes = device->getAccelerationStructureBuildSizes(buildFlags,motionBlur,cpuGeoms,pMaxPrimitiveCounts);
// TODO: check if the strides need to be aligned to 4 bytes for AABBs
for (const auto& geom : geoms)
if (const auto triCount=*(pMaxPrimitiveCounts++); triCount)
{
switch (geom.indexType)
{
case E_INDEX_TYPE::EIT_16BIT:
allocationsPerStride[sizeof(uint16_t)] += triCount*3;
break;
case E_INDEX_TYPE::EIT_32BIT:
allocationsPerStride[sizeof(uint32_t)] += triCount*3;
break;
default:
break;
}
size_t bytesPerVertex = geom.vertexStride;
if (geom.vertexData[1])
bytesPerVertex += bytesPerVertex;
allocationsPerStride[geom.vertexStride] += geom.maxVertex;
}
}
for (const auto& entry : allocationsPerStride)
inputSize = core::roundUp<size_t>(inputSize,entry.first)+entry.first*entry.second;
}
}
}
if (!sizes)
continue;
// this is where it gets a bit weird, we need to create a buffer to back the acceleration structure
IGPUBuffer::SCreationParams params = {};
constexpr size_t MinASBufferAlignment = 256u;
params.size = core::roundUp(sizes.accelerationStructureSize,MinASBufferAlignment);
params.usage = IGPUBuffer::E_USAGE_FLAGS::EUF_ACCELERATION_STRUCTURE_STORAGE_BIT|IGPUBuffer::E_USAGE_FLAGS::EUF_SHADER_DEVICE_ADDRESS_BIT;
// concurrent ownership if any
const auto outIx = i+entry.second.firstCopyIx;
const auto uniqueCopyGroupID = gpuObjUniqueCopyGroupIDs[outIx];
const auto queueFamilies = inputs.getSharedOwnershipQueueFamilies(uniqueCopyGroupID,as,patch);
params.queueFamilyIndexCount = queueFamilies.size();
params.queueFamilyIndices = queueFamilies.data();
// we need to save the buffer in a side-channel for later
auto& out = accelerationStructureParams[IsTLAS][baseOffset+entry.second.firstCopyIx+i];
out = {
.storage = device->createBuffer(std::move(params)),
.scratchSize = sizes.buildScratchSize,
.motionBlur = motionBlur,
.compactAfterBuild = patch.compactAfterBuild,
.inputSize = inputSize
};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs some love from me

Comment on lines +2953 to +2957
// This gets deferred till AFTER the Buffer Memory Allocations and Binding for Acceleration Structures
if constexpr (!std::is_same_v<AssetType,ICPUBottomLevelAccelerationStructure> && !std::is_same_v<AssetType,ICPUTopLevelAccelerationStructure>)
dfsCache.for_each([&](const instance_t<AssetType>& instance, dfs_cache<AssetType>::created_t& created)->void
{
auto& stagingCache = std::get<SReserveResult::staging_cache_t<AssetType>>(retval.m_stagingCaches);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to pack up the lambda and defer it

Comment on lines +3251 to +3304
// Deal with Deferred Creation of Acceleration structures
{
for (auto asLevel=0; asLevel<2; asLevel++)
{
// each of these stages must have a barrier inbetween
size_t scratchSizeFullParallelBuild = 0;
size_t scratchSizeFullParallelCompact = 0;
// we collect that stats AFTER making sure that the BLAS / TLAS can actually be created
for (const auto& deferredParams : accelerationStructureParams[asLevel])
{
// buffer failed to create/allocate
if (!deferredParams.storage.get())
continue;
IGPUAccelerationStructure::SCreationParams baseParams;
{
auto* buf = deferredParams.storage.get();
const auto bufSz = buf->getSize();
using create_f = IGPUAccelerationStructure::SCreationParams::FLAGS;
baseParams = {
.bufferRange = {.offset=0,.size=bufSz,.buffer=smart_refctd_ptr<IGPUBuffer>(buf)},
.flags = deferredParams.motionBlur ? create_f::MOTION_BIT:create_f::NONE
};
}
smart_refctd_ptr<IGPUAccelerationStructure> as;
if (asLevel)
{
as = device->createBottomLevelAccelerationStructure({baseParams,deferredParams.maxInstanceCount});
}
else
{
as = device->createTopLevelAccelerationStructure({baseParams,deferredParams.maxInstanceCount});
}
// note that in order to compact an AS you need to allocate a buffer range whose size is known only after the build
const auto buildSize = deferredParams.inputSize+deferredParams.scratchSize;
// sizes for building 1-by-1 vs parallel, note that
retval.m_minASBuildScratchSize = core::max(buildSize,retval.m_minASBuildScratchSize);
scratchSizeFullParallelBuild += buildSize;
if (deferredParams.compactAfterBuild)
scratchSizeFullParallelCompact += deferredParams.scratchSize;
// triangles, AABBs or Instance Transforms will need to be supplied from VRAM
// TODO: also mark somehow that we'll need a BUILD INPUT READ ONLY BUFFER WITH XFER usage
if (deferredParams.inputSize)
retval.m_queueFlags |= IQueue::FAMILY_FLAGS::TRANSFER_BIT;
}
//
retval.m_maxASBuildScratchSize = core::max(core::max(scratchSizeFullParallelBuild,scratchSizeFullParallelCompact),retval.m_maxASBuildScratchSize);
}
//
if (retval.m_minASBuildScratchSize)
{
retval.m_queueFlags |= IQueue::FAMILY_FLAGS::COMPUTE_BIT;
retval.m_maxASBuildScratchSize = core::max(core::max(scratchSizeFullParallelBLASBuild,scratchSizeFullParallelBLASCompact),core::max(scratchSizeFullParallelTLASBuild,scratchSizeFullParallelTLASCompact));
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs some love from me

@devshgraphicsprogramming devshgraphicsprogramming merged commit b9be039 into master Apr 18, 2025
1 check failed
@devshgraphicsprogramming devshgraphicsprogramming deleted the AS_conv branch April 18, 2025 16:02
Comment on lines +75 to +82
#ifndef _NBL_DEBUG
if (!params.optimizer)
{
using pass_e = asset::ISPIRVOptimizer::E_OPTIMIZER_PASS;
// shall we do others?
params.optimizer = core::make_smart_rectd_ptr<asset::ISPIRVOptimizer>({EOP_STRIP_DEBUG_INFO});
}
#endif
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevyuu we should move this to your ISPIRVDebloater (or trimmer as I'd like to call it) and make it an option to not run the SPIR-V optimizer multiple times for no reason

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants