Feature Request: Support UMA archs (e.g. iOS) #238

darksylinc · 2021-11-02T14:48:17Z

Note: This is brainstorming

We currently have:

BT_IMMUTABLE must be uploaded on creation
BT_DEFAULT must use upload() / staging buffers
BT_DYNAMIC_* must use map()

iOS shared works like default in terms of speed, but wants access using map.

Intended use case

Immediate read access to vertex data (and similar) that is immutable or nearly immutable (GPU side) to perform ray cast computations and other physics stuff
Avoid relying on a shadow copy which provides the intended speed benefit because device is memory constraint

Possible solutions:

Offer a new BT_TYPE
Add a boolean flag to indicate BT_DEFAULT can be mapped
Add a function that returns a ptr to the raw memory, e.g. BufferPacked::getRawGpuPtr()

Issue: Synchronization

All writes from CPU are race conditions if the GPU is using it. That's why BT_DYNAMIC 3x the memory and why BT_DEFAULT uses StagingBuffers to upload. Easiest solution is to disallow writes from CPU i.e. getRawGpuPtr returns a const pointer. If the user const casts the pointer and writes to it, it's his responsability.

All writes from GPU are race conditions if the CPU wants to read that data. Writes from GPU:

UAV / Compute: suggestion, disallow this feature on UAV buffers
BufferPacked::copyTo
Uploads via StagingBuffer

Possible solution:

Add BufferPacked::getRawGpuPtr(); returns nullptr if it's not supported or there is a data hazard
Add BufferPacked::isRawGpuAvailable() (name pending). If it returns true; it is safe to call getRawGpuPtr and it will return a valid ptr.
Add BufferPacked::rawGpuBufferAvailableFrame() (name pending)
Add a new BT_TYPE which uses a derived class e.g. MetalSharedBufferInterface. When a GPU write operation is called (e.g. user calls BufferPacked::copyTo); MetalSharedBufferInterface has an extra uint32 to store the current frame i.e. it stores VaoManager::getFrameCount.
- The reason for a new BT_TYPE and MetalSharedBufferInterface is to avoid the extra uint32 on buffers that don't need it. But since it's only 4 bytes, I'm not feeling too strong about this
The implementation would look like this:

bool BufferPacked::isRawGpuAvailable() const
{
   return vaoManager->isFrameFinished( this->rawGpuBufferAvailableFrame() ) ;
}

VaoManager::waitForSpecificFrameToFinish can be used to wait for the buffer.

If the frame counter overflows, a longer-than-usual wait will happen (should we increase the counter to 64 bits?).

The feature could be supported on BT_IMMUTABLE: Immutable buffers could add a boolean instead of a frame idx. Once the first upload has finished GPU side; it is forever accessible and rawGpuBufferAvailableFrame could always return VaoManager::getFrameCount - VaoManager::getDynamicBufferMultiplier.
- Alternatively instead of a boolean; the buffer could be placed in a list. Once the buffer is out of that list, it means the upload operation is done
- Or alternatively, it behaves exactly the same as described above (i.e. keep the frame count it was uploaded. If getFrameCount overflows, a small hitch will happen)

Issue: Validation layer

Expect Metal to complain a bit that certain operations can or cannot be done when the buffer type is MTLStorageModeShared, these are often minor fixes

What if user holds on to the returned pointer getRawGpuPtr?

User should call getRawGpuPtr to prevent data races. Otherwise all bets are off. It's not our responsibility anymore.

If the user calls VaoManager::cleanupEmptyPools, the pointer may no longer be valid.

Textures

Textures can't be accessed directly. They don't have linear tiling access. Supporting linear tiling is a mess and we recommend to use TexBufferPacked instead

Metal & Vulkan

Metal & Vulkan can benefit from the same features. As long as the GPU exposes a HOST_VISIBLE_BIT|HOST_COHERENT_BIT|DEVICE_LOCAL_BIT (HOST_CACHED_BIT too?) then the feature can be implemented.

It can benefit Desktop GPUs too. With PCIE resizable bar / SAM, GPU drivers are exposing HOST_VISIBLE_BIT|HOST_COHERENT_BIT|DEVICE_LOCAL_BIT (not cached) memory which lives on the GPU and can be read from CPU (it gets pulled from the PCIE bus).

Though given that it is not cached, users should read from each byte only once (or use memcpy to make a local copy). It should be treated the same way as write combined memory.

When Resizable Bar is not available we should not use that memory because there's only 256 MB, and the driver needs some of that too; and it's too precious.

The text was updated successfully, but these errors were encountered:

darksylinc added enhancement New feature or request Metal Issue affects Metal specifically. Don't use this tag if all RenderSystems are affected Vulkan labels Nov 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Support UMA archs (e.g. iOS) #238

Feature Request: Support UMA archs (e.g. iOS) #238

darksylinc commented Nov 2, 2021 •

edited

Feature Request: Support UMA archs (e.g. iOS) #238

Feature Request: Support UMA archs (e.g. iOS) #238

Comments

darksylinc commented Nov 2, 2021 • edited

We currently have:

Intended use case

Possible solutions:

Issue: Synchronization

Possible solution:

Issue: Validation layer

What if user holds on to the returned pointer getRawGpuPtr?

Textures

Metal & Vulkan

darksylinc commented Nov 2, 2021 •

edited