-
Notifications
You must be signed in to change notification settings - Fork 50
Description
🚀 The feature, motivation and pitch
GPU Direct Storage (GDS) Feature Documentation
Overview
This feature introduces GPU Direct Storage (GDS) support to enable direct data transfer between GPU memory and storage, bypassing CPU memory as an intermediate buffer. This significantly reduces memory bandwidth bottlenecks and improves KV cache loading/offloading performance.
Motivation
Currently, the KV cache transfer follows this path:
Device Memory <-> Host Memory <-> Storage
With GDS support, we can achieve:
Device Memory <-> Storage (direct)
This eliminates the need for:
- Host memory staging buffers
- CPU-GPU memory copies (H2D/D2H)
- Additional memory bandwidth consumption
- Two-stage transfer operations
Technical Design
Basic GDS call flow
cuFileDriverOpen()
Initialize the GDS runtime; must be called first, can be called repeatedly.open(path, O_DIRECT | …)
Open the target file with POSIX; ensure it goes through the Direct-I/O path.cuFileHandleRegister(&fh, &desc)
Wrap the regular fd into a cuFile handle then register to GDS.
Need to cache it to improve performancecuFileBufRegister(d_buf, size, 0)
Register GPU buffer with GDS for zero-copy. We will not use this apiIf the cuFileBufRegister has not been previously called on the buffer pointer, cuFileRead/cuFileWrite will use internal registered buffers when required. per_buffer_cache_size_kb = 1MB, max_device_cache_size_kb = 128MB
cuFileRead(fh, d_buf, size, offset, 0)
orcuFileWrite(fh, d_buf, size, offset, 0)
Transfer data between file and GPU memory with no CPU copies.cuFileBufDeregister(d_buf)
Unregister the buffer;
cuFileHandleDeregister(fh)to release the file-handle resource.close(fd)
Close the underlying file descriptor.cuFileDriverClose()
Shut down GDS and release global resources.
Architecture Overview
The design follows a clean separation of concerns with polymorphic queue types:
- Configuration Layer:
NFSStore::ConfigaddstransferUseDirectflag - Queue Abstraction: New
ITsfTaskQueueinterface for polymorphic queue types - Implementation Layer:
TsfTaskQueue: Traditional path (Device <-> Host <-> Storage)DirectTsfTaskQueue: GDS path (Device <-> Storage)
- Device Layer: CUDA device implements
S2D/D2Smethods using cuFile APIs
Key Components
1. Device Interface Extension
// ucm/store/device/idevice.h
class IDevice {
virtual Status Setup(bool transferUseDirect) = 0;
// GDS Sync
virtual Status S2D(const std::string& path, size_t offset, size_t length,
std::byte* devicePtr) {
return Status::Unsupported();
}
virtual Status D2S(const std::string& path, size_t offset, size_t length,
std::byte* devicePtr) {
return Status::Unsupported();
}
// GDS Async
virtual Status S2DAsync(const std::string& path, size_t offset, size_t length,
std::byte* devicePtr, std::function<void(bool)> callback) {
return Status::Unsupported();
}
virtual Status D2SAsync(const std::string& path, size_t offset, size_t length,
const std::byte* devicePtr, std::function<void(bool)> callback) {
return Status::Unsupported();
}
};
// ucm/store/device/ibuffered_device.h
class IBufferedDevice : public IDevice {
Status Setup(bool transferUseDirect) override
{
if (transferUseDirect) {
return Status::OK();
}
}
}2. CUDA Implementation
// ucm/store/device/cuda/cuda_device.cc
class CudaDevice : public IBufferedDevice {
// cuFileDriverOpen()
void InitGdsOnce();
// Sync Iml
Status S2D(...) override; // Sync Storage to Device
Status D2S(...) override; // Sync Device to Storage
};3. Queue Polymorphism
// ucm/store/nfsstore/cc/domain/tsf_task/itsf_task_queue.h
class ITsfTaskQueue {
virtual Status Setup(...) = 0;
virtual void Push(std::list<TsfTask>& tasks) = 0;
};
class TsfTaskQueue : public ITsfTaskQueue { /* Traditional path */ };
class DirectTsfTaskQueue : public ITsfTaskQueue { /* GDS path */ };4. Factory Pattern in Manager
// TsfTaskManager creates appropriate queue type based on configuration
Status TsfTaskManager::Setup(..., const bool transferUseDirect) {
for (size_t i = 0; i < streamNumber; ++i) {
std::unique_ptr<ITsfTaskQueue> queue;
if (transferUseDirect) {
queue = std::make_unique<DirectTsfTaskQueue>();
} else {
queue = std::make_unique<TsfTaskQueue>();
}
// Setup and store in unified _queues vector
}
}Implementation Details
GDS Requirements
- CUDA 11.4+ with GDS support
- Compatible filesystem (ext4, xfs with O_DIRECT support)
- libcufile library
- 4KB alignment for offset, length, and buffer addresses
- Learn more on (https://docs.nvidia.com/gpudirect-storage/getting-started/index.html)
DirectTsfTaskQueue
class DirectTsfTaskQueue : public ITsfTaskQueue {
public:
Status Setup(...) override;
void Push(...) override;
private:
// Thread processing load/dump
void DirectOper(TsfTask& task);
// call CudaDevice.S2DAsync(...)
Status S2D(const TsfTask& task);
// call CudaDevice.D2SAsync(...)
Status D2S(const TsfTask& task);
void Done(const TsfTask& task, bool success);
...FileHandleCache
FileHandleCache is a lightweight class for caching and reusing cuFileHandle objects, designed for use with NVIDIA GPUDirect Storage (GDS).
Its main goals are:
- Avoid redundant
cuFileHandleRegister()calls to reduce overhead - Automatically manage reference counts (
refCount) - Evict handles with
refCount == 0when the cache exceeds its capacity
Core Api
CUfileHandle_t acquire(const std::string& path);
void release(const std::string& path);
void cleanup();acquire(path)
- If the handle already exists in the cache, increments its reference count and returns it.
- Otherwise:
- Opens the file using
open()with theO_DIRECTflag. - Registers it with
cuFileHandleRegister(). - Inserts it into the cache with
refCount = 1. - Calls
cleanupIfNeeded()to evict unused handles if the cache exceedsmaxSize.
- Opens the file using
release(path)
- Decrements the reference count of the specified file handle.
- When the reference count reaches zero, the handle becomes eligible for cleanup.
cleanup()
- Manually removes and deregisters all handles with
refCount == 0. - Typically invoked before program shutdown or when releasing GPU resources.
Usage
- Use
auto fileHandleCache = Singleton<FileHandleCache>::Instance();
CudaDevice
Setup
Status Setup() override
{
...
if (transferUseDirect) {
InitGdsOnce();
}
...
}
...
Status CudaDevice::InitGdsOnce()
{
auto ret = cuFileDriverOpen();
if (ret == CU_FILE_SUCCESS) {
UC_INFO("GDS driver initialized successfully");
} else {
UC_WARN("Failed to initialize GDS driver, ret={}", ret);
}
return ret
}
CUfileError_t cuFileDriverClose();This happens implicitly upon process exit.
Sync Implementation Details(Implement first)
ssize_t cuFileRead(CUfileHandle_tfh, void *bufPtr_base, size_t size, off_t file_offset, off_t bufPtr_offset);- Key parameters:
CUFileHandle_t fh: Registered file handlevoid *bufPtr_base: Registered device memory pointersize_t *size: Pointer to sizeoff_t *file_offset_p: Pointer to file offset (evaluated at execution time)off_t bufPtr_offset: Should set to 0 if read from beginning
- Return:
- Bytes read
- -1 on IO error, so errno is set to indicate filesystem errors
- All other errors return a negative integer value of the
CUfileOpErrorenum value.
- Resource cleanup(deregister and close fd)
- Learn more on (https://docs.nvidia.com/gpudirect-storage/api-reference-guide/index.html)
Async Implementation Details
CUfileError_t cuFileReadAsync(CUFileHandle_t fh, void *bufPtr_base, size_t *size_p, off_t *file_offset_p, off_t *bufPtr_offset_p, int *bytes_read_p, CUstream stream);- Key parameters:
CUFileHandle_t fh: Registered file handlevoid *bufPtr_base: Registered Device memory pointersize_t *size_p: Pointer to sizeoff_t *file_offset_p: Pointer to file offsetint *bytes_read_p: Host-allocated memory for returned byte countCUstream stream: CUDA stream for operation ordering
- Return:
CU_FILE_SUCCESSrepresents success
- Use
cudaStreamAddCallbackfor monitoring async completion and task completion notification - Automatic resource cleanup (file handles, buffer registration, streams) in callbacks
Alignment Requirements
GDS requires strict 4KB alignment for optimal performance:
- File offset: Must be multiple of 4KB
- Transfer length: Must be multiple of 4KB
- Device pointer: Must be 4KB aligned in device memory
When alignment requirements are not met:
-
The API works correctly for unaligned offsets and any data size, although the performance might not match the performance of aligned reads
Error Handling & Fallback
- The cuFile API automatically probes for GDS support at runtime; if the capability is missing, it silently falls back to compatibility mode.
- Alignment validation with automatic fallback for misaligned requests
- cufile.log for troubleshooting
- Even though the cuFile API performs automatic capability detection, registration can still fail; when it does, the job is aborted immediately rather than attempting any further fallback.
Build Configuration
CMake Configuration
# ucm/store/device/cuda/CMakeLists.txt
...
set_target_properties(Cuda::cudart PROPERTIES
INTERFACE_INCLUDE_DIRECTORIES "${CUDA_ROOT}/include"
IMPORTED_LOCATION "${CUDA_ROOT}/lib64/libcudart.so"
IMPORTED_LOCATION "${CUDA_ROOT}/lib64/libcufile.so"
)Summary
This GDS feature provides a high-performance alternative to traditional CPU-mediated storage transfers, with automatic fallback mechanisms and comprehensive error handling. The polymorphic design ensures clean separation of concerns while maintaining full backward compatibilitypis
Alternatives
No response
Additional context
No response