[Feature]: Introduce GPU Direct Storage(GDS) Feature

### 🚀 The feature, motivation and pitch

# GPU Direct Storage (GDS) Feature Documentation

## Overview

This feature introduces GPU Direct Storage (GDS) support to enable direct data transfer between GPU memory and storage, bypassing CPU memory as an intermediate buffer. This significantly reduces memory bandwidth bottlenecks and improves KV cache loading/offloading performance.

### Motivation

Currently, the KV cache transfer follows this path:
```
Device Memory <-> Host Memory <-> Storage
```

With GDS support, we can achieve:
```
Device Memory <-> Storage (direct)
```

This eliminates the need for:
- Host memory staging buffers
- CPU-GPU memory copies (H2D/D2H)
- Additional memory bandwidth consumption
- Two-stage transfer operations

## Technical Design

### Basic GDS call flow

1. `cuFileDriverOpen()`  
   Initialize the GDS runtime; must be called first, *can be called repeatedly*.
2. `open(path, O_DIRECT | …)`  
   Open the target file with POSIX; ensure it goes through the Direct-I/O path.
3. `cuFileHandleRegister(&fh, &desc)`  
   Wrap the regular fd into a cuFile handle then register to GDS.
   **Need to cache it to improve performance**
4. `cuFileBufRegister(d_buf, size, 0)` 
   Register GPU buffer with GDS for zero-copy. We will not use this api
   >If the cuFileBufRegister has not been previously called on the buffer pointer, cuFileRead/cuFileWrite will use internal registered buffers when required. per_buffer_cache_size_kb = 1MB, max_device_cache_size_kb = 128MB
5. `cuFileRead(fh, d_buf, size, offset, 0)`  
   or `cuFileWrite(fh, d_buf, size, offset, 0)`  
   Transfer data between file and GPU memory with no CPU copies.
6. `cuFileBufDeregister(d_buf)`  
   Unregister the buffer;  
   `cuFileHandleDeregister(fh)` to release the file-handle resource.
7. `close(fd)`  
   Close the underlying file descriptor.
8. `cuFileDriverClose()`  
   Shut down GDS and release global resources.

### Architecture Overview
The design follows a clean separation of concerns with polymorphic queue types:

1. **Configuration Layer**: `NFSStore::Config` adds `transferUseDirect` flag
2. **Queue Abstraction**: New `ITsfTaskQueue` interface for polymorphic queue types
3. **Implementation Layer**: 
   - `TsfTaskQueue`: Traditional path (Device <-> Host <-> Storage)
   - `DirectTsfTaskQueue`: GDS path (Device <-> Storage)
4. **Device Layer**: CUDA device implements `S2D`/`D2S` methods using cuFile APIs

### Key Components

#### 1. Device Interface Extension
```cpp
// ucm/store/device/idevice.h
class IDevice {
    virtual Status Setup(bool transferUseDirect) = 0;


   // GDS Sync
    virtual Status S2D(const std::string& path, size_t offset, size_t length, 
                            std::byte* devicePtr) { 
        return Status::Unsupported(); 
    }
    virtual Status D2S(const std::string& path, size_t offset, size_t length, 
                            std::byte* devicePtr) { 
        return Status::Unsupported(); 
    }

    // GDS Async
    virtual Status S2DAsync(const std::string& path, size_t offset, size_t length, 
                            std::byte* devicePtr, std::function<void(bool)> callback) { 
        return Status::Unsupported(); 
    }
    virtual Status D2SAsync(const std::string& path, size_t offset, size_t length, 
                            const std::byte* devicePtr, std::function<void(bool)> callback) { 
        return Status::Unsupported(); 
    }
};

// ucm/store/device/ibuffered_device.h
class IBufferedDevice : public IDevice {
    Status Setup(bool transferUseDirect) override
    {
        if (transferUseDirect) {
            return Status::OK();
        }
    }
}
```

#### 2. CUDA Implementation
```cpp
// ucm/store/device/cuda/cuda_device.cc
class CudaDevice : public IBufferedDevice {
    // cuFileDriverOpen()
    void InitGdsOnce();
    
    //  Sync Iml
    Status S2D(...) override;  // Sync Storage to Device
    Status D2S(...) override;  // Sync Device to Storage
};
```

#### 3. Queue Polymorphism
```cpp
// ucm/store/nfsstore/cc/domain/tsf_task/itsf_task_queue.h
class ITsfTaskQueue {
    virtual Status Setup(...) = 0;
    virtual void Push(std::list<TsfTask>& tasks) = 0;
};

class TsfTaskQueue : public ITsfTaskQueue { /* Traditional path */ };
class DirectTsfTaskQueue : public ITsfTaskQueue { /* GDS path */ };
```

#### 4. Factory Pattern in Manager
```cpp
// TsfTaskManager creates appropriate queue type based on configuration
Status TsfTaskManager::Setup(..., const bool transferUseDirect) {
    for (size_t i = 0; i < streamNumber; ++i) {
        std::unique_ptr<ITsfTaskQueue> queue;
        if (transferUseDirect) {
            queue = std::make_unique<DirectTsfTaskQueue>();
        } else {
            queue = std::make_unique<TsfTaskQueue>();
        }
        // Setup and store in unified _queues vector
    }
}
```

## Implementation Details

### GDS Requirements
- CUDA 11.4+ with GDS support
- Compatible filesystem (ext4, xfs with O_DIRECT support)
- libcufile library
- 4KB alignment for offset, length, and buffer addresses
- Learn more on (https://docs.nvidia.com/gpudirect-storage/getting-started/index.html)

### DirectTsfTaskQueue 
```cpp
class DirectTsfTaskQueue : public ITsfTaskQueue {
public:
    Status Setup(...) override;
    void Push(...) override;

private:
    // Thread processing load/dump
    void DirectOper(TsfTask& task);
    // call CudaDevice.S2DAsync(...)
    Status S2D(const TsfTask& task);
   // call CudaDevice.D2SAsync(...)
    Status D2S(const TsfTask& task);
    void Done(const TsfTask& task, bool success);
...
```
### FileHandleCache
`FileHandleCache` is a lightweight class for **caching and reusing cuFileHandle objects**, designed for use with **NVIDIA GPUDirect Storage (GDS)**.
Its main goals are:
- Avoid redundant `cuFileHandleRegister()` calls to reduce overhead  
- Automatically manage reference counts (`refCount`)  
- Evict handles with `refCount == 0` when the cache exceeds its capacity  
#### Core Api
```cpp
CUfileHandle_t acquire(const std::string& path);
void release(const std::string& path);
void cleanup();
```
##### `acquire(path)`

- If the handle already exists in the cache, increments its reference count and returns it.  
- Otherwise:
  1. Opens the file using `open()` with the `O_DIRECT` flag.  
  2. Registers it with `cuFileHandleRegister()`.  
  3. Inserts it into the cache with `refCount = 1`.  
  4. Calls `cleanupIfNeeded()` to evict unused handles if the cache exceeds `maxSize`.  

##### `release(path)`

- Decrements the reference count of the specified file handle.  
- When the reference count reaches zero, the handle becomes eligible for cleanup.  

##### `cleanup()`

- Manually removes and deregisters all handles with `refCount == 0`.  
- Typically invoked before program shutdown or when releasing GPU resources.  

#### Usage
- Use `auto fileHandleCache = Singleton<FileHandleCache>::Instance();`
### CudaDevice
#### Setup
```cpp
Status Setup() override
{
    ...
    if (transferUseDirect) {
        InitGdsOnce();
    }
    ...
}
...
Status CudaDevice::InitGdsOnce()
{
    auto ret = cuFileDriverOpen();
    if (ret == CU_FILE_SUCCESS) {
        UC_INFO("GDS driver initialized successfully");
    } else {
        UC_WARN("Failed to initialize GDS driver, ret={}", ret);
    }
    return ret
}
```
>`CUfileError_t cuFileDriverClose();`This happens implicitly upon process exit.
#### Sync Implementation Details(**Implement first**)
- `ssize_t cuFileRead(CUfileHandle_tfh, void *bufPtr_base, size_t size, off_t file_offset, off_t bufPtr_offset);`
- Key parameters:
  - `CUFileHandle_t fh`: Registered file handle
  - `void *bufPtr_base`: Registered device memory pointer
  - `size_t *size`: Pointer to size
  - `off_t *file_offset_p`: Pointer to file offset (evaluated at execution time)
  - `off_t bufPtr_offset`: Should set to 0 if read from beginning
- Return:
  - Bytes read
  - -1 on IO error,  so errno is set to indicate filesystem errors
  - All other errors return a negative integer value of the `CUfileOpError` enum value.
- Resource cleanup(deregister and close fd)
- Learn more on (https://docs.nvidia.com/gpudirect-storage/api-reference-guide/index.html)

#### Async Implementation Details
- `CUfileError_t cuFileReadAsync(CUFileHandle_t fh,
                        void *bufPtr_base,
                        size_t *size_p,
                        off_t *file_offset_p,
                        off_t *bufPtr_offset_p,
                        int *bytes_read_p,
                        CUstream stream);`
- Key parameters:
  - `CUFileHandle_t fh`: Registered file handle
  - `void *bufPtr_base`: Registered Device memory pointer
  - `size_t *size_p`: Pointer to size
  - `off_t *file_offset_p`: Pointer to file offset
  - `int *bytes_read_p`: Host-allocated memory for returned byte count
  - `CUstream stream`: CUDA stream for operation ordering
- Return:
  - `CU_FILE_SUCCESS` represents success
- Use `cudaStreamAddCallback` for monitoring async completion and task completion notification
- Automatic resource cleanup (file handles, buffer registration, streams) in callbacks

### Alignment Requirements
GDS requires strict 4KB alignment for optimal performance:
- **File offset**: Must be multiple of 4KB
- **Transfer length**: Must be multiple of 4KB 
- **Device pointer**: Must be 4KB aligned in device memory

When alignment requirements are not met:
- >The API works correctly for unaligned offsets and any data size, although the performance might not match the performance of aligned reads

## Error Handling & Fallback
- The cuFile API automatically probes for GDS support at runtime; if the capability is missing, it silently falls back to compatibility mode.
- Alignment validation with automatic fallback for misaligned requests
- *cufile.log* for troubleshooting
- Even though the cuFile API performs automatic capability detection, registration can still fail; when it does, the job is aborted immediately rather than attempting any further fallback.

## Build Configuration

### CMake Configuration
```cmake
# ucm/store/device/cuda/CMakeLists.txt
...
set_target_properties(Cuda::cudart PROPERTIES
    INTERFACE_INCLUDE_DIRECTORIES "${CUDA_ROOT}/include"
    IMPORTED_LOCATION "${CUDA_ROOT}/lib64/libcudart.so"
    IMPORTED_LOCATION "${CUDA_ROOT}/lib64/libcufile.so"
)
```

## Summary

This GDS feature provides a high-performance alternative to traditional CPU-mediated storage transfers, with automatic fallback mechanisms and comprehensive error handling. The polymorphic design ensures clean separation of concerns while maintaining full backward compatibilitypis

### Alternatives

_No response_

### Additional context

_No response_

[Feature]: Introduce GPU Direct Storage(GDS) Feature #276

Description

🚀 The feature, motivation and pitch

GPU Direct Storage (GDS) Feature Documentation

Overview

Motivation

Technical Design

Basic GDS call flow

Architecture Overview

Key Components

1. Device Interface Extension

2. CUDA Implementation

3. Queue Polymorphism

4. Factory Pattern in Manager

Implementation Details

GDS Requirements

DirectTsfTaskQueue

FileHandleCache

Core Api

acquire(path)

release(path)

cleanup()

Usage

CudaDevice

Setup

Sync Implementation Details(Implement first)

Async Implementation Details

Alignment Requirements

Error Handling & Fallback

Build Configuration

CMake Configuration

Summary

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`acquire(path)`

`release(path)`

`cleanup()`