

**Building Custom Al Infrastructure with NVLink Fusion** 

# Rising Complexity of Al Models: From CPUs to the Age of Reasoning

Rack-scale computing: the foundation for tomorrow's AI infrastructure



#### Early AI Era

≤100M parameters

Follows Moore's Law —

Model size doubling every 20 months

Inference runs on CPUs



#### GPU AI Era

~100M to ~1B parameters
Doubling every 6 months
Inference runs on 1 GPU



#### Multi-GPU AI Era

~1B to multi-trillion parameters
Commercial AI Driven
Doubling every 10 months
Inference running up to 8 GPUs



#### Age of Al Reasoning at Scale

Drastic Increase in Compute for Reasoning
Expansion of Distributed Parallelism Techniques
Large Scale Mixture of Experts
Inference running up to 72 GPUs





# Requirements of Scale-Up Fabric for Rack Scale Computing

Scale-up fabric is a compute fabric



Very large models

Tensor parallelism Expert parallelism

Large contexts

Domain size

Peer-compute memory access

High bandwidth, low latency

Collective offloads

Flexible workload shapes



### NVIDIA GB300 NVL72

#### Scaling GPU domains with NVIDIA NVLink





GPU Bandwidth 1.8 TB/s Domain Size 72 GPUs

All-to-All 130 TB/s All-Reduce 260 TB/s



NVIDIA NVLink Fabric and rack architecture for custom compute

- Single, scalable AI factory architecture
- Brings together entire rack architecture
- Proven scale-up and scale-out roadmap and ecosystem



Custom Combinations with Open Standards Integration







### **NVIDIA NVL72 Rack Architecture**

# **OCP MGX Rack** 10x Compute 9x NVL72 **NVLink Switch** Trays 8x **Compute Trays** 4

#### Single 72-GPU L1 Domain

Fully Copper Domain

#### 9x Switch Trays

2x Switch ASICs per tray
7.2TB/s per switch

#### **18x Compute Trays**

4x GPUs/Tray

1.8 TB/s per GPU





## **NVLink Fusion System Integration**

Complete Leverage from NVL72 Rack Architecture

- NVLink Fusion CPU and XPU integrated into NVIDIA rack architecture, like NVIDIA Grace/Vera and NVIDIA GPUs
- NVLink C2C Integration Offered for optimal CPU connectivity
- Complete leverage from NVL72 Architecture
  - System, rack architecture
  - Software tools and libraries
  - Supply chain





#### **Custom CPU integration**

- CPU-GPU High Bandwidth Interface
  - > 100 lanes of PCIe GEN6
- Unified Memory Architecture
  - GPU and CPU can access all memory
- Heterogenous Memory Model Support
  - CPU follows CPU memory model
  - GPU follows GPU memory model
  - Full support for CPU or GPU defined atomics
- GPU Memory expansion
  - Native use of Host Memory from GPU
  - Supports all native operations to host memory
- Protocol support
  - Symmetric functionality for Host and Accelerator
  - Compatible with industry standard IP interfaces
- NVLink IP Integration
  - Soft IP + Optimized IO Implementation





#### **Custom XPU integration**

- NVLink Integration to XPU
  - Chiplet-based Integration
  - CHI-like protocol, UCle Phy Layer
- Zero friction access to peer-XPU memory
  - Load/Store architecture
  - Access through DMA engines
- Ultra-High Bandwidth Architecture
  - Slim Network Layer
  - Low latency, area and power overhead
- In-Network Computing
  - Multicast, programmable reductions
- Fabric Management
  - NVLink configuration, decoding, mapping and routing
- NVLink C2C IP Integration
  - Optimal CPU:XPU interface option
- Custom CPU/NIC Support





## **NVLink Integration with XPU Compute**

- NVLink chiplet encapsulates NVLink functionality
  - Exposes a standard interface to the XPU Compute
  - NVLink specifics contained within the chiplet
  - Link and Physical Layer for D2D compatible with standard UCIe specifications
- Protocol based is CHI-like, optimized for NVLink
  - Packetization leverages CHI C2C on UCIe
  - Support for scale-up operations
    - Peer-XPU Memory reads/writes
    - Collective operations for reductions
    - Atomic operations
- XPU has flexible choices on integration into XPU fabric
  - Peer XPU memory can be exposed only to DMA engines, or more deeply integrated into the PEs







### **NVLink Switch**

- NVLink5 Switch
  - Single monolithic die for minimal latency
  - 7.2 TB/s full all-to-all bidirectional BW over 72 ports
- SHARP<sup>TM</sup> \* In-Network Compute
  - 3.6 TFLOPS of compute
  - Unicast, Multicast writes, Multicast reads with data reduction
  - Multiple type of operands from 32bits to 8bits
  - Multiple reduction groups in parallel
- Fabric Partitioning Support
- NVLink5 Switch Tray
  - 2x NVLink5 Switch chips
  - 14.4 TB/s total bandwidth



### **NVLink Fusion Software**

NVIDIA's best in class communication libraries, telemetry and debug tools

#### NCCL

- Best in class algorithms for low latency and high bandwidth
- Topology graph search for finding optimal data paths
- Integration with PyTorch, VLLM, SGLang, and many other frameworks
- Ten years of performance tuning and production operation
- Fabric and Memory Management
  - Address space management APIs
  - Extendable for XPU-specific memory semantics
  - Routing and forwarding setup
- User Mode Tools
  - NVLink telemetry and debug
- Diagnostics
  - Manufacturing diagnostic capabilities for compute tray and rack
  - Fabric Testing at rack scale





# **NVLink Fusion Ecosystem**

#### **Custom Silicon Partners**









**CPU Partners** 



Qualcom

**Technology Partners** 

cādence° **SYNOPSYS®** 

#### **System Partners**

















PEGATRON













### **NVLink Fusion Ecosystem**

Rich eco-system of partners

























# **NVLink Fusion System Roadmap**

One-Year Rhythm | Full-System | One Architecture





