# Lab 1.1.2: Memory Architecture Lab

**Module:** 1.1 - DGX Spark Platform Mastery  
**Time:** 1.5 hours  
**Difficulty:** ‚≠ê‚≠ê (Intermediate)

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand how unified memory works on DGX Spark
- [ ] Allocate and monitor GPU tensors of various sizes
- [ ] Identify the memory limits and behavior patterns
- [ ] Learn to clear buffer cache for optimal memory availability

---

## üìö Prerequisites

- Completed: Lab 1.1.1 (System Exploration)
- Running inside NGC PyTorch container

---

## üåç Real-World Context

When you want to run a 70 billion parameter language model, you need about 35-140GB of memory depending on quantization. Traditional GPUs max out at 24-80GB. The DGX Spark's 128GB unified memory is a game-changer because:

1. **No memory copy overhead** - Data doesn't need to transfer between CPU and GPU
2. **Larger models fit** - Run models that wouldn't fit on traditional GPUs
3. **Simpler programming** - No need to manage separate memory pools

In this lab, you'll see this in action by allocating tensors of various sizes and watching how memory behaves.

---

## üßí ELI5: Unified Memory

> **Imagine you have two desks in your room...**
>
> One desk (CPU) is where you read books and make plans. The other desk (GPU) is where you do thousands of math problems super fast. Normally, if you need to use something from one desk on the other, you have to walk over and carry it - which takes time!
>
> **Unified memory is like pushing both desks together into one GIANT desk.** Now everything is right there - no more walking back and forth! Both "workers" (CPU and GPU) can reach everything instantly.
>
> That's why DGX Spark can load huge AI brains (models) that wouldn't fit on a normal GPU desk - because it has one enormous shared desk (128GB of memory)!
>
> **In AI terms:** The CPU and GPU share the same physical memory pool, eliminating the PCIe transfer bottleneck that limits traditional systems.

---

## ‚ö†Ô∏è Important: Run This Notebook in NGC Container

This notebook **must** be run inside an NGC PyTorch container. If you're not already in one, start it with:

```bash
docker run --gpus all -it --rm \
    -v $HOME/workspace:/workspace \
    -v $HOME/.cache/huggingface:/root/.cache/huggingface \
    -p 8888:8888 \
    --ipc=host \
    nvcr.io/nvidia/pytorch:25.11-py3 \
    jupyter lab --ip=0.0.0.0 --allow-root --no-browser
```

> **Note:** The `-p 8888:8888` flag is only needed when running Jupyter Lab inside the container.
> For interactive bash sessions, you can omit the port mapping.

---

## Part 1: Verify PyTorch CUDA Setup

### Concept Explanation

First, let's confirm we have GPU access through PyTorch. This is the most critical check for any deep learning work.

In [1]:
import torch
import gc
import time

# Check CUDA availability
print("=" * 60)
print("PyTorch CUDA Configuration")
print("=" * 60)
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"cuDNN version: {torch.backends.cudnn.version()}")
    print(f"Device count: {torch.cuda.device_count()}")
    print(f"Current device: {torch.cuda.current_device()}")
    print(f"Device name: {torch.cuda.get_device_name(0)}")
    
    # Memory info
    props = torch.cuda.get_device_properties(0)
    print(f"\nGPU Memory: {props.total_memory / 1e9:.1f} GB")
    print(f"Compute capability: {props.major}.{props.minor}")
else:
    print("\n‚ùå CUDA is not available!")
    print("Make sure you're running inside an NGC container with --gpus all")

PyTorch CUDA Configuration
PyTorch version: 2.9.0+cu130
CUDA available: True
CUDA version: 13.0
cuDNN version: 91300
Device count: 1
Current device: 0
Device name: NVIDIA GB10

GPU Memory: 128.5 GB
Compute capability: 12.1


    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (8.0) - (12.0)
    


In [2]:
# Quick test: Create a tensor on GPU
if torch.cuda.is_available():
    x = torch.randn(1000, 1000, device='cuda')
    print(f"‚úÖ Successfully created tensor on GPU")
    print(f"   Shape: {x.shape}")
    print(f"   Device: {x.device}")
    print(f"   Memory used: {x.element_size() * x.nelement() / 1e6:.2f} MB")
    del x
    torch.cuda.empty_cache()

‚úÖ Successfully created tensor on GPU
   Shape: torch.Size([1000, 1000])
   Device: cuda:0
   Memory used: 4.00 MB


---

## Part 2: Memory Monitoring Utilities

### Concept Explanation

PyTorch provides detailed memory tracking. Let's create helper functions to monitor memory usage throughout our experiments.

In [3]:
def get_memory_stats():
    """
    Get current GPU memory statistics.
    
    Returns:
        dict: Memory statistics in GB
    """
    if not torch.cuda.is_available():
        return None
    
    allocated = torch.cuda.memory_allocated() / 1e9
    reserved = torch.cuda.memory_reserved() / 1e9
    max_allocated = torch.cuda.max_memory_allocated() / 1e9
    max_reserved = torch.cuda.max_memory_reserved() / 1e9
    total = torch.cuda.get_device_properties(0).total_memory / 1e9
    
    return {
        'allocated_gb': allocated,
        'reserved_gb': reserved,
        'max_allocated_gb': max_allocated,
        'max_reserved_gb': max_reserved,
        'total_gb': total,
        'free_gb': total - reserved
    }

def print_memory_stats(label="Current"):
    """
    Print formatted memory statistics.
    """
    stats = get_memory_stats()
    if stats:
        print(f"\nüìä {label} Memory Status:")
        print(f"   Allocated: {stats['allocated_gb']:.2f} GB")
        print(f"   Reserved:  {stats['reserved_gb']:.2f} GB")
        print(f"   Free:      {stats['free_gb']:.2f} GB")
        print(f"   Total:     {stats['total_gb']:.1f} GB")
        
        # Visual bar
        used_pct = stats['reserved_gb'] / stats['total_gb'] * 100
        bar_len = 40
        filled = int(bar_len * used_pct / 100)
        bar = '‚ñà' * filled + '‚ñë' * (bar_len - filled)
        print(f"   [{bar}] {used_pct:.1f}%")

def clear_memory():
    """
    Clear GPU memory cache.
    """
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.reset_peak_memory_stats()
    print("üßπ Memory cleared")

# Test our functions
print_memory_stats("Initial")


üìä Initial Memory Status:
   Allocated: 0.00 GB
   Reserved:  0.00 GB
   Free:      128.52 GB
   Total:     128.5 GB
   [‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 0.0%


### üîç Understanding Memory Types

- **Allocated**: Memory actually used by tensors
- **Reserved**: Memory reserved by PyTorch (includes caching)
- **Free**: Memory available for new allocations
- **Total**: Total GPU memory (128GB on DGX Spark)

PyTorch reserves more memory than strictly needed to speed up future allocations.

---

## Part 3: Small Tensor Allocations (1-10 GB)

### Concept Explanation

Let's start with allocations that would fit on any modern GPU. We'll create tensors of increasing size and observe memory behavior.

In [4]:
# Clear any existing allocations
clear_memory()

def allocate_tensor_gb(size_gb: float, dtype=torch.float32):
    """
    Allocate a tensor of approximately the specified size in GB.
    
    Args:
        size_gb: Desired size in gigabytes
        dtype: Data type (default: float32 = 4 bytes per element)
    
    Returns:
        tuple: (tensor, actual_size_gb, time_taken)
    """
    bytes_per_element = torch.tensor([], dtype=dtype).element_size()
    num_elements = int(size_gb * 1e9 / bytes_per_element)
    
    # Use a 1D tensor for simplicity
    start_time = time.time()
    tensor = torch.empty(num_elements, dtype=dtype, device='cuda')
    torch.cuda.synchronize()  # Wait for allocation to complete
    elapsed = time.time() - start_time
    
    actual_size = tensor.element_size() * tensor.nelement() / 1e9
    
    return tensor, actual_size, elapsed

# Test with 1GB
print("Testing 1GB allocation...")
tensor_1gb, size, alloc_time = allocate_tensor_gb(1.0)
print(f"‚úÖ Allocated {size:.2f} GB in {alloc_time*1000:.1f} ms")
print_memory_stats("After 1GB")

# Clean up
del tensor_1gb

üßπ Memory cleared
Testing 1GB allocation...
‚úÖ Allocated 1.00 GB in 35.8 ms

üìä After 1GB Memory Status:
   Allocated: 1.00 GB
   Reserved:  1.00 GB
   Free:      127.52 GB
   Total:     128.5 GB
   [‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 0.8%


In [5]:
# Now let's test a range of sizes
clear_memory()

test_sizes = [1, 2, 4, 8, 10]  # GB
results = []

print("\n" + "=" * 60)
print("Small Tensor Allocation Test (1-10 GB)")
print("=" * 60)
print(f"{'Size (GB)':<12} {'Alloc Time (ms)':<18} {'Memory Used (GB)':<18}")
print("-" * 48)

for size_gb in test_sizes:
    clear_memory()
    
    try:
        tensor, actual_size, alloc_time = allocate_tensor_gb(size_gb)
        stats = get_memory_stats()
        
        results.append({
            'requested_gb': size_gb,
            'actual_gb': actual_size,
            'alloc_ms': alloc_time * 1000,
            'reserved_gb': stats['reserved_gb']
        })
        
        print(f"{size_gb:<12} {alloc_time*1000:<18.2f} {stats['reserved_gb']:<18.2f}")
        
        del tensor
        
    except RuntimeError as e:
        print(f"{size_gb:<12} FAILED: {str(e)[:40]}")

clear_memory()
print("\n‚úÖ All small allocations successful!")

üßπ Memory cleared

Small Tensor Allocation Test (1-10 GB)
Size (GB)    Alloc Time (ms)    Memory Used (GB)  
------------------------------------------------
üßπ Memory cleared
1            35.67              1.00              
üßπ Memory cleared
2            78.09              2.00              
üßπ Memory cleared
4            148.63             4.00              
üßπ Memory cleared
8            291.55             8.00              
üßπ Memory cleared
10           344.19             10.00             
üßπ Memory cleared

‚úÖ All small allocations successful!


### üîç What Just Happened?

Notice that:
1. **Allocation is fast** - No data transfer between CPU and GPU
2. **Reserved > Allocated** - PyTorch caches memory for efficiency
3. **Linear scaling** - Memory usage scales linearly with tensor size

---

## Part 4: Medium Tensor Allocations (20-50 GB)

### Concept Explanation

Now we enter territory that would fail on most GPUs! A typical RTX 4090 has 24GB. Let's allocate 30, 40, and 50 GB tensors.

In [6]:
clear_memory()

medium_sizes = [20, 30, 40, 50]  # GB

print("\n" + "=" * 60)
print("Medium Tensor Allocation Test (20-50 GB)")
print("=" * 60)
print("‚ö†Ô∏è  These sizes would FAIL on most consumer GPUs!")
print(f"{'Size (GB)':<12} {'Alloc Time (ms)':<18} {'Memory Used (GB)':<18}")
print("-" * 48)

for size_gb in medium_sizes:
    clear_memory()
    
    try:
        tensor, actual_size, alloc_time = allocate_tensor_gb(size_gb)
        stats = get_memory_stats()
        
        print(f"{size_gb:<12} {alloc_time*1000:<18.2f} {stats['reserved_gb']:<18.2f}")
        
        del tensor
        
    except RuntimeError as e:
        print(f"{size_gb:<12} ‚ùå FAILED: Out of memory")
        print(f"   Hint: Clear buffer cache and try again")

clear_memory()
print("\nüéâ DGX Spark handled these like a champ!")

üßπ Memory cleared

Medium Tensor Allocation Test (20-50 GB)
‚ö†Ô∏è  These sizes would FAIL on most consumer GPUs!
Size (GB)    Alloc Time (ms)    Memory Used (GB)  
------------------------------------------------
üßπ Memory cleared
20           672.04             20.00             
üßπ Memory cleared
30           978.98             30.00             
üßπ Memory cleared
40           1287.50            40.00             
üßπ Memory cleared
50           1544.87            50.00             
üßπ Memory cleared

üéâ DGX Spark handled these like a champ!


### ‚úã Try It Yourself #1

What's the largest single tensor you can allocate? Try sizes between 60-100 GB.

<details>
<summary>üí° Hint</summary>
You should be able to allocate up to ~100GB, but the exact limit depends on:
- Current buffer cache usage
- Other running processes
- PyTorch's memory overhead
</details>

In [None]:
# YOUR CODE HERE: Try allocating larger tensors
# What's the maximum size you can allocate?

clear_memory()

# Try these sizes:
# large_sizes = [60, 70, 80, 90, 100]


---

## Part 5: Large Tensor Allocations (70-100 GB)

### Concept Explanation

This is where the DGX Spark truly shines. We can allocate tensors that represent full 70B parameter models!

In [7]:
clear_memory()

# Before large allocations, let's check system memory
import subprocess
from typing import Dict, Optional

def get_system_memory() -> Optional[Dict[str, int]]:
    """
    Get system memory info.
    
    Returns:
        Dictionary with memory info (total, used, free, available in GB),
        or None if unable to retrieve.
    """
    try:
        result = subprocess.run(['free', '-g'], capture_output=True, text=True, timeout=10)
        lines = result.stdout.strip().split('\n')
        if len(lines) >= 2:
            parts = lines[1].split()
            return {
                'total': int(parts[1]),
                'used': int(parts[2]),
                'free': int(parts[3]),
                'available': int(parts[6]) if len(parts) > 6 else int(parts[3])
            }
    except subprocess.TimeoutExpired:
        print("‚ö†Ô∏è Command timed out")
    except Exception as e:
        print(f"‚ö†Ô∏è Error getting system memory: {e}")
    return None

sys_mem = get_system_memory()
if sys_mem:
    print("System Memory Status:")
    print(f"  Total:     {sys_mem['total']} GB")
    print(f"  Available: {sys_mem['available']} GB")
    print(f"  Used:      {sys_mem['used']} GB")
else:
    print("‚ö†Ô∏è Could not retrieve system memory info")

üßπ Memory cleared
System Memory Status:
  Total:     119 GB
  Available: 106 GB
  Used:      12 GB


In [8]:
# Large allocations
clear_memory()

large_sizes = [60, 70, 80]  # GB - Conservative to ensure success

print("\n" + "=" * 60)
print("Large Tensor Allocation Test (60-80 GB)")
print("=" * 60)
print("üöÄ This is IMPOSSIBLE on most GPUs!")
print(f"{'Size (GB)':<12} {'Status':<15} {'Time (s)':<12} {'Memory (GB)':<15}")
print("-" * 54)

for size_gb in large_sizes:
    clear_memory()
    
    try:
        tensor, actual_size, alloc_time = allocate_tensor_gb(size_gb)
        stats = get_memory_stats()
        
        print(f"{size_gb:<12} {'‚úÖ SUCCESS':<15} {alloc_time:<12.3f} {stats['reserved_gb']:<15.2f}")
        
        del tensor
        
    except RuntimeError as e:
        print(f"{size_gb:<12} {'‚ùå FAILED':<15} {'N/A':<12} {'N/A':<15}")

clear_memory()

üßπ Memory cleared

Large Tensor Allocation Test (60-80 GB)
üöÄ This is IMPOSSIBLE on most GPUs!
Size (GB)    Status          Time (s)     Memory (GB)    
------------------------------------------------------
üßπ Memory cleared
60           ‚úÖ SUCCESS       3.108        60.00          
üßπ Memory cleared
70           ‚úÖ SUCCESS       2.626        70.00          
üßπ Memory cleared
80           ‚úÖ SUCCESS       2.778        80.00          
üßπ Memory cleared


<cell_type></cell_type>
### üîç What This Means for AI Models

**DGX Spark v2.0 Model Capacity Matrix:**

| Scenario | Maximum Model Size | Memory Usage | Notes |
|----------|-------------------|--------------|-------|
| Full Fine-Tuning (FP16) | 12-16B | ~100-128GB | With gradient checkpointing |
| QLoRA Fine-Tuning | 100-120B | ~50-70GB | 4-bit quantized + adapters |
| FP16 Inference | 50-55B | ~110-120GB | Including KV cache headroom |
| FP8 Inference | 90-100B | ~90-100GB | Native Blackwell support |
| NVFP4 Inference | ~200B | ~100GB | Blackwell exclusive format |

**Standard Memory Requirements:**

| Model Size | FP16 | INT8 | INT4/NVFP4 |
|------------|------|------|------------|
| 7B params  | ~14 GB | ~7 GB | ~3.5 GB |
| 13B params | ~26 GB | ~13 GB | ~6.5 GB |
| 70B params | ~140 GB | ~70 GB | ~35 GB |

The 128GB unified memory means **70B models fit entirely in INT4/NVFP4**, even without aggressive optimization!

> **Note:** NVFP4 is Blackwell's native 4-bit format, offering better accuracy than standard INT4 quantization.

---

## Part 6: Buffer Cache Impact

### Concept Explanation

Linux aggressively caches disk data in RAM to speed up file access. On unified memory systems, this cache competes with GPU allocations.

> **ELI5:** Imagine your shared desk has papers scattered around from previous work. Before doing a big new project, you need to clean up those old papers to make room!

In [9]:
# Check current buffer cache usage
def check_buffer_cache() -> int:
    """
    Check Linux buffer cache usage.
    
    Returns:
        Buffer cache size in GB, or 0 if unable to determine.
    """
    try:
        result = subprocess.run(
            ['free', '-g'], 
            capture_output=True, 
            text=True,
            timeout=10
        )
        lines = result.stdout.strip().split('\n')
        if len(lines) >= 2:
            parts = lines[1].split()
            # buff/cache is typically column 6
            if len(parts) >= 6:
                return int(parts[5])  # buff/cache column
    except subprocess.TimeoutExpired:
        print("‚ö†Ô∏è Command timed out")
    except Exception as e:
        print(f"‚ö†Ô∏è Error checking buffer cache: {e}")
    return 0

cache_size = check_buffer_cache()
print(f"Current buffer cache: {cache_size} GB")

if cache_size > 10:
    print(f"\n‚ö†Ô∏è  Buffer cache is using {cache_size} GB!")
    print("   This may limit GPU memory for large models.")
    print("   Clear with: sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'")
else:
    print("\n‚úÖ Buffer cache is minimal - good for large allocations!")

Current buffer cache: 19 GB

‚ö†Ô∏è  Buffer cache is using 19 GB!
   This may limit GPU memory for large models.
   Clear with: sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'


In [10]:
# Function to clear buffer cache (requires sudo)
def clear_buffer_cache() -> bool:
    """
    Clear Linux buffer cache.
    
    NOTE: Requires sudo access.
    
    Returns:
        True if successful, False otherwise.
    """
    before = check_buffer_cache()
    
    try:
        # This command requires sudo
        result = subprocess.run(
            ["sudo", "sh", "-c", "sync; echo 3 > /proc/sys/vm/drop_caches"],
            capture_output=True,
            text=True,
            timeout=30
        )
        
        if result.returncode == 0:
            after = check_buffer_cache()
            print(f"‚úÖ Buffer cache cleared: {before} GB ‚Üí {after} GB")
            print(f"   Freed approximately {before - after} GB")
            return True
        else:
            print("‚ùå Failed to clear cache (sudo required)")
            print("   Run manually: sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'")
            return False
    except subprocess.TimeoutExpired:
        print("‚ùå Command timed out")
        return False
    except Exception as e:
        print(f"‚ùå Error clearing cache: {e}")
        return False

# Uncomment to clear cache:
# clear_buffer_cache()

---

## Part 7: Multiple Tensors and Memory Fragmentation

### Concept Explanation

In real AI workloads, you don't have one giant tensor - you have many tensors for model weights, activations, gradients, optimizer states, etc. Let's simulate this.

In [11]:
clear_memory()

# Simulate a typical model memory layout
print("\n" + "=" * 60)
print("Simulating Real Model Memory Usage")
print("=" * 60)

# Model components (simulated 13B parameter model)
components = {
    'embeddings': 2.0,      # GB - embedding layers
    'attention': 8.0,       # GB - attention weights
    'feedforward': 10.0,    # GB - FFN layers
    'layer_norms': 0.5,     # GB - normalization
    'output_head': 1.0,     # GB - output projection
    'activations': 4.0,     # GB - intermediate activations
}

total_expected = sum(components.values())
print(f"\nExpected total: {total_expected:.1f} GB")
print("-" * 40)

tensors = {}
for name, size_gb in components.items():
    tensor, actual, alloc_time = allocate_tensor_gb(size_gb)
    tensors[name] = tensor
    print(f"‚úÖ {name:15} : {actual:.2f} GB allocated")

print_memory_stats("After all components")

# Verify total
stats = get_memory_stats()
print(f"\nüìä Summary:")
print(f"   Expected: {total_expected:.1f} GB")
print(f"   Allocated: {stats['allocated_gb']:.2f} GB")
print(f"   Reserved: {stats['reserved_gb']:.2f} GB")
print(f"   Overhead: {(stats['reserved_gb'] - total_expected):.2f} GB")

üßπ Memory cleared

Simulating Real Model Memory Usage

Expected total: 25.5 GB
----------------------------------------
‚úÖ embeddings      : 2.00 GB allocated
‚úÖ attention       : 8.00 GB allocated
‚úÖ feedforward     : 10.00 GB allocated
‚úÖ layer_norms     : 0.50 GB allocated
‚úÖ output_head     : 1.00 GB allocated
‚úÖ activations     : 4.00 GB allocated

üìä After all components Memory Status:
   Allocated: 25.50 GB
   Reserved:  25.51 GB
   Free:      103.02 GB
   Total:     128.5 GB
   [‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 19.8%

üìä Summary:
   Expected: 25.5 GB
   Allocated: 25.50 GB
   Reserved: 25.51 GB
   Overhead: 0.01 GB


In [12]:
# Clean up the simulated model
for name in list(tensors.keys()):
    del tensors[name]

clear_memory()
print_memory_stats("After cleanup")

üßπ Memory cleared

üìä After cleanup Memory Status:
   Allocated: 4.00 GB
   Reserved:  4.00 GB
   Free:      124.52 GB
   Total:     128.5 GB
   [‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 3.1%


---

## Part 8: Different Data Types

### Concept Explanation

Different data types use different amounts of memory per element:

| Type | Bytes | Use Case |
|------|-------|----------|
| float32 | 4 | Training (traditional) |
| float16 | 2 | Inference, mixed precision |
| bfloat16 | 2 | Training (modern, recommended for Blackwell) |
| int8 | 1 | Quantized inference |
| int4/NVFP4 | 0.5 | Aggressive quantization (Blackwell native) |

**DGX Spark's Blackwell GPU has native support for bfloat16, FP8, and NVFP4!**

> **Tip:** Use `bfloat16` as the default dtype for training and inference on DGX Spark.
> NVFP4 (Blackwell's native 4-bit format) provides better accuracy than standard INT4.

In [13]:
clear_memory()

# Compare memory usage for different dtypes
target_elements = 1_000_000_000  # 1 billion elements

dtypes = [
    ('float32', torch.float32),
    ('float16', torch.float16),
    ('bfloat16', torch.bfloat16),
    ('int8', torch.int8),
]

print("\n" + "=" * 60)
print(f"Memory Usage for {target_elements/1e9:.0f}B Elements")
print("=" * 60)
print(f"{'Data Type':<12} {'Bytes/Elem':<12} {'Memory (GB)':<15} {'Time (ms)':<12}")
print("-" * 51)

for name, dtype in dtypes:
    clear_memory()
    
    start = time.time()
    tensor = torch.empty(target_elements, dtype=dtype, device='cuda')
    torch.cuda.synchronize()
    elapsed = (time.time() - start) * 1000
    
    size_gb = tensor.element_size() * tensor.nelement() / 1e9
    bytes_per = tensor.element_size()
    
    print(f"{name:<12} {bytes_per:<12} {size_gb:<15.2f} {elapsed:<12.1f}")
    
    del tensor

clear_memory()
print("\nüí° Tip: Use bfloat16 as default on DGX Spark (native Blackwell support)")

üßπ Memory cleared

Memory Usage for 1B Elements
Data Type    Bytes/Elem   Memory (GB)     Time (ms)   
---------------------------------------------------
üßπ Memory cleared
float32      4            4.00            144.9       
üßπ Memory cleared
float16      2            2.00            71.6        
üßπ Memory cleared
bfloat16     2            2.00            71.6        
üßπ Memory cleared
int8         1            1.00            35.6        
üßπ Memory cleared

üí° Tip: Use bfloat16 as default on DGX Spark (native Blackwell support)


### ‚úã Try It Yourself #2

Calculate how much memory a 70B parameter model needs in different precisions.

<details>
<summary>üí° Hint</summary>

- 70B parameters = 70,000,000,000 elements
- float32: 70B √ó 4 bytes = 280 GB (doesn't fit!)
- float16: 70B √ó 2 bytes = 140 GB (barely fits)
- int4: 70B √ó 0.5 bytes = 35 GB (fits easily!)

</details>

In [None]:
# YOUR CODE HERE: Calculate memory requirements for 70B model

params_70b = 70_000_000_000  # 70 billion parameters

# Calculate memory for each precision:
# memory_fp32 = ...
# memory_fp16 = ...
# memory_int4 = ...


---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Not clearing memory between experiments

```python
# ‚ùå Wrong way - Memory accumulates
tensor1 = torch.randn(10000, 10000, device='cuda')
tensor2 = torch.randn(10000, 10000, device='cuda')  # Uses MORE memory

# ‚úÖ Right way - Clean up first
tensor1 = torch.randn(10000, 10000, device='cuda')
del tensor1
torch.cuda.empty_cache()
tensor2 = torch.randn(10000, 10000, device='cuda')  # Reuses memory
```

### Mistake 2: Forgetting about buffer cache for large models

```python
# ‚ùå Wrong way - May fail with OOM on 70B model
model = load_model("70b-model")

# ‚úÖ Right way - Clear cache first
!sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
model = load_model("70b-model")
```

### Mistake 3: Using float32 when you don't need it

```python
# ‚ùå Wrong way - Wastes memory
model = model.to(device='cuda', dtype=torch.float32)

# ‚úÖ Right way - Use bfloat16 on Blackwell
model = model.to(device='cuda', dtype=torch.bfloat16)
```

---

## üéâ Checkpoint

You've learned:
- ‚úÖ How unified memory enables allocations impossible on other GPUs
- ‚úÖ How to monitor GPU memory with PyTorch
- ‚úÖ The impact of buffer cache on available memory
- ‚úÖ How different data types affect memory usage
- ‚úÖ Why DGX Spark can run 70B parameter models

---

## üöÄ Challenge (Optional)

Create a "memory stress test" that:
1. Allocates tensors in a loop until allocation fails
2. Records the maximum memory achieved
3. Cleans up gracefully

<details>
<summary>üí° Solution Hint</summary>

```python
def memory_stress_test(increment_gb=5):
    tensors = []
    total_gb = 0
    
    try:
        while True:
            t, size, _ = allocate_tensor_gb(increment_gb)
            tensors.append(t)
            total_gb += size
            print(f"Allocated: {total_gb:.1f} GB")
    except RuntimeError:
        print(f"Max allocation: {total_gb:.1f} GB")
    finally:
        for t in tensors:
            del t
        clear_memory()
```
</details>

In [None]:
# YOUR CHALLENGE CODE HERE


---

## üìñ Further Reading

- [NVIDIA Unified Memory Architecture](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/)
- [PyTorch CUDA Memory Management](https://pytorch.org/docs/stable/notes/cuda.html)
- [Blackwell Architecture Whitepaper](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/)

---

## üßπ Cleanup

In [14]:
# Final cleanup
clear_memory()
print_memory_stats("Final")

print("\n" + "=" * 60)
print("üéâ Great job completing Lab 1.1.2: Memory Architecture Lab!")
print("=" * 60)
print("\nNext up: Lab 1.1.3 - NGC Container Setup")
print("You'll learn to configure Docker containers for AI development.")

üßπ Memory cleared

üìä Final Memory Status:
   Allocated: 0.00 GB
   Reserved:  0.00 GB
   Free:      128.52 GB
   Total:     128.5 GB
   [‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë] 0.0%

üéâ Great job completing Lab 1.1.2: Memory Architecture Lab!

Next up: Lab 1.1.3 - NGC Container Setup
You'll learn to configure Docker containers for AI development.
