# Lab 1.1.3: NGC Container Setup

**Module:** 1.1 - DGX Spark Platform Mastery  
**Time:** 1.5 hours  
**Difficulty:** ‚≠ê‚≠ê (Intermediate)

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand why NGC containers are essential for DGX Spark
- [ ] Pull and configure the PyTorch NGC container
- [ ] Create a docker-compose.yml for easy development
- [ ] Verify GPU access inside containers

---

## üìö Prerequisites

- Completed: Lab 1.1.1 (System Exploration)
- Docker installed and running
- Internet access for container pulls

---

## üåç Real-World Context

Here's a frustrating scenario every AI developer has faced:

1. Install PyTorch with `pip install torch`
2. Run `torch.cuda.is_available()` ‚Üí Returns `False`
3. Spend hours debugging driver issues, CUDA versions, wheel compatibility...

On DGX Spark, this is even worse because **standard pip wheels don't work at all** - they're built for x86, not ARM64.

**NGC containers solve this completely.** They're pre-built, tested, and optimized by NVIDIA. Just pull and run!

---

## üßí ELI5: What are NGC Containers?

> **Imagine you want to bake a really complicated cake...**
>
> You could buy all the ingredients separately, find the right recipe, measure everything perfectly, and hope it works. Or... you could get a "cake kit" where everything is pre-measured and packaged together!
>
> **NGC containers are like cake kits for AI.** NVIDIA has already:
> - Compiled PyTorch correctly for your hardware
> - Set up CUDA and cuDNN perfectly
> - Tested everything to make sure it works
> - Optimized it for maximum speed
>
> You just "open the box" (pull the container) and start cooking (coding)!
>
> **In AI terms:** NGC containers are Docker images pre-configured with AI frameworks, drivers, and optimizations for NVIDIA hardware.

---

## Part 1: Why NGC Containers are Required

### The ARM64 + CUDA Challenge

DGX Spark uses:
- **ARM64 CPU** (not x86_64 like most computers)
- **CUDA 13+** (cutting edge)
- **Blackwell GPU** (brand new architecture)

Standard PyPI wheels are compiled for `x86_64 + CUDA 11/12`. They simply **cannot run** on DGX Spark.

Let's demonstrate this problem:

In [1]:
# Check our architecture
import platform

arch = platform.machine()
system = platform.system()

print(f"Architecture: {arch}")
print(f"System: {system}")

if arch == 'aarch64':
    print("\n‚ö†Ô∏è  You're on ARM64 - standard pip PyTorch won't work!")
    print("   You MUST use NGC containers for GPU support.")
else:
    print("\n‚úÖ You're on x86_64 - but NGC containers are still recommended.")

Architecture: aarch64
System: Linux

‚ö†Ô∏è  You're on ARM64 - standard pip PyTorch won't work!
   You MUST use NGC containers for GPU support.


In [2]:
# Check Docker status
!docker --version

Docker version 28.5.1, build e180ab8


In [3]:
# Check NVIDIA container toolkit
!docker info 2>/dev/null | grep -i nvidia
print("\n---")
!nvidia-container-cli --version 2>/dev/null || echo "nvidia-container-cli not directly accessible"

  cdi: nvidia.com/gpu=0
  cdi: nvidia.com/gpu=GPU-3d29cd16-ca97-2c38-7d79-d462cfa45fed
  cdi: nvidia.com/gpu=all
 Runtimes: runc io.containerd.runc.v2 nvidia
 Kernel Version: 6.14.0-1015-nvidia

---
cli-version: 1.18.1
lib-version: 1.18.1
build date: 2025-11-24T14:47+00:00
build revision: 889a3bb5408c195ed7897ba2cb8341c7d249672f
build compiler: aarch64-linux-gnu-gcc-7 7.5.0
build platform: aarch64


---

## Part 2: Understanding NGC Container Catalog

### Available Containers

The NGC Catalog (https://catalog.ngc.nvidia.com/) offers many pre-built containers:

| Container | Use Case | Image |
|-----------|----------|-------|
| PyTorch | Deep learning | `nvcr.io/nvidia/pytorch:25.11-py3` |
| TensorFlow | Deep learning | `nvcr.io/nvidia/tensorflow:25.11-tf2-py3` |
| Triton | Inference server | `nvcr.io/nvidia/tritonserver:25.11-py3` |
| NeMo | NLP/ASR | `nvcr.io/nvidia/nemo:25.11` |
| RAPIDS | Data science | `nvcr.io/nvidia/rapidsai/base:25.11-cuda13.0-py3.11` |

### Version Naming

NGC uses `YY.MM` versioning:
- `25.11` = November 2025 release
- `25.06` = June 2025 release

**Always use the latest compatible version for best performance!**

In [4]:
# List currently downloaded NGC containers
print("Currently downloaded NVIDIA containers:")
print("=" * 50)
!docker images | grep -E "nvcr.io|REPOSITORY" | head -20

Currently downloaded NVIDIA containers:
REPOSITORY                                TAG                        IMAGE ID       CREATED         SIZE
nvcr.io/nvidia/cuda                       13.0.1-devel-ubuntu24.04   d1f3dc428c53   3 months ago    6.59GB


---

## Part 3: Pulling the PyTorch Container

### Concept Explanation

The PyTorch NGC container is the most commonly used for AI development. It includes:
- PyTorch (latest version, compiled for your GPU)
- CUDA and cuDNN
- TensorRT for inference optimization
- Common libraries (numpy, pandas, etc.)
- Jupyter Lab

**Note:** This cell will download ~20GB+ the first time!

In [5]:
# =============================================================================
# NGC CONTAINER VERSION CONFIGURATION
# =============================================================================
# Update this variable when newer NGC containers become available.
# Check https://catalog.ngc.nvidia.com for the latest versions.
# Version format: YY.MM (e.g., 25.11 = November 2025 release)
# =============================================================================

PYTORCH_IMAGE = "nvcr.io/nvidia/pytorch:25.11-py3"

print(f"Target container: {PYTORCH_IMAGE}")
print("\nFirst pull can take 10-30 minutes depending on connection speed!")
print("\nTo check for newer versions:")
print("  Visit: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch")

Target container: nvcr.io/nvidia/pytorch:25.11-py3

First pull can take 10-30 minutes depending on connection speed!

To check for newer versions:
  Visit: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch


In [6]:
# Check if already downloaded
import subprocess

result = subprocess.run(
    ["docker", "images", "-q", PYTORCH_IMAGE],
    capture_output=True,
    text=True
)

if result.stdout.strip():
    print(f"‚úÖ Container already downloaded!")
    print(f"   Image ID: {result.stdout.strip()}")
else:
    print(f"Container not found locally.")
    print(f"Run this command to pull it:")
    print(f"\n    docker pull {PYTORCH_IMAGE}")

Container not found locally.
Run this command to pull it:

    docker pull nvcr.io/nvidia/pytorch:25.11-py3


### ‚úã Try It Yourself #1

Pull the PyTorch container. Run this in a terminal (not this notebook) to see progress:

```bash
docker pull nvcr.io/nvidia/pytorch:25.11-py3
```

Or uncomment and run the cell below:

In [7]:
# Uncomment to pull (this may take a while!)
!docker pull {PYTORCH_IMAGE}

25.11-py3: Pulling from nvidia/pytorch

[1B5db46e38: Pulling fs layer 
[1Bb700ef54: Pulling fs layer 
[1B3c1ba463: Pulling fs layer 
[1B72e6e08d: Pulling fs layer 
[1B63993349: Pulling fs layer 
[1Bf314d125: Pulling fs layer 
[1Bddcc87c2: Pulling fs layer 
[1B67646c1c: Pulling fs layer 
[1B138847d4: Pulling fs layer 
[1B51a7be9a: Pulling fs layer 
[1B0625c4f2: Pulling fs layer 
[1Ba2cea4f8: Pulling fs layer 
[1Bd52a000b: Pulling fs layer 
[1B0efc374f: Pulling fs layer 
[1B04cbfd01: Pulling fs layer 
[1B9b7f7ff1: Pulling fs layer 
[1Bda052f10: Pulling fs layer 
[1B69c44cc4: Pulling fs layer 
[1B761d0a1e: Pulling fs layer 
[1Bbb3e15dc: Pulling fs layer 
[1Ba87108e5: Pulling fs layer 
[1B31d931ca: Pulling fs layer 
[1B9db912fd: Pulling fs layer 
[1B2f1e370b: Pulling fs layer 
[1B272720e7: Pulling fs layer 
[1B26379d7c: Pulling fs layer 
[1Bcb5580f9: Pulling fs layer 
[1B779dcb1a: Pulling fs layer 
[1Be5459342: Pulling fs layer 
[1B45a29a11: Pulling fs layer 


---

## Part 4: Running Containers with GPU Access

### Key Flags Explained

```bash
docker run \
    --gpus all \           # Enable GPU access
    -it \                   # Interactive terminal
    --rm \                  # Remove container on exit
    -v $HOME/workspace:/workspace \  # Mount your code
    --ipc=host \           # Shared memory for PyTorch DataLoader
    nvcr.io/nvidia/pytorch:25.11-py3 \
    bash
```

Let's break down each flag:

| Flag | Purpose | When Required |
|------|----------|---------------|
| `--gpus all` | Makes GPU visible inside container | Always |
| `-it` | Interactive mode with terminal | For interactive sessions |
| `--rm` | Auto-cleanup when container exits | Recommended always |
| `-v` | Mount host directory into container | To persist your work |
| `--ipc=host` | Needed for PyTorch multi-worker data loading | Always for PyTorch |
| `-p 8888:8888` | Port mapping for Jupyter Lab | Only when running Jupyter |

> **Note:** The `-p 8888:8888` flag is only needed when running Jupyter Lab. For interactive bash sessions, port mapping is not required.

In [8]:
# Generate the docker run command
import os

home = os.environ.get('HOME', '/home/user')

run_command = f"""docker run --gpus all -it --rm \\
    -v {home}/workspace:/workspace \\
    -v {home}/.cache/huggingface:/root/.cache/huggingface \\
    --ipc=host \\
    {PYTORCH_IMAGE} \\
    bash"""

print("Run this command to start the container:")
print("=" * 50)
print(run_command)

Run this command to start the container:
docker run --gpus all -it --rm \
    -v /home/trosfy/workspace:/workspace \
    -v /home/trosfy/.cache/huggingface:/root/.cache/huggingface \
    --ipc=host \
    nvcr.io/nvidia/pytorch:25.11-py3 \
    bash


In [9]:
# Test GPU access in container (quick test)
test_command = f'''docker run --gpus all --rm {PYTORCH_IMAGE} \
    python -c "import torch; print(f'CUDA available: {{torch.cuda.is_available()}}'); print(f'Device: {{torch.cuda.get_device_name(0) if torch.cuda.is_available() else None}}')"
'''

print("Testing GPU access in container...")
print("-" * 50)
!{test_command}

Testing GPU access in container...
--------------------------------------------------

== PyTorch ==

NVIDIA Release 25.11 (build 231036168)
PyTorch Version 2.10.0a0+b558c98
Container image Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2014-2024 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files in

---

## Part 5: Creating docker-compose.yml

### Concept Explanation

Docker Compose makes it easy to run containers with complex configurations. Instead of typing long `docker run` commands, you define everything in a YAML file.

> **Important: File Locations**
>
> The following cells will create configuration files in your **current working directory**.
> For best organization, ensure you're running this notebook from your project root directory
> (e.g., `$HOME/workspace/dgx-spark-project/`).
>
> Files created:
> - `docker-compose.yml` - Docker Compose configuration
> - `start_pytorch.sh` - Shell script launcher
> - `verify_gpu.py` - GPU verification script

In [10]:
# Generate docker-compose.yml content
import os

home = os.environ.get('HOME', '/home/user')

docker_compose_content = f"""# DGX Spark AI Development Environment
# Generated for NGC PyTorch container

services:
  pytorch:
    image: {PYTORCH_IMAGE}
    container_name: dgx-spark-pytorch
    
    # GPU access
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    
    # Volume mounts
    volumes:
      - {home}/workspace:/workspace
      - {home}/.cache/huggingface:/root/.cache/huggingface
      - {home}/.cache/torch:/root/.cache/torch
    
    # Networking
    ports:
      - "8888:8888"   # Jupyter Lab
      - "6006:6006"   # TensorBoard
    
    # Required for PyTorch DataLoader
    ipc: host
    
    # Keep container running
    stdin_open: true
    tty: true
    
    # Working directory
    working_dir: /workspace
    
    # Default command (can override)
    command: jupyter lab --ip=0.0.0.0 --port=8888 --allow-root --no-browser
"""

print("docker-compose.yml content:")
print("=" * 50)
print(docker_compose_content)

docker-compose.yml content:
# DGX Spark AI Development Environment
# Generated for NGC PyTorch container

services:
  pytorch:
    image: nvcr.io/nvidia/pytorch:25.11-py3
    container_name: dgx-spark-pytorch

    # GPU access
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

    # Volume mounts
    volumes:
      - /home/trosfy/workspace:/workspace
      - /home/trosfy/.cache/huggingface:/root/.cache/huggingface
      - /home/trosfy/.cache/torch:/root/.cache/torch

    # Networking
    ports:
      - "8888:8888"   # Jupyter Lab
      - "6006:6006"   # TensorBoard

    # Required for PyTorch DataLoader
    ipc: host

    # Keep container running
    stdin_open: true
    tty: true

    # Working directory
    working_dir: /workspace

    # Default command (can override)
    command: jupyter lab --ip=0.0.0.0 --port=8888 --allow-root --no-browser



In [11]:
# Save docker-compose.yml
import os

# Save to current directory
compose_path = "docker-compose.yml"

with open(compose_path, 'w') as f:
    f.write(docker_compose_content)

print(f"‚úÖ Saved to: {os.path.abspath(compose_path)}")
print("\nUsage:")
print("  docker compose up -d      # Start in background")
print("  docker compose logs -f    # View logs")
print("  docker compose down       # Stop and remove")
print("  docker compose exec pytorch bash  # Get shell")

‚úÖ Saved to: /home/trosfy/projects/dgx-spark-ai-curriculum/domain-1-platform-foundations/module-1.1-dgx-spark-platform/labs/docker-compose.yml

Usage:
  docker compose up -d      # Start in background
  docker compose logs -f    # View logs
  docker compose down       # Stop and remove
  docker compose exec pytorch bash  # Get shell


### Alternative: Shell Script Launcher

Some developers prefer a simple shell script:

In [12]:
# Generate shell script launcher
shell_script = f"""#!/bin/bash
# DGX Spark PyTorch Development Environment
# Usage: ./start_pytorch.sh [command]
#   ./start_pytorch.sh           # Start Jupyter Lab
#   ./start_pytorch.sh bash      # Get shell
#   ./start_pytorch.sh python    # Python REPL

IMAGE="{PYTORCH_IMAGE}"
CONTAINER_NAME="dgx-spark-pytorch"

# Default command is Jupyter Lab
CMD="${{@:-jupyter lab --ip=0.0.0.0 --allow-root --no-browser}}"

# Stop existing container if running
docker stop $CONTAINER_NAME 2>/dev/null
docker rm $CONTAINER_NAME 2>/dev/null

echo "Starting DGX Spark PyTorch environment..."
echo "Image: $IMAGE"
echo "Command: $CMD"
echo ""

docker run --gpus all -it --rm \\
    --name $CONTAINER_NAME \\
    -v $HOME/workspace:/workspace \\
    -v $HOME/.cache/huggingface:/root/.cache/huggingface \\
    -v $HOME/.cache/torch:/root/.cache/torch \\
    -p 8888:8888 \\
    -p 6006:6006 \\
    --ipc=host \\
    -w /workspace \\
    $IMAGE \\
    $CMD
"""

# Save script
script_path = "start_pytorch.sh"
with open(script_path, 'w') as f:
    f.write(shell_script)

# Make executable
os.chmod(script_path, 0o755)

print(f"‚úÖ Saved to: {os.path.abspath(script_path)}")
print("\nUsage:")
print("  ./start_pytorch.sh           # Start Jupyter Lab")
print("  ./start_pytorch.sh bash      # Get shell")
print("  ./start_pytorch.sh python    # Python REPL")

‚úÖ Saved to: /home/trosfy/projects/dgx-spark-ai-curriculum/domain-1-platform-foundations/module-1.1-dgx-spark-platform/labs/start_pytorch.sh

Usage:
  ./start_pytorch.sh           # Start Jupyter Lab
  ./start_pytorch.sh bash      # Get shell
  ./start_pytorch.sh python    # Python REPL


---

## Part 6: Verifying GPU Access

### Concept Explanation

Once your container is running, you need to verify that:
1. PyTorch can see the GPU
2. Tensor operations work on GPU
3. Memory is accessible

In [13]:
# Create a verification script
verification_script = '''
#!/usr/bin/env python3
"""
DGX Spark GPU Verification Script
Run this inside the NGC container to verify GPU access.
"""

import sys

def check_torch():
    """Check PyTorch GPU access."""
    print("=" * 60)
    print("PyTorch GPU Verification")
    print("=" * 60)
    
    try:
        import torch
        print(f"PyTorch version: {torch.__version__}")
        print(f"CUDA available: {torch.cuda.is_available()}")
        
        if torch.cuda.is_available():
            print(f"CUDA version: {torch.version.cuda}")
            print(f"Device count: {torch.cuda.device_count()}")
            print(f"Current device: {torch.cuda.current_device()}")
            print(f"Device name: {torch.cuda.get_device_name(0)}")
            
            # Memory info
            props = torch.cuda.get_device_properties(0)
            print(f"Total memory: {props.total_memory / 1e9:.1f} GB")
            
            # Test tensor operation
            print("\nTesting tensor operations...")
            x = torch.randn(1000, 1000, device="cuda")
            y = torch.randn(1000, 1000, device="cuda")
            z = torch.matmul(x, y)
            print(f"‚úÖ Matrix multiplication successful!")
            print(f"   Result shape: {z.shape}")
            print(f"   Memory used: {torch.cuda.memory_allocated() / 1e6:.1f} MB")
            
            return True
        else:
            print("‚ùå CUDA not available!")
            return False
            
    except ImportError:
        print("‚ùå PyTorch not installed!")
        return False

def check_cudnn():
    """Check cuDNN status."""
    print("\n" + "-" * 40)
    print("cuDNN Status")
    print("-" * 40)
    
    try:
        import torch
        print(f"cuDNN available: {torch.backends.cudnn.is_available()}")
        print(f"cuDNN enabled: {torch.backends.cudnn.enabled}")
        print(f"cuDNN version: {torch.backends.cudnn.version()}")
        return True
    except Exception as e:
        print(f"‚ùå Error: {e}")
        return False

def check_bfloat16():
    """Check bfloat16 support (important for Blackwell)."""
    print("\n" + "-" * 40)
    print("BFloat16 Support (Blackwell Optimized)")
    print("-" * 40)
    
    try:
        import torch
        if torch.cuda.is_available():
            x = torch.randn(100, 100, dtype=torch.bfloat16, device="cuda")
            y = torch.randn(100, 100, dtype=torch.bfloat16, device="cuda")
            z = torch.matmul(x, y)
            print(f"‚úÖ BFloat16 operations work!")
            return True
    except Exception as e:
        print(f"‚ùå BFloat16 error: {e}")
        return False

if __name__ == "__main__":
    all_passed = True
    all_passed &= check_torch()
    all_passed &= check_cudnn()
    all_passed &= check_bfloat16()
    
    print("\n" + "=" * 60)
    if all_passed:
        print("‚úÖ All checks passed! Your DGX Spark is ready for AI!")
    else:
        print("‚ùå Some checks failed. Please review the output above.")
    print("=" * 60)
    
    sys.exit(0 if all_passed else 1)
'''

# Save verification script
verify_path = "verify_gpu.py"
with open(verify_path, 'w') as f:
    f.write(verification_script)

print(f"‚úÖ Saved verification script to: {verify_path}")
print("\nRun inside container:")
print("  python verify_gpu.py")

‚úÖ Saved verification script to: verify_gpu.py

Run inside container:
  python verify_gpu.py


---

## Part 7: Common Container Configurations

### Configuration Templates

In [14]:
# Print common container configurations

configs = {
    "Development (Interactive)": f"""
docker run --gpus all -it --rm \\
    -v $HOME/workspace:/workspace \\
    -v $HOME/.cache/huggingface:/root/.cache/huggingface \\
    --ipc=host \\
    {PYTORCH_IMAGE} bash
""",

    "Jupyter Lab": f"""
docker run --gpus all -it --rm \\
    -v $HOME/workspace:/workspace \\
    -v $HOME/.cache/huggingface:/root/.cache/huggingface \\
    -p 8888:8888 \\
    --ipc=host \\
    {PYTORCH_IMAGE} \\
    jupyter lab --ip=0.0.0.0 --allow-root --no-browser
""",

    "Training Script": f"""
docker run --gpus all --rm \\
    -v $HOME/workspace:/workspace \\
    -v $HOME/.cache/huggingface:/root/.cache/huggingface \\
    --ipc=host \\
    {PYTORCH_IMAGE} \\
    python /workspace/train.py
""",

    "With TensorBoard": f"""
docker run --gpus all -it --rm \\
    -v $HOME/workspace:/workspace \\
    -p 8888:8888 -p 6006:6006 \\
    --ipc=host \\
    {PYTORCH_IMAGE} bash -c "
        tensorboard --logdir=/workspace/logs --bind_all &
        jupyter lab --ip=0.0.0.0 --allow-root --no-browser
    "
"""
}

print("Common Container Configurations")
print("=" * 60)

for name, cmd in configs.items():
    print(f"\n### {name}")
    print("-" * 40)
    print(cmd)

Common Container Configurations

### Development (Interactive)
----------------------------------------

docker run --gpus all -it --rm \
    -v $HOME/workspace:/workspace \
    -v $HOME/.cache/huggingface:/root/.cache/huggingface \
    --ipc=host \
    nvcr.io/nvidia/pytorch:25.11-py3 bash


### Jupyter Lab
----------------------------------------

docker run --gpus all -it --rm \
    -v $HOME/workspace:/workspace \
    -v $HOME/.cache/huggingface:/root/.cache/huggingface \
    -p 8888:8888 \
    --ipc=host \
    nvcr.io/nvidia/pytorch:25.11-py3 \
    jupyter lab --ip=0.0.0.0 --allow-root --no-browser


### Training Script
----------------------------------------

docker run --gpus all --rm \
    -v $HOME/workspace:/workspace \
    -v $HOME/.cache/huggingface:/root/.cache/huggingface \
    --ipc=host \
    nvcr.io/nvidia/pytorch:25.11-py3 \
    python /workspace/train.py


### With TensorBoard
----------------------------------------

docker run --gpus all -it --rm \
    -v $HOME/work

---

## Part 8: Persisting Container Changes

### Concept Explanation

By default, containers are ephemeral - changes are lost when they stop. Here's how to persist data:

1. **Volume mounts** (recommended): Mount directories from host
2. **Docker volumes**: Named volumes managed by Docker
3. **Custom image**: Build your own image with modifications

In [15]:
# Show recommended volume mounts
print("Recommended Volume Mounts for AI Development")
print("=" * 60)
print("""
Mount                                  | Purpose
---------------------------------------|------------------------
$HOME/workspace:/workspace             | Your code and projects
$HOME/.cache/huggingface:/root/.cache/ | HuggingFace models
  huggingface                          |   (saves re-downloads)
$HOME/.cache/torch:/root/.cache/torch  | PyTorch model cache
$HOME/data:/data                       | Large datasets
$HOME/models:/models                   | Saved model checkpoints
""")

Recommended Volume Mounts for AI Development

Mount                                  | Purpose
---------------------------------------|------------------------
$HOME/workspace:/workspace             | Your code and projects
$HOME/.cache/huggingface:/root/.cache/ | HuggingFace models
  huggingface                          |   (saves re-downloads)
$HOME/.cache/torch:/root/.cache/torch  | PyTorch model cache
$HOME/data:/data                       | Large datasets
$HOME/models:/models                   | Saved model checkpoints



In [16]:
# Create recommended directories
import os

directories = [
    os.path.expanduser("~/workspace"),
    os.path.expanduser("~/.cache/huggingface"),
    os.path.expanduser("~/.cache/torch"),
    os.path.expanduser("~/data"),
    os.path.expanduser("~/models"),
]

print("Creating recommended directories...")
for d in directories:
    os.makedirs(d, exist_ok=True)
    print(f"  ‚úÖ {d}")

print("\nDone! These directories will persist your work.")

Creating recommended directories...
  ‚úÖ /home/trosfy/workspace
  ‚úÖ /home/trosfy/.cache/huggingface
  ‚úÖ /home/trosfy/.cache/torch
  ‚úÖ /home/trosfy/data
  ‚úÖ /home/trosfy/models

Done! These directories will persist your work.


---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Forgetting `--gpus all`

```bash
# ‚ùå Wrong - No GPU access
docker run -it nvcr.io/nvidia/pytorch:25.11-py3 bash

# ‚úÖ Right - GPU enabled
docker run --gpus all -it nvcr.io/nvidia/pytorch:25.11-py3 bash
```

### Mistake 2: Missing `--ipc=host`

```bash
# ‚ùå Wrong - DataLoader workers may crash
docker run --gpus all -it nvcr.io/nvidia/pytorch:25.11-py3 bash

# ‚úÖ Right - Shared memory enabled
docker run --gpus all --ipc=host -it nvcr.io/nvidia/pytorch:25.11-py3 bash
```
**Why:** PyTorch DataLoader uses shared memory for inter-process communication.

### Mistake 3: Not mounting cache directories

```bash
# ‚ùå Wrong - Re-downloads models every time
docker run --gpus all -it nvcr.io/nvidia/pytorch:25.11-py3 bash

# ‚úÖ Right - Persist model cache
docker run --gpus all -it \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    nvcr.io/nvidia/pytorch:25.11-py3 bash
```

### Mistake 4: Using wrong port for Jupyter

```bash
# ‚ùå Wrong - Can't access from host
docker run --gpus all -it IMAGE jupyter lab --allow-root

# ‚úÖ Right - Bind to all interfaces and expose port
docker run --gpus all -it -p 8888:8888 IMAGE \
    jupyter lab --ip=0.0.0.0 --allow-root --no-browser
```

---

## üéâ Checkpoint

You've learned:
- ‚úÖ Why NGC containers are required for DGX Spark
- ‚úÖ How to pull the PyTorch NGC container
- ‚úÖ How to run containers with GPU access
- ‚úÖ How to create docker-compose.yml for easy development
- ‚úÖ How to verify GPU access inside containers

---

## üöÄ Challenge (Optional)

Create a custom Dockerfile that extends the NGC PyTorch image with:
1. Your favorite Python packages (e.g., transformers, datasets)
2. Custom Jupyter configuration
3. A startup script that shows GPU status

<details>
<summary>üí° Solution Hint</summary>

```dockerfile
FROM nvcr.io/nvidia/pytorch:25.11-py3

# Install additional packages
RUN pip install transformers datasets accelerate

# Jupyter config
RUN mkdir -p /root/.jupyter
RUN echo "c.NotebookApp.token = ''" >> /root/.jupyter/jupyter_notebook_config.py

# Startup script
COPY startup.sh /startup.sh
ENTRYPOINT ["/startup.sh"]
```
</details>

In [None]:
# YOUR CHALLENGE CODE HERE


---

## üìñ Further Reading

- [NGC Container Catalog](https://catalog.ngc.nvidia.com/)
- [Docker GPU Documentation](https://docs.docker.com/config/containers/resource_constraints/#gpu)
- [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/overview.html)

---

## üßπ Cleanup

In [17]:
# Cleanup
import gc
gc.collect()

print("Files created in this notebook:")
print("-" * 40)
print("  - docker-compose.yml")
print("  - start_pytorch.sh")
print("  - verify_gpu.py")
print("\nThese files are useful - keep them!")

print("\n" + "=" * 60)
print("üéâ Great job completing Lab 1.1.3: NGC Container Setup!")
print("=" * 60)
print("\nNext up: Lab 1.1.4 - Compatibility Matrix")
print("You'll research which AI tools work on DGX Spark.")

Files created in this notebook:
----------------------------------------
  - docker-compose.yml
  - start_pytorch.sh
  - verify_gpu.py

These files are useful - keep them!

üéâ Great job completing Lab 1.1.3: NGC Container Setup!

Next up: Lab 1.1.4 - Compatibility Matrix
You'll research which AI tools work on DGX Spark.
