# Lab 1.1.1: System Exploration - SOLUTIONS

This notebook contains solutions to the exercises in the System Exploration notebook.

---

## Try It Yourself #1 Solution

**Task:** Run nvidia-smi topo command. What interconnect does the DGX Spark use?

In [None]:
# Solution: GPU topology information
!nvidia-smi topo --matrix

**Expected Output:**
```
        GPU0    CPU Affinity    NUMA Affinity
GPU0     X      0-19            N/A

Legend:
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect
  NODE = Connection traversing PCIe as well as the NUMA node interconnect
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge
  NV#  = Connection traversing NVLink
```

**Answer:** The DGX Spark uses NVLink-C2C (Chip-to-Chip) interconnect between the Grace CPU and Blackwell GPU. This is shown in the topology as a direct connection with no PCIe traversal.

---

## Try It Yourself #2 Solution

**Task:** Find out the cache sizes on your CPU.

In [None]:
# Solution: Filter for cache information
!lscpu | grep -i cache

**Expected Output:**
```
L1d cache:                       1 MiB (20 instances)
L1i cache:                       1 MiB (20 instances)
L2 cache:                        20 MiB (20 instances)
L3 cache:                        24 MiB (1 instance)
```

**Explanation:**
- L1d (data) and L1i (instruction) caches: 1 MiB each per core
- L2 cache: 1 MiB per core (20 total)
- L3 cache: 24 MiB shared across all cores

---

## Challenge Solution

**Task:** Create a monitoring script that runs every 5 seconds and displays GPU memory, temperature, and system memory.

In [None]:
# Solution 1: Using watch command (run in terminal)
print("Run this in terminal:")
print("")
print('watch -n 5 "nvidia-smi --query-gpu=memory.used,memory.free,temperature.gpu --format=csv && echo \'---\' && free -h | head -2"')

In [None]:
# Solution 2: Python version with continuous updates
import subprocess
import time
import sys

def monitor_once():
    """Display system status once."""
    # Clear output (works in Jupyter)
    from IPython.display import clear_output
    clear_output(wait=True)
    
    print("=" * 50)
    print(f"System Monitor - {time.strftime('%H:%M:%S')}")
    print("=" * 50)
    
    # GPU info - with timeout to prevent hanging
    try:
        gpu_info = subprocess.run(
            ["nvidia-smi", "--query-gpu=memory.used,memory.free,temperature.gpu,power.draw", 
             "--format=csv,noheader"],
            capture_output=True, text=True, timeout=30
        )
        if gpu_info.returncode == 0:
            parts = gpu_info.stdout.strip().split(", ")
            if len(parts) >= 4:
                print(f"\nGPU Memory Used:  {parts[0]}")
                print(f"GPU Memory Free:  {parts[1]}")
                print(f"GPU Temperature:  {parts[2]}C")
                print(f"GPU Power:        {parts[3]}")
    except subprocess.TimeoutExpired:
        print("\nGPU info: Command timed out")
    
    # System memory - with timeout to prevent hanging
    try:
        mem_info = subprocess.run(["free", "-h"], capture_output=True, text=True, timeout=30)
        if mem_info.returncode == 0:
            lines = mem_info.stdout.strip().split("\n")
            print("\nSystem Memory:")
            for line in lines[:2]:
                print(f"  {line}")
    except subprocess.TimeoutExpired:
        print("\nMemory info: Command timed out")
    
    print("\n" + "-" * 50)
    print("Press Ctrl+C to stop (or interrupt kernel)")

def continuous_monitor(interval=5, max_iterations=10):
    """Monitor continuously for a number of iterations."""
    for i in range(max_iterations):
        monitor_once()
        time.sleep(interval)
    print("\nMonitoring complete!")

# Run once to show output
monitor_once()

# Uncomment to run continuously:
# continuous_monitor(interval=5, max_iterations=10)

In [None]:
# Solution 3: Using the memory_monitor script from scripts/
print("You can also use the provided memory_monitor.py script:")
print("")
print("python ../scripts/memory_monitor.py --interval 5")
print("python ../scripts/memory_monitor.py --interval 5 --processes  # Include GPU processes")
print("python ../scripts/memory_monitor.py --interval 5 --log memory.csv  # Log to file")

---

## Key Takeaways

1. **nvidia-smi** is your primary tool for GPU monitoring
2. **lscpu** shows CPU architecture - ARM64 on DGX Spark
3. **free** shows unified memory shared between CPU and GPU
4. The Grace-Blackwell architecture uses NVLink-C2C for fast interconnect
5. Buffer cache can compete with GPU memory - clear before large models

---

## Cleanup

In [None]:
# Cleanup resources
import gc
gc.collect()
print("Cleanup complete!")