# Lab 1.1.1: System Exploration

**Module:** 1.1 - DGX Spark Platform Mastery  
**Time:** 1 hour  
**Difficulty:** ‚≠ê (Beginner)

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand the DGX Spark hardware architecture
- [ ] Use system commands to explore your hardware
- [ ] Document your system's capabilities
- [ ] Know the key specifications that matter for AI workloads

---

## üìö Prerequisites

- Basic command line knowledge
- Access to a DGX Spark system

---

## üåç Real-World Context

Before you can train AI models or run inference, you need to understand your hardware. Just like a chef needs to know their kitchen before cooking a feast, an AI engineer needs to know their compute environment.

The DGX Spark is NVIDIA's first desktop AI supercomputer, designed specifically for AI developers. Understanding its unique architecture will help you:
- Choose the right model sizes for your hardware
- Optimize memory usage for maximum performance
- Troubleshoot issues when things don't work as expected

---

## üßí ELI5: What is the DGX Spark?

> **Imagine you have a super-powered gaming computer...**
>
> But instead of being good at video games, it's amazing at doing math - billions of calculations per second! The DGX Spark has a special brain (the GPU) that can do many math problems at the same time, like having 6,144 calculators working together.
>
> The really cool part? It has a HUGE memory (128GB) that both the regular brain (CPU) and the math brain (GPU) can share. It's like having one giant desk where everyone can work together, instead of passing papers back and forth between desks.
>
> **In AI terms:** This unified memory architecture means you can load enormous AI models (like a 70 billion parameter LLM) without worrying about GPU memory limits!

---

## Part 1: GPU Information with nvidia-smi

### Concept Explanation

The `nvidia-smi` (NVIDIA System Management Interface) command is your window into the GPU. It shows:
- GPU model and architecture
- Memory usage (total, used, free)
- Running processes using the GPU
- Temperature and power consumption

### Code Implementation

In [1]:
# Let's start by checking our GPU!
# This is the most important command for any AI developer

!nvidia-smi

Thu Jan  1 21:49:15 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GB10                    On  |   0000000F:01:00.0 Off |                  N/A |
| N/A   39C    P8              3W /  N/A  | Not Supported          |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+----------------------------------------------

### üîç What Just Happened?

You should see output showing:
- **GPU Name:** NVIDIA Graphics Device (or GB10 Superchip)
- **Memory:** 128GB (this is the unified memory!)
- **CUDA Version:** 13.0 or higher
- **Driver Version:** Latest NVIDIA driver

Let's get a more detailed view:

In [2]:
# Get detailed GPU information in a query format
!nvidia-smi --query-gpu=name,memory.total,memory.free,memory.used,temperature.gpu,power.draw --format=csv

name, memory.total [MiB], memory.free [MiB], memory.used [MiB], temperature.gpu, power.draw [W]
NVIDIA GB10, [N/A], [N/A], [N/A], 39, 3.74 W


In [3]:
# Let's also check CUDA compute capability
!nvidia-smi --query-gpu=compute_cap --format=csv,noheader

12.1


### ‚úã Try It Yourself #1

Run the following cell to get GPU topology information. What interconnect does the DGX Spark use?

<details>
<summary>üí° Hint</summary>
Look for "NVLink" in the output - this is the high-speed interconnect between CPU and GPU.
</details>

In [7]:
# YOUR CODE HERE: Run nvidia-smi topo command
# Hint: nvidia-smi topo --matrix

!nvidia-smi topo --matrix

	[4mGPU0	NIC0	NIC1	NIC2	NIC3	CPU Affinity	NUMA Affinity	GPU NUMA ID[0m
GPU0	 X 	NODE	NODE	NODE	NODE	0-19	0		N/A
NIC0	NODE	 X 	PIX	NODE	NODE				
NIC1	NODE	PIX	 X 	NODE	NODE				
NIC2	NODE	NODE	NODE	 X 	PIX				
NIC3	NODE	NODE	NODE	PIX	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: rocep1s0f0
  NIC1: rocep1s0f1
  NIC2: roceP2p1s0f0
  NIC3: roceP2p1s0f1



---

## Part 2: CPU Information with lscpu

### Concept Explanation

The DGX Spark uses NVIDIA's Grace CPU - an ARM-based processor designed specifically for AI workloads. This is different from typical x86 processors (Intel/AMD) found in most computers.

**Why ARM matters:**
- More power-efficient
- Designed to work seamlessly with the Blackwell GPU
- Connected via NVLink-C2C for maximum bandwidth

### Code Implementation

In [8]:
# Check CPU information
!lscpu

Architecture:                aarch64
  CPU op-mode(s):            64-bit
  Byte Order:                Little Endian
CPU(s):                      20
  On-line CPU(s) list:       0-19
Vendor ID:                   ARM
  Model name:                Cortex-X925
    Model:                   1
    Thread(s) per core:      1
    Core(s) per socket:      10
    Socket(s):               1
    Stepping:                r0p1
    CPU(s) scaling MHz:      101%
    CPU max MHz:             4004.0000
    CPU min MHz:             1378.0000
    BogoMIPS:                2000.00
    Flags:                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics 
                             fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop 
                             sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat 
                             ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmu
                             ll svebitperm svesha3 svesm4 flagm2 frint svei8mm s
                             ve

### üîç What Just Happened?

Key things to note from the output:
- **Architecture:** aarch64 (ARM 64-bit)
- **CPU(s):** 20 cores
- **Model name:** ARM v9.2 cores
- **Core types:** Mix of performance (Cortex-X925) and efficiency (Cortex-A725) cores

In [9]:
# Get a summary of CPU info in a cleaner format
!lscpu | grep -E "Architecture|CPU\(s\)|Model name|Thread|Core|Socket|MHz"

Architecture:                            aarch64
CPU(s):                                  20
On-line CPU(s) list:                     0-19
Model name:                              Cortex-X925
Thread(s) per core:                      1
Core(s) per socket:                      10
Socket(s):                               1
CPU(s) scaling MHz:                      94%
CPU max MHz:                             4004.0000
CPU min MHz:                             1378.0000
Model name:                              Cortex-A725
Thread(s) per core:                      1
Core(s) per socket:                      10
Socket(s):                               1
CPU(s) scaling MHz:                      99%
CPU max MHz:                             2860.0000
CPU min MHz:                             338.0000
NUMA node0 CPU(s):                       0-19


### ‚úã Try It Yourself #2

Find out the cache sizes on your CPU. Large caches help with AI workloads!

<details>
<summary>üí° Hint</summary>
Use `lscpu | grep -i cache` to filter for cache information
</details>

In [10]:
# YOUR CODE HERE: Find the cache sizes
!lscpu | grep -E "cache"

L1d cache:                               1.3 MiB (20 instances)
L1i cache:                               1.3 MiB (20 instances)
L2 cache:                                25 MiB (20 instances)
L3 cache:                                24 MiB (2 instances)


---

## Part 3: Memory Information with free

### Concept Explanation

The DGX Spark's **unified memory** is its superpower. Unlike traditional systems where CPU and GPU have separate memory pools, here they share 128GB of LPDDR5X memory.

**Why this matters:**
- No need to copy data between CPU and GPU
- Can load larger models than typical GPU memory allows
- Simpler programming model

### Code Implementation

In [11]:
# Check memory information
!free -h

               total        used        free      shared  buff/cache   available
Mem:           119Gi        12Gi        66Gi       3.6Mi        42Gi       107Gi
Swap:             0B          0B          0B


### üîç What Just Happened?

You should see:
- **total:** ~128GB (the unified memory pool)
- **used:** Memory currently in use by system and applications
- **free:** Available memory
- **buff/cache:** Memory used for disk caching (can be reclaimed)

‚ö†Ô∏è **Important:** The "buff/cache" memory can compete with GPU allocations! We'll learn how to clear it before loading large models.

In [12]:
# Get more detailed memory info
!cat /proc/meminfo | head -20

MemTotal:       125511968 kB
MemFree:        69472388 kB
MemAvailable:   112483768 kB
Buffers:          583712 kB
Cached:         41396448 kB
SwapCached:            0 kB
Active:          7702780 kB
Inactive:       43433692 kB
Active(anon):    6795344 kB
Inactive(anon):  2372016 kB
Active(file):     907436 kB
Inactive(file): 41061676 kB
Unevictable:       26436 kB
Mlocked:           26436 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Zswap:                 0 kB
Zswapped:              0 kB
Dirty:               408 kB
Writeback:             0 kB


In [13]:
# Check memory bandwidth info
!sudo dmidecode -t memory 2>/dev/null | grep -E "Size|Speed|Type" | head -10 || echo "Note: dmidecode requires root access"

[sudo] password for trosfy: 


---

## Part 4: Storage Information with df

### Concept Explanation

AI models and datasets can be HUGE. A single 70B parameter model can take 40-140GB depending on quantization. Knowing your storage is essential for planning.

### Code Implementation

In [14]:
# Check disk space
!df -h

Filesystem      Size  Used Avail Use% Mounted on
tmpfs            12G  3.1M   12G   1% /run
efivarfs        256K   20K  237K   8% /sys/firmware/efi/efivars
/dev/nvme0n1p2  3.7T  753G  2.8T  22% /
tmpfs            60G   96K   60G   1% /dev/shm
tmpfs           5.0M  8.0K  5.0M   1% /run/lock
/dev/nvme0n1p1  298M  7.4M  291M   3% /boot/efi
tmpfs            12G  156K   12G   1% /run/user/1001


In [15]:
# Check just the main partitions (filter out snap and temporary filesystems)
!df -h | grep -E "^/dev|Filesystem"

Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p2  3.7T  753G  2.8T  22% /
/dev/nvme0n1p1  298M  7.4M  291M   3% /boot/efi


In [16]:
# Check where models are typically stored
!du -sh ~/.cache/huggingface 2>/dev/null || echo "Hugging Face cache not found yet"
!du -sh ~/.ollama 2>/dev/null || echo "Ollama models not found yet"

185G	/home/trosfy/.cache/huggingface
Ollama models not found yet


---

## Part 5: Software Environment

### Concept Explanation

The DGX Spark runs DGX OS (based on Ubuntu 24.04 LTS). It comes with many AI tools pre-installed, but there's a critical thing to remember:

‚ö†Ô∏è **Standard pip-installed PyTorch does NOT work** on DGX Spark because it's ARM64 + CUDA. You must use NGC containers!

### Code Implementation

In [17]:
# Check OS version
!cat /etc/os-release

PRETTY_NAME="Ubuntu 24.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.3 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo


In [18]:
# Check kernel version
!uname -a

Linux dgx-spark 6.14.0-1015-nvidia #15-Ubuntu SMP PREEMPT_DYNAMIC Tue Nov 25 18:02:16 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux


In [19]:
# Check CUDA version
!nvcc --version 2>/dev/null || echo "CUDA compiler not directly accessible - use NGC containers"

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:57:39_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0


In [20]:
# Check Docker installation
!docker --version

Docker version 28.5.1, build e180ab8


In [21]:
# Check if NVIDIA container runtime is installed
!docker info 2>/dev/null | grep -i nvidia || echo "Check docker info for NVIDIA runtime"

  cdi: nvidia.com/gpu=0
  cdi: nvidia.com/gpu=GPU-3d29cd16-ca97-2c38-7d79-d462cfa45fed
  cdi: nvidia.com/gpu=all
 Runtimes: runc io.containerd.runc.v2 nvidia
 Kernel Version: 6.14.0-1015-nvidia


In [22]:
# Check if Ollama is installed
!ollama --version 2>/dev/null || echo "Ollama not found"

ollama version is 0.13.4


---

## Part 6: Create Your System Specification Document

Now let's compile all this information into a reusable Python script.

In [23]:
import subprocess
import platform
import os
from datetime import datetime

def run_command(cmd):
    """Run a shell command and return its output."""
    try:
        result = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=30)
        return result.stdout.strip()
    except Exception as e:
        return f"Error: {e}"

def get_system_info():
    """Gather comprehensive system information for DGX Spark."""
    info = {
        "timestamp": datetime.now().isoformat(),
        "hostname": platform.node(),
        "os": {},
        "cpu": {},
        "memory": {},
        "gpu": {},
        "storage": {},
        "software": {}
    }
    
    # OS Information
    info["os"]["system"] = platform.system()
    info["os"]["release"] = platform.release()
    info["os"]["version"] = run_command("cat /etc/os-release | grep PRETTY_NAME | cut -d'=' -f2")
    info["os"]["architecture"] = platform.machine()
    
    # CPU Information
    info["cpu"]["model"] = run_command("lscpu | grep 'Model name' | cut -d':' -f2").strip()
    info["cpu"]["cores"] = run_command("nproc")
    info["cpu"]["architecture"] = run_command("lscpu | grep 'Architecture' | cut -d':' -f2").strip()
    
    # Memory Information
    mem_total = run_command("free -g | grep Mem | awk '{print $2}'")
    mem_available = run_command("free -g | grep Mem | awk '{print $7}'")
    info["memory"]["total_gb"] = mem_total
    info["memory"]["available_gb"] = mem_available
    
    # GPU Information
    info["gpu"]["name"] = run_command("nvidia-smi --query-gpu=name --format=csv,noheader")
    info["gpu"]["memory_total"] = run_command("nvidia-smi --query-gpu=memory.total --format=csv,noheader")
    info["gpu"]["driver_version"] = run_command("nvidia-smi --query-gpu=driver_version --format=csv,noheader")
    info["gpu"]["cuda_version"] = run_command("nvidia-smi | grep 'CUDA Version' | awk '{print $9}'")
    
    # Storage Information
    info["storage"]["root_total"] = run_command("df -h / | tail -1 | awk '{print $2}'")
    info["storage"]["root_available"] = run_command("df -h / | tail -1 | awk '{print $4}'")
    
    # Software Versions
    info["software"]["docker"] = run_command("docker --version 2>/dev/null | cut -d' ' -f3 | tr -d ','")
    info["software"]["ollama"] = run_command("ollama --version 2>/dev/null")
    info["software"]["python"] = platform.python_version()
    
    return info

# Get and display system information
system_info = get_system_info()

print("=" * 60)
print("DGX SPARK SYSTEM SPECIFICATION")
print("=" * 60)
print(f"\nGenerated: {system_info['timestamp']}")
print(f"Hostname: {system_info['hostname']}")
print("\n" + "-" * 40)
print("OPERATING SYSTEM")
print("-" * 40)
print(f"  OS: {system_info['os']['version']}")
print(f"  Kernel: {system_info['os']['release']}")
print(f"  Architecture: {system_info['os']['architecture']}")
print("\n" + "-" * 40)
print("CPU")
print("-" * 40)
print(f"  Model: {system_info['cpu']['model']}")
print(f"  Cores: {system_info['cpu']['cores']}")
print(f"  Architecture: {system_info['cpu']['architecture']}")
print("\n" + "-" * 40)
print("MEMORY")
print("-" * 40)
print(f"  Total: {system_info['memory']['total_gb']} GB")
print(f"  Available: {system_info['memory']['available_gb']} GB")
print("\n" + "-" * 40)
print("GPU")
print("-" * 40)
print(f"  Model: {system_info['gpu']['name']}")
print(f"  Memory: {system_info['gpu']['memory_total']}")
print(f"  Driver: {system_info['gpu']['driver_version']}")
print(f"  CUDA: {system_info['gpu']['cuda_version']}")
print("\n" + "-" * 40)
print("STORAGE")
print("-" * 40)
print(f"  Root Partition: {system_info['storage']['root_total']} total, {system_info['storage']['root_available']} available")
print("\n" + "-" * 40)
print("SOFTWARE")
print("-" * 40)
print(f"  Docker: {system_info['software']['docker']}")
print(f"  Ollama: {system_info['software']['ollama']}")
print(f"  Python: {system_info['software']['python']}")
print("\n" + "=" * 60)

DGX SPARK SYSTEM SPECIFICATION

Generated: 2026-01-02T07:44:26.707442
Hostname: dgx-spark

----------------------------------------
OPERATING SYSTEM
----------------------------------------
  OS: "Ubuntu 24.04.3 LTS"
  Kernel: 6.14.0-1015-nvidia
  Architecture: aarch64

----------------------------------------
CPU
----------------------------------------
  Model: Cortex-X925
                              Cortex-A725
  Cores: 20
  Architecture: aarch64

----------------------------------------
MEMORY
----------------------------------------
  Total: 119 GB
  Available: 107 GB

----------------------------------------
GPU
----------------------------------------
  Model: NVIDIA GB10
  Memory: [N/A]
  Driver: 580.95.05
  CUDA: 13.0

----------------------------------------
STORAGE
----------------------------------------
  Root Partition: 3.7T total, 2.8T available

----------------------------------------
SOFTWARE
----------------------------------------
  Docker: 28.5.1
  Ollama: ollama

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Trying to pip install PyTorch

```python
# ‚ùå Wrong way - This will NOT work on DGX Spark!
pip install torch torchvision

# ‚úÖ Right way - Use NGC containers
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:25.11-py3 python -c "import torch; print(torch.cuda.is_available())"
```
**Why:** Standard PyTorch wheels are built for x86 architecture. DGX Spark uses ARM64, so you need specially compiled versions from NGC.

### Mistake 2: Not clearing buffer cache before loading large models

```bash
# ‚ùå Wrong way - Loading 70B model with cached memory
ollama run qwen3:32b  # Might fail with OOM!

# ‚úÖ Right way - Clear cache first
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
ollama run qwen3:32b  # Works!
```
**Why:** Linux aggressively caches disk data in RAM. This memory competes with GPU allocations on unified memory systems.

### Mistake 3: Forgetting --gpus all flag

```bash
# ‚ùå Wrong way - No GPU access
docker run -it nvcr.io/nvidia/pytorch:25.11-py3 bash

# ‚úÖ Right way - GPU enabled
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:25.11-py3 bash
```
**Why:** Docker doesn't expose GPUs by default. The `--gpus all` flag enables NVIDIA container runtime.

---

## üéâ Checkpoint

You've learned:
- ‚úÖ How to use `nvidia-smi` to check GPU status
- ‚úÖ How to use `lscpu` to understand the ARM64 CPU
- ‚úÖ How to use `free` to monitor unified memory
- ‚úÖ How to use `df` to check storage capacity
- ‚úÖ The importance of NGC containers for PyTorch

---

## üöÄ Challenge (Optional)

Create a monitoring script that runs every 5 seconds and displays:
1. GPU memory usage
2. GPU temperature
3. System memory usage

Use `watch` command or write a Python loop with `time.sleep()`.

<details>
<summary>üí° Solution Hint</summary>

```bash
watch -n 5 "nvidia-smi --query-gpu=memory.used,memory.free,temperature.gpu --format=csv && echo '---' && free -h | head -2"
```
</details>

In [None]:
# YOUR CHALLENGE CODE HERE


---

## üìñ Further Reading

- [DGX Spark User Guide](https://docs.nvidia.com/dgx/dgx-spark/)
- [NVIDIA Grace CPU Architecture](https://www.nvidia.com/en-us/data-center/grace-cpu/)
- [Blackwell GPU Architecture](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/)
- [NGC Container Catalog](https://catalog.ngc.nvidia.com/)

---

## üßπ Cleanup

No cleanup needed for this notebook - we only ran read-only commands!

In [24]:
# Cleanup cell - good practice even for read-only notebooks
import gc
gc.collect()
print("‚úÖ Cleanup complete!")

‚úÖ Cleanup complete!


In [25]:
print("Great job completing Lab 1.1.1: System Exploration!")
print("\nNext up: Lab 1.1.2 - Memory Architecture Lab")
print("You'll learn how unified memory really works by allocating tensors of various sizes.")

Great job completing Lab 1.1.1: System Exploration!

Next up: Lab 1.1.2 - Memory Architecture Lab
You'll learn how unified memory really works by allocating tensors of various sizes.
