# Live Demo: How Do You Run a Model That Doesn't Fit on a GPU?

**Model**: Qwen2.5 72B Instruct (72 billion parameters)  
**Hardware**: 2x NVIDIA A100 80GB (SXM)  

We have a problem: this model is 144GB, but each GPU only has 80GB of memory.  
Let's figure out how to make it work -- and compare different approaches.

---

## Step 0: Setup

Install libraries, log in to HuggingFace, and download the model.

### 0.0 Clean GPU slate

Run this FIRST every time you start (or restart) the notebook.  
Kills any leftover processes from previous runs that might be hogging GPU memory.

In [1]:
import subprocess, os, signal

def nuke_gpu_processes():
    """Kill ALL processes using GPUs (except this Jupyter kernel)."""
    my_pid = os.getpid()
    killed = []
    try:
        result = subprocess.run(
            ["nvidia-smi", "--query-compute-apps=pid", "--format=csv,noheader"],
            capture_output=True, text=True
        )
        for line in result.stdout.strip().split("\n"):
            line = line.strip()
            if line and line.isdigit():
                pid = int(line)
                if pid != my_pid:
                    try:
                        os.kill(pid, signal.SIGKILL)
                        killed.append(pid)
                    except ProcessLookupError:
                        pass
    except Exception as e:
        print(f"Warning: {e}")

    # Also clear PyTorch cache if available
    try:
        import torch
        import gc
        gc.collect()
        torch.cuda.empty_cache()
    except ImportError:
        pass

    if killed:
        print(f"Killed {len(killed)} leftover GPU processes: {killed}")
    else:
        print("No leftover GPU processes found.")

    # Show clean state
    os.system("nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv")

nuke_gpu_processes()

No leftover GPU processes found.
index, memory.used [MiB], memory.total [MiB]
0, 1 MiB, 81920 MiB
1, 1 MiB, 81920 MiB


### 0.1 Install dependencies

In [2]:
!pip install transformers accelerate bitsandbytes huggingface_hub sentencepiece protobuf -q
!pip install vllm -q

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


### 0.2 Log in to HuggingFace

HuggingFace is the largest hub for open-source AI models.  
To download models, you need a free account and an access token.

**How to get your token:**
1. Go to https://huggingface.co and create a free account
2. Go to https://huggingface.co/settings/tokens
3. Click **New token** -- name it anything -- select **Read** access -- click **Create**
4. Copy the token (starts with `hf_`) and paste it below

In [None]:
from huggingface_hub import login

# Paste your HuggingFace token here:
HF_TOKEN = "YOUR_HUGGINGFACE_TOKEN"

login(token=HF_TOKEN)
print("Logged in to HuggingFace!")

Logged in to HuggingFace!


### 0.3 Download the model from HuggingFace

We are downloading **Qwen2.5 72B Instruct** -- a 72 billion parameter model from Alibaba.  
The files total ~136GB. This will take a few minutes.

In [4]:
from huggingface_hub import snapshot_download

MODEL_NAME = "Qwen/Qwen2.5-72B-Instruct"

print(f"Downloading: {MODEL_NAME}")
print(f"72 billion parameters = ~136GB of files")
print()

model_path = snapshot_download(MODEL_NAME, cache_dir="/workspace/models")

print(f"\nDownload complete!")
print(f"Saved to: {model_path}")

Downloading: Qwen/Qwen2.5-72B-Instruct
72 billion parameters = ~136GB of files



Fetching 47 files:   0%|          | 0/47 [00:00<?, ?it/s]


Download complete!
Saved to: /workspace/models/models--Qwen--Qwen2.5-72B-Instruct/snapshots/495f39366efef23836d0cfae4fbe635880d2be31


---

## Part 1: The Problem -- This Model Doesn't Fit

Let's check our hardware and do the math.

### 1.1 Check our GPUs

In [5]:
import torch

def gpu_status():
    """Print GPU memory usage."""
    for i in range(torch.cuda.device_count()):
        total = torch.cuda.get_device_properties(i).total_memory / 1e9
        used = torch.cuda.memory_allocated(i) / 1e9
        print(f"  GPU {i} ({torch.cuda.get_device_name(i)}): {used:.1f} GB used / {total:.0f} GB total")

print(f"GPUs available: {torch.cuda.device_count()}")
print()
gpu_status()
print()
print("Both GPUs are empty. Let's try to load the model.")

GPUs available: 2

  GPU 0 (NVIDIA A100-SXM4-80GB): 0.0 GB used / 85 GB total
  GPU 1 (NVIDIA A100-SXM4-80GB): 0.0 GB used / 85 GB total

Both GPUs are empty. Let's try to load the model.


### 1.2 Do the math

The model has **72 billion parameters**.  
Each parameter is stored as a 16-bit float (FP16) = **2 bytes**.  

```
72,000,000,000 params x 2 bytes = 144,000,000,000 bytes = 144 GB
```

One A100 GPU has **80 GB** of VRAM.  

**144 GB > 80 GB. The model does not fit on one GPU.**

But let's try anyway...

### 1.3 Try to load on 1 GPU -- watch it crash

In [None]:
from transformers import AutoModelForCausalLM
import gc

print("Attempting to load 144GB model onto 1 GPU (80GB)...")
print("This WILL fail.")
print()

try:
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        torch_dtype=torch.float16,
        device_map={"":  "cuda:0"},  # Force everything onto GPU 0
    )
except Exception as e:
    # CRITICAL: Clean up the partial allocation from the failed load!
    # Without this, device_map="auto" will see less free VRAM and
    # offload layers to CPU, making inference ~20x slower.
    try:
        del model
    except NameError:
        pass
    gc.collect()
    torch.cuda.empty_cache()

    print(f"CRASHED: {type(e).__name__}")
    print(f"\n{e}")
    print("\n" + "="*60)
    print("The model is 144GB. The GPU has 80GB. It doesn't fit.")
    print("We need a solution.")
    print()
    gpu_status()  # Verify GPUs are clean after cleanup

### 1.4 So what do we do?

We have 3 main strategies:

1. **Pipeline Parallelism** -- Split the model's layers across multiple GPUs. GPU 0 gets the first half of layers, GPU 1 gets the second half. Data flows through like an assembly line.

2. **Quantization** -- Use smaller numbers (4-bit instead of 16-bit) so the model shrinks to fit on 1 GPU.

3. **Tensor Parallelism** -- Split individual weight matrices across GPUs. Each GPU computes a partial result, then they combine. This is what production systems use.

**Our plan:**
- First, we'll try Pipeline Parallelism and Quantization using **HuggingFace** (easy to understand)
- Then, we'll learn how Tensor Parallelism works under the hood
- Finally, we'll use **vLLM** to do a **fair speed comparison** between Tensor and Pipeline Parallelism (same engine, different strategies)

Let's start!

---

## Part 2: Pipeline Parallelism with HuggingFace (2 GPUs)

**Idea**: Don't split any individual matrix. Instead, put some layers on GPU 0 and other layers on GPU 1.  
Data flows from GPU 0 to GPU 1, like an assembly line.

```
Input --> [GPU 0: Layers 0-39] --> transfer --> [GPU 1: Layers 40-79] --> Output
```

HuggingFace does this automatically with `device_map="auto"`.

### 2.1 Load with pipeline parallelism

In [6]:
%%time
from transformers import AutoModelForCausalLM, AutoTokenizer

print("Loading 70B model with device_map='auto' (pipeline parallelism)...")
print("This splits LAYERS across GPUs.")
print()

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

print("Model loaded successfully!")

`torch_dtype` is deprecated! Use `dtype` instead!


Loading 70B model with device_map='auto' (pipeline parallelism)...
This splits LAYERS across GPUs.



Loading checkpoint shards:   0%|          | 0/37 [00:00<?, ?it/s]

Model loaded successfully!
CPU times: user 7min 24s, sys: 6min 36s, total: 14min
Wall time: 48.7 s


### 2.2 Proof: both GPUs are loaded

In [7]:
!nvidia-smi

Tue Feb 24 10:35:40 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-80GB          On  |   00000000:BD:00.0 Off |                    0 |
| N/A   28C    P0             82W /  500W |   68087MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  |   00

### 2.3 Which layers are on which GPU?

This is pipeline parallelism: the first half of layers on GPU 0, second half on GPU 1.

In [8]:
# Show the layer-to-GPU mapping
gpu0_layers = []
gpu1_layers = []

for name, param in model.named_parameters():
    if "layers." in name:
        layer_num = int(name.split("layers.")[1].split(".")[0])
        device = str(param.device)
        if "cuda:0" in device:
            if layer_num not in gpu0_layers:
                gpu0_layers.append(layer_num)
        elif "cuda:1" in device:
            if layer_num not in gpu1_layers:
                gpu1_layers.append(layer_num)

gpu0_layers.sort()
gpu1_layers.sort()

print("PIPELINE PARALLELISM: Layer-to-GPU Mapping")
print("=" * 50)
print(f"GPU 0: Layers {gpu0_layers[0]}-{gpu0_layers[-1]} ({len(gpu0_layers)} layers)")
print(f"GPU 1: Layers {gpu1_layers[0]}-{gpu1_layers[-1]} ({len(gpu1_layers)} layers)")
print()
print("Data flows: Input -> GPU 0 -> GPU 1 -> Output")
print("This is an assembly line: each GPU handles a different stage.")

PIPELINE PARALLELISM: Layer-to-GPU Mapping
GPU 0: Layers 0-38 (39 layers)
GPU 1: Layers 39-79 (41 layers)

Data flows: Input -> GPU 0 -> GPU 1 -> Output
This is an assembly line: each GPU handles a different stage.


### 2.4 Generate text and measure speed

In [9]:
import time
from transformers import TextStreamer

# A reasoning question with a clear, concise answer
prompt = "A farmer has 3 fields. Field A produces twice as much wheat as Field B. Field C produces 50kg more than Field A. Together all three fields produce 550kg. How much does each field produce?"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
num_input_tokens = inputs["input_ids"].shape[1]

print(f"Prompt: '{prompt}'")
print(f"\nGenerating with Pipeline Parallelism...\n")

# Custom streamer that also tracks time-to-first-token
class TimedStreamer(TextStreamer):
    def __init__(self, tokenizer, **kwargs):
        super().__init__(tokenizer, **kwargs)
        self.start_time = time.time()
        self.first_token_time = None

    def on_finalized_text(self, text, stream_end=False):
        if self.first_token_time is None and text:
            self.first_token_time = time.time()
        print(text, end="", flush=True)

streamer = TimedStreamer(tokenizer, skip_special_tokens=True)

start = time.time()
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=2048,
        temperature=0.7,
        do_sample=True,
        streamer=streamer,
    )
pp_total_time = time.time() - start

pp_tokens = outputs.shape[1] - num_input_tokens
pp_ttft = streamer.first_token_time - start if streamer.first_token_time else pp_total_time
pp_speed = pp_tokens / pp_total_time

print(f"\n\n{'='*60}")
print(f"PIPELINE PARALLELISM RESULTS (HuggingFace, 2 GPUs):")
print(f"  Total time:                 {pp_total_time:.1f}s")
print(f"  Tokens generated:           {pp_tokens}")
print(f"  Speed:                      {pp_speed:.1f} tokens/sec")
print(f"\nCorrect answer: Field B = 100kg, Field A = 200kg, Field C = 250kg")

Prompt: 'A farmer has 3 fields. Field A produces twice as much wheat as Field B. Field C produces 50kg more than Field A. Together all three fields produce 550kg. How much does each field produce?'

Generating with Pipeline Parallelism...

A farmer has 3 fields. Field A produces twice as much wheat as Field B. Field C produces 50kg more than Field A. Together all three fields produce 550kg. How much does each field produce? Let's denote the amount of wheat produced by Field B as \( x \) kg.

Given:
- Field A produces twice as much wheat as Field B, so Field A produces \( 2x \) kg.
- Field C produces 50 kg more than Field A, so Field C produces \( 2x + 50 \) kg.
- The total production from all three fields is 550 kg.

We can set up the following equation to represent the total production:

\[
x + 2x + (2x + 50) = 550
\]

Simplify the equation:

\[
x + 2x + 2x + 50 = 550
\]

Combine like terms:

\[
5x + 50 = 550
\]

Subtract 50 from both sides:

\[
5x = 500
\]

Divide both sides by 5:

\

---

## Part 3: Quantization with HuggingFace (1 GPU)

What if instead of splitting the model across GPUs, we **shrink** it?

Each weight is stored as a 16-bit float (FP16 = 2 bytes).  
What if we used 4-bit integers instead (INT4 = 0.5 bytes)?

```
FP16: 70B x 2 bytes   = 140 GB --> needs 2 GPUs
INT4: 70B x 0.5 bytes = 35 GB  --> fits on 1 GPU!
```

The tradeoff: less precision means slightly lower quality.  
Like compressing a photo from RAW to JPEG -- smaller file, almost the same picture.

### 3.1 First, unload the FP16 model

We need to free GPU memory before loading the quantized version.

In [10]:
import gc

del model
gc.collect()
torch.cuda.empty_cache()

print("FP16 model unloaded. GPU memory freed.")
gpu_status()

FP16 model unloaded. GPU memory freed.
  GPU 0 (NVIDIA A100-SXM4-80GB): 0.0 GB used / 85 GB total
  GPU 1 (NVIDIA A100-SXM4-80GB): 2.5 GB used / 85 GB total


### 3.2 Load the same model in 4-bit

In [11]:
%%time
from transformers import BitsAndBytesConfig

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
)

print("Loading the SAME 70B model in 4-bit quantization...")
print("70B x 0.5 bytes = ~35GB. This should fit on a SINGLE GPU.")
print()

model_4bit = AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=quant_config,
    device_map="auto",
)

print("\n4-bit model loaded!")

Loading the SAME 70B model in 4-bit quantization...
70B x 0.5 bytes = ~35GB. This should fit on a SINGLE GPU.



Loading checkpoint shards:   0%|          | 0/37 [00:00<?, ?it/s]


4-bit model loaded!
CPU times: user 7min 28s, sys: 6min 55s, total: 14min 24s
Wall time: 53.1 s


### 3.3 Proof: it fits on 1 GPU!

In [12]:
!nvidia-smi

Tue Feb 24 10:37:21 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-80GB          On  |   00000000:BD:00.0 Off |                    0 |
| N/A   28C    P0             82W /  500W |   29377MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  |   00

Compare to Part 2 where we needed **both** GPUs.  
Now the same 70B model fits on **one** GPU thanks to quantization.

### 3.4 Generate text -- does quality hold up?

Same prompt, same token count. Let's see if the quantized model can still reason correctly.

In [13]:
# Re-define TimedStreamer (in case of kernel restart)
class TimedStreamer(TextStreamer):
    def __init__(self, tokenizer, **kwargs):
        super().__init__(tokenizer, **kwargs)
        self.start_time = time.time()
        self.first_token_time = None

    def on_finalized_text(self, text, stream_end=False):
        if self.first_token_time is None and text:
            self.first_token_time = time.time()
        print(text, end="", flush=True)

# Same prompt as Pipeline Parallelism
prompt = "A farmer has 3 fields. Field A produces twice as much wheat as Field B. Field C produces 50kg more than Field A. Together all three fields produce 550kg. How much does each field produce?"

inputs = tokenizer(prompt, return_tensors="pt").to(model_4bit.device)
num_input_tokens = inputs["input_ids"].shape[1]

print(f"Prompt: '{prompt}'")
print(f"\nGenerating with 4-bit Quantization (1 GPU)...\n")

streamer = TimedStreamer(tokenizer, skip_special_tokens=True)

start = time.time()
with torch.no_grad():
    outputs = model_4bit.generate(
        **inputs,
        max_new_tokens=2048,
        temperature=0.7,
        do_sample=True,
        streamer=streamer,
    )
q_total_time = time.time() - start

q_tokens = outputs.shape[1] - num_input_tokens
q_speed = q_tokens / q_total_time

print(f"\n\n{'='*60}")
print(f"QUANTIZED INT4 RESULTS (HuggingFace, 1 GPU):")
print(f"  Total time:                 {q_total_time:.1f}s")
print(f"  Tokens generated:           {q_tokens}")
print(f"  Speed:                      {q_speed:.1f} tokens/sec")
print(f"\nCorrect answer: Field B = 100kg, Field A = 200kg, Field C = 250kg")
print(f"Compare the reasoning quality to the FP16 version above.")
print()
print(f"HuggingFace COMPARISON:")
print(f"  Pipeline Parallel (2 GPUs, FP16):  {pp_speed:>6.1f} tok/s  ({pp_tokens} tokens in {pp_total_time:.1f}s)")
print(f"  Quantized INT4    (1 GPU,  INT4):  {q_speed:>6.1f} tok/s  ({q_tokens} tokens in {q_total_time:.1f}s)")

Prompt: 'A farmer has 3 fields. Field A produces twice as much wheat as Field B. Field C produces 50kg more than Field A. Together all three fields produce 550kg. How much does each field produce?'

Generating with 4-bit Quantization (1 GPU)...

A farmer has 3 fields. Field A produces twice as much wheat as Field B. Field C produces 50kg more than Field A. Together all three fields produce 550kg. How much does each field produce? Let's denote the amount of wheat produced by Field B as \( x \) kg.

According to the problem:
- Field A produces twice as much wheat as Field B, so Field A produces \( 2x \) kg.
- Field C produces 50 kg more than Field A, so Field C produces \( 2x + 50 \) kg.

The total production from all three fields is 550 kg. Therefore, we can set up the following equation:

\[
x + 2x + (2x + 50) = 550
\]

Simplify the equation:

\[
x + 2x + 2x + 50 = 550
\]

Combine like terms:

\[
5x + 50 = 550
\]

Subtract 50 from both sides:

\[
5x = 500
\]

Divide both sides by 5:

\

### 3.5 Save HuggingFace results and restart kernel

We need to fully release GPU memory before loading vLLM.  
The safest way: **save our results, restart the kernel**, then continue.

In [14]:
import json

# Save all HuggingFace results so we can compare later
hf_results = {
    "pp_speed": pp_speed,
    "pp_ttft": pp_ttft,
    "pp_total_time": pp_total_time,
    "pp_tokens": pp_tokens,
    "q_speed": q_speed,
    "q_total_time": q_total_time,
    "q_tokens": q_tokens,
    "model_path": model_path,
}
with open("/workspace/hf_results.json", "w") as f:
    json.dump(hf_results, f)

print("HuggingFace results saved to /workspace/hf_results.json")
print(f"  Pipeline Parallel: {pp_speed:.1f} tok/s | {pp_tokens} tokens in {pp_total_time:.1f}s")
print(f"  Quantized INT4:    {q_speed:.1f} tok/s | {q_tokens} tokens in {q_total_time:.1f}s")
print()
print("=" * 60)
print("  NOW: Restart the kernel (Kernel -> Restart)")
print("  Then continue from Part 4 below")
print("=" * 60)

HuggingFace results saved to /workspace/hf_results.json
  Pipeline Parallel: 8.7 tok/s | 409 tokens in 47.0s
  Quantized INT4:    12.3 tok/s | 408 tokens in 33.2s

  NOW: Restart the kernel (Kernel -> Restart)
  Then continue from Part 4 below


---

## Part 4: Understanding Tensor Parallelism (Concept)

Pipeline parallelism splits **layers** across GPUs. But there's another approach:  
**Tensor parallelism splits individual weight matrices.**

Instead of giving each GPU complete layers, we give each GPU **half of every matrix**.  
Each GPU computes a partial result, then they combine.

```
PIPELINE PARALLELISM                    TENSOR PARALLELISM
(each GPU has COMPLETE layers)          (each GPU has HALF of EVERY layer)

┌─────────────┐                         ┌──────────┬──────────┐
│   GPU 0     │                         │  GPU 0   │  GPU 1   │
│             │                         │          │          │
│  Layer 0    │                         │ Layer 0  │ Layer 0  │
│  Layer 1    │                         │ (left)   │ (right)  │
│  ...        │                         │          │          │
│  Layer 39   │                         │ Layer 1  │ Layer 1  │
├─────────────┤  data transfer          │ (left)   │ (right)  │
│   GPU 1     │  ↓ between GPUs         │          │          │
│             │                         │ ...      │ ...      │
│  Layer 40   │                         │          │          │
│  Layer 41   │                         │ Layer 79 │ Layer 79 │
│  ...        │                         │ (left)   │ (right)  │
│  Layer 79   │                         │          │          │
└─────────────┘                         └──────────┴──────────┘
Sequential flow                         Parallel computation
1 big transfer between GPUs             Many small transfers (AllReduce)
```

**Why is TP faster?** Pipeline parallelism has **pipeline bubbles**: GPU 1 sits idle while GPU 0 processes, then vice versa. Tensor parallelism keeps **both GPUs busy simultaneously** on every layer.

The tradeoff: TP requires **fast GPU-to-GPU communication** (NVLink at 600+ GB/s). PP only needs occasional transfers. This is why:
- **TP is used within a server** (GPUs connected by NVLink)
- **PP is used across servers** (connected by slower InfiniBand)

We'll see the speed difference in Part 6 using vLLM.

---

## Part 5: Data Parallelism (Training Concept)

We've seen two ways to split a model across GPUs:  
- **Pipeline Parallelism**: split layers  
- **Tensor Parallelism**: split matrices  

But there's a third type: **Data Parallelism**.

Data parallelism is primarily a **training** technique:  
- Make **complete copies** of the model on separate GPUs  
- Split the **training batch** across copies -- each GPU processes different data  
- After each step, **synchronize gradients** across all copies (AllReduce)  
- Every copy stays in sync and learns from all the data  

```
TRAINING with Data Parallelism:

  Training Batch (1024 samples)
       |
       ├──→ GPU 0: Model copy 1 (256 samples) ──→ gradients ──┐
       ├──→ GPU 1: Model copy 2 (256 samples) ──→ gradients ──┤ AllReduce
       ├──→ GPU 2: Model copy 3 (256 samples) ──→ gradients ──┤ (average)
       └──→ GPU 3: Model copy 4 (256 samples) ──→ gradients ──┘
                                                                │
                                                   All GPUs update weights
                                                   with averaged gradients
```

Data parallelism doesn't help fit a bigger model -- each GPU needs the full model.  
It helps **train faster** by processing more data in parallel.

For **inference**, a similar idea is used (multiple model replicas behind a load balancer),  
but it's simpler -- no gradient sync needed, each replica is fully independent.  

We can't demo this on 2 GPUs (the model already takes both GPUs),  
but this is how large-scale training works in practice.

### 5.1 Real-world training at scale: 3D Parallelism

How would you **train** a 744B parameter model (like GLM-5)?  
You need all 3 types of parallelism working together:

```
Model size: 744B params x 2 bytes = 1,488 GB (~1.5 TB)
One H100 GPU: 80 GB

Step 1 - Tensor Parallelism (within each server):
  8 GPUs per server, connected by NVLink (900 GB/s)
  Each GPU holds 1/8th of every weight matrix
  TP = 8

Step 2 - Pipeline Parallelism (across servers):
  4 servers, connected by InfiniBand (50 GB/s)
  Each server handles ~20 layers
  PP = 4

Step 3 - Data Parallelism (for throughput):
  33 complete copies of the 4-server setup
  Each copy trains on different data, then syncs gradients
  DP = 33

Total: TP(8) x PP(4) x DP(33) = 1,056 H100 GPUs
Cost: ~$30M in GPU hardware alone
```

This is called **3D parallelism** -- combining all three types.  
Frameworks like Megatron-LM and DeepSpeed make this possible.

Now let's see a **fair speed comparison** using vLLM: Tensor Parallel vs Pipeline Parallel, same engine.

---

## Part 6: Fair Speed Comparison with vLLM

Earlier we used HuggingFace to demonstrate Pipeline Parallelism and Quantization.  
But HuggingFace's `model.generate()` is simple and unoptimized.

**vLLM** is a production inference engine with PagedAttention, fused kernels, and other optimizations.  
It supports both **Tensor Parallelism** and **Pipeline Parallelism** -- so we can compare them fairly.

Same engine, same model, different parallelism strategy. **Apple-to-apple comparison.**

> **If you restarted the kernel**, start running from here.

### 6.1 Load HuggingFace results + verify GPUs are clean

In [1]:
import torch, time, json

# Reload HuggingFace results from disk
with open("/workspace/hf_results.json", "r") as f:
    hf = json.load(f)

pp_speed = hf["pp_speed"]
pp_total_time = hf["pp_total_time"]
pp_tokens = hf["pp_tokens"]
q_speed = hf["q_speed"]
q_total_time = hf["q_total_time"]
q_tokens = hf["q_tokens"]
model_path = hf["model_path"]

print(f"Loaded HuggingFace results:")
print(f"  Pipeline Parallel (HF): {pp_speed:.1f} tok/s | {pp_tokens} tokens in {pp_total_time:.1f}s")
print(f"  Quantized INT4    (HF): {q_speed:.1f} tok/s | {q_tokens} tokens in {q_total_time:.1f}s")
print()

# Verify GPUs are clean
def gpu_status():
    for i in range(torch.cuda.device_count()):
        total = torch.cuda.get_device_properties(i).total_memory / 1e9
        used = torch.cuda.memory_allocated(i) / 1e9
        print(f"  GPU {i} ({torch.cuda.get_device_name(i)}): {used:.1f} GB used / {total:.0f} GB total")

gpu_status()
print("\nGPUs are clean. Ready for vLLM.")

prompt = "A farmer has 3 fields. Field A produces twice as much wheat as Field B. Field C produces 50kg more than Field A. Together all three fields produce 550kg. How much does each field produce?"

Loaded HuggingFace results:
  Pipeline Parallel (HF): 8.7 tok/s | 409 tokens in 47.0s
  Quantized INT4    (HF): 12.3 tok/s | 408 tokens in 33.2s

  GPU 0 (NVIDIA A100-SXM4-80GB): 0.0 GB used / 85 GB total
  GPU 1 (NVIDIA A100-SXM4-80GB): 0.0 GB used / 85 GB total

GPUs are clean. Ready for vLLM.


### 6.2 vLLM Tensor Parallel (2 GPUs)

**Tensor Parallelism**: split every weight matrix across GPUs. Both GPUs compute in parallel on every layer.

In [2]:
%%time
from vllm import LLM, SamplingParams

print("Loading 70B model with vLLM Tensor Parallelism...")
print("tensor_parallel_size=2")
print("This splits MATRICES across GPUs -- both GPUs work on every layer.")
print()

llm_tp = LLM(
    model=model_path,
    tensor_parallel_size=2,
    dtype="float16",
    gpu_memory_utilization=0.92,
    max_model_len=2048,
    enforce_eager=True,
    disable_custom_all_reduce=True,
)

print("\nvLLM loaded with Tensor Parallelism!")

Loading 70B model with vLLM Tensor Parallelism...
tensor_parallel_size=2
This splits MATRICES across GPUs -- both GPUs work on every layer.

INFO 02-24 10:38:46 [utils.py:261] non-default args: {'dtype': 'float16', 'max_model_len': 2048, 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.92, 'disable_log_stats': True, 'enforce_eager': True, 'disable_custom_all_reduce': True, 'model': '/workspace/models/models--Qwen--Qwen2.5-72B-Instruct/snapshots/495f39366efef23836d0cfae4fbe635880d2be31'}
INFO 02-24 10:38:54 [model.py:541] Resolved architecture: Qwen2ForCausalLM
INFO 02-24 10:38:54 [model.py:1561] Using max model len 2048
INFO 02-24 10:38:54 [scheduler.py:226] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 02-24 10:38:54 [vllm.py:624] Asynchronous scheduling is enabled.
INFO 02-24 10:38:54 [vllm.py:762] Cudagraph is disabled under eager mode
[0;36m(EngineCore_DP0 pid=3339)[0;0m INFO 02-24 10:39:01 [core.py:96] Initializing a V1 LLM engine (v0.15.1) with config:

Loading safetensors checkpoint shards:   0% Completed | 0/37 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   3% Completed | 1/37 [00:01<00:54,  1.51s/it]
Loading safetensors checkpoint shards:   5% Completed | 2/37 [00:03<00:57,  1.65s/it]
Loading safetensors checkpoint shards:   8% Completed | 3/37 [00:04<00:56,  1.67s/it]
Loading safetensors checkpoint shards:  11% Completed | 4/37 [00:06<00:57,  1.73s/it]
Loading safetensors checkpoint shards:  14% Completed | 5/37 [00:08<00:55,  1.74s/it]
Loading safetensors checkpoint shards:  16% Completed | 6/37 [00:10<00:54,  1.76s/it]
Loading safetensors checkpoint shards:  19% Completed | 7/37 [00:12<00:52,  1.74s/it]
Loading safetensors checkpoint shards:  22% Completed | 8/37 [00:13<00:51,  1.77s/it]
Loading safetensors checkpoint shards:  24% Completed | 9/37 [00:15<00:49,  1.76s/it]
Loading safetensors checkpoint shards:  27% Completed | 10/37 [00:17<00:47,  1.75s/it]
Loading safetensors checkpoint shards:  30% Completed | 11/37

[0;36m(Worker_TP0 pid=3537)[0;0m INFO 02-24 10:40:32 [default_loader.py:291] Loading weights took 65.44 seconds
[0;36m(Worker_TP0 pid=3537)[0;0m INFO 02-24 10:40:33 [gpu_model_runner.py:4130] Model loading took 67.8 GiB memory and 82.232013 seconds
[0;36m(Worker_TP0 pid=3537)[0;0m INFO 02-24 10:40:37 [gpu_worker.py:356] Available KV cache memory: 3.1 GiB
[0;36m(EngineCore_DP0 pid=3339)[0;0m INFO 02-24 10:40:37 [kv_cache_utils.py:1307] GPU KV cache size: 20,320 tokens
[0;36m(EngineCore_DP0 pid=3339)[0;0m INFO 02-24 10:40:37 [kv_cache_utils.py:1312] Maximum concurrency for 2,048 tokens per request: 9.92x
[0;36m(EngineCore_DP0 pid=3339)[0;0m INFO 02-24 10:40:39 [core.py:272] init engine (profile, create kv cache, warmup model) took 6.33 seconds
[0;36m(EngineCore_DP0 pid=3339)[0;0m INFO 02-24 10:40:41 [vllm.py:624] Asynchronous scheduling is enabled.
[0;36m(EngineCore_DP0 pid=3339)[0;0m INFO 02-24 10:40:41 [vllm.py:762] Cudagraph is disabled under eager mode
INFO 02-24 10:4

In [3]:
sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=2048,
)

print(f"Generating with vLLM Tensor Parallelism...\n")

start = time.time()
outputs = llm_tp.generate([prompt], sampling_params)
vllm_tp_total_time = time.time() - start

generated = outputs[0].outputs[0].text
vllm_tp_tokens = len(outputs[0].outputs[0].token_ids)
vllm_tp_speed = vllm_tp_tokens / vllm_tp_total_time

print(prompt + generated)
print(f"\n{'='*60}")
print(f"vLLM TENSOR PARALLEL RESULTS (2 GPUs):")
print(f"  Total time:                 {vllm_tp_total_time:.1f}s")
print(f"  Tokens generated:           {vllm_tp_tokens}")
print(f"  Speed:                      {vllm_tp_speed:.1f} tokens/sec")
print(f"\nCorrect answer: Field B = 100kg, Field A = 200kg, Field C = 250kg")

Generating with vLLM Tensor Parallelism...



Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

A farmer has 3 fields. Field A produces twice as much wheat as Field B. Field C produces 50kg more than Field A. Together all three fields produce 550kg. How much does each field produce? Let's denote the amount of wheat produced by Field B as \( x \) kg.

According to the problem:
- Field A produces twice as much wheat as Field B, so Field A produces \( 2x \) kg.
- Field C produces 50 kg more than Field A, so Field C produces \( 2x + 50 \) kg.
- The total production from all three fields is 550 kg.

We can set up the following equation to represent the total production:

\[
x + 2x + (2x + 50) = 550
\]

Simplify the equation:

\[
x + 2x + 2x + 50 = 550
\]

Combine like terms:

\[
5x + 50 = 550
\]

Subtract 50 from both sides:

\[
5x = 500
\]

Divide both sides by 5:

\[
x = 100
\]

Now we can find the production of each field:
- Field B produces \( x = 100 \) kg.
- Field A produces \( 2x = 2 \times 100 = 200 \) kg.
- Field C produces \( 2x + 50 = 200 + 50 = 250 \) kg.

So, the producti

### 6.3 Save TP results and restart for Pipeline Parallel

We need to fully unload vLLM before loading with a different parallelism strategy.  
**Restart the kernel**, then continue with 6.4.

In [4]:
import json

# Save vLLM TP results
vllm_tp_results = {
    "vllm_tp_speed": vllm_tp_speed,
    "vllm_tp_total_time": vllm_tp_total_time,
    "vllm_tp_tokens": vllm_tp_tokens,
}
with open("/workspace/vllm_tp_results.json", "w") as f:
    json.dump(vllm_tp_results, f)

print("vLLM Tensor Parallel results saved to /workspace/vllm_tp_results.json")
print(f"  Speed: {vllm_tp_speed:.1f} tok/s | Total: {vllm_tp_total_time:.1f}s")
print()
print("=" * 60)
print("  NOW: Restart the kernel (Kernel -> Restart)")
print("  Then continue from 6.4 below (vLLM Pipeline Parallel)")
print("=" * 60)

vLLM Tensor Parallel results saved to /workspace/vllm_tp_results.json
  Speed: 18.7 tok/s | Total: 22.0s

  NOW: Restart the kernel (Kernel -> Restart)
  Then continue from 6.4 below (vLLM Pipeline Parallel)


### 6.4 vLLM Pipeline Parallel (2 GPUs)

Now let's run vLLM with **Pipeline Parallelism** -- layers split across GPUs.  
**Same engine, same model, different parallelism strategy.** Fair comparison.

> **If you restarted the kernel**, start running from here.

In [1]:
import torch, time, json
from vllm import LLM, SamplingParams

# Reload all previous results
with open("/workspace/hf_results.json", "r") as f:
    hf = json.load(f)
with open("/workspace/vllm_tp_results.json", "r") as f:
    vtp = json.load(f)

pp_speed = hf["pp_speed"]
pp_total_time = hf["pp_total_time"]
pp_tokens = hf["pp_tokens"]
q_speed = hf["q_speed"]
q_total_time = hf["q_total_time"]
q_tokens = hf["q_tokens"]
model_path = hf["model_path"]
vllm_tp_speed = vtp["vllm_tp_speed"]
vllm_tp_total_time = vtp["vllm_tp_total_time"]
vllm_tp_tokens = vtp["vllm_tp_tokens"]

print(f"Loaded previous results:")
print(f"  HF Pipeline:   {pp_speed:.1f} tok/s | {pp_tokens} tokens")
print(f"  HF Quantized:  {q_speed:.1f} tok/s | {q_tokens} tokens")
print(f"  vLLM Tensor:   {vllm_tp_speed:.1f} tok/s | {vllm_tp_tokens} tokens")
print()

prompt = "A farmer has 3 fields. Field A produces twice as much wheat as Field B. Field C produces 50kg more than Field A. Together all three fields produce 550kg. How much does each field produce?"

print("Loading 70B model with vLLM Pipeline Parallelism...")
print("pipeline_parallel_size=2, tensor_parallel_size=1")
print("This splits LAYERS across GPUs (like HuggingFace, but with vLLM's engine).")
print()

llm_pp = LLM(
    model=model_path,
    pipeline_parallel_size=2,
    tensor_parallel_size=1,
    dtype="float16",
    max_model_len=2048,
    enforce_eager=True,
)

print("\nvLLM loaded with Pipeline Parallelism!")

Loaded previous results:
  HF Pipeline:   8.7 tok/s | 409 tokens
  HF Quantized:  12.3 tok/s | 408 tokens
  vLLM Tensor:   18.7 tok/s | 411 tokens

Loading 70B model with vLLM Pipeline Parallelism...
pipeline_parallel_size=2, tensor_parallel_size=1
This splits LAYERS across GPUs (like HuggingFace, but with vLLM's engine).

INFO 02-24 10:44:32 [utils.py:261] non-default args: {'dtype': 'float16', 'max_model_len': 2048, 'pipeline_parallel_size': 2, 'disable_log_stats': True, 'enforce_eager': True, 'model': '/workspace/models/models--Qwen--Qwen2.5-72B-Instruct/snapshots/495f39366efef23836d0cfae4fbe635880d2be31'}
INFO 02-24 10:44:32 [model.py:541] Resolved architecture: Qwen2ForCausalLM
INFO 02-24 10:44:32 [model.py:1561] Using max model len 2048
INFO 02-24 10:44:33 [scheduler.py:226] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 02-24 10:44:33 [vllm.py:624] Asynchronous scheduling is enabled.
INFO 02-24 10:44:33 [vllm.py:762] Cudagraph is disabled under eager mode
[0;

Loading safetensors checkpoint shards:   0% Completed | 0/37 [00:00<?, ?it/s]


[0;36m(EngineCore_DP0 pid=4110)[0;0m [0;36m(Worker_PP0 pid=4124)[0;0m INFO 02-24 10:45:45 [default_loader.py:291] Loading weights took 65.31 seconds
[0;36m(EngineCore_DP0 pid=4110)[0;0m [0;36m(Worker_PP0 pid=4124)[0;0m INFO 02-24 10:45:46 [gpu_model_runner.py:4130] Model loading took 67.72 GiB memory and 66.732011 seconds
[0;36m(EngineCore_DP0 pid=4110)[0;0m [0;36m(Worker_PP0 pid=4124)[0;0m INFO 02-24 10:45:53 [gpu_worker.py:356] Available KV cache memory: 1.73 GiB
[0;36m(EngineCore_DP0 pid=4110)[0;0m INFO 02-24 10:45:53 [kv_cache_utils.py:1307] GPU KV cache size: 10,496 tokens
[0;36m(EngineCore_DP0 pid=4110)[0;0m INFO 02-24 10:45:53 [kv_cache_utils.py:1312] Maximum concurrency for 2,048 tokens per request: 5.12x
[0;36m(EngineCore_DP0 pid=4110)[0;0m INFO 02-24 10:45:53 [core.py:272] init engine (profile, create kv cache, warmup model) took 3.30 seconds
[0;36m(EngineCore_DP0 pid=4110)[0;0m INFO 02-24 10:45:55 [vllm.py:762] Cudagraph is disabled under eager mode
INFO 

[0;36m(EngineCore_DP0 pid=4110)[0;0m [0;36m(Worker_PP0 pid=4124)[0;0m   object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)


In [2]:
sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=2048,
)

print(f"Generating with vLLM Pipeline Parallelism...\n")

start = time.time()
outputs = llm_pp.generate([prompt], sampling_params)
vllm_pp_total_time = time.time() - start

generated = outputs[0].outputs[0].text
vllm_pp_tokens = len(outputs[0].outputs[0].token_ids)
vllm_pp_speed = vllm_pp_tokens / vllm_pp_total_time

print(prompt + generated)
print(f"\n{'='*60}")
print(f"vLLM PIPELINE PARALLEL RESULTS (2 GPUs):")
print(f"  Total time:                 {vllm_pp_total_time:.1f}s")
print(f"  Tokens generated:           {vllm_pp_tokens}")
print(f"  Speed:                      {vllm_pp_speed:.1f} tokens/sec")
print(f"\nCorrect answer: Field B = 100kg, Field A = 200kg, Field C = 250kg")
print()
print(f"FAIR COMPARISON (same engine -- vLLM):")
print(f"  vLLM Tensor Parallel:   {vllm_tp_speed:>6.1f} tok/s  ({vllm_tp_tokens} tokens in {vllm_tp_total_time:.1f}s)")
print(f"  vLLM Pipeline Parallel: {vllm_pp_speed:>6.1f} tok/s  ({vllm_pp_tokens} tokens in {vllm_pp_total_time:.1f}s)")
print(f"  TP speedup over PP:     {vllm_tp_speed/vllm_pp_speed:.2f}x")
print()
print("Same model, same engine, same prompt -- the ONLY difference is the parallelism strategy.")

Generating with vLLM Pipeline Parallelism...



Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]



A farmer has 3 fields. Field A produces twice as much wheat as Field B. Field C produces 50kg more than Field A. Together all three fields produce 550kg. How much does each field produce? Let's denote the amount of wheat produced by Field B as \( x \) kg.

According to the problem:
- Field A produces twice as much wheat as Field B, so Field A produces \( 2x \) kg.
- Field C produces 50 kg more than Field A, so Field C produces \( 2x + 50 \) kg.
- The total production from all three fields is 550 kg.

We can set up the following equation to represent the total production:

\[
x + 2x + (2x + 50) = 550
\]

Simplify the equation:

\[
x + 2x + 2x + 50 = 550
\]

Combine like terms:

\[
5x + 50 = 550
\]

Subtract 50 from both sides:

\[
5x = 500
\]

Divide both sides by 5:

\[
x = 100
\]

Now we can find the production of each field:
- Field B produces \( x = 100 \) kg.
- Field A produces \( 2x = 2 \times 100 = 200 \) kg.
- Field C produces \( 2x + 50 = 200 + 50 = 250 \) kg.

So, the producti

---

## Part 7: The Complete Picture

Let's see all our results side by side.

In [3]:
print("COMPLETE COMPARISON: Four ways to run a 70B model")
print("=" * 80)
print(f"{'':30s} {'HF Pipeline':>12s} {'HF INT4':>12s} {'vLLM TP':>12s} {'vLLM PP':>12s}")
print("-" * 80)
print(f"{'GPUs used':30s} {'2':>12s} {'1':>12s} {'2':>12s} {'2':>12s}")
print(f"{'Parallelism':30s} {'Pipeline':>12s} {'None':>12s} {'Tensor':>12s} {'Pipeline':>12s}")
print(f"{'Framework':30s} {'HuggingFace':>12s} {'HuggingFace':>12s} {'vLLM':>12s} {'vLLM':>12s}")
print(f"{'Precision':30s} {'FP16':>12s} {'INT4':>12s} {'FP16':>12s} {'FP16':>12s}")
print(f"{'Speed (tok/s)':30s} {pp_speed:>12.1f} {q_speed:>12.1f} {vllm_tp_speed:>12.1f} {vllm_pp_speed:>12.1f}")
print(f"{'Tokens generated':30s} {pp_tokens:>12d} {q_tokens:>12d} {vllm_tp_tokens:>12d} {vllm_pp_tokens:>12d}")
print(f"{'Total time':30s} {pp_total_time:>11.1f}s {q_total_time:>11.1f}s {vllm_tp_total_time:>11.1f}s {vllm_pp_total_time:>11.1f}s")
print(f"{'Quality':30s} {'Full':>12s} {'Slight loss':>12s} {'Full':>12s} {'Full':>12s}")
print(f"{'Memory per GPU':30s} {'~70 GB':>12s} {'~35 GB':>12s} {'~66 GB':>12s} {'~70 GB':>12s}")
print("=" * 80)
print()
print("FAIR COMPARISON (same engine, same model, same prompt):")
print(f"  vLLM Tensor Parallel vs vLLM Pipeline Parallel: {vllm_tp_speed/vllm_pp_speed:.2f}x speedup")
print()
print("KEY TAKEAWAYS:")
print("  1. Tensor Parallelism is faster than Pipeline Parallelism (both GPUs work on every layer)")
print("  2. The vLLM TP vs PP comparison is FAIR (same engine) -- proves TP's advantage")
print("  3. HF vs vLLM comparison shows engine optimization matters too (PagedAttention, fused kernels)")
print("  4. Quantization trades quality for efficiency -- fits 70B on 1 GPU")
print("  5. Real systems combine all 3: TP within server, PP across servers, DP for throughput")
print()
print("3D PARALLELISM in production:")
print("  TP(8 GPUs/server) x PP(4 servers) x DP(33 copies) = 1,056 GPUs")
print("  This is how models like Claude and GPT are trained and served.")

COMPLETE COMPARISON: Four ways to run a 70B model
                                HF Pipeline      HF INT4      vLLM TP      vLLM PP
--------------------------------------------------------------------------------
GPUs used                                 2            1            2            2
Parallelism                        Pipeline         None       Tensor     Pipeline
Framework                       HuggingFace  HuggingFace         vLLM         vLLM
Precision                              FP16         INT4         FP16         FP16
Speed (tok/s)                           8.7         12.3         18.7         10.7
Tokens generated                        409          408          411          411
Total time                            47.0s        33.2s        22.0s        38.5s
Quality                                Full  Slight loss         Full         Full
Memory per GPU                       ~70 GB       ~35 GB       ~66 GB       ~70 GB

FAIR COMPARISON (same engine, same mod

---

## Summary

We started with a problem: **a 144GB model that doesn't fit on an 80GB GPU.**

We solved it three ways:

1. **Pipeline Parallelism** -- Split layers across GPUs. Simple but has idle time (pipeline bubbles).
2. **Quantization** -- Shrink the numbers from 16-bit to 4-bit. Fits on 1 GPU, slight quality loss.
3. **Tensor Parallelism** -- Split matrices across GPUs. Faster, but needs fast interconnect (NVLink).

Then we did a **fair comparison** using vLLM (same engine, different strategies) and confirmed that **Tensor Parallelism is faster than Pipeline Parallelism**.

And we learned about **Data Parallelism** -- replicating the model to train on more data in parallel, with gradient synchronization (AllReduce).

In real production (Claude, ChatGPT, etc.), all techniques are combined as **3D Parallelism**:
- **Tensor Parallel** within a server (NVLink)
- **Pipeline Parallel** across servers (InfiniBand)
- **Data Parallel** for throughput (gradient sync)
- **Quantization** to reduce cost

---

*Hardware: 2x NVIDIA A100 SXM4 80GB on RunPod*  
*Model: Qwen2.5 72B Instruct (72B parameters)*