# Perplexity Measurement

**Version:** 1.0 | **Build:** 2026-01-10

Measure perplexity of QAT checkpoints and compare with baseline Qwen model.

**Perplexity** = exp(cross-entropy loss) on next-token prediction.
- Lower is better
- WikiText-2 baselines: GPT-2 ~22, good LLMs ~5-10

## Setup (Colab)

Run the setup cells below to clone the repository and mount Google Drive.

In [None]:
#@title Clone repository (run once)
import os

REPO_URL = "https://github.com/anemll/qwen3_apple_style_2bit_qat_lora.git"  #@param {type:"string"}
REPO_DIR = "qwen3_apple_style_2bit_qat_lora"

if not os.path.exists(REPO_DIR):
    !git clone {REPO_URL}
    print(f"✓ Cloned to {REPO_DIR}")
else:
    print(f"✓ Repository already exists at {REPO_DIR}")

# Change to repo directory
os.chdir(REPO_DIR)
print(f"Working directory: {os.getcwd()}")

In [None]:
#@title Mount Google Drive (for checkpoints)
MOUNT_DRIVE = True  #@param {type:"boolean"}

if MOUNT_DRIVE:
    from google.colab import drive
    drive.mount('/content/drive')
    print("✓ Google Drive mounted at /content/drive")
    print("  Use paths like: /content/drive/MyDrive/qat_checkpoints/model.pt")

In [None]:
#@title Config
# Checkpoint path (local or Google Drive)
# Examples:
#   "runs/SR-011/checkpoint.pt"  (local)
#   "/content/drive/MyDrive/qat_checkpoints/model.pt"  (Google Drive)
CHECKPOINT = "runs/SR-011_q4_a4_r32_mlp_autosnap/v2_q4a4_r32_fp32_20260110_133950.pt"  #@param {type:"string"}
MODEL_NAME = "Qwen/Qwen3-0.6B"  #@param {type:"string"}
LORA_R = 0  #@param {type:"integer"}

# Evaluation settings
MAX_LENGTH = 1024  #@param {type:"integer"}
STRIDE = 512  #@param {type:"integer"}
VERBOSE = True  #@param {type:"boolean"}

In [None]:
#@title Install dependencies (run once)
!pip install -q datasets transformers torch

In [None]:
#@title Device setup
import os
import torch

# Auto-detect device
if torch.cuda.is_available():
    DEVICE = 'cuda'
    DTYPE = torch.bfloat16
elif torch.backends.mps.is_available():
    DEVICE = 'mps'
    DTYPE = torch.float32
else:
    try:
        import torch_xla.core.xla_model as xm
        DEVICE = 'tpu'
        DTYPE = torch.bfloat16
    except ImportError:
        DEVICE = 'cpu'
        DTYPE = torch.float32

print(f"Device: {DEVICE}")
print(f"Dtype: {DTYPE}")

## 1. Measure Baseline Model Perplexity

Measure the original Qwen model (no QAT) to establish a baseline.

In [None]:
#@title Measure baseline perplexity
!python scripts/measure_perplexity.py --baseline \
    --model {MODEL_NAME} \
    --max-length {MAX_LENGTH} \
    --stride {STRIDE} \
    --device {DEVICE} \
    {'--verbose' if VERBOSE else ''}

## 2. Measure QAT Checkpoint Perplexity

Measure the quantized model checkpoint.

In [None]:
#@title Measure checkpoint perplexity
lora_flag = f"--lora-r {LORA_R}" if LORA_R > 0 else ""
verbose_flag = "--verbose" if VERBOSE else ""

!python scripts/measure_perplexity.py "{CHECKPOINT}" \
    --model {MODEL_NAME} \
    --max-length {MAX_LENGTH} \
    --stride {STRIDE} \
    --device {DEVICE} \
    {lora_flag} \
    {verbose_flag}

## 3. Compare Multiple Checkpoints (Optional)

Compare perplexity across training steps.

In [None]:
#@title List available checkpoints
import os
from pathlib import Path

checkpoint_dir = Path(CHECKPOINT).parent
if checkpoint_dir.exists():
    checkpoints = sorted(checkpoint_dir.glob("*.pt"))
    print(f"Found {len(checkpoints)} checkpoints in {checkpoint_dir}:")
    for ckpt in checkpoints[-10:]:  # Show last 10
        size_mb = ckpt.stat().st_size / 1024 / 1024
        print(f"  {ckpt.name:<50} {size_mb:.1f} MB")
else:
    print(f"Directory not found: {checkpoint_dir}")

In [None]:
#@title Batch measure multiple checkpoints
CHECKPOINTS_TO_MEASURE = [
    # Add checkpoint paths here
    # "runs/SR-011/checkpoint_step1000.pt",
    # "runs/SR-011/checkpoint_step2000.pt",
]

results = []
for ckpt in CHECKPOINTS_TO_MEASURE:
    print(f"\n{'='*60}")
    print(f"Measuring: {ckpt}")
    print(f"{'='*60}")
    !python scripts/measure_perplexity.py "{ckpt}" \
        --model {MODEL_NAME} \
        --max-length {MAX_LENGTH} \
        --stride {STRIDE} \
        --device {DEVICE}

## 4. Using Cache Data (Alternative)

If WikiText-2 download fails, use existing KD cache.

In [None]:
#@title Measure with KD cache
CACHE_DIR = "caches/alpaca_chat_think_both_L128_K128"  #@param {type:"string"}
NUM_SAMPLES = 100  #@param {type:"integer"}

!python scripts/measure_perplexity.py "{CHECKPOINT}" \
    --cache-dir "{CACHE_DIR}" \
    --num-samples {NUM_SAMPLES} \
    --model {MODEL_NAME} \
    --max-length {MAX_LENGTH} \
    --stride {STRIDE} \
    --device {DEVICE}