# 3D UMAP Embedding: Euclidean Metric

Compute 3D UMAP embedding using Euclidean (L2) distances.

**Method:**
- Load token sample from causal distance file (for consistent token_indices)
- Compute Euclidean pairwise distances: d(i,j) = ||γᵢ - γⱼ||₂
- Run UMAP to reduce to 3D
- Save embedding and metadata

**Outputs:**
- `data/vectors/umap_embedding_32k_3d_euclidean.npy` - 3D coordinates
- `data/vectors/euclidean_norms_32k.npy` - Token norms for coloring

**Expected runtime:** ~7-8 minutes (6-7 min for distances, ~1.5 min for UMAP)

## Configuration

In [1]:
# Input files
DISTANCES_FILE = '../data/vectors/distances_causal_32000.pt'  # Just for token_indices
MODEL_NAME = 'Qwen/Qwen3-4B-Instruct-2507'

# UMAP parameters
N_NEIGHBORS = 15
MIN_DIST = 0.1
RANDOM_SEED = 42

# Output files
OUTPUT_EMBEDDING = '../data/vectors/umap_embedding_32k_3d_euclidean.npy'
OUTPUT_NORMS = '../data/vectors/euclidean_norms_32k.npy'

print(f"Configuration:")
print(f"  Model: {MODEL_NAME}")
print(f"  Distance metric: Euclidean (L2)")
print(f"  Output embedding: {OUTPUT_EMBEDDING}")
print(f"  Output norms: {OUTPUT_NORMS}")

Configuration:
  Model: Qwen/Qwen3-4B-Instruct-2507
  Distance metric: Euclidean (L2)
  Output embedding: ../data/vectors/umap_embedding_32k_3d_euclidean.npy
  Output norms: ../data/vectors/euclidean_norms_32k.npy


## Setup

In [2]:
import torch
import numpy as np
from umap import UMAP
from transformers import AutoModelForCausalLM

print("✓ Imports complete")

✓ Imports complete


## Load Token Indices

In [3]:
print(f"Loading token indices from {DISTANCES_FILE}...")
data = torch.load(DISTANCES_FILE, weights_only=False)
token_indices = data['token_indices']
N = data['N']

print(f"✓ Loaded token indices")
print(f"  N tokens: {N:,}")

Loading token indices from ../data/vectors/distances_causal_32000.pt...
✓ Loaded token indices
  N tokens: 32,000


## Load Model and Extract Embeddings

In [4]:
print(f"\nLoading model {MODEL_NAME}...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,  # Use fp16 for speed and memory
    device_map='mps',  # Use MPS (Apple GPU)
)

gamma = model.lm_head.weight.data.clone()  # [vocab_size, 2560]
sampled_gamma = gamma[token_indices]  # [32000, 2560]

print(f"✓ Extracted token embeddings")
print(f"  Shape: {sampled_gamma.shape}")
print(f"  Dtype: {sampled_gamma.dtype}")
print(f"  Device: {sampled_gamma.device}")

del model
torch.mps.empty_cache()  # Clear MPS cache
print(f"✓ Freed model memory")


Loading model Qwen/Qwen3-4B-Instruct-2507...


`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

✓ Extracted token embeddings
  Shape: torch.Size([32000, 2560])
  Dtype: torch.float16
  Device: mps:0
✓ Freed model memory


## Compute Euclidean Distance Matrix

Using standard L2 norm: d(i,j) = ||γᵢ - γⱼ||₂

Double-batching strategy to keep memory usage safe (~410 MB per iteration).

In [5]:
print("\nComputing Euclidean pairwise distances...")
print("  Using GPU (MPS) with fp16 and double-batching for memory safety!\n")

# Double-batching: batch over both i and j dimensions
batch_i = 200  # Query batch
batch_j = 200  # Target batch
distances = torch.zeros(N, N, dtype=torch.float16, device='mps')

n_batches = int(np.ceil(N / batch_i))
for i in range(0, N, batch_i):
    i_end = min(i + batch_i, N)
    tokens_i = sampled_gamma[i:i_end]  # [batch_i, 2560]
    
    for j in range(0, N, batch_j):
        j_end = min(j + batch_j, N)
        tokens_j = sampled_gamma[j:j_end]  # [batch_j, 2560]
        
        # Broadcast: [batch_i, 1, 2560] - [1, batch_j, 2560]
        # Memory: 200 × 200 × 2560 × 2 bytes = 410 MB (safe!)
        diff = tokens_i[:, None, :] - tokens_j[None, :, :]
        distances[i:i_end, j:j_end] = torch.sqrt((diff ** 2).sum(dim=2))
    
    if (i // batch_i + 1) % 10 == 0 or i_end == N:
        print(f"  Progress: {i_end:,} / {N:,} ({100*i_end/N:.1f}%)")

# Move to CPU and convert to numpy
print("\n  Moving to CPU...")
distances_np = distances.cpu().numpy()
del distances
torch.mps.empty_cache()

print(f"\n✓ Computed {N}×{N} Euclidean distance matrix")
print(f"  Memory: {distances_np.nbytes / 1e9:.2f} GB")
print(f"  Distance range: [{distances_np[distances_np > 0].min():.2f}, {distances_np.max():.2f}]")
print(f"  Mean distance: {distances_np[distances_np > 0].mean():.2f}")


Computing Euclidean pairwise distances...
  Using GPU (MPS) with fp16 and double-batching for memory safety!

  Progress: 2,000 / 32,000 (6.2%)
  Progress: 4,000 / 32,000 (12.5%)
  Progress: 6,000 / 32,000 (18.8%)
  Progress: 8,000 / 32,000 (25.0%)
  Progress: 10,000 / 32,000 (31.2%)
  Progress: 12,000 / 32,000 (37.5%)
  Progress: 14,000 / 32,000 (43.8%)
  Progress: 16,000 / 32,000 (50.0%)
  Progress: 18,000 / 32,000 (56.2%)
  Progress: 20,000 / 32,000 (62.5%)
  Progress: 22,000 / 32,000 (68.8%)
  Progress: 24,000 / 32,000 (75.0%)
  Progress: 26,000 / 32,000 (81.2%)
  Progress: 28,000 / 32,000 (87.5%)
  Progress: 30,000 / 32,000 (93.8%)
  Progress: 32,000 / 32,000 (100.0%)

  Moving to CPU...

✓ Computed 32000×32000 Euclidean distance matrix
  Memory: 2.05 GB
  Distance range: [0.01, 2.07]
  Mean distance: 1.49


## Compute Euclidean Norms

For coloring points by their distance from origin.

In [6]:
print("\nComputing Euclidean norms (distance from origin)...")
euclidean_norms = torch.norm(sampled_gamma, dim=1).cpu().numpy()

print(f"✓ Computed Euclidean norms")
print(f"  Min: {euclidean_norms.min():.2f}")
print(f"  Max: {euclidean_norms.max():.2f}")
print(f"  Mean: {euclidean_norms.mean():.2f}")
print(f"  Median: {np.median(euclidean_norms):.2f}")
print(f"  CV: {euclidean_norms.std() / euclidean_norms.mean() * 100:.1f}%")


Computing Euclidean norms (distance from origin)...
✓ Computed Euclidean norms
  Min: 0.36
  Max: 1.54
  Mean: 1.09
  Median: 1.11
  CV: 15.6%


## Run 3D UMAP

In [7]:
print("\nRunning UMAP for 3D embedding (Euclidean distances)...")
print(f"  This should take ~1.5 minutes for {N:,} points...\n")

reducer = UMAP(
    n_components=3,
    metric='precomputed',
    n_neighbors=N_NEIGHBORS,
    min_dist=MIN_DIST,
    random_state=RANDOM_SEED,
    verbose=True
)

embedding_3d = reducer.fit_transform(distances_np)

print(f"\n✓ 3D UMAP complete!")
print(f"  Embedding shape: {embedding_3d.shape}")


Running UMAP for 3D embedding (Euclidean distances)...
  This should take ~1.5 minutes for 32,000 points...



  warn("using precomputed metric; inverse_transform will be unavailable")
  warn(


UMAP(metric='precomputed', n_components=3, n_jobs=1, random_state=42, verbose=True)
Thu Oct 30 13:46:51 2025 Construct fuzzy simplicial set
Thu Oct 30 13:46:51 2025 Finding Nearest Neighbors
Thu Oct 30 13:47:38 2025 Finished Nearest Neighbor Search
Thu Oct 30 13:47:39 2025 Construct embedding


Epochs completed:   0%|            0/200 [00:00]

	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs
	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs
Thu Oct 30 13:47:51 2025 Finished embedding

✓ 3D UMAP complete!
  Embedding shape: (32000, 3)


## Save Outputs

In [8]:
print("\nSaving outputs...")

np.save(OUTPUT_EMBEDDING, embedding_3d)
print(f"✓ Saved embedding to {OUTPUT_EMBEDDING}")
print(f"  Shape: {embedding_3d.shape}")
print(f"  Size: {embedding_3d.nbytes / 1e6:.2f} MB")

np.save(OUTPUT_NORMS, euclidean_norms)
print(f"✓ Saved norms to {OUTPUT_NORMS}")
print(f"  Shape: {euclidean_norms.shape}")
print(f"  Size: {euclidean_norms.nbytes / 1e6:.2f} MB")


Saving outputs...
✓ Saved embedding to ../data/vectors/umap_embedding_32k_3d_euclidean.npy
  Shape: (32000, 3)
  Size: 0.38 MB
✓ Saved norms to ../data/vectors/euclidean_norms_32k.npy
  Shape: (32000,)
  Size: 0.06 MB


## Summary

Created 3D UMAP embedding using Euclidean distances!

**Outputs:**
- `umap_embedding_32k_3d_euclidean.npy` - 3D UMAP coordinates
- `euclidean_norms_32k.npy` - Token norms for visualization

**Next:** Use notebook 07.42 to generate visualizations (HTML + GIF).