# 3D UMAP Embedding: Causal Metric

Compute 3D UMAP embedding using precomputed causal metric distances.

**Method:**
- Load precomputed causal distance matrix (from 07.1)
- Reconstruct symmetric matrix from upper triangle
- Run UMAP to reduce to 3D
- Compute causal norms for coloring
- Save embedding and metadata

**Inputs:**
- `data/vectors/distances_causal_32000.pt` - Precomputed causal distances
- `data/vectors/causal_metric_tensor_qwen3_4b.pt` - Metric tensor M

**Outputs:**
- `data/vectors/umap_embedding_32k_3d_causal.npy` - 3D coordinates
- `data/vectors/causal_norms_32k.npy` - Token norms for coloring

**Expected runtime:** ~2 minutes (UMAP only, distances already computed)

## Configuration

In [1]:
# Input files
DISTANCES_FILE = '../data/vectors/distances_causal_32000.pt'
METRIC_FILE = '../data/vectors/causal_metric_tensor_qwen3_4b.pt'
MODEL_NAME = 'Qwen/Qwen3-4B-Instruct-2507'

# UMAP parameters
N_NEIGHBORS = 15
MIN_DIST = 0.1
RANDOM_SEED = 42

# Output files
OUTPUT_EMBEDDING = '../data/vectors/umap_embedding_32k_3d_causal.npy'
OUTPUT_NORMS = '../data/vectors/causal_norms_32k.npy'

print(f"Configuration:")
print(f"  Model: {MODEL_NAME}")
print(f"  Distance metric: Causal (M = Cov(γ)⁻¹)")
print(f"  Output embedding: {OUTPUT_EMBEDDING}")
print(f"  Output norms: {OUTPUT_NORMS}")

Configuration:
  Model: Qwen/Qwen3-4B-Instruct-2507
  Distance metric: Causal (M = Cov(γ)⁻¹)
  Output embedding: ../data/vectors/umap_embedding_32k_3d_causal.npy
  Output norms: ../data/vectors/causal_norms_32k.npy


## Setup

In [2]:
import torch
import numpy as np
from umap import UMAP
from transformers import AutoModelForCausalLM

print("✓ Imports complete")

✓ Imports complete


## Load Distance Matrix

In [3]:
print(f"Loading distance matrix from {DISTANCES_FILE}...")
data = torch.load(DISTANCES_FILE, weights_only=False)

triu_values = data['triu_values']
token_indices = data['token_indices']
N = data['N']

print(f"✓ Loaded compressed data")
print(f"  N tokens: {N:,}")

Loading distance matrix from ../data/vectors/distances_causal_32000.pt...
✓ Loaded compressed data
  N tokens: 32,000


## Reconstruct Distance Matrix

In [4]:
print("\nReconstructing symmetric distance matrix...")

distances = torch.zeros(N, N, dtype=triu_values.dtype)
triu_indices = torch.triu_indices(N, N, offset=1)
distances[triu_indices[0], triu_indices[1]] = triu_values
distances = distances + distances.T

distances_np = distances.numpy()
del distances

print(f"✓ Reconstructed {N}×{N} matrix")
print(f"  Memory: {distances_np.nbytes / 1e9:.2f} GB")


Reconstructing symmetric distance matrix...
✓ Reconstructed 32000×32000 matrix
  Memory: 2.05 GB


## Run 3D UMAP

In [5]:
print("\nRunning UMAP for 3D embedding (causal distances)...")
print(f"  This should take ~1.5 minutes for {N:,} points...\n")

reducer = UMAP(
    n_components=3,
    metric='precomputed',
    n_neighbors=N_NEIGHBORS,
    min_dist=MIN_DIST,
    random_state=RANDOM_SEED,
    verbose=True
)

embedding_3d = reducer.fit_transform(distances_np)

print(f"\n✓ 3D UMAP complete!")
print(f"  Embedding shape: {embedding_3d.shape}")


Running UMAP for 3D embedding (causal distances)...
  This should take ~1.5 minutes for 32,000 points...



  warn("using precomputed metric; inverse_transform will be unavailable")
  warn(


UMAP(metric='precomputed', n_components=3, n_jobs=1, random_state=42, verbose=True)
Thu Oct 30 14:00:06 2025 Construct fuzzy simplicial set
Thu Oct 30 14:00:07 2025 Finding Nearest Neighbors
Thu Oct 30 14:00:51 2025 Finished Nearest Neighbor Search
Thu Oct 30 14:00:52 2025 Construct embedding


Epochs completed:   0%|            0/200 [00:00]

	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs
	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs
Thu Oct 30 14:01:13 2025 Finished embedding

✓ 3D UMAP complete!
  Embedding shape: (32000, 3)


## Compute Causal Norms

For coloring points by their causal distance from origin.

In [6]:
print("\nLoading metric tensor and computing causal norms...")

# Load metric tensor
metric_data = torch.load(METRIC_FILE, weights_only=False)
M = metric_data['M'].to('cpu')

# Load model and get gamma
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float32,
    device_map='cpu',
)
gamma = model.lm_head.weight.data.clone()
sampled_gamma = gamma[token_indices]
del model, gamma

# Compute causal norms
batch_size = 1000
causal_norms = []
for i in range(0, len(sampled_gamma), batch_size):
    batch = sampled_gamma[i:i+batch_size]
    M_batch = torch.matmul(batch, M)
    norms_squared = (batch * M_batch).sum(dim=1)
    causal_norms.append(torch.sqrt(torch.clamp(norms_squared, min=0)))

causal_norms = torch.cat(causal_norms).numpy()

print(f"✓ Computed causal norms")
print(f"  Range: [{causal_norms.min():.2f}, {causal_norms.max():.2f}] logometers")
print(f"  Mean: {causal_norms.mean():.2f} logometers")
print(f"  CV: {causal_norms.std() / causal_norms.mean() * 100:.1f}%")


Loading metric tensor and computing causal norms...


`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

✓ Computed causal norms
  Range: [21.45, 75.08] logometers
  Mean: 54.12 logometers
  CV: 14.7%


## Save Outputs

In [7]:
print("\nSaving outputs...")

np.save(OUTPUT_EMBEDDING, embedding_3d)
print(f"✓ Saved embedding to {OUTPUT_EMBEDDING}")
print(f"  Shape: {embedding_3d.shape}")
print(f"  Size: {embedding_3d.nbytes / 1e6:.2f} MB")

np.save(OUTPUT_NORMS, causal_norms)
print(f"✓ Saved norms to {OUTPUT_NORMS}")
print(f"  Shape: {causal_norms.shape}")
print(f"  Size: {causal_norms.nbytes / 1e6:.2f} MB")


Saving outputs...
✓ Saved embedding to ../data/vectors/umap_embedding_32k_3d_causal.npy
  Shape: (32000, 3)
  Size: 0.38 MB
✓ Saved norms to ../data/vectors/causal_norms_32k.npy
  Shape: (32000,)
  Size: 0.13 MB


## Summary

Created 3D UMAP embedding using causal metric distances!

**Outputs:**
- `umap_embedding_32k_3d_causal.npy` - 3D UMAP coordinates
- `causal_norms_32k.npy` - Token norms for visualization

**Next:** Use notebook 07.44 to generate visualizations (HTML + GIF).