# Reconstruct Full Distance Matrix

**Goal:** Load the upper-triangle compressed distance matrix and reconstruct the full symmetric matrix for TDA.

**Why separate notebook:**
- Matrix reconstruction is memory-intensive and takes a few minutes
- Once saved, we can reuse it for multiple TDA experiments without recomputing
- One notebook = one step

**Inputs:**
- `data/vectors/distances_causal_64000.pt` - Upper triangle (4.9 GB)

**Outputs:**
- `data/vectors/distances_causal_64000_full.npy` - Full symmetric matrix (8.2 GB)

**Expected runtime:** ~2-3 minutes

## Configuration

In [4]:
# Input file (upper triangle)
INPUT_DISTANCES = '../data/vectors/distances_causal_32000.pt'

# Output file (full matrix)
OUTPUT_DISTANCES = '../data/vectors/distances_causal_32000_full.npy'

print(f"Configuration:")
print(f"  Input: {INPUT_DISTANCES}")
print(f"  Output: {OUTPUT_DISTANCES}")

Configuration:
  Input: ../data/vectors/distances_causal_32000.pt
  Output: ../data/vectors/distances_causal_32000_full.npy


## Setup

In [5]:
import torch
import numpy as np
from tqdm.auto import tqdm

print("✓ Imports complete")

✓ Imports complete


## Load Upper Triangle

In [6]:
print(f"Loading upper triangle from {INPUT_DISTANCES}...")
data = torch.load(INPUT_DISTANCES, weights_only=False)

triu_values = data['triu_values']
token_indices = data['token_indices']
N = data['N']
metadata = data['metadata']

print(f"\n✓ Loaded upper triangle")
print(f"  N tokens: {N:,}")
print(f"  Upper triangle values: {len(triu_values):,}")
print(f"  Memory: {triu_values.element_size() * triu_values.nelement() / 1e9:.2f} GB")
print(f"  Model: {metadata['model']}")

print(f"\nDistance statistics (from metadata):")
for k, v in metadata['distance_stats'].items():
    print(f"  {k.capitalize()}: {v:.2f} logometers")

Loading upper triangle from ../data/vectors/distances_causal_32000.pt...

✓ Loaded upper triangle
  N tokens: 32,000
  Upper triangle values: 511,984,000
  Memory: 1.02 GB
  Model: Qwen/Qwen3-4B-Instruct-2507

Distance statistics (from metadata):
  Min: 0.00 logometers
  Max: 103.19 logometers
  Mean: 71.00 logometers
  Median: 72.25 logometers
  Std: 8.03 logometers


## Reconstruct Full Symmetric Matrix

In [7]:
print("\nReconstructing full symmetric distance matrix...")
print(f"  This will create a {N}×{N} matrix ({N*N:,} elements)")
print(f"  Expected memory: {N * N * 4 / 1e9:.2f} GB (float32)\n")

# Initialize empty matrix
distances = np.zeros((N, N), dtype=np.float32)

# Fill upper triangle
print("Filling upper triangle...")
triu_indices = np.triu_indices(N, k=1)
distances[triu_indices[0], triu_indices[1]] = triu_values.numpy()

# Make symmetric by copying upper to lower
print("Making symmetric...")
distances = distances + distances.T

print(f"\n✓ Matrix reconstructed!")
print(f"  Shape: {distances.shape}")
print(f"  Memory: {distances.nbytes / 1e9:.2f} GB")
print(f"  Dtype: {distances.dtype}")


Reconstructing full symmetric distance matrix...
  This will create a 32000×32000 matrix (1,024,000,000 elements)
  Expected memory: 4.10 GB (float32)

Filling upper triangle...
Making symmetric...

✓ Matrix reconstructed!
  Shape: (32000, 32000)
  Memory: 4.10 GB
  Dtype: float32


## Validation Checks

In [8]:
print("\nValidation checks:")

# Check 1: Diagonal should be zero
diag = np.diag(distances)
print(f"  Diagonal (should be ~0):")
print(f"    Mean: {diag.mean():.6f}")
print(f"    Max: {diag.max():.6f}")

# Check 2: Matrix should be symmetric
asymmetry = np.abs(distances - distances.T).max()
print(f"  Symmetry check (should be ~0): {asymmetry:.6f}")

# Check 3: All distances should be non-negative
min_dist = distances.min()
print(f"  Minimum distance (should be ≥0): {min_dist:.6f}")

# Check 4: Stats should match metadata
mask = ~np.eye(N, dtype=bool)  # Exclude diagonal
off_diag = distances[mask]

print(f"\nDistance statistics (excluding diagonal):")
print(f"  Min: {off_diag.min():.2f} (expected: {metadata['distance_stats']['min']:.2f})")
print(f"  Max: {off_diag.max():.2f} (expected: {metadata['distance_stats']['max']:.2f})")
print(f"  Mean: {off_diag.mean():.2f} (expected: {metadata['distance_stats']['mean']:.2f})")

if np.allclose([off_diag.min(), off_diag.max(), off_diag.mean()],
               [metadata['distance_stats']['min'], 
                metadata['distance_stats']['max'],
                metadata['distance_stats']['mean']],
               rtol=0.01):
    print("\n✅ All validation checks passed!")
else:
    print("\n⚠️  Warning: Statistics don't match metadata (check for errors)")


Validation checks:
  Diagonal (should be ~0):
    Mean: 0.000000
    Max: 0.000000
  Symmetry check (should be ~0): 0.000000
  Minimum distance (should be ≥0): 0.000000

Distance statistics (excluding diagonal):
  Min: 0.00 (expected: 0.00)
  Max: 103.19 (expected: 103.19)
  Mean: 70.99 (expected: 71.00)

✅ All validation checks passed!


## Save Full Matrix

In [9]:
print(f"\nSaving full matrix to {OUTPUT_DISTANCES}...")
np.save(OUTPUT_DISTANCES, distances)

import os
file_size = os.path.getsize(OUTPUT_DISTANCES) / 1e9
print(f"✓ Saved!")
print(f"  File size: {file_size:.2f} GB")


Saving full matrix to ../data/vectors/distances_causal_32000_full.npy...
✓ Saved!
  File size: 4.10 GB


## Summary

✓ Reconstructed full symmetric distance matrix from upper triangle

✓ Validated: symmetric, zero diagonal, correct statistics

✓ Saved as `distances_causal_64000_full.npy` (float32, 8.2 GB)

**Next step:** Load this matrix in 07.52 for persistent homology analysis with ripser.