# 05.2a: Token Inspector

**Goal:** Deep dive on a single token by ID.

Given a token ID (from hovering on sky map), display:
- Decoded string representation
- Location in sky (lon, lat, lat_CDF)
- Distance from global centroid
- 30 nearest neighbors in γ' (Euclidean distance)
- 30 nearest neighbors in angular distance
- Average distance to k-nearest neighbors (density metric)

This is the **target identification** step in the observatory workflow.

## Parameters

In [8]:
TENSOR_DIR = "../data/tensors"
MODEL_NAME = "Qwen/Qwen3-4B-Instruct-2507"

# Token to inspect
TARGET_TOKEN_ID = 119347

# Which spherical coordinate system to use
SPHERICAL_COORDS_FILE = "spherical_coords_pc1_pc2_pc3.safetensors"

# Neighborhood size
K_NEIGHBORS = 30

## Imports

In [9]:
import torch
import numpy as np
import pandas as pd
from safetensors.torch import load_file
from pathlib import Path
from transformers import AutoTokenizer

print("Imports loaded successfully.")

Imports loaded successfully.


## Step 1: Load Data

In [10]:
# Load γ' (centered embeddings)
gamma_prime_path = Path(TENSOR_DIR) / "gamma_centered_qwen3_4b_instruct_2507.safetensors"
gamma_prime = load_file(gamma_prime_path)['gamma_centered']

N, d = gamma_prime.shape

print(f"Loaded γ' (centered):")
print(f"  Tokens: {N:,}")
print(f"  Dimensions: {d:,}")
print()

# Load spherical coordinates
coords_path = Path(TENSOR_DIR) / SPHERICAL_COORDS_FILE
coords = load_file(coords_path)

r = coords['r']
phi_deg = coords['phi_deg']
theta_deg = coords['theta_deg']
theta_flat = coords['theta_flat']

print(f"Loaded spherical coordinates from: {SPHERICAL_COORDS_FILE}")
print()

# Load tokenizer
print(f"Loading tokenizer: {MODEL_NAME}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
print("Tokenizer loaded.")

Loaded γ' (centered):
  Tokens: 151,936
  Dimensions: 2,560

Loaded spherical coordinates from: spherical_coords_pc1_pc2_pc3.safetensors

Loading tokenizer: Qwen/Qwen3-4B-Instruct-2507...
Tokenizer loaded.


## Step 2: Basic Token Information

In [11]:
if TARGET_TOKEN_ID < 0 or TARGET_TOKEN_ID >= N:
    raise ValueError(f"Token ID {TARGET_TOKEN_ID} out of range [0, {N-1}]")

# Decode token
token_str = tokenizer.decode([TARGET_TOKEN_ID])

# Get position
token_r = r[TARGET_TOKEN_ID].item()
token_phi = phi_deg[TARGET_TOKEN_ID].item()
token_theta = theta_deg[TARGET_TOKEN_ID].item()
token_theta_flat = theta_flat[TARGET_TOKEN_ID].item()

print("="*60)
print(f"TOKEN {TARGET_TOKEN_ID}")
print("="*60)
print()
print(f"String: {repr(token_str)}")
print()
print(f"Sky coordinates (from {SPHERICAL_COORDS_FILE}):")
print(f"  Longitude: {token_phi:.2f}°")
print(f"  Latitude: {token_theta:.4f}° (original)")
print(f"  Latitude: {token_theta_flat:.2f}° (CDF-flattened)")
print()
print(f"Distance from global centroid: {token_r:.6f} gamma units")

TOKEN 119347

String: '珊�'

Sky coordinates (from spherical_coords_pc1_pc2_pc3.safetensors):
  Longitude: 164.50°
  Latitude: 23.3882° (original)
  Latitude: 9.02° (CDF-flattened)

Distance from global centroid: 0.122323 gamma units


## Step 3: Nearest Neighbors in γ' (Euclidean Distance)

In [12]:
print("\n" + "="*60)
print(f"NEAREST NEIGHBORS IN γ' SPACE (Euclidean Distance)")
print("="*60)
print()

# Get target token's embedding
target_embedding = gamma_prime[TARGET_TOKEN_ID]

# Compute Euclidean distances to all tokens
distances_euclidean = torch.norm(gamma_prime - target_embedding, dim=1)

# Find k+1 nearest (including self)
nearest_euclidean_dists, nearest_euclidean_indices = torch.topk(
    distances_euclidean, 
    k=K_NEIGHBORS + 1, 
    largest=False
)

# Build results (exclude self)
euclidean_results = []
for i, (idx, dist) in enumerate(zip(nearest_euclidean_indices[1:], nearest_euclidean_dists[1:]), 1):
    token_id = idx.item()
    distance = dist.item()
    token = tokenizer.decode([token_id])
    
    euclidean_results.append({
        'Rank': i,
        'Token_ID': token_id,
        'Distance': f"{distance:.6f}",
        'Token': repr(token)
    })

euclidean_df = pd.DataFrame(euclidean_results)
print(euclidean_df.to_string(index=False))

# Average distance to k-nearest
avg_euclidean_dist = nearest_euclidean_dists[1:].mean().item()
print(f"\nAverage distance to {K_NEIGHBORS} nearest: {avg_euclidean_dist:.6f} gamma units")


NEAREST NEIGHBORS IN γ' SPACE (Euclidean Distance)

 Rank  Token_ID Distance  Token
    1       181 0.000000    '�'
    2    124403 0.000000  'นัก'
    3    119347 0.000000   '珊�'
    4       185 0.000000    '�'
    5     77150 0.000000   '１０'
    6       184 0.000000    '�'
    7    124602 0.000000 'แล้ว'
    8       178 0.000000    '�'
    9    124294 0.000000   'วั'
   10    124316 0.000000  'คุณ'
   11    124031 0.000000   'มี'
   12    124501 0.000000 'ขึ้น'
   13    124121 0.000000   'วิ'
   14    124559 0.000000   'พิ'
   15       182 0.000000    '�'
   16    124361 0.000000   'ณ์'
   17       183 0.000000    '�'
   18    124742 0.000000  'าร์'
   19       177 0.000000    '�'
   20    124030 0.000000  'ว่า'
   21       179 0.000000    '�'
   22    124673 0.000000   'รี'
   23       186 0.000000    '�'
   24    124652 0.000000  'ค้า'
   25       187 0.000000    '�'
   26    124770 0.000000 'ทั้ง'
   27    124573 0.000000   'ค์'
   28    124502 0.000000   'ก่'
   29    124377 0.0

## Step 4: Nearest Neighbors in Angular Distance

In [13]:
print("\n" + "="*60)
print(f"NEAREST NEIGHBORS IN ANGULAR DISTANCE")
print("="*60)
print()

# Normalize embeddings for cosine similarity
gamma_prime_normalized = gamma_prime / gamma_prime.norm(dim=1, keepdim=True)
target_normalized = target_embedding / target_embedding.norm()

# Compute cosine similarities
cosine_sims = gamma_prime_normalized @ target_normalized

# Convert to angular distances
angular_distances_rad = torch.acos(torch.clamp(cosine_sims, -1, 1))
angular_distances_deg = torch.rad2deg(angular_distances_rad)

# Find k+1 nearest (including self)
nearest_angular_dists, nearest_angular_indices = torch.topk(
    angular_distances_deg,
    k=K_NEIGHBORS + 1,
    largest=False
)

# Build results (exclude self)
angular_results = []
for i, (idx, dist) in enumerate(zip(nearest_angular_indices[1:], nearest_angular_dists[1:]), 1):
    token_id = idx.item()
    distance = dist.item()
    token = tokenizer.decode([token_id])
    
    angular_results.append({
        'Rank': i,
        'Token_ID': token_id,
        'Distance': f"{distance:.4f}°",
        'Token': repr(token)
    })

angular_df = pd.DataFrame(angular_results)
print(angular_df.to_string(index=False))

# Average angular distance to k-nearest
avg_angular_dist = nearest_angular_dists[1:].mean().item()
print(f"\nAverage angular distance to {K_NEIGHBORS} nearest: {avg_angular_dist:.4f}°")


NEAREST NEIGHBORS IN ANGULAR DISTANCE

 Rank  Token_ID Distance   Token
    1    124060  0.0000°    'ริ'
    2    124033  0.0000°   'ไม่'
    3    124027  0.0000°  'เป็น'
    4    123806  0.0000°     '�'
    5    124105  0.0000°   'ให้'
    6    124084  0.0000°   'รับ'
    7    124311  0.0000°    'ดี'
    8    123939  0.0000°    'ร์'
    9    124479  0.0000°   'ตัว'
   10    124212  0.0000°  'ื่อง'
   11    124055  0.0000°   'ได้'
   12    124383  0.0000°  'ต้อง'
   13    123870  0.0000°     '�'
   14    124440  0.0000°   'ี่ย'
   15    119346  0.0000°    '珊�'
   16    124439  0.0000°   'รู้'
   17    124083  0.0000°   'ั้ง'
   18    124484  0.0000°    'นั'
   19     80091  0.0000°    '２０'
   20    124139  0.0000°   'นี้'
   21    124132  0.0000°    'สิ'
   22    124482  0.0000°   'ใช้'
   23    124258  0.0000°   'ผู้'
   24    124254  0.0000°   'กัน'
   25    124238  0.0000°    'สั'
   26    124496  0.0000°    'ทุ'
   27    124458  0.0000°    'ตุ'
   28    124418  0.0000°   'สุด'
   

## Step 5: Density Analysis

In [14]:
print("\n" + "="*60)
print("DENSITY ANALYSIS")
print("="*60)
print()

print(f"Average Euclidean distance to {K_NEIGHBORS} nearest neighbors:")
print(f"  {avg_euclidean_dist:.6f} gamma units")
print()
print(f"Average angular distance to {K_NEIGHBORS} nearest neighbors:")
print(f"  {avg_angular_dist:.4f}°")
print()
print("Interpretation:")
print("  Smaller values = denser neighborhood")
print("  Larger values = sparser neighborhood")
print()
print("Compare to global statistics:")
print(f"  Global mean Euclidean distance: {distances_euclidean.mean().item():.6f} gamma units")
print(f"  Global mean angular distance: {angular_distances_deg.mean().item():.4f}°")


DENSITY ANALYSIS

Average Euclidean distance to 30 nearest neighbors:
  0.000000 gamma units

Average angular distance to 30 nearest neighbors:
  0.0000°

Interpretation:
  Smaller values = denser neighborhood
  Larger values = sparser neighborhood

Compare to global statistics:
  Global mean Euclidean distance: 1.050221 gamma units
  Global mean angular distance: 88.3374°


## Summary

Token inspector for detailed analysis of individual tokens.

**Workflow:**
1. Find interesting token ID from 04.1c sky map
2. Set `TARGET_TOKEN_ID` and run this notebook
3. Examine neighborhoods and density
4. Look for patterns in nearby tokens

**Neighborhoods:**
- **Euclidean neighbors**: Tokens that trained similarly (close in 2560D)
- **Angular neighbors**: Tokens pointing in similar directions (semantic/syntactic)

**Density:**
- Low average distance = dense cluster
- High average distance = isolated token in sparse region