
# Word2Vec: Vanilla vs RSR (Human Behavioral Similarity)

## Experiment Pipeline

**Goal**: Compare two Word2Vec models on downstream category prediction:
1. **Vanilla**: Trained ONLY on Wikipedia corpus
2. **RSR**: Trained on Wikipedia + Human similarity judgments (4.7M triplets)

### Pipeline Steps:
1. Load Wikipedia corpus
2. Load human behavioral similarity matrix  
3. Load THINGS concepts & category labels
4. Train Word2Vec **VANILLA** (Wikipedia only)
5. Train Word2Vec **RSR** (Wikipedia + human similarity)
6. Compare both on THINGS category prediction task

### RSR Approach (brain_chapter method)

The RSR model uses **Representational Similarity Regularization** adapted from the brain_chapter implementation:

- **Loss Function**: Soft Spearman correlation (rank-based, scale-invariant)
  - `L_rsr = 1 - soft_spearman(model_similarity, target_similarity)`
  
- **Loss Combination**: Weighted balance
  - `L_total = (1 - α) × L_w2v + α × L_rsr`
  - Where `α = REG_STRENGTH` (default: 0.1)

- **Efficiency**: Random sampling of concept pairs per batch (default: 5000 pairs)

This approach aligns model representations with human behavioral similarity using differentiable ranking, which is more robust than MSE to scale differences between similarity matrices.



## Step 1: Imports & Configuration

In [1]:
import os
import pickle
import numpy as np
import pandas as pd
from tqdm import tqdm
from collections import Counter

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import scipy.io as sio

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Note: torchsort is the brain_chapter's choice, but it requires C++ compilation.
# We use a pure PyTorch soft ranking implementation instead (see soft_rank below).

# Configuration
BASE_DATA_DIR = "data"
THINGS_DIR = "things_similarity"

# Corpus path (text file, one sentence per line)
CORPUS_PATH = os.path.join(BASE_DATA_DIR, "AllCombined.txt")

# THINGS paths
THINGS_WORDS_PATH = os.path.join(THINGS_DIR, "variables", "unique_id.txt")
BEHAVIORAL_SIM_PATH = os.path.join(THINGS_DIR, "data", "spose_similarity.mat")
# Try category_mat_manual.mat which likely has the manual category labels
CATEGORY_DATA_PATH = os.path.join(THINGS_DIR, "data", "category_mat_manual.mat")

# Device
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {DEVICE}")

Using device: cuda


## Step 2: Load Wikipedia Corpus

In [2]:
# Load Wikipedia corpus from text file
print("="*70)
print("STEP 2: Loading Wikipedia Corpus")
print("="*70)

import re
from pathlib import Path

def simple_tokenize(text: str):
    """Basic tokenizer: lowercase, keep only alphabetic characters and spaces, split on whitespace."""
    text = text.lower()
    text = re.sub(r"[^a-zA-Z\s]", " ", text)
    return text.split()

def load_corpus(corpus_path: Path):
    """Load corpus as list of token lists (one per line)."""
    sentences = []
    with corpus_path.open("r", encoding="utf-8", errors="ignore") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            tokens = simple_tokenize(line)
            if tokens:
                sentences.append(tokens)
    return sentences

corpus_path = Path(CORPUS_PATH)
sentences = load_corpus(corpus_path)

print(f"  Source: Wikipedia corpus (AllCombined.txt)")
print(f"  Sentences: {len(sentences):,}")
if sentences:
    print(f"  Sample: {sentences[0][:10]}...")

STEP 2: Loading Wikipedia Corpus
  Source: Wikipedia corpus (AllCombined.txt)
  Sentences: 965,517
  Sample: ['april']...


## Step 3: Load Human Behavioral Similarity Matrix

In [3]:
print("="*70)
print("STEP 3: Loading Human Behavioral Similarity Matrix")
print("="*70)

# Load full behavioral similarity matrix (1854 x 1854 THINGS concepts)
behav_data = sio.loadmat(BEHAVIORAL_SIM_PATH)
behav_sim_full = behav_data['spose_sim']

print(f"  Source: 4.7 million human triplet judgments")
print(f"  Matrix shape: {behav_sim_full.shape}")
print(f"  Similarity range: [{behav_sim_full.min():.3f}, {behav_sim_full.max():.3f}]")
print(f"  Mean similarity: {behav_sim_full.mean():.3f}")

STEP 3: Loading Human Behavioral Similarity Matrix
  Source: 4.7 million human triplet judgments
  Matrix shape: (1854, 1854)
  Similarity range: [0.052, 1.000]
  Mean similarity: 0.334


## Step 4: Load THINGS Concepts & Category Labels

In [4]:
print("="*70)
print("STEP 4: Loading THINGS Concepts & Category Labels")
print("="*70)

# Load THINGS concepts (one word per line)
def load_things_words(words_path):
    """Load THINGS word list (one word per line)."""
    words = []
    with open(words_path, "r", encoding="utf-8") as f:
        for line in f:
            word = line.strip()
            if word:
                words.append(word)
    return words

concepts = load_things_words(THINGS_WORDS_PATH)
print(f"  THINGS concepts: {len(concepts)}")

# Load Category27 labels from .mat file (downstream task)
category_data = sio.loadmat(CATEGORY_DATA_PATH)

# Debug: show all keys and their shapes
print(f"  Category .mat file keys:")
for key in category_data.keys():
    if not key.startswith('__'):
        val = category_data[key]
        if isinstance(val, np.ndarray):
            print(f"    {key}: shape={val.shape}, dtype={val.dtype}")
        else:
            print(f"    {key}: type={type(val)}")

# Try to find the category matrix (should be 1854 x 27 or similar)
Y_all = None
for key in ['typicality', 'category', 'categories', 'Y', 'labels']:
    if key in category_data:
        arr = category_data[key]
        if isinstance(arr, np.ndarray) and len(arr.shape) == 2:
            # Check if either dimension is ~27 (categories) and the other is large (concepts)
            if arr.shape[0] > 100 or arr.shape[1] > 100:
                Y_all = arr.astype(np.float32)
                print(f"  Using '{key}' as category labels")
                break

# If not found, try to find the largest 2D array
if Y_all is None:
    for key in category_data.keys():
        if not key.startswith('__'):
            arr = category_data[key]
            if isinstance(arr, np.ndarray) and len(arr.shape) == 2:
                if arr.shape[0] > 100 or arr.shape[1] > 100:
                    Y_all = arr.astype(np.float32)
                    print(f"  Using '{key}' as category labels (fallback)")
                    break

if Y_all is None:
    print("  WARNING: Could not find valid category matrix!")
    print("  Creating dummy category labels for debugging...")
    # Create dummy labels (all zeros) so the pipeline can at least run
    Y_all = np.zeros((len(concepts), 27), dtype=np.float32)

# Transpose if needed (should be concepts x categories)
if Y_all.shape[0] == 27 and Y_all.shape[1] > 100:
    print(f"  Transposing from {Y_all.shape} to {Y_all.T.shape}")
    Y_all = Y_all.T

print(f"  Category labels: {Y_all.shape}")

# For property ratings, we'll skip for now since the file structure may differ
# Create a dummy prop_df for compatibility
prop_df = pd.DataFrame(index=concepts)
print(f"  Note: Property ratings not loaded (using concept list only)")

STEP 4: Loading THINGS Concepts & Category Labels
  THINGS concepts: 1854
  Category .mat file keys:
    category_mat_manual: shape=(1854, 27), dtype=uint8
  Using 'category_mat_manual' as category labels (fallback)
  Category labels: (1854, 27)
  Note: Property ratings not loaded (using concept list only)


## Step 5: Build Vocabulary & Align Data

In [5]:
print("="*70)
print("STEP 5: Building Vocabulary & Aligning Data")
print("="*70)

# Build vocabulary from corpus
MIN_COUNT = 5
print(f"\nBuilding vocabulary (min_count={MIN_COUNT})...")

word_counts = Counter()
for sent in tqdm(sentences, desc="Counting words"):
    word_counts.update(sent)

vocab = sorted([w for w, c in word_counts.items() if c >= MIN_COUNT])
word2idx = {w: i for i, w in enumerate(vocab)}
idx2word = {i: w for i, w in enumerate(vocab)}
vocab_size = len(vocab)

# Compute unigram distribution for negative sampling
word_freqs = np.array([word_counts[w] for w in vocab], dtype=np.float32)
word_freqs = word_freqs ** 0.75
word_probs = word_freqs / word_freqs.sum()

print(f"  Vocabulary size: {vocab_size:,}")
print(f"  Total tokens: {sum(word_counts.values()):,}")

# Helper function to find word in vocabulary
def get_word_idx(word, word2idx):
    """
    Try multiple strategies to match THINGS concept to vocabulary word.
    THINGS concepts often have underscores (e.g., 'air_conditioner').
    Vocabulary has lowercase single words.
    """
    word_lower = word.lower()
    
    # Strategy 1: Exact match (lowercase)
    if word_lower in word2idx:
        return word2idx[word_lower]
    
    # Strategy 2: Replace underscores with nothing (compound word)
    no_underscore = word_lower.replace("_", "")
    if no_underscore in word2idx:
        return word2idx[no_underscore]
    
    # Strategy 3: Take the first word of compound (e.g., "air" from "air_conditioner")
    parts = word_lower.split("_")
    if parts[0] in word2idx:
        return word2idx[parts[0]]
    
    # Strategy 4: Take the last word of compound (e.g., "conditioner" from "air_conditioner")
    if len(parts) > 1 and parts[-1] in word2idx:
        return word2idx[parts[-1]]
    
    # Strategy 5: Try each part of the compound
    for part in parts:
        if part in word2idx:
            return word2idx[part]
    
    return None

# Align THINGS concepts with vocabulary
print(f"\nAligning THINGS concepts with vocabulary...")
valid_concepts = []
valid_word_indices = []
Y_rows = []
valid_things_indices = []
unmatched_concepts = []

for idx, concept in enumerate(concepts):
    # Try to find the concept in vocabulary (handle various formats)
    word_idx = get_word_idx(concept, word2idx)
    if word_idx is None:
        unmatched_concepts.append(concept)
        continue
    # Make sure we have category data for this concept
    if idx >= Y_all.shape[0]:
        continue
    valid_word_indices.append(word_idx)
    Y_rows.append(Y_all[idx])
    valid_concepts.append(concept)
    valid_things_indices.append(idx)

# Debug: show matching statistics
print(f"  Matched: {len(valid_concepts)} / {len(concepts)} concepts")
print(f"  Unmatched: {len(unmatched_concepts)} concepts")
if unmatched_concepts[:10]:
    print(f"  Sample unmatched: {unmatched_concepts[:10]}")
if valid_concepts[:10]:
    print(f"  Sample matched: {valid_concepts[:10]}")

# Safety check
if len(valid_concepts) == 0:
    raise ValueError("No THINGS concepts could be matched to vocabulary! Check word matching logic.")

valid_word_indices = np.array(valid_word_indices)
Y = np.stack(Y_rows, axis=0).astype(np.float32)

# Extract aligned similarity matrix
behav_sim_subset = behav_sim_full[np.ix_(valid_things_indices, valid_things_indices)]
behav_sim_target = torch.tensor(behav_sim_subset, dtype=torch.float32, device=DEVICE)

print(f"\n{'='*70}")
print(f"ALIGNED DATASET:")
print(f"{'='*70}")
print(f"  Valid THINGS concepts: {len(valid_concepts)}")
print(f"  Category labels Y: {Y.shape}")
print(f"  Similarity matrix: {behav_sim_subset.shape}")
print(f"  Similarity matrix has {(behav_sim_subset > 0).sum()} non-zero entries")

STEP 5: Building Vocabulary & Aligning Data

Building vocabulary (min_count=5)...


Counting words: 100%|██████████| 965517/965517 [00:02<00:00, 468630.11it/s]


  Vocabulary size: 112,969
  Total tokens: 29,083,496

Aligning THINGS concepts with vocabulary...
  Matched: 1611 / 1854 concepts
  Unmatched: 243 concepts
  Sample unmatched: ['airboat', 'anklet', 'applesauce', 'ashtray', 'awning', 'backscratcher', 'bandanna', 'barrette', 'bassinet', 'bat1']
  Sample matched: ['aardvark', 'abacus', 'accordion', 'acorn', 'air_conditioner', 'air_mattress', 'air_pump', 'airbag', 'aircraft_carrier', 'airplane']

ALIGNED DATASET:
  Valid THINGS concepts: 1611
  Category labels Y: (1611, 27)
  Similarity matrix: (1611, 1611)
  Similarity matrix has 2595321 non-zero entries


## Step 6: Define Model & Training Functions

In [6]:
class SkipGramWord2Vec(nn.Module):
    """PyTorch Skip-gram Word2Vec with negative sampling."""
    
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        self.target_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.context_embeddings = nn.Embedding(vocab_size, embedding_dim)
        
        init_range = 0.5 / embedding_dim
        self.target_embeddings.weight.data.uniform_(-init_range, init_range)
        self.context_embeddings.weight.data.uniform_(-init_range, init_range)
    
    def forward(self, targets, contexts):
        t_emb = self.target_embeddings(targets)
        c_emb = self.context_embeddings(contexts)
        return torch.sum(t_emb * c_emb, dim=1)

# ============================================================================
# Sample pairs randomly as we go
# ============================================================================

def preprocess_sentences(sentences, word2idx):
    """Convert sentences to index arrays (do once, reuse)."""
    indexed = []
    for sent in tqdm(sentences, desc="Indexing sentences"):
        indices = [word2idx[w] for w in sent if w in word2idx]
        if len(indices) >= 2:
            indexed.append(np.array(indices, dtype=np.int32))
    return indexed

def sample_batch(indexed_sentences, batch_size, window_size, vocab_size, neg_probs_np):
    """Sample a batch of (target, context, negatives) on-the-fly."""
    targets = []
    contexts = []
    
    # Sample random sentences and extract pairs
    sent_indices = np.random.randint(0, len(indexed_sentences), batch_size * 2)
    
    for sent_idx in sent_indices:
        sent = indexed_sentences[sent_idx]
        if len(sent) < 2:
            continue
        
        # Random position in sentence
        pos = np.random.randint(0, len(sent))
        target = sent[pos]
        
        # Random context within window
        start = max(0, pos - window_size)
        end = min(len(sent), pos + window_size + 1)
        context_positions = [j for j in range(start, end) if j != pos]
        
        if context_positions:
            ctx_pos = context_positions[np.random.randint(0, len(context_positions))]
            targets.append(target)
            contexts.append(sent[ctx_pos])
        
        if len(targets) >= batch_size:
            break
    
    return np.array(targets[:batch_size]), np.array(contexts[:batch_size])

# ============================================================================
# HYPERPARAMETERS
# ============================================================================
EMBEDDING_DIM = 300
WINDOW_SIZE = 5
NEG_SAMPLES = 5

# Batch size: Typical values are 32-256 for Word2Vec. Larger batches:
# - Pros: Better GPU utilisation, more stable gradients, fewer kernel launches
# - Cons: Less frequent updates, may need more epochs, less stochasticity
# 128 is a good middle ground - standard in practice and still fast
BATCH_SIZE = 128

# Number of batches per epoch. With batch_size=128, this gives ~1.28M samples/epoch
# Adjust this to control training time vs coverage (more batches = more samples seen)
BATCHES_PER_EPOCH = 10000

W2V_EPOCHS = 5              
W2V_LR = 0.001

# RSR Configuration (brain_chapter approach)
# REG_STRENGTH controls the balance between primary task and RSR:
#   - 0.0 = No RSR (pure Word2Vec)
#   - 0.1 = Light regularization (recommended starting point)
#   - 0.5 = Equal weight
#   - 0.9 = Heavy RSR regularization
# Loss = (1 - REG_STRENGTH) * L_w2v + REG_STRENGTH * L_rsr
REG_STRENGTH = 0.1

# Apply RSR every N batches (1 = every batch like brain_chapter, higher = less frequent)
RSR_EVERY_N_BATCHES = 1

# ============================================================================
# Soft Spearman Correlation (from brain_chapter)
# Pure PyTorch implementation (no C++ compilation needed)
# ============================================================================

# Number of concept pairs to sample for RSR computation
# (Full matrix has ~960K pairs - sampling makes it tractable)
RSR_SAMPLE_SIZE = 5000

def soft_rank(x, regularization_strength=1.0):
    """
    Differentiable soft ranking using pairwise comparisons with sigmoid.
    
    For each element, counts how many elements are smaller (soft comparison).
    This gives an approximation to the rank that is differentiable.
    
    Args:
        x: 1D tensor of values to rank
        regularization_strength: Controls sharpness (higher = sharper ranks)
    
    Returns:
        Soft ranks (1-indexed, differentiable)
    """
    n = x.shape[0]
    
    # For large tensors, use a chunked approach to save memory
    if n > 10000:
        # For very large n, use a simpler approximation
        # Sort indices and use position as rank
        _, indices = torch.sort(x)
        ranks = torch.zeros_like(x)
        ranks[indices] = torch.arange(1, n + 1, dtype=x.dtype, device=x.device)
        return ranks
    
    # Pairwise differences: x[i] - x[j] for all i, j
    x_expanded = x.unsqueeze(1)  # (n, 1)
    diffs = x_expanded - x.unsqueeze(0)  # (n, n) where [i,j] = x[i] - x[j]
    
    # Soft comparison: sigmoid of scaled differences
    # For each row i, sum of sigmoid(x[i] - x[j]) gives soft count of elements < x[i]
    soft_comparisons = torch.sigmoid(regularization_strength * diffs)
    
    # Sum across columns gives rank
    ranks = soft_comparisons.sum(dim=1)
    
    return ranks


def soft_spearman(pred, target, regularization_strength=1.0):
    """
    Differentiable Spearman correlation using soft ranking.
    
    This is the brain_chapter approach: instead of MSE on similarity matrices,
    we compute rank-based correlation which is more robust to scale differences.
    
    Args:
        pred: Model similarity values (sampled pairs)
        target: Target similarity values (sampled pairs)  
        regularization_strength: Controls sharpness of soft ranking (higher = sharper)
    
    Returns:
        Spearman correlation (higher = more similar, range approximately [-1, 1])
    """
    # Soft rank (differentiable approximation to ranking)
    pred_ranked = soft_rank(pred, regularization_strength)
    target_ranked = soft_rank(target, regularization_strength)
    
    # Normalize to zero mean, unit norm (standard correlation formula)
    pred_ranked = pred_ranked - pred_ranked.mean()
    pred_ranked = pred_ranked / (pred_ranked.norm() + 1e-8)
    
    target_ranked = target_ranked - target_ranked.mean()
    target_ranked = target_ranked / (target_ranked.norm() + 1e-8)
    
    # Correlation = dot product of normalized vectors
    return (pred_ranked * target_ranked).sum()

print(f"{'='*70}")
print("MODEL CONFIGURATION:")
print(f"{'='*70}")
print(f"  Embedding dim: {EMBEDDING_DIM}")
print(f"  Window size: {WINDOW_SIZE}")
print(f"  Negative samples: {NEG_SAMPLES}")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Batches per epoch: {BATCHES_PER_EPOCH}")
print(f"  Samples per epoch: ~{BATCH_SIZE * BATCHES_PER_EPOCH:,}")
print(f"  Epochs: {W2V_EPOCHS}")
print(f"  Learning rate: {W2V_LR}")
print(f"{'='*70}")
print(f"\nRSR CONFIGURATION (brain_chapter approach):")
print(f"{'='*70}")
print(f"  Regularization strength: {REG_STRENGTH}")
print(f"  Loss formula: (1-{REG_STRENGTH}) * L_w2v + {REG_STRENGTH} * L_rsr")
print(f"  RSR loss type: 1 - soft_spearman (rank-based)")
print(f"  RSR frequency: every {RSR_EVERY_N_BATCHES} batch(es)")
print(f"{'='*70}")

MODEL CONFIGURATION:
  Embedding dim: 300
  Window size: 5
  Negative samples: 5
  Batch size: 128
  Batches per epoch: 10000
  Samples per epoch: ~1,280,000
  Epochs: 5
  Learning rate: 0.001

RSR CONFIGURATION (brain_chapter approach):
  Regularization strength: 0.1
  Loss formula: (1-0.1) * L_w2v + 0.1 * L_rsr
  RSR loss type: 1 - soft_spearman (rank-based)
  RSR frequency: every 1 batch(es)


## Step 7: Train VANILLA Word2Vec (Wikipedia Only)

In [7]:
print("="*70)
print("STEP 7: Training VANILLA Word2Vec (Wikipedia Only)")
print("="*70)
print("Using on-the-fly sampling")
print("="*70 + "\n")

# Preprocess sentences ONCE (convert to indices)
print("Preprocessing sentences (one-time)...")
indexed_sentences = preprocess_sentences(sentences, word2idx)
print(f"  Indexed {len(indexed_sentences):,} sentences")

# Negative sampling distribution (numpy for fast sampling)
neg_probs_np = word_probs
neg_probs_torch = torch.tensor(word_probs, device=DEVICE)
valid_idx_tensor = torch.LongTensor(valid_word_indices).to(DEVICE)

# Pre-allocate tensors for speed
pos_labels = torch.ones(BATCH_SIZE, device=DEVICE)
neg_labels = torch.zeros(BATCH_SIZE * NEG_SAMPLES, device=DEVICE)

# Create VANILLA model
vanilla_model = SkipGramWord2Vec(vocab_size, EMBEDDING_DIM).to(DEVICE)
vanilla_optimizer = optim.Adam(vanilla_model.parameters(), lr=W2V_LR)
loss_fn = nn.BCEWithLogitsLoss()

print(f"\nTraining... (~{BATCHES_PER_EPOCH * W2V_EPOCHS:,} iterations total)")
print(f"Expected time: ~{BATCHES_PER_EPOCH * W2V_EPOCHS // 500} minutes\n")

# Training loop
for epoch in range(W2V_EPOCHS):
    total_loss = 0
    
    pbar = tqdm(range(BATCHES_PER_EPOCH), desc=f"Vanilla Epoch {epoch+1}/{W2V_EPOCHS}")
    for batch_idx in pbar:
        # Sample batch on-the-fly (FAST!)
        targets_np, contexts_np = sample_batch(
            indexed_sentences, BATCH_SIZE, WINDOW_SIZE, vocab_size, neg_probs_np
        )
        
        # To GPU
        targets = torch.LongTensor(targets_np).to(DEVICE)
        contexts = torch.LongTensor(contexts_np).to(DEVICE)
        
        # Positive scores
        pos_scores = vanilla_model(targets, contexts)
        
        # Negative samples
        neg_contexts = torch.multinomial(neg_probs_torch, len(targets) * NEG_SAMPLES, replacement=True)
        neg_targets = targets.repeat_interleave(NEG_SAMPLES)
        neg_scores = vanilla_model(neg_targets, neg_contexts)
        
        # Loss
        all_scores = torch.cat([pos_scores, neg_scores])
        all_labels = torch.cat([pos_labels[:len(targets)], neg_labels[:len(targets)*NEG_SAMPLES]])
        loss = loss_fn(all_scores, all_labels)
        
        vanilla_optimizer.zero_grad()
        loss.backward()
        vanilla_optimizer.step()
        
        total_loss += loss.item()
        if batch_idx % 100 == 0:
            pbar.set_postfix({'loss': f'{loss.item():.4f}'})
    
    print(f"Epoch {epoch+1} | Avg Loss: {total_loss/BATCHES_PER_EPOCH:.4f}")

# Extract vanilla embeddings
X_vanilla = vanilla_model.target_embeddings(valid_idx_tensor).detach().cpu().numpy()
print(f"\n Vanilla training complete! Embeddings: {X_vanilla.shape}")

STEP 7: Training VANILLA Word2Vec (Wikipedia Only)
Using on-the-fly sampling

Preprocessing sentences (one-time)...


Indexing sentences: 100%|██████████| 965517/965517 [00:02<00:00, 329115.18it/s]


  Indexed 881,112 sentences

Training... (~50,000 iterations total)
Expected time: ~100 minutes



Vanilla Epoch 1/5: 100%|██████████| 10000/10000 [01:37<00:00, 102.32it/s, loss=0.4165]


Epoch 1 | Avg Loss: 0.4458


Vanilla Epoch 2/5: 100%|██████████| 10000/10000 [01:37<00:00, 103.04it/s, loss=0.3788]


Epoch 2 | Avg Loss: 0.3859


Vanilla Epoch 3/5: 100%|██████████| 10000/10000 [01:37<00:00, 103.00it/s, loss=0.3892]


Epoch 3 | Avg Loss: 0.3705


Vanilla Epoch 4/5: 100%|██████████| 10000/10000 [01:37<00:00, 102.17it/s, loss=0.3609]


Epoch 4 | Avg Loss: 0.3602


Vanilla Epoch 5/5: 100%|██████████| 10000/10000 [01:38<00:00, 101.74it/s, loss=0.3417]

Epoch 5 | Avg Loss: 0.3533

 Vanilla training complete! Embeddings: (1611, 300)





## Step 8: Train RSR Word2Vec (Wikipedia + Human Similarity)

In [8]:
print("="*70)
print("STEP 8: Training RSR Word2Vec (Wikipedia + Human Similarity)")
print("="*70)
print("Using brain_chapter approach: soft Spearman + weighted loss combination")
print("="*70 + "\n")

# Create RSR model (fresh initialisation with random weights)
rsr_model = SkipGramWord2Vec(vocab_size, EMBEDDING_DIM).to(DEVICE)
rsr_optimizer = optim.Adam(rsr_model.parameters(), lr=W2V_LR)

# Pre-compute all valid upper triangle indices for sampling
# Full matrix has n*(n-1)/2 pairs = ~960K pairs for 1386 concepts
n_concepts = behav_sim_target.shape[0]
all_triu_indices = torch.triu_indices(n_concepts, n_concepts, offset=1, device=DEVICE)
total_pairs = all_triu_indices.shape[1]

print(f"Training with RSR every {RSR_EVERY_N_BATCHES} batch(es)...")
print(f"Loss formula: (1 - {REG_STRENGTH}) * L_w2v + {REG_STRENGTH} * L_rsr")
print(f"RSR loss: 1 - soft_spearman(model_sim, target_sim)")
print(f"Sampling {RSR_SAMPLE_SIZE} concept pairs per RSR step (from {total_pairs:,} total)")
print(f"Expected time: ~{BATCHES_PER_EPOCH * W2V_EPOCHS // 300} minutes\n")

# Training loop
for epoch in range(W2V_EPOCHS):
    total_w2v_loss, total_rsr_loss, total_combined_loss = 0, 0, 0
    total_spearman = 0
    rsr_count = 0
    
    pbar = tqdm(range(BATCHES_PER_EPOCH), desc=f"RSR Epoch {epoch+1}/{W2V_EPOCHS}")
    for batch_idx in pbar:
        # Sample batch on-the-fly
        targets_np, contexts_np = sample_batch(
            indexed_sentences, BATCH_SIZE, WINDOW_SIZE, vocab_size, neg_probs_np
        )
        
        targets = torch.LongTensor(targets_np).to(DEVICE)
        contexts = torch.LongTensor(contexts_np).to(DEVICE)
        
        # Skip-gram loss
        pos_scores = rsr_model(targets, contexts)
        neg_contexts = torch.multinomial(neg_probs_torch, len(targets) * NEG_SAMPLES, replacement=True)
        neg_targets = targets.repeat_interleave(NEG_SAMPLES)
        neg_scores = rsr_model(neg_targets, neg_contexts)
        
        all_scores = torch.cat([pos_scores, neg_scores])
        all_labels = torch.cat([pos_labels[:len(targets)], neg_labels[:len(targets)*NEG_SAMPLES]])
        L_w2v = loss_fn(all_scores, all_labels)
        
        # RSR loss using soft Spearman (brain_chapter approach)
        L_rsr = torch.tensor(0.0, device=DEVICE)
        spearman_corr = torch.tensor(0.0, device=DEVICE)
        
        if batch_idx % RSR_EVERY_N_BATCHES == 0:
            # Sample random concept pairs for this RSR step
            sample_indices = torch.randperm(total_pairs, device=DEVICE)[:RSR_SAMPLE_SIZE]
            sampled_i = all_triu_indices[0, sample_indices]
            sampled_j = all_triu_indices[1, sample_indices]
            
            # Get target similarities for sampled pairs
            target_sim_sample = behav_sim_target[sampled_i, sampled_j]
            
            # Get THINGS concept embeddings and normalize
            things_emb = rsr_model.target_embeddings(valid_idx_tensor)
            things_emb_norm = F.normalize(things_emb, p=2, dim=1)
            
            # Compute model similarities for sampled pairs
            # sim(i, j) = dot(emb[i], emb[j]) for normalized embeddings
            model_sim_sample = (things_emb_norm[sampled_i] * things_emb_norm[sampled_j]).sum(dim=1)
            
            # Compute soft Spearman correlation (brain_chapter approach)
            spearman_corr = soft_spearman(model_sim_sample, target_sim_sample)
            
            # RSR loss: we want to maximize correlation, so minimize (1 - correlation)
            L_rsr = 1.0 - spearman_corr
            
            rsr_count += 1
            total_spearman += spearman_corr.item()
        
        # Combined loss: brain_chapter weighted balance approach
        # L_total = (1 - reg_strength) * L_w2v + reg_strength * L_rsr
        L_total = (1.0 - REG_STRENGTH) * L_w2v + REG_STRENGTH * L_rsr
        
        rsr_optimizer.zero_grad()
        L_total.backward()
        rsr_optimizer.step()
        
        total_w2v_loss += L_w2v.item()
        total_rsr_loss += L_rsr.item()
        total_combined_loss += L_total.item()
        
        if batch_idx % 100 == 0:
            pbar.set_postfix({
                'w2v': f'{L_w2v.item():.4f}', 
                'rsr': f'{L_rsr.item():.4f}',
                'ρ': f'{spearman_corr.item():.3f}'
            })
    
    avg_w2v = total_w2v_loss / BATCHES_PER_EPOCH
    avg_rsr = total_rsr_loss / max(1, rsr_count)
    avg_combined = total_combined_loss / BATCHES_PER_EPOCH
    avg_spearman = total_spearman / max(1, rsr_count)
    print(f"Epoch {epoch+1} | W2V: {avg_w2v:.4f} | RSR: {avg_rsr:.4f} | Combined: {avg_combined:.4f} | Spearman ρ: {avg_spearman:.4f}")

# Extract RSR embeddings
X_rsr = rsr_model.target_embeddings(valid_idx_tensor).detach().cpu().numpy()
print(f"\n✓ RSR training complete! Embeddings: {X_rsr.shape}")
print(f"  Final Spearman correlation with human similarity: {avg_spearman:.4f}")


STEP 8: Training RSR Word2Vec (Wikipedia + Human Similarity)
Using brain_chapter approach: soft Spearman + weighted loss combination

Training with RSR every 1 batch(es)...
Loss formula: (1 - 0.1) * L_w2v + 0.1 * L_rsr
RSR loss: 1 - soft_spearman(model_sim, target_sim)
Sampling 5000 concept pairs per RSR step (from 1,296,855 total)
Expected time: ~166 minutes



RSR Epoch 1/5: 100%|██████████| 10000/10000 [02:23<00:00, 69.75it/s, w2v=0.4338, rsr=0.0270, ρ=0.973]


Epoch 1 | W2V: 0.4482 | RSR: 0.0323 | Combined: 0.4066 | Spearman ρ: 0.9677


RSR Epoch 2/5: 100%|██████████| 10000/10000 [02:24<00:00, 69.20it/s, w2v=0.3847, rsr=0.0282, ρ=0.972]


Epoch 2 | W2V: 0.3858 | RSR: 0.0276 | Combined: 0.3500 | Spearman ρ: 0.9724


RSR Epoch 3/5: 100%|██████████| 10000/10000 [02:24<00:00, 69.12it/s, w2v=0.3855, rsr=0.0244, ρ=0.976]


Epoch 3 | W2V: 0.3697 | RSR: 0.0270 | Combined: 0.3354 | Spearman ρ: 0.9730


RSR Epoch 4/5: 100%|██████████| 10000/10000 [02:24<00:00, 69.01it/s, w2v=0.3653, rsr=0.0264, ρ=0.974]


Epoch 4 | W2V: 0.3600 | RSR: 0.0268 | Combined: 0.3266 | Spearman ρ: 0.9732


RSR Epoch 5/5: 100%|██████████| 10000/10000 [02:25<00:00, 68.65it/s, w2v=0.3380, rsr=0.0284, ρ=0.972]

Epoch 5 | W2V: 0.3529 | RSR: 0.0267 | Combined: 0.3203 | Spearman ρ: 0.9733

✓ RSR training complete! Embeddings: (1611, 300)
  Final Spearman correlation with human similarity: 0.9733





## Step 9: Compare on Downstream Task (THINGS Category Prediction)

In [9]:
print("="*70)
print("STEP 9: Comparing Models on THINGS Category Prediction")
print("="*70)
print("Task: Predict 27 binary category labels from embeddings")
print("Method: Logistic regression with 80/20 train/test split")
print("="*70 + "\n")

def evaluate_embeddings(X, Y, C=1.0, test_size=0.2, random_state=42):
    """Evaluate embeddings on category prediction. Returns mean F1."""
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    f1_scores = []
    for f in range(Y.shape[1]):
        y = Y[:, f]
        if np.all(y == y[0]):
            continue
        X_train, X_test, y_train, y_test = train_test_split(
            X_scaled, y, test_size=test_size, random_state=random_state, stratify=y
        )
        clf = LogisticRegression(C=C, max_iter=1000)
        clf.fit(X_train, y_train)
        f1_scores.append(f1_score(y_test, clf.predict(X_test)))
    
    return float(np.mean(f1_scores)), float(np.std(f1_scores))

# Evaluate both models
vanilla_f1, vanilla_std = evaluate_embeddings(X_vanilla, Y)
rsr_f1, rsr_std = evaluate_embeddings(X_rsr, Y)

# Results
print("="*70)
print("RESULTS: THINGS Category Prediction (F1 Score)")
print("="*70)
print(f"  VANILLA (Wikipedia only):     F1 = {vanilla_f1:.3f} ± {vanilla_std:.3f}")
print(f"  RSR (Wikipedia + Human Sim):  F1 = {rsr_f1:.3f} ± {rsr_std:.3f}")
print("="*70)

delta = rsr_f1 - vanilla_f1
print(f"\n{'='*70}")
if delta > 0.01:
    print(f"✓ RSR IMPROVED performance by {delta:.3f} F1 points!")
    print(f"  Human similarity judgments help category prediction!")
elif delta > 0:
    print(f"~ Slight improvement: Δ = {delta:.3f} F1")
else:
    print(f"✗ No improvement: Δ = {delta:.3f} F1")
print("="*70)

STEP 9: Comparing Models on THINGS Category Prediction
Task: Predict 27 binary category labels from embeddings
Method: Logistic regression with 80/20 train/test split

RESULTS: THINGS Category Prediction (F1 Score)
  VANILLA (Wikipedia only):     F1 = 0.083 ± 0.111
  RSR (Wikipedia + Human Sim):  F1 = 0.577 ± 0.241

✓ RSR IMPROVED performance by 0.494 F1 points!
  Human similarity judgments help category prediction!


## Bonus: Nearest Neighbor Analysis

In [10]:
from numpy.linalg import norm

def cosine_sim(a, b):
    return float(np.dot(a, b) / (norm(a) * norm(b) + 1e-8))

def nearest_neighbors(word, embeddings_dict, k=5):
    if word not in embeddings_dict:
        return []
    vec = embeddings_dict[word]
    sims = [(w, cosine_sim(vec, v)) for w, v in embeddings_dict.items() if w != word]
    return sorted(sims, key=lambda x: x[1], reverse=True)[:k]

# Create word -> embedding dicts
vanilla_dict = {w: X_vanilla[i] for i, w in enumerate(valid_concepts)}
rsr_dict = {w: X_rsr[i] for i, w in enumerate(valid_concepts)}

# Compare nearest neighbors
test_words = ["cat", "dog", "car", "hammer", "apple"]

print("\n" + "="*70)
print("NEAREST NEIGHBOR COMPARISON")
print("="*70 + "\n")

for word in test_words:
    if word not in vanilla_dict:
        continue
    print(f"=== '{word}' ===")
    print("VANILLA:", [w for w, _ in nearest_neighbors(word, vanilla_dict, 5)])
    print("RSR:    ", [w for w, _ in nearest_neighbors(word, rsr_dict, 5)])
    print()


NEAREST NEIGHBOR COMPARISON

=== 'cat' ===
VANILLA: ['pelican', 'zebra', 'chili', 'tree', 'tree_trunk']
RSR:     ['chihuahua', 'gorilla', 'meerkat', 'puppy', 'rat']

=== 'dog' ===
VANILLA: ['weasel', 'lamp', 'cookie', 'cookie_sheet', 'pelican']
RSR:     ['kitten', 'poodle', 'meerkat', 'sheep', 'monkey']

=== 'car' ===
VANILLA: ['car_door', 'car_seat', 'airplane', 'turbine', 'bomb']
RSR:     ['car_door', 'car_seat', 'bus', 'limousine', 'hearse']

=== 'hammer' ===
VANILLA: ['ladle', 'hummingbird', 'raft', 'timer', 'bike']
RSR:     ['screwdriver', 'chisel', 'trowel', 'ratchet', 'pliers']

=== 'apple' ===
VANILLA: ['apple_tree', 'alligator', 'slide', 'jam', 'antenna']
RSR:     ['apple_tree', 'pumpkin', 'peach', 'mulberry', 'pineapple']



## Save Models

In [11]:
# Save both models for later use
import os
os.makedirs("results", exist_ok=True)

torch.save({
    'model_state_dict': vanilla_model.state_dict(),
    'vocab_size': vocab_size,
    'embedding_dim': EMBEDDING_DIM,
    'word2idx': word2idx,
    'idx2word': idx2word,
}, "results/vanilla_word2vec.pt")

torch.save({
    'model_state_dict': rsr_model.state_dict(),
    'vocab_size': vocab_size,
    'embedding_dim': EMBEDDING_DIM,
    'word2idx': word2idx,
    'idx2word': idx2word,
}, "results/rsr_word2vec.pt")

print("✓ Models saved to results/")

✓ Models saved to results/
