# Language Embeddings Model - Training and Evaluation Demo

This notebook demonstrates:
1. Data preparation and tokenization
2. Model initialization
3. Training with contrastive learning
4. Evaluation on similarity and retrieval tasks
5. Visualization of embeddings

The architecture is modular and ready for MoE (Mixture of Experts) integration in future iterations.

In [None]:
import sys
sys.path.append('..')

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings('ignore')

# Import our modules
from src.models import EmbeddingModel
from src.data import SimpleTokenizer, PairDataset
from src.training import MultipleNegativesRankingLoss, EmbeddingTrainer
from src.evaluation import compute_similarity, evaluate_retrieval, compute_embedding_statistics

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## 1. Prepare Data

We'll create a synthetic dataset of similar sentence pairs for demonstration.
In practice, you would use real datasets like:
- SNLI/MNLI for natural language inference
- STS benchmark for semantic similarity
- MS MARCO or Natural Questions for retrieval

In [None]:
# Create synthetic training data with paraphrases
train_pairs = [
    # Technology
    ("The computer is very fast", "This machine has high performance"),
    ("I love programming in Python", "Python is my favorite programming language"),
    ("Machine learning is fascinating", "I find artificial intelligence very interesting"),
    ("The algorithm runs efficiently", "This computational method is very fast"),
    ("Neural networks are powerful", "Deep learning models show great capability"),
    
    # Nature
    ("The weather is sunny today", "It's a beautiful day with clear skies"),
    ("Trees provide oxygen", "Plants produce air for us to breathe"),
    ("The ocean is vast", "The sea is enormous"),
    ("Mountains are tall", "The peaks reach high into the sky"),
    ("Flowers bloom in spring", "Blossoms appear during springtime"),
    
    # Daily life
    ("I enjoy reading books", "Books are something I love to read"),
    ("Coffee tastes good", "I like the flavor of coffee"),
    ("Exercise is healthy", "Physical activity is good for you"),
    ("Music makes me happy", "I feel joy when listening to music"),
    ("Sleep is important", "Getting rest is essential"),
    
    # Food
    ("Pizza is delicious", "I think pizza tastes great"),
    ("Fresh fruit is nutritious", "Eating fruit is healthy"),
    ("The cake was sweet", "This dessert had a sugary taste"),
    ("Water is essential", "We need water to survive"),
    ("Vegetables are healthy", "Eating veggies is good for you"),
]

# Expand dataset by adding reverse pairs
train_pairs_expanded = train_pairs + [(b, a) for a, b in train_pairs]

# Add more variations
additional_pairs = [
    ("Dogs are loyal animals", "Canines show great faithfulness"),
    ("Cats are independent", "Felines prefer autonomy"),
    ("The city is busy", "Urban areas have lots of activity"),
    ("The library is quiet", "The book room is silent"),
    ("The car moves fast", "The vehicle has high speed"),
    ("Learning is fun", "I enjoy gaining knowledge"),
    ("The sunset is beautiful", "Dusk looks stunning"),
    ("Stars shine at night", "Celestial bodies glow after dark"),
]

train_pairs_expanded.extend(additional_pairs)
train_pairs_expanded.extend([(b, a) for a, b in additional_pairs])

print(f"Number of training pairs: {len(train_pairs_expanded)}")
print(f"\nExample pairs:")
for i in range(3):
    print(f"{i+1}. '{train_pairs_expanded[i][0]}'")
    print(f"   '{train_pairs_expanded[i][1]}'\n")

## 2. Build Vocabulary and Tokenizer

We'll use a simple word-based tokenizer. In production, you'd use:
- BPE (Byte-Pair Encoding)
- WordPiece
- SentencePiece
- Or use a pre-trained tokenizer from HuggingFace

In [None]:
# Extract all unique sentences for vocabulary building
all_sentences = []
for pair in train_pairs_expanded:
    all_sentences.extend(pair)

# Initialize and fit tokenizer
tokenizer = SimpleTokenizer(vocab_size=5000, max_length=128)
tokenizer.fit(all_sentences)

print(f"Vocabulary size: {len(tokenizer)}")
print(f"\nSample tokens:")
sample_tokens = list(tokenizer.token2id.keys())[:20]
print(sample_tokens)

In [None]:
# Test tokenization
test_sentence = "The computer is very fast"
encoded = tokenizer.encode(test_sentence)
print(f"Original: {test_sentence}")
print(f"Encoded IDs: {encoded['input_ids'][0][:15]}...")  # First 15 tokens
print(f"Attention mask: {encoded['attention_mask'][0][:15]}...")
print(f"Decoded: {tokenizer.decode(encoded['input_ids'][0].tolist())}")

## 3. Create Dataset and DataLoader

In [None]:
# Create dataset
train_dataset = PairDataset(train_pairs_expanded, tokenizer)

# Split into train and validation
train_size = int(0.8 * len(train_dataset))
val_size = len(train_dataset) - train_size

train_dataset, val_dataset = torch.utils.data.random_split(
    train_dataset, [train_size, val_size]
)

print(f"Train size: {len(train_dataset)}")
print(f"Validation size: {len(val_dataset)}")

# Create data loaders
batch_size = 8
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

# Check a batch
sample_batch = next(iter(train_loader))
print(f"\nBatch keys: {sample_batch.keys()}")
print(f"Input IDs shape: {sample_batch['input_ids_1'].shape}")
print(f"Attention mask shape: {sample_batch['attention_mask_1'].shape}")

## 4. Initialize Model

Our model architecture:
- Token + Positional Embeddings
- Transformer Encoder (6 layers, 8 heads)
- Pooling Layer (mean pooling)
- L2 Normalization

**Modular design** allows easy replacement of components with MoE variants later.

In [None]:
# Model hyperparameters
model_config = {
    "vocab_size": len(tokenizer),
    "hidden_dim": 128,        # Embedding dimension
    "num_layers": 4,          # Number of transformer layers
    "num_heads": 4,           # Number of attention heads
    "ff_dim": 512,            # Feed-forward dimension (can be replaced with MoE)
    "max_seq_len": 128,
    "dropout": 0.1,
    "pooling_mode": "mean",   # Can also try "max", "cls", or "mean_max"
    "pad_token_id": tokenizer.pad_token_id,
    "normalize_embeddings": True
}

# Initialize model
model = EmbeddingModel(**model_config)
model = model.to(device)

# Print model info
num_params = sum(p.numel() for p in model.parameters())
num_trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Model initialized with {num_params:,} parameters")
print(f"Trainable parameters: {num_trainable:,}")
print(f"\nModel architecture:")
print(model)

## 5. Test Forward Pass

In [None]:
# Test with a batch
model.eval()
with torch.no_grad():
    sample_batch_device = {k: v.to(device) for k, v in sample_batch.items()}
    output = model(
        sample_batch_device['input_ids_1'],
        sample_batch_device['attention_mask_1']
    )
    
print(f"Output embeddings shape: {output['embeddings'].shape}")
print(f"Hidden states shape: {output['hidden_states'].shape}")
print(f"\nFirst embedding (first 10 dims): {output['embeddings'][0][:10]}")
print(f"Embedding L2 norm: {torch.norm(output['embeddings'][0]).item():.4f}")

## 6. Setup Training

We'll use **Multiple Negatives Ranking Loss** (InfoNCE):
- Efficient contrastive learning
- Uses in-batch negatives
- Same loss as used in CLIP, SimCLR, and sentence-transformers

In [None]:
# Loss function
loss_fn = MultipleNegativesRankingLoss(temperature=0.05)

# Optimizer
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=0.01
)

# Learning rate scheduler
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=20,
    eta_min=1e-6
)

# Initialize trainer
trainer = EmbeddingTrainer(
    model=model,
    loss_fn=loss_fn,
    optimizer=optimizer,
    device=device,
    scheduler=scheduler
)

print("Training setup complete!")

## 7. Train the Model

In [None]:
# Train
history = trainer.train(
    train_loader=train_loader,
    val_loader=val_loader,
    num_epochs=20,
    eval_every=1,
    save_best=True,
    save_path="../models/best_model.pt"
)

## 8. Visualize Training Progress

In [None]:
# Plot training curves
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss curves
axes[0].plot(history['train_loss'], label='Train Loss', marker='o')
if history['val_loss']:
    axes[0].plot(history['val_loss'], label='Val Loss', marker='s')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training and Validation Loss')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Learning rate
axes[1].plot(history['learning_rate'], marker='o', color='green')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Learning Rate')
axes[1].set_title('Learning Rate Schedule')
axes[1].set_yscale('log')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 9. Evaluate Model Performance

### 9.1 Semantic Similarity Evaluation

In [None]:
# Create test pairs (similar and dissimilar)
test_pairs = [
    # Similar pairs (should have high similarity)
    ("The dog is running", "A canine is jogging", True),
    ("I like ice cream", "Ice cream is something I enjoy", True),
    ("The book is interesting", "This novel is fascinating", True),
    ("Programming is fun", "I enjoy coding", True),
    ("The sky is blue", "Blue is the color of the sky", True),
    
    # Dissimilar pairs (should have low similarity)
    ("The dog is running", "I like ice cream", False),
    ("Programming is fun", "The sky is blue", False),
    ("The book is interesting", "The dog is running", False),
    ("I like ice cream", "Programming is fun", False),
    ("The sky is blue", "The book is interesting", False),
]

# Compute similarities
model.eval()
similarities = []
labels = []

with torch.no_grad():
    for sent1, sent2, is_similar in test_pairs:
        # Encode
        enc1 = tokenizer.encode(sent1, return_tensors="pt")
        enc2 = tokenizer.encode(sent2, return_tensors="pt")
        
        # Get embeddings
        emb1 = model(enc1['input_ids'].to(device), enc1['attention_mask'].to(device))['embeddings']
        emb2 = model(enc2['input_ids'].to(device), enc2['attention_mask'].to(device))['embeddings']
        
        # Cosine similarity
        sim = torch.nn.functional.cosine_similarity(emb1, emb2).item()
        similarities.append(sim)
        labels.append(is_similar)
        
        print(f"Similarity: {sim:.3f} | Similar: {is_similar}")
        print(f"  S1: {sent1}")
        print(f"  S2: {sent2}\n")

# Compute average similarity for similar vs dissimilar pairs
similar_sims = [s for s, l in zip(similarities, labels) if l]
dissimilar_sims = [s for s, l in zip(similarities, labels) if not l]

print(f"\nAverage similarity for similar pairs: {np.mean(similar_sims):.3f} ± {np.std(similar_sims):.3f}")
print(f"Average similarity for dissimilar pairs: {np.mean(dissimilar_sims):.3f} ± {np.std(dissimilar_sims):.3f}")

### 9.2 Visualize Similarity Distribution

In [None]:
# Plot similarity distributions
plt.figure(figsize=(10, 5))

plt.hist(similar_sims, alpha=0.7, label='Similar pairs', bins=10, color='green')
plt.hist(dissimilar_sims, alpha=0.7, label='Dissimilar pairs', bins=10, color='red')

plt.xlabel('Cosine Similarity')
plt.ylabel('Frequency')
plt.title('Similarity Distribution for Similar vs Dissimilar Pairs')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 10. Visualize Embeddings with t-SNE

In [None]:
# Encode all training sentences
sample_sentences = [
    # Technology cluster
    "The computer is very fast",
    "I love programming in Python",
    "Machine learning is fascinating",
    "Neural networks are powerful",
    
    # Nature cluster
    "The weather is sunny today",
    "Trees provide oxygen",
    "The ocean is vast",
    "Mountains are tall",
    
    # Food cluster
    "Pizza is delicious",
    "Fresh fruit is nutritious",
    "Water is essential",
    "Vegetables are healthy",
]

# Categories for coloring
categories = ['Tech']*4 + ['Nature']*4 + ['Food']*4

# Get embeddings
embeddings_list = []
model.eval()

with torch.no_grad():
    for sent in sample_sentences:
        enc = tokenizer.encode(sent, return_tensors="pt")
        emb = model(enc['input_ids'].to(device), enc['attention_mask'].to(device))['embeddings']
        embeddings_list.append(emb.cpu().numpy())

embeddings_array = np.vstack(embeddings_list)

print(f"Embeddings shape: {embeddings_array.shape}")

In [None]:
# Apply t-SNE for visualization
tsne = TSNE(n_components=2, random_state=42, perplexity=5)
embeddings_2d = tsne.fit_transform(embeddings_array)

# Plot
plt.figure(figsize=(12, 8))

colors = {'Tech': 'blue', 'Nature': 'green', 'Food': 'orange'}
for i, (sent, cat) in enumerate(zip(sample_sentences, categories)):
    plt.scatter(embeddings_2d[i, 0], embeddings_2d[i, 1], 
               c=colors[cat], s=100, alpha=0.6, label=cat if cat not in plt.gca().get_legend_handles_labels()[1] else "")
    plt.annotate(sent[:30] + '...', (embeddings_2d[i, 0], embeddings_2d[i, 1]),
                fontsize=8, alpha=0.7)

plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.title('t-SNE Visualization of Sentence Embeddings')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 11. Embedding Statistics and Analysis

In [None]:
# Compute embedding statistics
stats = compute_embedding_statistics(embeddings_array)

print("Embedding Statistics:")
print("=" * 40)
for key, value in stats.items():
    print(f"{key:20s}: {value:.4f}")

## 12. Similarity Heatmap

In [None]:
# Compute similarity matrix
from sklearn.metrics.pairwise import cosine_similarity

sim_matrix = cosine_similarity(embeddings_array)

# Plot heatmap
plt.figure(figsize=(14, 12))
sns.heatmap(sim_matrix, 
           xticklabels=[s[:25] + '...' for s in sample_sentences],
           yticklabels=[s[:25] + '...' for s in sample_sentences],
           annot=True, fmt='.2f', cmap='RdYlGn', center=0.5,
           vmin=0, vmax=1)
plt.title('Cosine Similarity Heatmap')
plt.tight_layout()
plt.show()

## 13. Save Model and Tokenizer

In [None]:
import os
import pickle

# Create models directory if it doesn't exist
os.makedirs('../models', exist_ok=True)

# Save model
torch.save({
    'model_state_dict': model.state_dict(),
    'model_config': model_config,
    'training_history': history
}, '../models/embedding_model_final.pt')

# Save tokenizer
with open('../models/tokenizer.pkl', 'wb') as f:
    pickle.dump(tokenizer, f)

print("Model and tokenizer saved successfully!")
print(f"  - Model: ../models/embedding_model_final.pt")
print(f"  - Tokenizer: ../models/tokenizer.pkl")

## 14. Inference Example - Use the Trained Model

In [None]:
def get_embedding(text, model, tokenizer, device):
    """Get embedding for a single text"""
    model.eval()
    with torch.no_grad():
        enc = tokenizer.encode(text, return_tensors="pt")
        emb = model(
            enc['input_ids'].to(device),
            enc['attention_mask'].to(device)
        )['embeddings']
        return emb.cpu().numpy()

def find_most_similar(query, candidates, model, tokenizer, device, top_k=3):
    """Find most similar sentences to query"""
    # Get query embedding
    query_emb = get_embedding(query, model, tokenizer, device)
    
    # Get candidate embeddings
    candidate_embs = np.vstack([
        get_embedding(c, model, tokenizer, device) for c in candidates
    ])
    
    # Compute similarities
    similarities = cosine_similarity(query_emb, candidate_embs)[0]
    
    # Get top-k
    top_indices = np.argsort(similarities)[::-1][:top_k]
    
    results = []
    for idx in top_indices:
        results.append({
            'text': candidates[idx],
            'similarity': similarities[idx]
        })
    
    return results

In [None]:
# Test similarity search
query = "I enjoy building software applications"

candidates = [
    "The computer is very fast",
    "I love programming in Python",
    "The weather is sunny today",
    "Machine learning is fascinating",
    "Pizza is delicious",
    "Trees provide oxygen",
    "Neural networks are powerful",
    "The ocean is vast",
]

results = find_most_similar(query, candidates, model, tokenizer, device, top_k=5)

print(f"Query: '{query}'\n")
print("Most similar sentences:")
print("=" * 60)
for i, result in enumerate(results, 1):
    print(f"{i}. [{result['similarity']:.3f}] {result['text']}")

## 15. Model Architecture Summary and Next Steps

### Current Architecture:
- ✅ Token + Positional Embeddings
- ✅ Multi-layer Transformer Encoder
- ✅ Mean Pooling
- ✅ L2 Normalization
- ✅ Contrastive Learning (Multiple Negatives Ranking Loss)

### Modular Components Ready for MoE:
1. **Feed-Forward Layers** in `src/models/encoder.py:FeedForward`
   - Can be replaced with MoE layers
   - Each expert specializes in different linguistic patterns

2. **Pooling Layer** in `src/models/pooling.py:Pooler`
   - Can add expert-based pooling strategies
   - Different experts for different sentence types

3. **Expert Module Placeholder** in `src/experts/`
   - Ready for MoE implementation
   - Will include gating networks and expert selection

### Next Steps for MoE Integration:
1. Implement expert networks (specialized FFNs)
2. Design gating mechanism (Top-K routing)
3. Add load balancing loss
4. Replace FFN layers with MoE layers
5. Fine-tune on diverse domains to learn expert specialization
6. Analyze which experts activate for different inputs

### Performance Notes:
- Current model works well for semantic similarity
- Learns to cluster semantically related sentences
- Ready for scaling with MoE for better capacity and specialization

## Summary

This notebook demonstrated:
1. ✅ Building a vocabulary and tokenizer
2. ✅ Creating a modular embedding model architecture
3. ✅ Training with contrastive learning
4. ✅ Evaluating semantic similarity
5. ✅ Visualizing embeddings with t-SNE
6. ✅ Performing similarity search

The architecture is **fully modular** and ready for MoE integration in the next phase!