# ⚠️ Memory Optimization for Apple Silicon (MPS) Users

If you encounter `MPS backend out of memory` errors, this notebook includes several automatic optimizations:

## Automatic Optimizations Applied:
- **Reduced Batch Size**: Automatically reduced from 8 to 4 for MPS
- **Gradient Checkpointing**: Enabled to reduce memory usage during backpropagation
- **Memory Cleanup**: Regular `torch.mps.empty_cache()` calls between operations
- **Gradient Accumulation**: Maintains effective batch size through accumulation
- **Disabled Features**: Pin memory and multiprocessing disabled for MPS stability

## Manual Environment Settings (Optional):
If you still encounter memory issues, run these commands in terminal before training:

```bash
# Disable MPS memory limit (may cause system instability)
export PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0

# Further reduce batch size
export PHYSBERT_BATCH_SIZE=2

# Reduce training examples for testing
export PHYSBERT_TEST_SIZE=5

# Skip data augmentation to reduce memory usage
export PHYSBERT_SKIP_AUGMENTATION=true
```

## Memory-Friendly Training Settings:
- **Small Dataset**: Use `PHYSBERT_TEST_SIZE=10` for initial testing
- **Shorter Sequences**: Model automatically uses 512 max length
- **Single Process**: No multiprocessing to avoid memory conflicts
- **Evaluation Batching**: Reduced batch sizes during model comparison

---

# PhysBERT Transformers-Based Fine-tuning Pipeline for Physics Domain Specialization

## Overview
This notebook fine-tunes the **PhysBERT** model (`thellert/physbert_cased`) on physics-specific triplet data using the **transformers** library directly (NOT sentence-transformers). The goal is to create specialized embeddings that better understand physics concepts and relationships through contrastive learning.

## Architecture Strategy
- **Base Model**: PhysBERT (`thellert/physbert_cased`) - BERT model pre-trained on physics literature
- **Fine-tuning Method**: Native transformers with custom triplet loss and training loop
- **Training Data**: Query-Positive-Negative triplets from physics textbook content
- **Loss Function**: Triplet loss with margin-based contrastive learning
- **Output**: Physics-specialized embedding model for domain-specific tasks

## Key Differences from Sentence-Transformers Approach

### Transformers-Based Approach (This Notebook):
- **Direct Model Control**: Full access to model architecture and training loop
- **Custom Loss Implementation**: Hand-crafted triplet loss with gradient computation
- **Flexible Pooling**: Custom mean pooling implementation for embeddings
- **Training Loop**: Manual forward/backward pass with optimizer control
- **Memory Efficiency**: Better control over batch processing and gradient accumulation

### Sentence-Transformers Approach (Original):
- **High-level API**: Built-in triplet training with automatic handling
- **Less Control**: Limited customization of training process
- **Simplified Setup**: Easy configuration but less flexibility

## Files and Dependencies

### Input Files:
- `data/physics_triplets_st.json` - Triplet training data in format: `{"texts": [query, positive, negative]}`
- `config/physbert_config.yaml` - Training configuration and hyperparameters
- `main.tex` - Original LaTeX file with physics content (source for triplets)

### Key Dependencies:
```python
transformers>=4.21.0  # For PhysBERT model and tokenizer
torch>=1.12.0         # PyTorch for model training
datasets>=2.0.0       # Data handling and processing
accelerate>=0.20.0    # Training acceleration
scikit-learn>=1.1.0   # For evaluation metrics
pyyaml>=6.0          # Configuration management
```

### Output Files:
- `models/physbert-transformers-finetuned/` - Fine-tuned model directory
- `logs/physbert_transformers_training.log` - Training logs and metrics
- `models/training_metadata_transformers.yaml` - Training configuration and results

## Training Process Flow

1. **Data Loading**: Load physics triplets from JSON with proper validation
2. **Model Setup**: Load PhysBERT using transformers AutoModel/AutoTokenizer
3. **Custom Embedding Layer**: Implement mean pooling for generating embeddings
4. **Triplet Loss Implementation**: Custom triplet loss function with margin
5. **Training Loop**: Manual forward/backward pass with optimizer control
6. **Data Augmentation**: Generate additional triplets via negative shuffling
7. **Evaluation**: Compare base vs fine-tuned model on physics queries
8. **Model Saving**: Export fine-tuned model for downstream physics tasks

## Expected Improvements

### Base PhysBERT Performance:
- General physics understanding from pre-training
- Good but not optimized for specific textbook concepts

### Fine-tuned Model Performance:
- **Higher query-positive similarity** for physics concepts
- **Better separation** between relevant and irrelevant physics content  
- **Domain-specific optimization** for textbook chapter material
- **Improved retrieval** for physics Q&A and concept matching
- **Lower-level control** over embedding generation process

## Usage Example

```python
# Load fine-tuned model
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("models/physbert-transformers-finetuned")
model = AutoModel.from_pretrained("models/physbert-transformers-finetuned")

# Generate embeddings for physics queries
def get_embeddings(texts):
    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state.mean(dim=1)  # Mean pooling
    return embeddings

query = "What describes energy loss of charged particles in matter?"
embeddings = get_embeddings([query])
```

## Configuration

Training parameters are managed through `config/physbert_config.yaml` following RAGAS framework patterns with environment variable overrides:

- `PHYSBERT_BATCH_SIZE`: Override training batch size
- `PHYSBERT_EPOCHS`: Override number of training epochs  
- `PHYSBERT_OUTPUT_PATH`: Override model save directory
- `PHYSBERT_TEST_SIZE`: Limit training data for testing
- `PHYSBERT_LEARNING_RATE`: Override learning rate

---

In [1]:
import json
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW  # Fixed: Use torch.optim.AdamW instead of transformers.AdamW
from transformers import AutoTokenizer, AutoModel, get_linear_schedule_with_warmup
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from typing import List, Dict, Tuple
import yaml
from datetime import datetime
import random
from tqdm import tqdm
import logging
import gc  # For garbage collection

# Setup logging with handler management to prevent duplicates
def setup_logging():
    logger = logging.getLogger(__name__)
    
    # Clear existing handlers to prevent duplicates
    if logger.handlers:
        logger.handlers.clear()
    
    # Only add handler if none exists
    if not logger.handlers:
        handler = logging.StreamHandler()
        formatter = logging.Formatter('%(levelname)s: %(message)s')
        handler.setFormatter(formatter)
        logger.addHandler(handler)
        logger.setLevel(logging.INFO)
    
    return logger

logger = setup_logging()

def check_mps_memory():
    """
    Check MPS memory usage and provide recommendations
    """
    if torch.backends.mps.is_available():
        try:
            # Get memory info (this may not work on all macOS versions)
            print("🍎 MPS Memory Status:")
            print(f"   - MPS Available: {torch.backends.mps.is_available()}")
            print(f"   - MPS Built: {torch.backends.mps.is_built()}")
            
            # Force garbage collection
            gc.collect()
            torch.mps.empty_cache()
            print("Memory cache cleared")
        except Exception as e:
            print(f"Warning: Could not get MPS memory info: {e}")
    else:
        print("MPS not available on this system")

def optimize_for_memory():
    """
    Apply memory optimization settings for training (removed watermark ratio)
    """
    print("🔧 Applied memory optimizations:")
    
    if torch.backends.mps.is_available():
        torch.mps.empty_cache()
        print("- MPS cache cleared")
    
    gc.collect()
    print("- Garbage collection completed")

def load_triplets_from_json(json_path: str = "data/physics_triplets_st.json") -> List[Dict]:
    """
    Load physics triplets from JSON for transformers-based training
    
    Args:
        json_path: Path to the physics triplets JSON file
        
    Returns:
        List of dictionaries with triplet data
    """
    try:
        config_path = os.getenv('TRIPLET_DATA_PATH', json_path)
        
        with open(config_path, 'r', encoding='utf-8') as f:
            data = json.load(f)
        
        triplets = []
        for item in data:
            if 'texts' in item and len(item['texts']) == 3:
                triplet = {
                    'anchor': item['texts'][0],    # query
                    'positive': item['texts'][1],  # relevant text
                    'negative': item['texts'][2],  # irrelevant text
                    'id': item.get('id', f'triplet_{len(triplets)}')
                }
                triplets.append(triplet)
        
        print(f"Loaded {len(triplets)} triplet examples for training")
        return triplets
        
    except FileNotFoundError:
        print(f"Error: Physics triplet file not found: {json_path}")
        return []
    except Exception as e:
        print(f"Error loading triplets: {e}")
        return []



In [2]:
def augment_physics_triplets(triplets: List[Dict], 
                           negative_shuffle_factor: int = 2,
                           hard_negative_factor: int = 1) -> List[Dict]:
    """
    Augment physics triplets using negative shuffling and hard negatives
    
    Args:
        triplets: Original triplet dictionaries
        negative_shuffle_factor: Number of shuffled negatives per original
        hard_negative_factor: Number of hard negatives per original
        
    Returns:
        Augmented list of triplet dictionaries
    """
    augmented = triplets.copy()  # Keep originals
    
    # Extract all negatives and positives
    all_negatives = [t['negative'] for t in triplets]
    all_positives = [t['positive'] for t in triplets]
    
    for i, triplet in enumerate(triplets):
        anchor = triplet['anchor']
        positive = triplet['positive']
        
        # Negative shuffling: use other negatives
        other_negatives = [neg for j, neg in enumerate(all_negatives) if j != i]
        for shuffle_idx in range(min(negative_shuffle_factor, len(other_negatives))):
            shuffled_neg = random.choice(other_negatives)
            other_negatives.remove(shuffled_neg)
            
            augmented.append({
                'anchor': anchor,
                'positive': positive,
                'negative': shuffled_neg,
                'id': f"{triplet['id']}_shuffle_{shuffle_idx}"
            })
        
        # Hard negatives: use other positives as challenging negatives
        other_positives = [pos for j, pos in enumerate(all_positives) if j != i]
        for hard_idx in range(min(hard_negative_factor, len(other_positives))):
            hard_neg = random.choice(other_positives)
            other_positives.remove(hard_neg)
            
            augmented.append({
                'anchor': anchor,
                'positive': positive, 
                'negative': hard_neg,
                'id': f"{triplet['id']}_hard_{hard_idx}"
            })
    
    print(f"Data augmentation: {len(triplets)} -> {len(augmented)} examples")
    return augmented

In [3]:
class TripletDataset(Dataset):
    """
    PyTorch Dataset for triplet training with transformers
    """
    
    def __init__(self, triplets: List[Dict], tokenizer, max_length: int = 512):
        self.triplets = triplets
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.triplets)
    
    def __getitem__(self, idx):
        triplet = self.triplets[idx]
        
        # Tokenize all three texts
        anchor_inputs = self.tokenizer(
            triplet['anchor'],
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        positive_inputs = self.tokenizer(
            triplet['positive'],
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        negative_inputs = self.tokenizer(
            triplet['negative'],
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        return {
            'anchor_input_ids': anchor_inputs['input_ids'].squeeze(),
            'anchor_attention_mask': anchor_inputs['attention_mask'].squeeze(),
            'positive_input_ids': positive_inputs['input_ids'].squeeze(),
            'positive_attention_mask': positive_inputs['attention_mask'].squeeze(),
            'negative_input_ids': negative_inputs['input_ids'].squeeze(),
            'negative_attention_mask': negative_inputs['attention_mask'].squeeze(),
            'triplet_id': triplet['id']
        }

In [4]:
class PhysBERTEmbeddingModel(nn.Module):
    """
    PhysBERT model wrapper for generating embeddings with custom pooling
    """
    
    def __init__(self, model_name: str = "thellert/physbert_cased"):
        super(PhysBERTEmbeddingModel, self).__init__()
        self.bert = AutoModel.from_pretrained(model_name)
        self.pooling_strategy = 'mean'  # Can be 'mean', 'cls', 'max'
        
    def forward(self, input_ids, attention_mask):
        """
        Forward pass to generate embeddings
        
        Args:
            input_ids: Token IDs
            attention_mask: Attention mask
            
        Returns:
            Pooled embeddings
        """
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        
        if self.pooling_strategy == 'mean':
            # Mean pooling over sequence length
            token_embeddings = outputs.last_hidden_state
            attention_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
            sum_embeddings = torch.sum(token_embeddings * attention_mask_expanded, 1)
            sum_mask = torch.clamp(attention_mask_expanded.sum(1), min=1e-9)
            embeddings = sum_embeddings / sum_mask
            
        elif self.pooling_strategy == 'cls':
            # Use [CLS] token embedding
            embeddings = outputs.last_hidden_state[:, 0, :]
            
        elif self.pooling_strategy == 'max':
            # Max pooling over sequence length
            token_embeddings = outputs.last_hidden_state
            attention_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
            token_embeddings[attention_mask_expanded == 0] = -1e9  # Set padding tokens to very negative values
            embeddings = torch.max(token_embeddings, 1)[0]
            
        return embeddings
    
    def encode(self, texts: List[str], tokenizer, device, batch_size: int = 32):
        """
        Encode texts to embeddings (for evaluation) with memory optimization
        
        Args:
            texts: List of texts to encode
            tokenizer: Tokenizer instance
            device: Device to run on
            batch_size: Batch size for encoding
            
        Returns:
            Numpy array of embeddings
        """
        self.eval()
        embeddings = []
        
        # Memory optimization: Reduce batch size for MPS
        if device.type == 'mps':
            batch_size = min(batch_size, 16)  # Smaller batches for MPS
        
        with torch.no_grad():
            for i in range(0, len(texts), batch_size):
                batch_texts = texts[i:i+batch_size]
                
                inputs = tokenizer(
                    batch_texts,
                    max_length=512,
                    padding=True,
                    truncation=True,
                    return_tensors='pt'
                ).to(device, non_blocking=False)  # Disable non_blocking for MPS
                
                batch_embeddings = self.forward(
                    inputs['input_ids'],
                    inputs['attention_mask']
                )
                
                embeddings.append(batch_embeddings.cpu().numpy())
                
                # Memory cleanup for MPS after each batch
                if device.type == 'mps':
                    torch.mps.empty_cache()
                elif device.type == 'cuda':
                    torch.cuda.empty_cache()
                
                # Clear input tensors
                del inputs, batch_embeddings
        
        return np.vstack(embeddings)

In [5]:
class TripletLoss(nn.Module):
    """
    Custom triplet loss implementation for contrastive learning
    """
    
    def __init__(self, margin: float = 0.5, distance_metric: str = 'cosine'):
        super(TripletLoss, self).__init__()
        self.margin = margin
        self.distance_metric = distance_metric.lower()
        
    def forward(self, anchor_embeddings, positive_embeddings, negative_embeddings):
        """
        Compute triplet loss
        
        Args:
            anchor_embeddings: Anchor embeddings
            positive_embeddings: Positive embeddings
            negative_embeddings: Negative embeddings
            
        Returns:
            Triplet loss value
        """
        if self.distance_metric == 'cosine':
            # Use cosine similarity (higher is better)
            pos_similarity = F.cosine_similarity(anchor_embeddings, positive_embeddings, dim=1)
            neg_similarity = F.cosine_similarity(anchor_embeddings, negative_embeddings, dim=1)
            
            # For cosine similarity, we want: pos_sim > neg_sim + margin
            # So loss = max(0, margin - (pos_sim - neg_sim))
            loss = F.relu(self.margin - (pos_similarity - neg_similarity))
            
        elif self.distance_metric == 'euclidean':
            # Use Euclidean distance (lower is better)
            pos_distance = F.pairwise_distance(anchor_embeddings, positive_embeddings, p=2)
            neg_distance = F.pairwise_distance(anchor_embeddings, negative_embeddings, p=2)
            
            # For euclidean distance, we want: pos_dist + margin < neg_dist
            # So loss = max(0, pos_dist - neg_dist + margin)
            loss = F.relu(pos_distance - neg_distance + self.margin)
            
        else:
            raise ValueError(f"Unsupported distance metric: {self.distance_metric}")
        
        return loss.mean()
    
    def get_similarities(self, anchor_embeddings, positive_embeddings, negative_embeddings):
        """
        Get similarity scores for evaluation
        
        Returns:
            Dictionary with positive and negative similarities
        """
        with torch.no_grad():
            if self.distance_metric == 'cosine':
                pos_sim = F.cosine_similarity(anchor_embeddings, positive_embeddings, dim=1)
                neg_sim = F.cosine_similarity(anchor_embeddings, negative_embeddings, dim=1)
                return {
                    'positive_similarity': pos_sim.mean().item(),
                    'negative_similarity': neg_sim.mean().item(),
                    'margin': (pos_sim - neg_sim).mean().item()
                }
            else:
                pos_dist = F.pairwise_distance(anchor_embeddings, positive_embeddings, p=2)
                neg_dist = F.pairwise_distance(anchor_embeddings, negative_embeddings, p=2)
                return {
                    'positive_distance': pos_dist.mean().item(),
                    'negative_distance': neg_dist.mean().item(),
                    'margin': (neg_dist - pos_dist).mean().item()
                }

In [6]:
def _load_training_config(config_path: str) -> dict:
    """
    Load training configuration with fallback defaults for transformers training
    """
    default_config = {
        'model': {'base_model': 'thellert/physbert_cased'},
        'data': {'path': 'data/physics_triplets_st.json'},
        'augmentation': {
            'enabled': True,
            'negative_shuffle_factor': 2, 
            'hard_negative_factor': 1
        },
        'training': {
            'batch_size': 8,
            'epochs': 3,
            'learning_rate': 2e-5,
            'warmup_ratio': 0.1,
            'weight_decay': 0.01,
            'max_length': 512,
            'gradient_accumulation_steps': 1,
            'output_path': 'models/physbert-transformers-finetuned',
            'save_steps': 100,
            'eval_steps': 50,
            'triplet_loss': {
                'distance_metric': 'cosine', 
                'margin': 0.5
            }
        },
        'evaluation': {
            'enabled': True,
            'test_ratio': 0.2,
            'comparison': {
                'enabled': True,
                'sample_queries': 5
            }
        },
        'logging': {
            'level': 'INFO',
            'save_logs': True,
            'log_file': 'logs/physbert_transformers_training.log'
        }
    }
    
    try:
        if os.path.exists(config_path):
            with open(config_path, 'r') as f:
                user_config = yaml.safe_load(f)
            # Deep merge configs with user overrides
            def deep_merge(default, user):
                for key, value in user.items():
                    if key in default and isinstance(default[key], dict) and isinstance(value, dict):
                        deep_merge(default[key], value)
                    else:
                        default[key] = value
                return default
            
            return deep_merge(default_config, user_config)
        else:
            print(f"⚠️  Config file not found: {config_path}, using defaults")
            return default_config
    except Exception as e:
        print(f"⚠️  Error loading config: {e}, using defaults")
        return default_config

In [7]:
def train_physbert_transformers(config_path: str = "config/physbert_config.yaml"):
    """
    Main training function for PhysBERT fine-tuning using transformers
    
    Args:
        config_path: Path to training configuration file
        
    Returns:
        Trained model and tokenizer
    """
    # Configuration-driven setup following RAGAS patterns
    config = _load_training_config(config_path)
    
    print("PhysBERT Transformers Fine-tuning Pipeline")
    print("=" * 50)
    
    # Set device with Apple Silicon MPS support and memory optimization
    if torch.backends.mps.is_available():
        device = torch.device('mps')
        print(f"Using Apple Silicon MPS: {device}")
        torch.mps.empty_cache()
    elif torch.cuda.is_available():
        device = torch.device('cuda')
        print(f"Using CUDA GPU: {device}")
        torch.cuda.empty_cache()
    else:
        device = torch.device('cpu')
        print(f"Using CPU: {device}")
    
    try:
        # Load base model and tokenizer
        base_model_name = config.get('model', {}).get('base_model', 'thellert/physbert_cased')
        print(f"Loading base model: {base_model_name}")
        
        tokenizer = AutoTokenizer.from_pretrained(base_model_name)
        model = PhysBERTEmbeddingModel(base_model_name).to(device)
        
        # Enable gradient checkpointing for memory efficiency
        if hasattr(model.bert, 'gradient_checkpointing_enable'):
            model.bert.gradient_checkpointing_enable()
            print("Enabled gradient checkpointing")
        
        print("Model and tokenizer loaded successfully")
        
        # Load training data
        data_path = os.getenv('TRIPLET_DATA_PATH', config.get('data', {}).get('path', 'data/physics_triplets_st.json'))
        triplets = load_triplets_from_json(data_path)
        
        if not triplets:
            raise ValueError(f"No training examples loaded from {data_path}")
        
        # Limit examples for demo/test runs
        test_limit = int(os.getenv('PHYSBERT_TEST_SIZE', len(triplets)))
        triplets = triplets[:test_limit]
        print(f"Using {len(triplets)} examples for training")
        
        # Data augmentation with config-driven parameters
        aug_config = config.get('augmentation', {})
        if aug_config.get('enabled', True) and os.getenv('PHYSBERT_SKIP_AUGMENTATION') != 'true':
            augmented_triplets = augment_physics_triplets(
                triplets,
                negative_shuffle_factor=aug_config.get('negative_shuffle_factor', 2),
                hard_negative_factor=aug_config.get('hard_negative_factor', 1)
            )
        else:
            augmented_triplets = triplets
            print("Data augmentation disabled")
        
        # Create dataset and dataloader with memory-optimized settings
        train_config = config.get('training', {})
        max_length = train_config.get('max_length', 512)
        
        # Memory optimization: Reduce batch size for MPS
        original_batch_size = int(os.getenv('PHYSBERT_BATCH_SIZE', train_config.get('batch_size', 8)))
        if device.type == 'mps':
            batch_size = min(original_batch_size, 4)
            print(f"MPS optimization: batch size reduced from {original_batch_size} to {batch_size}")
        else:
            batch_size = original_batch_size
        
        dataset = TripletDataset(augmented_triplets, tokenizer, max_length=max_length)
        dataloader = DataLoader(
            dataset, 
            batch_size=batch_size, 
            shuffle=True,
            pin_memory=False if device.type == 'mps' else True,  # Disable pin_memory for MPS
            num_workers=0  # Disable multiprocessing for MPS stability
        )
        
        # Setup loss function
        triplet_config = train_config.get('triplet_loss', {})
        criterion = TripletLoss(
            margin=triplet_config.get('margin', 0.5),
            distance_metric=triplet_config.get('distance_metric', 'cosine')
        )
        
        # Setup optimizer and scheduler
        learning_rate = float(os.getenv('PHYSBERT_LEARNING_RATE', train_config.get('learning_rate', 2e-5)))
        weight_decay = train_config.get('weight_decay', 0.01)
        epochs = int(os.getenv('PHYSBERT_EPOCHS', train_config.get('epochs', 3)))
        
        optimizer = AdamW(
            model.parameters(),
            lr=learning_rate,
            weight_decay=weight_decay
        )
        
        total_steps = len(dataloader) * epochs
        warmup_steps = int(total_steps * train_config.get('warmup_ratio', 0.1))
        
        scheduler = get_linear_schedule_with_warmup(
            optimizer,
            num_warmup_steps=warmup_steps,
            num_training_steps=total_steps
        )
        
        # Memory optimization: Enable gradient accumulation if batch size was reduced
        gradient_accumulation_steps = max(1, original_batch_size // batch_size)
        
        print(f"🚀 Training Configuration:")
        print(f"   - Examples: {len(augmented_triplets)}")
        print(f"   - Batch size: {batch_size} (gradient accumulation: {gradient_accumulation_steps})")
        print(f"   - Epochs: {epochs}")
        print(f"   - Learning rate: {learning_rate}")
        print(f"   - Total steps: {total_steps}")
        print(f"   - Warmup steps: {warmup_steps}")
        print(f"   - Max length: {max_length}")
        print(f"   - Device: {device}")
        
        # Training loop with memory optimization
        model.train()
        total_loss = 0
        step_count = 0
        accumulated_loss = 0
        
        print("\nStarting training...")
        
        for epoch in range(epochs):
            print(f"\nEpoch {epoch + 1}/{epochs}")
            epoch_loss = 0
            
            progress_bar = tqdm(dataloader, desc=f"Epoch {epoch + 1}")
            
            for batch_idx, batch in enumerate(progress_bar):
                # Move batch to device
                anchor_input_ids = batch['anchor_input_ids'].to(device, non_blocking=False)
                anchor_attention_mask = batch['anchor_attention_mask'].to(device, non_blocking=False)
                positive_input_ids = batch['positive_input_ids'].to(device, non_blocking=False)
                positive_attention_mask = batch['positive_attention_mask'].to(device, non_blocking=False)
                negative_input_ids = batch['negative_input_ids'].to(device, non_blocking=False)
                negative_attention_mask = batch['negative_attention_mask'].to(device, non_blocking=False)
                
                # Forward pass
                anchor_embeddings = model(anchor_input_ids, anchor_attention_mask)
                positive_embeddings = model(positive_input_ids, positive_attention_mask)
                negative_embeddings = model(negative_input_ids, negative_attention_mask)
                
                # Compute loss
                loss = criterion(anchor_embeddings, positive_embeddings, negative_embeddings)
                
                # Scale loss for gradient accumulation
                loss = loss / gradient_accumulation_steps
                accumulated_loss += loss.item()
                
                # Backward pass
                loss.backward()
                
                # Only step optimizer after accumulating gradients
                if (batch_idx + 1) % gradient_accumulation_steps == 0 or (batch_idx + 1) == len(dataloader):
                    optimizer.step()
                    scheduler.step()
                    optimizer.zero_grad()
                    
                    # Update metrics
                    total_loss += accumulated_loss
                    epoch_loss += accumulated_loss
                    step_count += 1
                    accumulated_loss = 0
                    
                    # Memory cleanup for MPS
                    if device.type == 'mps' and step_count % 10 == 0:
                        torch.mps.empty_cache()
                
                # Update progress bar
                progress_bar.set_postfix({
                    'loss': f'{loss.item() * gradient_accumulation_steps:.4f}',
                    'avg_loss': f'{epoch_loss / max(1, (batch_idx + 1) // gradient_accumulation_steps):.4f}',
                    'lr': f'{scheduler.get_last_lr()[0]:.2e}',
                    'device': str(device),
                    'mem_opt': 'ON' if device.type == 'mps' else 'OFF'
                })
                
                # Evaluation step
                if step_count % train_config.get('eval_steps', 50) == 0:
                    with torch.no_grad():  # Reduce memory usage during evaluation
                        similarities = criterion.get_similarities(
                            anchor_embeddings, positive_embeddings, negative_embeddings
                        )
                        print(f"\nStep {step_count} similarities: {similarities}")
            
            # Memory cleanup after epoch
            if device.type == 'mps':
                torch.mps.empty_cache()
            elif device.type == 'cuda':
                torch.cuda.empty_cache()
            
            avg_epoch_loss = epoch_loss / max(1, len(dataloader) // gradient_accumulation_steps)
            print(f"Epoch {epoch + 1} average loss: {avg_epoch_loss:.4f}")
        
        avg_total_loss = total_loss / step_count
        print(f"\nTraining completed. Average loss: {avg_total_loss:.4f}")
        
        # Save the fine-tuned model
        output_path = os.getenv('PHYSBERT_OUTPUT_PATH', train_config.get('output_path', 'models/physbert-transformers-finetuned'))
        os.makedirs(output_path, exist_ok=True)
        
        # Save model and tokenizer
        model.bert.save_pretrained(output_path)
        tokenizer.save_pretrained(output_path)
        
        # Save model wrapper state
        torch.save({
            'model_state_dict': model.state_dict(),
            'pooling_strategy': model.pooling_strategy,
            'training_config': config,
            'final_loss': avg_total_loss
        }, os.path.join(output_path, 'model_wrapper.pt'))
        
        print(f"Model saved to: {output_path}")
        
        # Save training metadata
        _save_training_metadata_transformers(config, len(augmented_triplets), output_path, avg_total_loss)
        
        # Final memory cleanup
        if device.type == 'mps':
            torch.mps.empty_cache()
        elif device.type == 'cuda':
            torch.cuda.empty_cache()
        
        return model, tokenizer, config
        
    except Exception as e:
        print(f"Training failed: {str(e)}")
        # Emergency memory cleanup
        if device.type == 'mps':
            torch.mps.empty_cache()
        elif device.type == 'cuda':
            torch.cuda.empty_cache()
        raise

def _save_training_metadata_transformers(config: dict, num_examples: int, output_path: str, final_loss: float):
    """
    Save training metadata following RAGAS results management patterns
    """
    try:
        # Get device info with MPS support
        if torch.backends.mps.is_available():
            device_info = f"mps ({torch.backends.mps.is_built()})"
        elif torch.cuda.is_available():
            device_info = f"cuda ({torch.cuda.get_device_name()})"
        else:
            device_info = "cpu"
            
        metadata = {
            'timestamp': datetime.now().isoformat(),
            'config': config,
            'training_examples': num_examples,
            'model_path': output_path,
            'framework': 'physbert_transformers_finetuning',
            'version': '1.0',
            'final_loss': final_loss,
            'device': device_info,
            'device_type': 'apple_silicon_mps' if torch.backends.mps.is_available() else 'other'
        }
        
        metadata_path = os.path.join(output_path, 'training_metadata_transformers.yaml')
        
        with open(metadata_path, 'w') as f:
            yaml.dump(metadata, f, default_flow_style=False)
        
        print(f"Training metadata saved to {metadata_path}")
        
    except Exception as e:
        print(f"Warning: Failed to save metadata: {e}")

In [8]:
def load_finetuned_model(model_path: str, device=None):
    """
    Load a fine-tuned PhysBERT model
    
    Args:
        model_path: Path to the fine-tuned model directory
        device: Device to load model on
        
    Returns:
        Loaded model, tokenizer, and config
    """
    if device is None:
        # Auto-select device with Apple Silicon MPS support
        if torch.backends.mps.is_available():
            device = torch.device('mps')
        elif torch.cuda.is_available():
            device = torch.device('cuda')
        else:
            device = torch.device('cpu')
    
    try:
        # Load tokenizer
        tokenizer = AutoTokenizer.from_pretrained(model_path)
        
        # Load model wrapper state
        wrapper_path = os.path.join(model_path, 'model_wrapper.pt')
        if os.path.exists(wrapper_path):
            checkpoint = torch.load(wrapper_path, map_location=device)
            
            # Create model instance
            model = PhysBERTEmbeddingModel()
            model.bert = AutoModel.from_pretrained(model_path)
            model.pooling_strategy = checkpoint.get('pooling_strategy', 'mean')
            
            # Load state dict
            model.load_state_dict(checkpoint['model_state_dict'])
            model.to(device)
            
            config = checkpoint.get('training_config', {})
            
            print(f"Loaded fine-tuned model from {model_path} on {device}")
            return model, tokenizer, config
        else:
            print(f"Model wrapper not found, loading as base model")
            model = PhysBERTEmbeddingModel()
            model.bert = AutoModel.from_pretrained(model_path)
            model.to(device)
            print(f"Loaded base model on {device}")
            return model, tokenizer, {}
            
    except Exception as e:
        print(f"Error loading model: {e}")
        raise

In [9]:
def compare_models_performance_transformers(base_model_name: str, 
                                          finetuned_model_path: str, 
                                          test_triplets: List[Dict], 
                                          num_samples: int = 3,
                                          save_results: bool = True,
                                          results_file: str = "model_comparison_results.json"):
    """
    Compare base model vs fine-tuned model performance using transformers with memory optimization
    
    Args:
        base_model_name: Name/path of base model
        finetuned_model_path: Path to fine-tuned model
        test_triplets: List of triplet dictionaries for testing
        num_samples: Number of examples to test
        save_results: Whether to save detailed results to JSON
        results_file: Filename for saving JSON results
    """
    # Set device with Apple Silicon MPS support
    if torch.backends.mps.is_available():
        device = torch.device('mps')
        print(f"Using Apple Silicon MPS for comparison")
        torch.mps.empty_cache()
    elif torch.cuda.is_available():
        device = torch.device('cuda')
        print(f"Using CUDA GPU for comparison")
        torch.cuda.empty_cache()
    else:
        device = torch.device('cpu')
        print(f"Using CPU for comparison")
    
    print("\nModel Performance Comparison")
    print("=" * 40)
    
    try:
        # Load base model
        print(f"Loading base model: {base_model_name}")
        base_tokenizer = AutoTokenizer.from_pretrained(base_model_name)
        base_model = PhysBERTEmbeddingModel(base_model_name).to(device)
        base_model.eval()
        
        # Load fine-tuned model
        print(f"Loading fine-tuned model: {finetuned_model_path}")
        finetuned_model, finetuned_tokenizer, _ = load_finetuned_model(finetuned_model_path, device)
        finetuned_model.eval()
        
        # Memory optimization: Only limit if really necessary for MPS
        original_num_samples = num_samples
        if device.type == 'mps' and num_samples > 10:
            num_samples = min(num_samples, 10)  # More reasonable limit for MPS
            print(f"MPS optimization: limited to {num_samples} samples (from {original_num_samples})")
        
        # Select sample queries for comparison
        sample_triplets = test_triplets[:num_samples]
        
        print(f"\nTesting {len(sample_triplets)} triplets on {device}...")
        
        base_improvements = 0
        total_tests = 0
        detailed_results = []  # Store detailed results for JSON
        
        with torch.no_grad():  # Disable gradients for entire comparison
            for i, triplet in enumerate(sample_triplets, 1):
                query = triplet['anchor']
                positive = triplet['positive']
                negative = triplet['negative']
                
                print(f"\nTest Case {i}:")
                print(f"Query: {query[:80]}...")
                print(f"Positive: {positive[:60]}...")
                print(f"Negative: {negative[:60]}...")
                
                # Memory-efficient encoding with smaller batch size
                encode_batch_size = 1 if device.type == 'mps' else 4
                
                # Base model similarities
                base_query_emb = base_model.encode([query], base_tokenizer, device, batch_size=encode_batch_size)
                base_pos_emb = base_model.encode([positive], base_tokenizer, device, batch_size=encode_batch_size)
                base_neg_emb = base_model.encode([negative], base_tokenizer, device, batch_size=encode_batch_size)
                
                base_pos_sim = cosine_similarity(base_query_emb, base_pos_emb)[0][0]
                base_neg_sim = cosine_similarity(base_query_emb, base_neg_emb)[0][0]
                base_margin = base_pos_sim - base_neg_sim
                
                # Clear base model embeddings to free memory
                del base_query_emb, base_pos_emb, base_neg_emb
                
                # Memory cleanup between models
                if device.type == 'mps':
                    torch.mps.empty_cache()
                elif device.type == 'cuda':
                    torch.cuda.empty_cache()
                
                # Fine-tuned model similarities  
                ft_query_emb = finetuned_model.encode([query], finetuned_tokenizer, device, batch_size=encode_batch_size)
                ft_pos_emb = finetuned_model.encode([positive], finetuned_tokenizer, device, batch_size=encode_batch_size)
                ft_neg_emb = finetuned_model.encode([negative], finetuned_tokenizer, device, batch_size=encode_batch_size)
                
                ft_pos_sim = cosine_similarity(ft_query_emb, ft_pos_emb)[0][0]
                ft_neg_sim = cosine_similarity(ft_query_emb, ft_neg_emb)[0][0]
                ft_margin = ft_pos_sim - ft_neg_sim
                
                # Clear fine-tuned model embeddings
                del ft_query_emb, ft_pos_emb, ft_neg_emb
                
                print(f"\nResults:")
                print(f"Base Model     - Pos: {base_pos_sim:.3f}, Neg: {base_neg_sim:.3f}, Margin: {base_margin:.3f}")
                print(f"Fine-tuned     - Pos: {ft_pos_sim:.3f}, Neg: {ft_neg_sim:.3f}, Margin: {ft_margin:.3f}")
                
                # Determine improvement
                improvement = ft_margin - base_margin
                is_improvement = improvement > 0
                
                if is_improvement:
                    print(f"Improvement: +{improvement:.3f} (Better separation)")
                    base_improvements += 1
                else:
                    print(f"Regression: {improvement:.3f} (Worse separation)")
                
                # Store detailed results for JSON
                test_result = {
                    "test_case": i,
                    "triplet_id": triplet.get('id', f'test_{i}'),
                    "query": query,
                    "positive": positive,
                    "negative": negative,
                    "base_model": {
                        "positive_similarity": float(base_pos_sim),
                        "negative_similarity": float(base_neg_sim),
                        "margin": float(base_margin)
                    },
                    "finetuned_model": {
                        "positive_similarity": float(ft_pos_sim),
                        "negative_similarity": float(ft_neg_sim),
                        "margin": float(ft_margin)
                    },
                    "comparison": {
                        "improvement": float(improvement),
                        "is_improvement": is_improvement,
                        "improvement_percentage": float((improvement / abs(base_margin)) * 100) if base_margin != 0 else 0.0
                    }
                }
                detailed_results.append(test_result)
                
                total_tests += 1
                
                # Memory cleanup after each test
                if device.type == 'mps':
                    torch.mps.empty_cache()
                elif device.type == 'cuda':
                    torch.cuda.empty_cache()
                
                print("-" * 40)
        
        # Clear models from memory
        del base_model, finetuned_model
        if device.type == 'mps':
            torch.mps.empty_cache()
        elif device.type == 'cuda':
            torch.cuda.empty_cache()
        
        # Summary
        improvement_rate = (base_improvements / total_tests) * 100
        print(f"\nSummary:")
        print(f"  Tests with improvement: {base_improvements}/{total_tests} ({improvement_rate:.1f}%)")
        print(f"  Tests with regression: {total_tests - base_improvements}/{total_tests} ({100 - improvement_rate:.1f}%)")
        print(f"  Device used: {device}")
        print(f"  Memory optimization: {'ON' if device.type == 'mps' else 'OFF'}")
        
        # Prepare complete results dictionary
        results_dict = {
            'total_tests': total_tests,
            'improvements': base_improvements,
            'improvement_rate': improvement_rate,
            'device': str(device),
            'memory_optimized': device.type == 'mps'
        }
        
        # Save detailed results to JSON if requested
        if save_results:
            complete_results = {
                "metadata": {
                    "timestamp": datetime.now().isoformat(),
                    "base_model": base_model_name,
                    "finetuned_model": finetuned_model_path,
                    "device": str(device),
                    "memory_optimized": device.type == 'mps',
                    "total_tests": total_tests,
                    "improvements": base_improvements,
                    "improvement_rate": improvement_rate
                },
                "test_results": detailed_results,
                "summary": {
                    "average_base_positive_similarity": sum([r["base_model"]["positive_similarity"] for r in detailed_results]) / len(detailed_results),
                    "average_base_negative_similarity": sum([r["base_model"]["negative_similarity"] for r in detailed_results]) / len(detailed_results),
                    "average_base_margin": sum([r["base_model"]["margin"] for r in detailed_results]) / len(detailed_results),
                    "average_finetuned_positive_similarity": sum([r["finetuned_model"]["positive_similarity"] for r in detailed_results]) / len(detailed_results),
                    "average_finetuned_negative_similarity": sum([r["finetuned_model"]["negative_similarity"] for r in detailed_results]) / len(detailed_results),
                    "average_finetuned_margin": sum([r["finetuned_model"]["margin"] for r in detailed_results]) / len(detailed_results),
                    "average_improvement": sum([r["comparison"]["improvement"] for r in detailed_results]) / len(detailed_results)
                }
            }
            
            # Save to JSON file
            try:
                with open(results_file, 'w', encoding='utf-8') as f:
                    json.dump(complete_results, f, indent=2, ensure_ascii=False)
                print(f"\nDetailed results saved to: {results_file}")
            except Exception as e:
                print(f"Warning: Failed to save results to JSON: {e}")
        
        return results_dict
        
    except Exception as e:
        print(f"Comparison failed: {e}")
        # Emergency memory cleanup
        if device.type == 'mps':
            torch.mps.empty_cache()
        elif device.type == 'cuda':
            torch.cuda.empty_cache()
        raise

In [10]:
def run_complete_transformers_pipeline():
    """
    Complete pipeline: Train transformers model and compare performance
    Following RAGAS framework patterns with memory optimization and proper train/test split
    """
    print("PhysBERT Transformers Fine-tuning Pipeline")
    print("=" * 50)
    
    # Memory optimization setup
    print("\nMemory Optimization Setup")
    optimize_for_memory()
    check_mps_memory()
    
    try:
        # Step 1: Load and prepare training data with proper train/test split
        print("\nStep 1: Loading and Splitting Training Data")
        config_path = 'config/physbert_config.yaml'
        all_triplets = load_triplets_from_json('data/physics_triplets_st.json')
        
        if not all_triplets:
            print("No training data found. Please check data path.")
            return
        
        # # Apply demo size limit if specified
        # demo_size = int(os.getenv('PHYSBERT_TEST_SIZE', len(all_triplets)))
        # if demo_size < len(all_triplets):
        #     all_triplets = all_triplets[:demo_size]
        #     print(f"Limited to {len(all_triplets)} examples for demonstration")
        
        # Create proper train/test split: Hold out 5 random triplets for testing
        test_size = min(5, len(all_triplets) // 4)  # Use 5 or 25% of data, whichever is smaller
        
        # Randomly shuffle and split the data
        random.seed(42)  # Set seed for reproducible results
        shuffled_triplets = all_triplets.copy()
        random.shuffle(shuffled_triplets)
        
        # Split into test and train sets
        test_triplets = shuffled_triplets[:test_size]
        train_triplets = shuffled_triplets[test_size:]
        
        print(f"Data split:")
        print(f"  - Training examples: {len(train_triplets)}")
        print(f"  - Test examples: {len(test_triplets)} (held out for evaluation)")
        print(f"  - Total examples: {len(all_triplets)}")
        
        # Display test set examples for transparency
        print(f"\nTest Set (Held Out for Evaluation):")
        for i, triplet in enumerate(test_triplets, 1):
            print(f"  Test {i}: Query: '{triplet['anchor'][:50]}...'")
        
        # Memory check before training
        if torch.backends.mps.is_available():
            torch.mps.empty_cache()
        
        # Step 2: Train the fine-tuned model on training set only
        print("\nStep 2: Fine-tuning PhysBERT Model (Training Set Only)")
        
        # Temporarily modify the data loading function to use our train split
        original_triplets_path = 'data/physics_triplets_st.json'
        train_triplets_path = 'data/physics_triplets_train_split.json'
        
        # Save training split to temporary file
        os.makedirs('data', exist_ok=True)
        train_data = []
        for triplet in train_triplets:
            train_data.append({
                'texts': [triplet['anchor'], triplet['positive'], triplet['negative']],
                'id': triplet['id']
            })
        
        with open(train_triplets_path, 'w', encoding='utf-8') as f:
            json.dump(train_data, f, ensure_ascii=False, indent=2)
        
        # Set environment variable to use training split
        os.environ['TRIPLET_DATA_PATH'] = train_triplets_path
        
        try:
            trained_model, tokenizer, config = train_physbert_transformers(config_path)
        finally:
            # Restore original data path and cleanup temp file
            if 'TRIPLET_DATA_PATH' in os.environ:
                del os.environ['TRIPLET_DATA_PATH']
            if os.path.exists(train_triplets_path):
                os.remove(train_triplets_path)
        
        if not trained_model:
            print("Training failed")
            return
        
        # Memory cleanup between training and evaluation
        if torch.backends.mps.is_available():
            torch.mps.empty_cache()
        
        # Step 3: Compare model performances on held-out test set
        print("\nStep 3: Comparing Model Performance on Held-Out Test Set")
        print(f"Testing on {len(test_triplets)} previously unseen examples...")
        
        base_model_name = config.get('model', {}).get('base_model', 'thellert/physbert_cased')
        finetuned_model_path = config.get('training', {}).get('output_path', 'models/physbert-transformers-finetuned')
        
        # Test on the held-out test set (not used during training)
        comparison_results = compare_models_performance_transformers(
            base_model_name, 
            finetuned_model_path, 
            test_triplets,  # Use held-out test set
            num_samples=len(test_triplets)  # Test all held-out examples
        )
        
        # Final memory cleanup
        if torch.backends.mps.is_available():
            torch.mps.empty_cache()
        
        print("\nPipeline completed successfully!")
        print(f"Fine-tuned model saved to: {finetuned_model_path}")
        print(f"\nEvaluation Results (Held-Out Test Set):")
        print(f"  - Test examples: {comparison_results['total_tests']}")
        print(f"  - Improvements: {comparison_results['improvements']}")
        print(f"  - Improvement rate: {comparison_results['improvement_rate']:.1f}%")
        print(f"  - Training examples: {len(train_triplets)}")
        
        if comparison_results.get('memory_optimized'):
            print("  - Memory optimizations applied for Apple Silicon")
        
        # Enhanced results with train/test split info
        results = {
            'trained_model': trained_model,
            'tokenizer': tokenizer,
            'config': config,
            'training_examples': len(train_triplets),
            'test_examples': len(test_triplets),
            'total_examples': len(all_triplets),
            'test_triplets': test_triplets,  # Include test set for further analysis
            'comparison_results': comparison_results,
            'memory_optimized': comparison_results.get('memory_optimized', False),
            'train_test_split': {
                'train_size': len(train_triplets),
                'test_size': len(test_triplets),
                'split_ratio': f"{len(train_triplets)}:{len(test_triplets)}"
            }
        }
        
        return results
        
    except Exception as e:
        print(f"Pipeline failed: {e}")
        # Emergency memory cleanup
        if torch.backends.mps.is_available():
            torch.mps.empty_cache()
        raise

# Execute the complete pipeline
if __name__ == "__main__":
    results = run_complete_transformers_pipeline()

PhysBERT Transformers Fine-tuning Pipeline

Memory Optimization Setup
🔧 Applied memory optimizations:
- MPS cache cleared
- Garbage collection completed
🍎 MPS Memory Status:
   - MPS Available: True
   - MPS Built: True
Memory cache cleared

Step 1: Loading and Splitting Training Data
Loaded 53 triplet examples for training
Data split:
  - Training examples: 48
  - Test examples: 5 (held out for evaluation)
  - Total examples: 53

Test Set (Held Out for Evaluation):
  Test 1: Query: 'What is a ``minimum-ionizing particle" (mip)?...'
  Test 2: Query: 'What is the photon mass attenuation length?...'
  Test 3: Query: 'What is the formula for plasma energy, _{p}?...'
  Test 4: Query: 'How can you calculate the radiation length for a c...'
  Test 5: Query: 'What is the formula for the collision stopping pow...'

Step 2: Fine-tuning PhysBERT Model (Training Set Only)
PhysBERT Transformers Fine-tuning Pipeline
Using Apple Silicon MPS: mps
Loading base model: thellert/physbert_cased
Enabled gr

  return forward_call(*args, **kwargs)
Epoch 1:   2%|▏         | 1/48 [00:03<02:23,  3.04s/it, loss=0.3667, avg_loss=0.0000, lr=0.00e+00, device=mps, mem_opt=ON]


Step 0 similarities: {'positive_similarity': 0.21358193457126617, 'negative_similarity': 0.08025713264942169, 'margin': 0.13332481682300568}


Epoch 1: 100%|██████████| 48/48 [01:53<00:00,  2.37s/it, loss=0.1247, avg_loss=0.2346, lr=1.85e-05, device=mps, mem_opt=ON]


Epoch 1 average loss: 0.2346

Epoch 2/3


Epoch 2: 100%|██████████| 48/48 [01:50<00:00,  2.30s/it, loss=0.0000, avg_loss=0.0455, lr=1.48e-05, device=mps, mem_opt=ON]


Epoch 2 average loss: 0.0455

Epoch 3/3


Epoch 3:   8%|▊         | 4/48 [00:09<01:41,  2.31s/it, loss=0.0000, avg_loss=0.0000, lr=1.45e-05, device=mps, mem_opt=ON]


Step 50 similarities: {'positive_similarity': 0.7070565223693848, 'negative_similarity': -0.15934453904628754, 'margin': 0.8664010763168335}


Epoch 3:  10%|█         | 5/48 [00:11<01:38,  2.30s/it, loss=0.0528, avg_loss=0.0000, lr=1.45e-05, device=mps, mem_opt=ON]


Step 50 similarities: {'positive_similarity': 0.6735079884529114, 'negative_similarity': 0.043615713715553284, 'margin': 0.6298922300338745}


Epoch 3: 100%|██████████| 48/48 [01:49<00:00,  2.29s/it, loss=0.0000, avg_loss=0.0148, lr=1.11e-05, device=mps, mem_opt=ON]


Epoch 3 average loss: 0.0148

Training completed. Average loss: 0.0983
Model saved to: models/physbert-physics-finetuned
Training metadata saved to models/physbert-physics-finetuned/training_metadata_transformers.yaml

Step 3: Comparing Model Performance on Held-Out Test Set
Testing on 5 previously unseen examples...
Using Apple Silicon MPS for comparison

Model Performance Comparison
Loading base model: thellert/physbert_cased
Loading fine-tuned model: models/physbert-physics-finetuned
Loaded fine-tuned model from models/physbert-physics-finetuned on mps

Testing 5 triplets on mps...

Test Case 1:
Query: What is a ``minimum-ionizing particle" (mip)?...
Positive: The stopping power functions are characterized by broad mini...
Negative: For protons of less than several hundred eV, non-ionizing nu...


  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)



Results:
Base Model     - Pos: 0.262, Neg: 0.242, Margin: 0.020
Fine-tuned     - Pos: 0.449, Neg: 0.178, Margin: 0.270
Improvement: +0.250 (Better separation)
----------------------------------------

Test Case 2:
Query: What is the photon mass attenuation length?...
Positive: The photon mass attenuation length (or mean free path) is =1...
Negative: The characteristic amount of matter traversed for [bremsstra...


  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)



Results:
Base Model     - Pos: 0.336, Neg: 0.191, Margin: 0.145
Fine-tuned     - Pos: 0.796, Neg: 0.487, Margin: 0.309
Improvement: +0.164 (Better separation)
----------------------------------------

Test Case 3:
Query: What is the formula for plasma energy, _{p}?...
Positive: The plasma energy is _{p} = 28.816~eV. _{p}: The plasma ener...
Negative: The determination of the mean excitation energy is the princ...


  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)



Results:
Base Model     - Pos: 0.203, Neg: 0.103, Margin: 0.101
Fine-tuned     - Pos: 0.664, Neg: 0.126, Margin: 0.538
Improvement: +0.438 (Better separation)
----------------------------------------

Test Case 4:
Query: How can you calculate the radiation length for a chemical compound?...
Positive: The radiation length in a mixture or compound may be approxi...
Negative: A mixture or compound can be thought of as made up of thin l...


  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)



Results:
Base Model     - Pos: 0.153, Neg: 0.123, Margin: 0.030
Fine-tuned     - Pos: 0.681, Neg: 0.289, Margin: 0.392
Improvement: +0.362 (Better separation)
----------------------------------------

Test Case 5:
Query: What is the formula for the collision stopping power for positrons?...
Positive: Electron-positron scattering is described by the Bhabha cros...
Negative: For electrons, large energy transfers to atomic electrons ar...

Results:
Base Model     - Pos: 0.388, Neg: 0.310, Margin: 0.078
Fine-tuned     - Pos: 0.552, Neg: 0.331, Margin: 0.222
Improvement: +0.143 (Better separation)
----------------------------------------

Summary:
  Tests with improvement: 5/5 (100.0%)
  Tests with regression: 0/5 (0.0%)
  Device used: mps
  Memory optimization: ON

Pipeline completed successfully!
Fine-tuned model saved to: models/physbert-physics-finetuned

Evaluation Results (Held-Out Test Set):
  - Test examples: 5
  - Improvements: 5
  - Improvement rate: 100.0%
  - Training examples:

  return forward_call(*args, **kwargs)


In [11]:
def analyze_test_results(results_dict):
    """
    Provide detailed analysis of the test results from the pipeline
    
    Args:
        results_dict: Results dictionary returned from run_complete_transformers_pipeline()
    """
    if not results_dict or 'comparison_results' not in results_dict:
        print("No valid results to analyze")
        return
    
    print("\nDetailed Test Results Analysis")
    print("=" * 50)
    
    # Extract key metrics
    comparison = results_dict['comparison_results']
    train_test_info = results_dict.get('train_test_split', {})
    
    print(f"Dataset Split Information:")
    print(f"  - Total examples loaded: {results_dict.get('total_examples', 'N/A')}")
    print(f"  - Training examples: {results_dict.get('training_examples', 'N/A')}")
    print(f"  - Test examples: {results_dict.get('test_examples', 'N/A')}")
    print(f"  - Split ratio: {train_test_info.get('split_ratio', 'N/A')}")
    
    print(f"\nModel Evaluation on Held-Out Test Set:")
    print(f"  - Test cases evaluated: {comparison.get('total_tests', 0)}")
    print(f"  - Cases with improvement: {comparison.get('improvements', 0)}")
    print(f"  - Cases with regression: {comparison.get('total_tests', 0) - comparison.get('improvements', 0)}")
    print(f"  - Overall improvement rate: {comparison.get('improvement_rate', 0):.1f}%")
    
    # Performance interpretation
    improvement_rate = comparison.get('improvement_rate', 0)
    if improvement_rate >= 80:
        performance_level = "Excellent"
        interpretation = "The fine-tuning significantly improved the model's ability to distinguish relevant physics content."
    elif improvement_rate >= 60:
        performance_level = "Good"
        interpretation = "The fine-tuning showed substantial improvements in most test cases."
    elif improvement_rate >= 40:
        performance_level = "Moderate"
        interpretation = "The fine-tuning showed some improvements but may need more training data or epochs."
    elif improvement_rate >= 20:
        performance_level = "Limited"
        interpretation = "The fine-tuning showed minimal improvements. Consider adjusting hyperparameters."
    else:
        performance_level = "Poor"
        interpretation = "The fine-tuning did not improve performance. Review training data and approach."
    
    print(f"\nPerformance Assessment:")
    print(f"  - Performance Level: {performance_level}")
    print(f"  - Interpretation: {interpretation}")
    
    # Technical details
    print(f"\nTechnical Details:")
    print(f"  - Device used: {comparison.get('device', 'N/A')}")
    print(f"  - Memory optimization: {'Enabled' if comparison.get('memory_optimized') else 'Disabled'}")
    
    # Test set transparency
    if 'test_triplets' in results_dict and results_dict['test_triplets']:
        print(f"\nTest Set Examples (for transparency):")
        for i, triplet in enumerate(results_dict['test_triplets'], 1):
            print(f"  Test {i}:")
            print(f"    Query: '{triplet['anchor'][:60]}...'")
            print(f"    Positive: '{triplet['positive'][:60]}...'")
            print(f"    Negative: '{triplet['negative'][:60]}...'")
    
    # Recommendations
    print(f"\nRecommendations:")
    if improvement_rate >= 60:
        print("  ✅ Model is ready for physics domain tasks")
        print("  ✅ Consider testing on more diverse physics content")
    elif improvement_rate >= 40:
        print("  ⚠️  Consider increasing training epochs or data augmentation")
        print("  ⚠️  Review learning rate and batch size settings")
    else:
        print("  ❌ Review training data quality and relevance")
        print("  ❌ Consider different model architecture or loss function")
        print("  ❌ Increase training data size or diversity")
    
    return {
        'performance_level': performance_level,
        'improvement_rate': improvement_rate,
        'interpretation': interpretation,
        'total_tests': comparison.get('total_tests', 0),
        'improvements': comparison.get('improvements', 0)
    }


def run_evaluation_only(model_path: str = "/Users/sandeshs/Documents/Projects/LLMTest/tests/ouragboros/models/physbert-physics-finetuned",
                       test_data_path: str = "data/physics_triplets_st.json",
                       num_test_samples: int = 5,
                       save_json: bool = True,
                       results_filename: str = None):
    """
    Run evaluation only (without training) on a random test set
    Useful for testing an already trained model
    
    Args:
        model_path: Path to the fine-tuned model
        test_data_path: Path to the test data
        num_test_samples: Number of random samples to test
        save_json: Whether to save detailed results as JSON
        results_filename: Custom filename for results (default: auto-generated)
        
    Returns:
        Comparison results dictionary
    """
    print("Model Evaluation Pipeline (No Training)")
    print("=" * 50)
    
    try:
        # Load all available triplets
        all_triplets = load_triplets_from_json(test_data_path)
        
        if not all_triplets:
            print("No test data found.")
            return None
        
        # Create random test set
        random.seed(42)  # Reproducible results
        test_triplets = random.sample(all_triplets, min(num_test_samples, len(all_triplets)))
        
        print(f"Selected {len(test_triplets)} random test examples")
        
        # Generate results filename if not provided
        if results_filename is None:
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            model_name = os.path.basename(model_path).replace("-", "_")
            results_filename = f"physbert_evaluation_{model_name}_{timestamp}.json"
        
        # Run comparison with JSON saving
        base_model_name = "thellert/physbert_cased"
        comparison_results = compare_models_performance_transformers(
            base_model_name,
            model_path,
            test_triplets,
            num_samples=len(test_triplets),
            save_results=save_json,
            results_file=results_filename
        )
        
        # Analyze results
        eval_results = {
            'comparison_results': comparison_results,
            'test_examples': len(test_triplets),
            'test_triplets': test_triplets,
            'total_examples': len(all_triplets),
            'results_file': results_filename if save_json else None
        }
        
        analysis = analyze_test_results(eval_results)
        
        if save_json:
            print(f"\n📊 Detailed test results with individual scores saved to: {results_filename}")
            print("   The JSON file contains:")
            print("   - Individual test case scores for both models")
            print("   - Similarity scores (positive, negative, margin)")
            print("   - Improvement analysis for each test")
            print("   - Summary statistics and averages")
        
        return eval_results
        
    except Exception as e:
        print(f"Evaluation failed: {e}")
        raise


def load_and_analyze_json_results(json_file: str):
    """
    Load and analyze previously saved JSON results
    
    Args:
        json_file: Path to the JSON results file
        
    Returns:
        Loaded results dictionary
    """
    try:
        with open(json_file, 'r', encoding='utf-8') as f:
            results = json.load(f)
        
        print(f"📈 Analysis of results from: {json_file}")
        print("=" * 50)
        
        metadata = results.get('metadata', {})
        test_results = results.get('test_results', [])
        summary = results.get('summary', {})
        
        print(f"Evaluation Metadata:")
        print(f"  - Timestamp: {metadata.get('timestamp', 'N/A')}")
        print(f"  - Base Model: {metadata.get('base_model', 'N/A')}")
        print(f"  - Fine-tuned Model: {os.path.basename(metadata.get('finetuned_model', 'N/A'))}")
        print(f"  - Device: {metadata.get('device', 'N/A')}")
        print(f"  - Total Tests: {metadata.get('total_tests', 0)}")
        print(f"  - Improvement Rate: {metadata.get('improvement_rate', 0):.1f}%")
        
        print(f"\nPer-Test Case Analysis:")
        for i, test in enumerate(test_results, 1):
            base = test['base_model']
            ft = test['finetuned_model']
            comp = test['comparison']
            
            print(f"  Test {i} (ID: {test.get('triplet_id', 'N/A')}):")
            print(f"    Query: '{test['query'][:50]}...'")
            print(f"    Base Model    - Pos: {base['positive_similarity']:.3f}, Neg: {base['negative_similarity']:.3f}, Margin: {base['margin']:.3f}")
            print(f"    Fine-tuned    - Pos: {ft['positive_similarity']:.3f}, Neg: {ft['negative_similarity']:.3f}, Margin: {ft['margin']:.3f}")
            print(f"    Improvement   - {comp['improvement']:+.3f} ({'✅' if comp['is_improvement'] else '❌'})")
            print()
        
        print(f"Summary Averages:")
        print(f"  - Base Model Average Margin: {summary.get('average_base_margin', 0):.3f}")
        print(f"  - Fine-tuned Average Margin: {summary.get('average_finetuned_margin', 0):.3f}")
        print(f"  - Average Improvement: {summary.get('average_improvement', 0):.3f}")
        
        return results
        
    except FileNotFoundError:
        print(f"❌ Results file not found: {json_file}")
        return None
    except Exception as e:
        print(f"❌ Error loading results: {e}")
        return None

In [12]:
run_evaluation_only()

Model Evaluation Pipeline (No Training)
Loaded 53 triplet examples for training
Selected 5 random test examples
Using Apple Silicon MPS for comparison

Model Performance Comparison
Loading base model: thellert/physbert_cased
Loading fine-tuned model: /Users/sandeshs/Documents/Projects/LLMTest/tests/ouragboros/models/physbert-physics-finetuned
Loaded fine-tuned model from /Users/sandeshs/Documents/Projects/LLMTest/tests/ouragboros/models/physbert-physics-finetuned on mps

Testing 5 triplets on mps...

Test Case 1:
Query: How does the minimum ionization value change with the atomic number Z of the abs...
Positive: Except in hydrogen, particles with the same velocity have si...
Negative: Figure 34.14 shows that the electron critical energy for the...


  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)



Results:
Base Model     - Pos: 0.179, Neg: 0.257, Margin: -0.078
Fine-tuned     - Pos: 0.675, Neg: -0.068, Margin: 0.742
Improvement: +0.820 (Better separation)
----------------------------------------

Test Case 2:
Query: How is the longitudinal profile of energy deposition in an electromagnetic casca...
Positive: The mean longitudinal profile of the energy deposition in an...
Negative: The transverse development of electromagnetic showers in dif...


  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)



Results:
Base Model     - Pos: 0.407, Neg: 0.275, Margin: 0.132
Fine-tuned     - Pos: 0.850, Neg: 0.018, Margin: 0.832
Improvement: +0.700 (Better separation)
----------------------------------------

Test Case 3:
Query: What is the formula for the most probable energy loss in a moderately thick dete...
Positive: For detectors of moderate thickness, the energy loss probabi...
Negative: The mean rate of energy loss by moderately relativistic char...


  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)



Results:
Base Model     - Pos: 0.549, Neg: 0.189, Margin: 0.360
Fine-tuned     - Pos: 0.846, Neg: -0.183, Margin: 1.028
Improvement: +0.668 (Better separation)
----------------------------------------

Test Case 4:
Query: What does the density effect correction, (), account for?...
Positive: As the particle energy increases, its electric field flatten...
Negative: The Bloch correction z^{2}L_{2} is a low-energy correction t...


  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)



Results:
Base Model     - Pos: -0.010, Neg: -0.010, Margin: -0.001
Fine-tuned     - Pos: 0.592, Neg: -0.140, Margin: 0.732
Improvement: +0.733 (Better separation)
----------------------------------------

Test Case 5:
Query: What is ``stopping power"?...
Positive: The mean rate of energy loss by moderately relativistic char...
Negative: Eq. (34.5) may be integrated to find the total (or partial) ...

Results:
Base Model     - Pos: 0.225, Neg: 0.154, Margin: 0.071
Fine-tuned     - Pos: 0.716, Neg: -0.088, Margin: 0.804
Improvement: +0.733 (Better separation)
----------------------------------------

Summary:
  Tests with improvement: 5/5 (100.0%)
  Tests with regression: 0/5 (0.0%)
  Device used: mps
  Memory optimization: ON

Detailed Test Results Analysis
Dataset Split Information:
  - Total examples loaded: 53
  - Training examples: N/A
  - Test examples: 5
  - Split ratio: N/A

Model Evaluation on Held-Out Test Set:
  - Test cases evaluated: 5
  - Cases with improvement: 5
  - Case

  return forward_call(*args, **kwargs)


{'comparison_results': {'total_tests': 5,
  'improvements': 5,
  'improvement_rate': 100.0,
  'device': 'mps',
  'memory_optimized': True},
 'test_examples': 5,
 'test_triplets': [{'anchor': 'How does the minimum ionization value change with the atomic number Z of the absorber?',
   'positive': 'Except in hydrogen, particles with the same velocity have similar rates of energy loss in different materials, although there is a slow decrease in the rate of energy loss with increasing Z. Figure 34.3 confirms this trend.',
   'negative': 'Figure 34.14 shows that the electron critical energy for the chemical elements decreases significantly with Z.',
   'id': 'physics_triplet_40'},
  {'anchor': 'How is the longitudinal profile of energy deposition in an electromagnetic cascade modeled?',
   'positive': 'The mean longitudinal profile of the energy deposition in an electromagnetic cascade is reasonably well described by a gamma distribution: {dt}=E_{0}be^{-bt}}{(a)}. }: The energy deposited per