# OLMoE Production Evaluation with Real Data

## High-Quality Evaluation Framework

**Features:**
- ‚úÖ Evaluates on **real datasets** (WikiText-2, LAMBADA)
- ‚úÖ Computes **proper metrics** (Perplexity, Token Accuracy, Loss)
- ‚úÖ Tests **multiple expert configs** (8, 16, 32, 64)
- ‚úÖ **Production-quality** code with proper error handling
- ‚úÖ **Publication-ready** visualizations
- ‚úÖ Exports results to CSV/JSON/PDF

---

## üì¶ Installation

In [None]:
!pip install -q torch transformers datasets accelerate sentencepiece matplotlib seaborn pandas numpy tqdm

## üîß Configuration

In [None]:
# Configuration - ADJUST THESE BASED ON YOUR NEEDS
CONFIG = {
    'model_name': 'allenai/OLMoE-1B-7B-0924',
    'expert_configs': [8, 16, 32, 64],  # Expert counts to test
    'datasets': ['wikitext', 'lambada'],  # Datasets to evaluate
    'max_samples': 500,  # Number of samples per dataset (500-1000 recommended)
    'max_length': 512,   # Maximum sequence length
    'output_dir': './olmoe_results',
    'seed': 42,
}

print("Configuration:")
for k, v in CONFIG.items():
    print(f"  {k}: {v}")

## üì• Load Production Code

Copy the production evaluation code from `olmoe_evaluation.py`

In [None]:
# Download the production evaluation script
!wget -q https://raw.githubusercontent.com/AliABULIEL/MOE-with-feature-selection/claude/olmoe-inference-experts-01XjzqPSCkvPdXxPi6iS3C5C/olmoe_evaluation.py -O olmoe_evaluation.py

# Or if file is local, just import it
from olmoe_evaluation import OLMoEEvaluator, EvaluationConfig

## üöÄ Run Evaluation

This will:
1. Load OLMoE model
2. Load WikiText-2 and LAMBADA datasets
3. Run inference with 8, 16, 32, 64 experts
4. Compute perplexity, accuracy, and speed metrics
5. Generate visualizations
6. Save results to files

In [None]:
# Create configuration
config = EvaluationConfig(
    model_name=CONFIG['model_name'],
    expert_configs=CONFIG['expert_configs'],
    datasets=CONFIG['datasets'],
    max_samples=CONFIG['max_samples'],
    max_length=CONFIG['max_length'],
    output_dir=CONFIG['output_dir'],
    seed=CONFIG['seed'],
)

# Create evaluator
print("Initializing evaluator...")
evaluator = OLMoEEvaluator(config)

In [None]:
# Run full evaluation
print("Starting evaluation... This will take 15-30 minutes depending on GPU.")
print("Progress will be shown below.\n")

results_df = evaluator.evaluate_all_configurations()

print("\n‚úì Evaluation complete!")

## üìä View Results

In [None]:
# Display results table
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.float_format', '{:.4f}'.format)

print("="*100)
print("EVALUATION RESULTS")
print("="*100)
display(results_df)
print("="*100)

In [None]:
# Summary statistics
print("\nSUMMARY STATISTICS BY EXPERT COUNT")
print("="*80)
summary = results_df.groupby('num_experts').agg({
    'perplexity': ['mean', 'std'],
    'token_accuracy': ['mean', 'std'],
    'tokens_per_second': ['mean', 'std']
})
display(summary)
print("="*80)

## üìà Generate Visualizations

In [None]:
# Generate all visualizations
evaluator.visualize_results(results_df)

## üìù Generate Report

In [None]:
# Generate markdown report
evaluator.generate_report(results_df)

# Display the report
from IPython.display import Markdown
with open(f"{CONFIG['output_dir']}/EVALUATION_REPORT.md", 'r') as f:
    report_content = f.read()

display(Markdown(report_content))

## üìÇ Download Results

All results are saved in the output directory:

In [None]:
import os

print(f"Results saved in: {CONFIG['output_dir']}\n")
print("Files created:")
for file in os.listdir(CONFIG['output_dir']):
    filepath = os.path.join(CONFIG['output_dir'], file)
    size = os.path.getsize(filepath) / 1024  # KB
    print(f"  - {file} ({size:.2f} KB)")

# Zip results for download
!zip -r olmoe_results.zip {CONFIG['output_dir']}
print("\n‚úì Results zipped to olmoe_results.zip")
print("  You can download this file from Colab's file browser.")

---

## üî¨ Advanced: Custom Analysis

Analyze specific aspects of the results:

In [None]:
# Perplexity improvement analysis
print("PERPLEXITY IMPROVEMENT OVER BASELINE (8 experts)\n")

for dataset in results_df['dataset'].unique():
    print(f"\n{dataset.upper()}:")
    data = results_df[results_df['dataset'] == dataset]
    baseline_ppl = data[data['num_experts'] == 8]['perplexity'].values[0]
    
    print(f"{'Experts':<10} {'Perplexity':<12} {'Improvement':<12} {'Speedup'}")
    print("-" * 50)
    
    for _, row in data.iterrows():
        improvement = (baseline_ppl - row['perplexity']) / baseline_ppl * 100
        baseline_speed = data[data['num_experts'] == 8]['tokens_per_second'].values[0]
        speedup = row['tokens_per_second'] / baseline_speed
        
        print(f"{row['num_experts']:<10} {row['perplexity']:<12.2f} "
              f"{improvement:>6.2f}%      {speedup:>5.2f}x")

In [None]:
# Quality-Speed Pareto frontier
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 6))

for dataset in results_df['dataset'].unique():
    data = results_df[results_df['dataset'] == dataset]
    ax.scatter(
        data['tokens_per_second'],
        data['perplexity'],
        s=200,
        alpha=0.6,
        label=dataset
    )
    
    # Annotate with expert count
    for _, row in data.iterrows():
        ax.annotate(
            f"{row['num_experts']}",
            (row['tokens_per_second'], row['perplexity']),
            fontsize=12,
            fontweight='bold',
            ha='center',
            va='center'
        )

ax.set_xlabel('Throughput (tokens/sec) ‚Üë', fontsize=14, fontweight='bold')
ax.set_ylabel('Perplexity ‚Üì', fontsize=14, fontweight='bold')
ax.set_title('Pareto Frontier: Quality vs Speed', fontsize=16, fontweight='bold')
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)

# Add arrows showing direction of improvement
ax.annotate('Better\nQuality', xy=(0.05, 0.95), xycoords='axes fraction',
            fontsize=11, ha='left', va='top', color='green', fontweight='bold',
            bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.3))
ax.annotate('Better\nSpeed', xy=(0.95, 0.05), xycoords='axes fraction',
            fontsize=11, ha='right', va='bottom', color='blue', fontweight='bold',
            bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.3))

plt.tight_layout()
plt.show()

---

## ‚úÖ Summary

### What We Evaluated

- **Model**: OLMoE-1B-7B-0924
- **Datasets**: WikiText-2, LAMBADA (real evaluation benchmarks)
- **Expert Configurations**: 8, 16, 32, 64 experts per token
- **Metrics**: Perplexity, Token Accuracy, Cross-Entropy Loss, Inference Speed

### Key Findings

Run the cells above to see:
- ‚úÖ How perplexity changes with more experts
- ‚úÖ Speed vs quality trade-offs
- ‚úÖ Optimal configuration for your use case
- ‚úÖ Statistical significance of improvements

### Files Generated

- `evaluation_results.csv` - Raw data
- `evaluation_results.json` - Results in JSON format
- `evaluation_results.png` - Main visualization
- `evaluation_results.pdf` - Publication-ready figures
- `EVALUATION_REPORT.md` - Detailed report

---

**Production-Quality Code by Senior ML Researcher & Software Engineer**