# EGen-Core — Model Evaluation: ELO Tournament

**The Athena Project (2025–2026)** — Developed by [ErebusTN](https://github.com/ErebusTN)

This notebook implements an ELO rating tournament for comparing language models
loaded through EGen-Core. It uses translated Vicuna evaluation prompts to
benchmark models head-to-head.

## Overview

1. **Define evaluation prompts** (Vicuna-style)
2. **Load models** via `AutoModel.from_pretrained()`
3. **Run pairwise comparisons** — each model generates a response for the same prompt
4. **Compute ELO ratings** using a round-robin tournament
5. **Display results** in a ranked leaderboard

In [None]:
!pip install -q EGen-Core pandas tabulate

In [None]:
import torch
import pandas as pd
import random
import math
import sys

print(f"Python:  {sys.version}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA:    {torch.cuda.is_available()}")

## 1. Evaluation Prompts

Vicuna-style evaluation prompts covering reasoning, knowledge, coding, and creativity.

In [None]:
EVAL_PROMPTS = [
    # Knowledge
    "What are the main differences between nuclear fission and nuclear fusion?",
    "Explain the concept of supply and demand in economics.",
    "What is the greenhouse effect and how does it relate to climate change?",
    
    # Reasoning
    "If a train travels 120 km in 2 hours, what is its average speed? Show your reasoning.",
    "A farmer has 17 sheep. All but 9 die. How many sheep are left? Explain step by step.",
    
    # Coding
    "Write a Python function to check if a string is a palindrome.",
    "Explain the difference between a stack and a queue with examples.",
    
    # Creativity
    "Write a short poem about artificial intelligence.",
    "Describe a futuristic city in 2100 in three sentences.",
    
    # Instruction Following
    "List 5 tips for effective time management, numbered 1-5.",
]

print(f"Loaded {len(EVAL_PROMPTS)} evaluation prompts.")

## 2. ELO Rating System

In [None]:
class ELORatingSystem:
    """ELO rating system for model comparison."""
    
    def __init__(self, k_factor=32, initial_rating=1500):
        self.k_factor = k_factor
        self.initial_rating = initial_rating
        self.ratings = {}
        self.match_history = []
    
    def add_model(self, model_name):
        """Register a model with initial ELO rating."""
        if model_name not in self.ratings:
            self.ratings[model_name] = self.initial_rating
    
    def expected_score(self, rating_a, rating_b):
        """Calculate expected score for player A."""
        return 1.0 / (1.0 + math.pow(10, (rating_b - rating_a) / 400.0))
    
    def update_ratings(self, model_a, model_b, score_a):
        """Update ELO ratings after a match. score_a: 1=A wins, 0=B wins, 0.5=draw."""
        ra = self.ratings[model_a]
        rb = self.ratings[model_b]
        
        ea = self.expected_score(ra, rb)
        eb = self.expected_score(rb, ra)
        
        self.ratings[model_a] = ra + self.k_factor * (score_a - ea)
        self.ratings[model_b] = rb + self.k_factor * ((1 - score_a) - eb)
        
        self.match_history.append({
            'model_a': model_a, 'model_b': model_b,
            'score_a': score_a, 'new_rating_a': self.ratings[model_a],
            'new_rating_b': self.ratings[model_b]
        })
    
    def get_leaderboard(self):
        """Return sorted leaderboard as a DataFrame."""
        data = sorted(self.ratings.items(), key=lambda x: x[1], reverse=True)
        return pd.DataFrame(data, columns=['Model', 'ELO Rating'])

# Test the ELO system
elo = ELORatingSystem()
elo.add_model('Model_A')
elo.add_model('Model_B')
elo.update_ratings('Model_A', 'Model_B', 1.0)  # A wins
print(f"After A beats B: {elo.ratings}")
print("PASS: ELO rating system functional.")

## 3. Model Loading and Generation

In [None]:
from egen_core import AutoModel

def load_model(model_id, **kwargs):
    """Load a model through EGen-Core's AutoModel."""
    print(f"Loading {model_id}...")
    model = AutoModel.from_pretrained(model_id, **kwargs)
    print(f"  Loaded successfully.")
    return model

def generate_response(model, prompt, max_new_tokens=100, max_length=256):
    """Generate a response from a loaded EGen-Core model."""
    tokens = model.tokenizer(
        [prompt],
        return_tensors="pt",
        return_attention_mask=False,
        truncation=True,
        max_length=max_length,
        padding=False
    )
    
    output = model.generate(
        tokens['input_ids'].cuda(),
        max_new_tokens=max_new_tokens,
        use_cache=True,
        return_dict_in_generate=True
    )
    
    return model.tokenizer.decode(output.sequences[0], skip_special_tokens=True)

print("Model loading and generation functions defined.")

## 4. Tournament Runner

Configure the models to evaluate below. Each model will be compared against every
other model on each prompt.

In [None]:
# Configure models for tournament
# Set to actual HuggingFace model repo IDs to run a real tournament
MODEL_IDS = [
    # "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    # "mistralai/Mistral-7B-Instruct-v0.1",
]

def score_response(response_a, response_b):
    """Simple length-based scoring heuristic for demo purposes.
    In production, use GPT-4 or human evaluation."""
    len_a = len(response_a.strip())
    len_b = len(response_b.strip())
    if len_a > len_b * 1.2:
        return 1.0  # A wins
    elif len_b > len_a * 1.2:
        return 0.0  # B wins
    else:
        return 0.5  # Draw

if len(MODEL_IDS) >= 2 and torch.cuda.is_available():
    elo = ELORatingSystem()
    models = {}
    
    for model_id in MODEL_IDS:
        elo.add_model(model_id)
        models[model_id] = load_model(model_id)
    
    # Round-robin tournament
    for prompt_idx, prompt in enumerate(EVAL_PROMPTS):
        print(f"\n--- Prompt {prompt_idx + 1}/{len(EVAL_PROMPTS)} ---")
        print(f"  {prompt[:80]}...")
        
        for i in range(len(MODEL_IDS)):
            for j in range(i + 1, len(MODEL_IDS)):
                resp_a = generate_response(models[MODEL_IDS[i]], prompt)
                resp_b = generate_response(models[MODEL_IDS[j]], prompt)
                score = score_response(resp_a, resp_b)
                elo.update_ratings(MODEL_IDS[i], MODEL_IDS[j], score)
    
    print("\n" + "="*60)
    print("FINAL LEADERBOARD")
    print("="*60)
    print(elo.get_leaderboard().to_string(index=False))
else:
    print("SKIP: Add at least 2 model IDs to MODEL_IDS and ensure CUDA is available.")
    print("\nDemo leaderboard (simulated):")
    elo = ELORatingSystem()
    demo_models = ['Model_Alpha', 'Model_Beta', 'Model_Gamma']
    for m in demo_models:
        elo.add_model(m)
    # Simulate matches
    elo.update_ratings('Model_Alpha', 'Model_Beta', 1.0)
    elo.update_ratings('Model_Alpha', 'Model_Gamma', 1.0)
    elo.update_ratings('Model_Beta', 'Model_Gamma', 0.5)
    print(elo.get_leaderboard().to_string(index=False))

---

**ELO Tournament framework ready.** ✅ Configure `MODEL_IDS` with real HuggingFace
model repos to run a full evaluation. For production use, replace `score_response()` with
GPT-4 or human evaluation grading.