# Model Scoring Template

Compare and evaluate different **models** using the winning prompt from prompt evaluation.

## What This Template Does

1. **Fixed Prompt** - Uses Prompt B (Structured) - the winner from prompt evaluation
2. **Model Comparison** - Tests multiple models on the same prompt and test texts
3. **Output Quality Metrics** - Measures how well each model produces Easy Language output
4. **A/B Testing** - Directly compare two models to pick the winner

## Key Difference from Prompt Evaluation

| Template | Fixed Variable | Changeable Variable |
|----------|----------------|---------------------|
| Prompt Evaluation | Model | Prompts |
| **Model Scoring** | **Prompt** | **Models** |

---
# 1. Setup

In [None]:
import os
import re
import time
from dotenv import load_dotenv, find_dotenv
from groq import Groq
from IPython.display import display, HTML, Markdown
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Load environment
found_path = find_dotenv(usecwd=True)
if found_path:
    load_dotenv(found_path, override=True)
    print(f"‚úÖ Loaded .env from: {found_path}")

# Initialize Groq
groq_client = None
if os.getenv("GROQ_API_KEY"):
    groq_client = Groq(api_key=os.getenv("GROQ_API_KEY"))
    print("‚úÖ Groq Client Ready")
else:
    print("‚ö†Ô∏è GROQ_API_KEY not found")

---
# 2. Configuration

### Models to Compare

In [None]:
# =============================================
# MODELS TO COMPARE
# =============================================

MODELS = {
    "Model A (8B Fast)": "llama-3.1-8b-instant",
    "Model B (70B Versatile)": "llama-3.3-70b-versatile",
}

# Additional models you can add:
# "Model C (Gemma)": "gemma2-9b-it",
# "Model D (Mixtral)": "mixtral-8x7b-32768",

print(f"ü§ñ Loaded {len(MODELS)} models to compare:")
for name, model_id in MODELS.items():
    print(f"   ‚Ä¢ {name}: {model_id}")

### Fixed Prompt (Winner: Prompt B - Structured)

This prompt won the prompt evaluation and will be used for all model comparisons.

In [None]:
# =============================================
# FIXED PROMPT (Prompt B - Structured)
# Winner from prompt evaluation
# =============================================

SYSTEM_PROMPT = """# Identity

You are an expert in plain language writing.
You specialise in rewriting text to be accessible 
to people with learning disabilities or low literacy.

# Core Task 

* Rewrite the input text to be extremely simple and easy to understand.
* Keep the same meaning as the original text.

# Constraints

* Do NOT include any introductory or concluding text (e.g., "Here is the simplified text").
* Output ONLY the simplified text.
* Never output any XML/HTML tags or attributes (no <...>, no id=...).

# Structure & Formatting Rules

* Use clear structure.
* Use bullet points for steps, lists, or multiple items. Otherwise prefer short sentences.
* Add blank lines between every paragraph.

# Plain Language Rules
# Sentence & Length Rules

* Use very short sentences (maximum 10 words per sentence).
* Break up long sentences.
* Keep subjects and verbs close together.

# Vocabulary & Wording Rules

* Use simple, familiar words. Avoid technical, foreign, or formal terms.
* Explain any uncommon or necessary technical words or abbreviations in parentheses the first time they appear.
* Explain complex ideas or uncommon nouns in parentheses.
* Use positive wording. Avoid negations and never use double negatives.
* Replace abstract nouns with concrete, active verbs.

# Tone & Audience Rules

* Prefer active voice. Avoid passive voice whenever possible.
* Address the reader directly using "you".
* Use a friendly, neutral tone.
* Avoid bureaucratic, legalistic, or commanding language.

# Consistency Rules

* Remove filler words and unnecessary details. Keep only essential information.
* Use the same words consistently. Do not switch terms for the same thing.

# Examples
# The following are example pairs.
# Learn the style and constraints from them.
# Do NOT copy the XML tags into your output.

<examples>

  <example id="1">
    <original_text>
    Upon arrival at the facility, visitors are required to sign in at the front desk and present valid photo identification.
    </original_text>

    <simplified_text>
    When you arrive:

    * Go to the front desk.
    * Sign in with your name.
    * Show your photo ID.
    </simplified_text>
  </example>

  <example id="2">
    <original_text>
    The medication should be administered twice daily with food to minimize potential gastrointestinal discomfort.
    </original_text>

    <simplified_text>
    Take this medicine two times every day.

    Eat food when you take it. This helps your stomach feel better.
    </simplified_text>
  </example>

</examples>

Rewrite this text in simple language:"""

print("üìù Fixed prompt loaded: Prompt B (Structured)")
print(f"   Length: {len(SYSTEM_PROMPT)} characters")

### Test Texts

In [None]:
# Test texts to evaluate models on
TEST_TEXTS = {
    "Legal": """The obligations contained herein shall remain in full force and effect indefinitely, 
notwithstanding the termination of this Agreement, until such time as the Confidential Information 
no longer qualifies as confidential under applicable law.""",

    "Medical": """Patients should take the prescribed medication twice daily with food to minimize 
gastrointestinal discomfort. If adverse reactions occur, discontinue use immediately and consult 
your healthcare provider.""",

    "Bureaucratic": """For the application of housing benefit, a fully completed application form 
must be submitted to the responsible authority. The required proof of income and the rent 
certificate must be attached. Processing time is usually six to eight weeks."""
}

print(f"üìÑ Loaded {len(TEST_TEXTS)} test texts")

### Output Quality Metrics

In [None]:
# =============================================
# OUTPUT QUALITY METRICS
# =============================================

OUTPUT_METRICS = {
    "short_sentences": {
        "name": "Short Sentences",
        "description": "Maximum 10 words per sentence",
        "weight": 2,
        "check": lambda text: max([len(s.split()) for s in re.split(r'[.!?]', text) if s.strip()], default=0)
    },
    "uses_bullets": {
        "name": "Uses Bullet Points",
        "description": "Uses bullet points or numbered lists",
        "weight": 1,
        "check": lambda text: bool(re.search(r'[‚Ä¢\-\*]\s|^\d+\.\s', text, re.MULTILINE))
    },
    "has_paragraphs": {
        "name": "Clear Paragraphs",
        "description": "Has blank lines between sections",
        "weight": 1,
        "check": lambda text: '\n\n' in text
    },
    "no_intro_text": {
        "name": "No Intro/Outro Text",
        "description": "Starts directly without meta-text",
        "weight": 1,
        "check": lambda text: not bool(re.match(r'^(Here\'s|Here is|This is|The following|Sure|Certainly)', text.strip(), re.IGNORECASE))
    },
    "no_xml_tags": {
        "name": "No XML/HTML Tags",
        "description": "No markup in output",
        "weight": 1,
        "check": lambda text: not bool(re.search(r'<[^>]+>', text))
    }
}

print(f"üìä Output metrics: {len(OUTPUT_METRICS)}")

---
# 3. Helper Functions

In [None]:
def get_completion(text: str, model_id: str) -> str:
    """Call a specific model with the fixed prompt."""
    if not groq_client:
        return "[No API Client]"
    
    try:
        response = groq_client.chat.completions.create(
            model=model_id,
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": text}
            ],
            temperature=0.1
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"[Error: {e}]"


def evaluate_output_quality(output: str) -> dict:
    """Evaluate the model output."""
    results = {}
    total_score = 0
    max_score = 0
    
    for metric_id, metric in OUTPUT_METRICS.items():
        check_result = metric["check"](output)
        weight = metric["weight"]
        max_score += weight
        
        if metric_id == "short_sentences":
            passed = check_result <= 10
            results[metric_id] = {"pass": passed, "value": check_result, "weight": weight}
        else:
            passed = bool(check_result)
            results[metric_id] = {"pass": passed, "weight": weight}
        
        if passed:
            total_score += weight
    
    results["_score"] = total_score / max_score if max_score > 0 else 0
    return results


def tfidf_similarity(text1: str, text2: str) -> float:
    """Calculate TF-IDF similarity between texts."""
    try:
        vectorizer = TfidfVectorizer(lowercase=True)
        matrix = vectorizer.fit_transform([text1, text2])
        return round(cosine_similarity(matrix[0:1], matrix[1:2])[0][0], 3)
    except:
        return 0.0

In [None]:
def display_model_comparison(results: dict, test_name: str):
    """Display side-by-side comparison of models with checkboxes."""
    
    html = f"""<div style='background:#1a1a2e; padding:15px; border-radius:8px; margin:10px 0;'>
    <h3 style='color:#eee; margin:0 0 15px 0;'>üìÑ Test: {test_name}</h3>
    <div style='display:flex; gap:10px; flex-wrap:wrap;'>"""
    
    for model_name, data in results.items():
        output_eval = data["output_eval"]
        output_score = output_eval["_score"]
        similarity = data["similarity"]
        
        # Count passed metrics
        output_passed = sum(1 for k, v in output_eval.items() if k != "_score" and v.get("pass"))
        output_total = len([k for k in output_eval if k != "_score"])
        
        score_color = "#4ade80" if output_score >= 0.8 else "#fbbf24" if output_score >= 0.6 else "#f87171"
        sim_color = "#4ade80" if similarity >= 0.4 else "#fbbf24" if similarity >= 0.2 else "#f87171"
        
        html += f"""
        <div style='flex:1; min-width:300px; background:#0f3460; padding:12px; border-radius:6px;'>
            <div style='display:flex; justify-content:space-between; align-items:center; margin-bottom:10px;'>
                <strong style='color:#e0e0e0; font-size:14px;'>{model_name}</strong>
                <span style='background:{score_color}; color:#000; padding:2px 8px; border-radius:10px; font-size:11px;'>
                    {output_score:.0%}
                </span>
            </div>
            
            <div style='background:#1a1a2e; padding:8px; border-radius:4px; margin-bottom:10px; max-height:180px; overflow-y:auto;'>
                <pre style='color:#ddd; font-size:11px; white-space:pre-wrap; margin:0;'>{data["output"][:500]}{'...' if len(data["output"]) > 500 else ''}</pre>
            </div>
            
            <div style='display:flex; justify-content:space-between; align-items:center; margin-bottom:8px;'>
                <span style='color:#888; font-size:10px;'>TF-IDF Similarity:</span>
                <span style='color:{sim_color}; font-size:12px; font-weight:bold;'>{similarity:.1%}</span>
            </div>
            
            <div style='color:#888; font-size:10px; margin-bottom:5px; text-transform:uppercase;'>Quality Metrics ({output_passed}/{output_total})</div>
            <div style='font-size:11px;'>"""
        
        # Output metric checkboxes
        for metric_id, result in output_eval.items():
            if metric_id == "_score":
                continue
            icon = "‚úÖ" if result["pass"] else "‚ùå"
            metric_name = OUTPUT_METRICS[metric_id]["name"]
            value_str = f" ({result.get('value', '')}w)" if metric_id == "short_sentences" else ""
            html += f"<div style='color:#aaa;'>{icon} {metric_name}{value_str}</div>"
        
        html += """</div>
        </div>"""
    
    html += "</div></div>"
    display(HTML(html))

In [None]:
def display_model_scorecard(all_results: dict):
    """Display overall scorecard for all models."""
    
    # Aggregate scores
    model_scores = {name: {"output": [], "similarity": []} for name in MODELS.keys()}
    
    for test_name, models in all_results.items():
        for model_name, data in models.items():
            model_scores[model_name]["output"].append(data["output_eval"]["_score"])
            model_scores[model_name]["similarity"].append(data["similarity"])
    
    html = """<div style='background:#1a1a2e; padding:20px; border-radius:8px; margin:20px 0;'>
    <h2 style='color:#eee; margin:0 0 15px 0;'>üèÜ Model Scorecard</h2>
    <table style='width:100%; border-collapse:collapse;'>
        <tr style='background:#0f3460;'>
            <th style='color:#eee; padding:12px; text-align:left;'>Model</th>
            <th style='color:#eee; padding:12px; text-align:center;'>Model ID</th>
            <th style='color:#eee; padding:12px; text-align:center;'>Avg Quality</th>
            <th style='color:#eee; padding:12px; text-align:center;'>Avg Similarity</th>
            <th style='color:#eee; padding:12px; text-align:center;'>Combined</th>
            <th style='color:#eee; padding:12px; text-align:center;'>Rank</th>
        </tr>"""
    
    # Calculate combined scores and rank
    rankings = []
    for model_name in MODELS.keys():
        output_avg = sum(model_scores[model_name]["output"]) / len(model_scores[model_name]["output"])
        sim_avg = sum(model_scores[model_name]["similarity"]) / len(model_scores[model_name]["similarity"])
        combined = (output_avg * 0.7) + (sim_avg * 0.3)
        rankings.append((model_name, MODELS[model_name], output_avg, sim_avg, combined))
    
    rankings.sort(key=lambda x: x[4], reverse=True)
    
    for rank, (model_name, model_id, output_avg, sim_avg, combined) in enumerate(rankings, 1):
        score_color = "#4ade80" if combined >= 0.7 else "#fbbf24" if combined >= 0.5 else "#f87171"
        rank_icon = "ü•á" if rank == 1 else "ü•à" if rank == 2 else "ü•â" if rank == 3 else f"#{rank}"
        
        html += f"""
        <tr style='border-bottom:1px solid #333;'>
            <td style='color:#ddd; padding:12px;'><strong>{model_name}</strong></td>
            <td style='color:#888; padding:12px; text-align:center; font-family:monospace; font-size:11px;'>{model_id}</td>
            <td style='color:#aaa; padding:12px; text-align:center;'>{output_avg:.0%}</td>
            <td style='color:#aaa; padding:12px; text-align:center;'>{sim_avg:.1%}</td>
            <td style='padding:12px; text-align:center;'>
                <span style='background:{score_color}; color:#000; padding:4px 12px; border-radius:12px;'>{combined:.0%}</span>
            </td>
            <td style='color:#eee; padding:12px; text-align:center; font-size:18px;'>{rank_icon}</td>
        </tr>"""
    
    html += "</table></div>"
    display(HTML(html))
    
    return rankings

---
# 4. Run Model Evaluation

Compare all models on each test text using the fixed prompt.

In [None]:
print(f"üîÑ Evaluating {len(MODELS)} models on {len(TEST_TEXTS)} test texts...")
print(f"üìù Using fixed prompt: Prompt B (Structured)\n")

all_results = {}

for test_name, test_text in TEST_TEXTS.items():
    all_results[test_name] = {}
    
    for model_name, model_id in MODELS.items():
        print(f"  ‚Üí {test_name} | {model_name}...", end=" ")
        
        # Get model output
        output = get_completion(test_text, model_id)
        
        # Evaluate output quality
        output_eval = evaluate_output_quality(output)
        
        # Calculate meaning preservation
        similarity = tfidf_similarity(test_text, output)
        
        all_results[test_name][model_name] = {
            "model_id": model_id,
            "output": output,
            "output_eval": output_eval,
            "similarity": similarity
        }
        
        print(f"‚úì ({output_eval['_score']:.0%})")
        time.sleep(0.5)
    
    # Display comparison for this test
    display_model_comparison(all_results[test_name], test_name)

print("\n‚úÖ Evaluation complete!")

---
# 5. Overall Scorecard

See which model performs best overall.

In [None]:
rankings = display_model_scorecard(all_results)

# Winner announcement
winner = rankings[0]
print(f"\nüèÜ Winner: {winner[0]} ({winner[1]})")
print(f"   Quality: {winner[2]:.0%} | Similarity: {winner[3]:.1%} | Combined: {winner[4]:.0%}")

---
# 6. Detailed Metrics Breakdown

See exactly which metrics each model passes or fails.

In [None]:
def display_detailed_metrics(all_results: dict):
    """Show detailed metric breakdown per model."""
    
    # Aggregate metrics across all tests
    output_metrics_agg = {name: {m: 0 for m in OUTPUT_METRICS} for name in MODELS}
    num_tests = len(TEST_TEXTS)
    
    for test_name, models in all_results.items():
        for model_name, data in models.items():
            for m in OUTPUT_METRICS:
                if data["output_eval"].get(m, {}).get("pass"):
                    output_metrics_agg[model_name][m] += 1
    
    # Display output metrics
    html = """<div style='background:#1a1a2e; padding:20px; border-radius:8px; margin:20px 0;'>
    <h3 style='color:#eee; margin:0 0 15px 0;'>üìä Output Quality Metrics (per test)</h3>
    <table style='width:100%; border-collapse:collapse; font-size:13px;'>
        <tr style='background:#0f3460;'>
            <th style='color:#eee; padding:8px; text-align:left;'>Metric</th>"""
    
    for name in MODELS:
        short_name = name.split("(")[0].strip()
        html += f"<th style='color:#eee; padding:8px; text-align:center;'>{short_name}</th>"
    html += "</tr>"
    
    for metric_id, metric in OUTPUT_METRICS.items():
        html += f"<tr style='border-bottom:1px solid #333;'><td style='color:#aaa; padding:8px;'>{metric['name']}</td>"
        for name in MODELS:
            count = output_metrics_agg[name][metric_id]
            pct = count / num_tests
            color = "#4ade80" if pct >= 0.8 else "#fbbf24" if pct >= 0.5 else "#f87171"
            icon = "‚úÖ" if pct == 1 else "‚ö†Ô∏è" if pct > 0 else "‚ùå"
            html += f"<td style='text-align:center; padding:8px;'><span style='color:{color};'>{icon} {count}/{num_tests}</span></td>"
        html += "</tr>"
    
    html += "</table></div>"
    display(HTML(html))

display_detailed_metrics(all_results)

---
# 7. A/B Test: Compare Two Models

Direct head-to-head comparison of two specific models.

In [None]:
# Show available models
print("Available models:")
for i, name in enumerate(MODELS.keys()):
    print(f"  {i+1}. {name}")

# Select two models to compare
model_names = list(MODELS.keys())
MODEL_A = model_names[0] if len(model_names) > 0 else None
MODEL_B = model_names[1] if len(model_names) > 1 else None

if MODEL_A and MODEL_B:
    # Count wins
    a_wins = 0
    b_wins = 0
    ties = 0
    
    print(f"\n‚öîÔ∏è A/B Test: {MODEL_A} vs {MODEL_B}\n")
    
    for test_name in TEST_TEXTS:
        score_a = all_results[test_name][MODEL_A]["output_eval"]["_score"]
        score_b = all_results[test_name][MODEL_B]["output_eval"]["_score"]
        sim_a = all_results[test_name][MODEL_A]["similarity"]
        sim_b = all_results[test_name][MODEL_B]["similarity"]
        
        # Combined score for comparison
        combined_a = (score_a * 0.7) + (sim_a * 0.3)
        combined_b = (score_b * 0.7) + (sim_b * 0.3)
        
        if combined_a > combined_b:
            a_wins += 1
            winner = f"üÖ∞Ô∏è {MODEL_A}"
        elif combined_b > combined_a:
            b_wins += 1
            winner = f"üÖ±Ô∏è {MODEL_B}"
        else:
            ties += 1
            winner = "ü§ù Tie"
        
        print(f"  {test_name}: {winner}")
        print(f"      A: {score_a:.0%} quality, {sim_a:.1%} sim ‚Üí {combined_a:.0%}")
        print(f"      B: {score_b:.0%} quality, {sim_b:.1%} sim ‚Üí {combined_b:.0%}")
    
    print(f"\nüìä Results: {MODEL_A} wins {a_wins}, {MODEL_B} wins {b_wins}, Ties: {ties}")
    overall_winner = MODEL_A if a_wins > b_wins else MODEL_B if b_wins > a_wins else "Tie"
    print(f"üèÜ Overall Winner: {overall_winner}")
else:
    print("‚ö†Ô∏è Need at least 2 models for A/B testing")

---
# 8. Side-by-Side Output Comparison

View full outputs from both models for manual inspection.

In [None]:
def display_full_outputs(test_name: str):
    """Display full outputs from all models for a given test."""
    
    html = f"""<div style='background:#1a1a2e; padding:20px; border-radius:8px; margin:20px 0;'>
    <h3 style='color:#eee; margin:0 0 10px 0;'>üìù Full Outputs: {test_name}</h3>
    <div style='background:#0f3460; padding:10px; border-radius:4px; margin-bottom:15px;'>
        <div style='color:#888; font-size:10px; text-transform:uppercase;'>Original Text</div>
        <pre style='color:#ddd; font-size:12px; white-space:pre-wrap; margin:5px 0 0 0;'>{TEST_TEXTS[test_name]}</pre>
    </div>
    <div style='display:flex; gap:15px; flex-wrap:wrap;'>"""
    
    for model_name, data in all_results[test_name].items():
        score = data["output_eval"]["_score"]
        sim = data["similarity"]
        score_color = "#4ade80" if score >= 0.8 else "#fbbf24" if score >= 0.6 else "#f87171"
        
        html += f"""
        <div style='flex:1; min-width:300px; background:#0f3460; padding:12px; border-radius:6px;'>
            <div style='display:flex; justify-content:space-between; margin-bottom:8px;'>
                <strong style='color:#e0e0e0;'>{model_name}</strong>
                <span style='background:{score_color}; color:#000; padding:2px 8px; border-radius:10px; font-size:11px;'>{score:.0%}</span>
            </div>
            <div style='color:#888; font-size:10px; margin-bottom:8px;'>Similarity: {sim:.1%}</div>
            <pre style='color:#ddd; font-size:12px; white-space:pre-wrap; background:#1a1a2e; padding:10px; border-radius:4px; margin:0;'>{data["output"]}</pre>
        </div>"""
    
    html += "</div></div>"
    display(HTML(html))

# Display full outputs for each test
for test_name in TEST_TEXTS:
    display_full_outputs(test_name)

---
# Notes for Future Use

## Adding New Models

1. Add your model to the `MODELS` dictionary in Section 2:

```python
MODELS = {
    "Model A (8B Fast)": "llama-3.1-8b-instant",
    "Model B (70B Versatile)": "llama-3.3-70b-versatile",
    "Model C (Gemma)": "gemma2-9b-it",  # Add new model
}
```

2. Re-run all cells
3. Check the scorecard to see how it compares

## Available Groq Models

| Model ID | Description | Speed |
|----------|-------------|-------|
| `llama-3.1-8b-instant` | Fast, efficient | ‚ö° Fast |
| `llama-3.3-70b-versatile` | Larger, more capable | üê¢ Slower |
| `gemma2-9b-it` | Google's efficient model | ‚ö° Fast |
| `mixtral-8x7b-32768` | Mistral's MoE model | üê¢ Slower |

## Scoring Breakdown

**Combined Score** = (Output Quality √ó 0.7) + (TF-IDF Similarity √ó 0.3)

| Component | Weight | What It Measures |
|-----------|--------|------------------|
| Output Quality | 70% | Rule adherence (sentences, bullets, formatting) |
| TF-IDF Similarity | 30% | Meaning preservation (lexical overlap) |

## Output Quality Metrics

| Metric | Description | Weight |
|--------|-------------|--------|
| Short Sentences | Max 10 words per sentence | 2x |
| Uses Bullets | Has bullet points or numbered lists | 1x |
| Clear Paragraphs | Has blank lines between sections | 1x |
| No Intro Text | Starts directly without "Here is..." | 1x |
| No XML Tags | No markup in output | 1x |

# Model Scoring Template

Compare and evaluate different **models** using the winning prompt from prompt evaluation.

## What This Template Does

1. **Fixed Prompt** - Uses Prompt B (Structured) - the winner from prompt evaluation
2. **Model Comparison** - Tests multiple models on the same prompt and test texts
3. **Output Quality Metrics** - Measures how well each model produces Easy Language output
4. **A/B Testing** - Directly compare two models to pick the winner

## Key Difference from Prompt Evaluation

| Template | Fixed Variable | Changeable Variable |
|----------|----------------|---------------------|
| Prompt Evaluation | Model | Prompts |
| **Model Scoring** | **Prompt** | **Models** |

---
# 1. Setup

In [None]:
import os
import re
import time
from dotenv import load_dotenv, find_dotenv
from groq import Groq
from IPython.display import display, HTML, Markdown
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Load environment
found_path = find_dotenv(usecwd=True)
if found_path:
    load_dotenv(found_path, override=True)
    print(f"‚úÖ Loaded .env from: {found_path}")

# Initialize Groq
groq_client = None
if os.getenv("GROQ_API_KEY"):
    groq_client = Groq(api_key=os.getenv("GROQ_API_KEY"))
    print("‚úÖ Groq Client Ready")
else:
    print("‚ö†Ô∏è GROQ_API_KEY not found")

---
# 2. Configuration

### Models to Compare

In [None]:
# =============================================
# MODELS TO COMPARE
# =============================================

MODELS = {
    "Model A (8B Fast)": "llama-3.1-8b-instant",
    "Model B (70B Versatile)": "llama-3.3-70b-versatile",
}

# Additional models you can add:
# "Model C (Gemma)": "gemma2-9b-it",
# "Model D (Mixtral)": "mixtral-8x7b-32768",

print(f"ü§ñ Loaded {len(MODELS)} models to compare:")
for name, model_id in MODELS.items():
    print(f"   ‚Ä¢ {name}: {model_id}")

### Fixed Prompt (Winner: Prompt B - Structured)

This prompt won the prompt evaluation and will be used for all model comparisons.

In [None]:
# =============================================
# FIXED PROMPT (Prompt B - Structured)
# Winner from prompt evaluation
# =============================================

SYSTEM_PROMPT = """# Identity

You are an expert in plain language writing.
You specialise in rewriting text to be accessible 
to people with learning disabilities or low literacy.

# Core Task 

* Rewrite the input text to be extremely simple and easy to understand.
* Keep the same meaning as the original text.

# Constraints

* Do NOT include any introductory or concluding text (e.g., "Here is the simplified text").
* Output ONLY the simplified text.
* Never output any XML/HTML tags or attributes (no <...>, no id=...).

# Structure & Formatting Rules

* Use clear structure.
* Use bullet points for steps, lists, or multiple items. Otherwise prefer short sentences.
* Add blank lines between every paragraph.

# Plain Language Rules
# Sentence & Length Rules

* Use very short sentences (maximum 10 words per sentence).
* Break up long sentences.
* Keep subjects and verbs close together.

# Vocabulary & Wording Rules

* Use simple, familiar words. Avoid technical, foreign, or formal terms.
* Explain any uncommon or necessary technical words or abbreviations in parentheses the first time they appear.
* Explain complex ideas or uncommon nouns in parentheses.
* Use positive wording. Avoid negations and never use double negatives.
* Replace abstract nouns with concrete, active verbs.

# Tone & Audience Rules

* Prefer active voice. Avoid passive voice whenever possible.
* Address the reader directly using "you".
* Use a friendly, neutral tone.
* Avoid bureaucratic, legalistic, or commanding language.

# Consistency Rules

* Remove filler words and unnecessary details. Keep only essential information.
* Use the same words consistently. Do not switch terms for the same thing.

# Examples
# The following are example pairs.
# Learn the style and constraints from them.
# Do NOT copy the XML tags into your output.

<examples>

  <example id="1">
    <original_text>
    Upon arrival at the facility, visitors are required to sign in at the front desk and present valid photo identification.
    </original_text>

    <simplified_text>
    When you arrive:

    * Go to the front desk.
    * Sign in with your name.
    * Show your photo ID.
    </simplified_text>
  </example>

  <example id="2">
    <original_text>
    The medication should be administered twice daily with food to minimize potential gastrointestinal discomfort.
    </original_text>

    <simplified_text>
    Take this medicine two times every day.

    Eat food when you take it. This helps your stomach feel better.
    </simplified_text>
  </example>

</examples>

Rewrite this text in simple language:"""

print("üìù Fixed prompt loaded: Prompt B (Structured)")
print(f"   Length: {len(SYSTEM_PROMPT)} characters")

### Test Texts

In [None]:
# Test texts to evaluate models on
TEST_TEXTS = {
    "Legal": """The obligations contained herein shall remain in full force and effect indefinitely, 
notwithstanding the termination of this Agreement, until such time as the Confidential Information 
no longer qualifies as confidential under applicable law.""",

    "Medical": """Patients should take the prescribed medication twice daily with food to minimize 
gastrointestinal discomfort. If adverse reactions occur, discontinue use immediately and consult 
your healthcare provider.""",

    "Bureaucratic": """For the application of housing benefit, a fully completed application form 
must be submitted to the responsible authority. The required proof of income and the rent 
certificate must be attached. Processing time is usually six to eight weeks."""
}

print(f"üìÑ Loaded {len(TEST_TEXTS)} test texts")

### Output Quality Metrics

In [None]:
# =============================================
# OUTPUT QUALITY METRICS
# =============================================

OUTPUT_METRICS = {
    "short_sentences": {
        "name": "Short Sentences",
        "description": "Maximum 10 words per sentence",
        "weight": 2,
        "check": lambda text: max([len(s.split()) for s in re.split(r'[.!?]', text) if s.strip()], default=0)
    },
    "uses_bullets": {
        "name": "Uses Bullet Points",
        "description": "Uses bullet points or numbered lists",
        "weight": 1,
        "check": lambda text: bool(re.search(r'[‚Ä¢\-\*]\s|^\d+\.\s', text, re.MULTILINE))
    },
    "has_paragraphs": {
        "name": "Clear Paragraphs",
        "description": "Has blank lines between sections",
        "weight": 1,
        "check": lambda text: '\n\n' in text
    },
    "no_intro_text": {
        "name": "No Intro/Outro Text",
        "description": "Starts directly without meta-text",
        "weight": 1,
        "check": lambda text: not bool(re.match(r'^(Here\'s|Here is|This is|The following|Sure|Certainly)', text.strip(), re.IGNORECASE))
    },
    "no_xml_tags": {
        "name": "No XML/HTML Tags",
        "description": "No markup in output",
        "weight": 1,
        "check": lambda text: not bool(re.search(r'<[^>]+>', text))
    }
}

print(f"üìä Output metrics: {len(OUTPUT_METRICS)}")

---
# 3. Helper Functions

In [None]:
def display_model_comparison(results: dict, test_name: str):
    """Display side-by-side comparison of models with checkboxes."""
    
    html = f"""<div style='background:#1a1a2e; padding:15px; border-radius:8px; margin:10px 0;'>
    <h3 style='color:#eee; margin:0 0 15px 0;'>üìÑ Test: {test_name}</h3>
    <div style='display:flex; gap:10px; flex-wrap:wrap;'>"""
    
    for model_name, data in results.items():
        output_eval = data["output_eval"]
        output_score = output_eval["_score"]
        similarity = data["similarity"]
        
        # Count passed metrics
        output_passed = sum(1 for k, v in output_eval.items() if k != "_score" and v.get("pass"))
        output_total = len([k for k in output_eval if k != "_score"])
        
        score_color = "#4ade80" if output_score >= 0.8 else "#fbbf24" if output_score >= 0.6 else "#f87171"
        sim_color = "#4ade80" if similarity >= 0.4 else "#fbbf24" if similarity >= 0.2 else "#f87171"
        
        html += f"""
        <div style='flex:1; min-width:300px; background:#0f3460; padding:12px; border-radius:6px;'>
            <div style='display:flex; justify-content:space-between; align-items:center; margin-bottom:10px;'>
                <strong style='color:#e0e0e0; font-size:14px;'>{model_name}</strong>
                <span style='background:{score_color}; color:#000; padding:2px 8px; border-radius:10px; font-size:11px;'>
                    {output_score:.0%}
                </span>
            </div>
            
            <div style='background:#1a1a2e; padding:8px; border-radius:4px; margin-bottom:10px; max-height:180px; overflow-y:auto;'>
                <pre style='color:#ddd; font-size:11px; white-space:pre-wrap; margin:0;'>{data["output"][:500]}{'...' if len(data["output"]) > 500 else ''}</pre>
            </div>
            
            <div style='display:flex; justify-content:space-between; align-items:center; margin-bottom:8px;'>
                <span style='color:#888; font-size:10px;'>TF-IDF Similarity:</span>
                <span style='color:{sim_color}; font-size:12px; font-weight:bold;'>{similarity:.1%}</span>
            </div>
            
            <div style='color:#888; font-size:10px; margin-bottom:5px; text-transform:uppercase;'>Quality Metrics ({output_passed}/{output_total})</div>
            <div style='font-size:11px;'>"""
        
        # Output metric checkboxes
        for metric_id, result in output_eval.items():
            if metric_id == "_score":
                continue
            icon = "‚úÖ" if result["pass"] else "‚ùå"
            metric_name = OUTPUT_METRICS[metric_id]["name"]
            value_str = f" ({result.get('value', '')}w)" if metric_id == "short_sentences" else ""
            html += f"<div style='color:#aaa;'>{icon} {metric_name}{value_str}</div>"
        
        html += """</div>
        </div>"""
    
    html += "</div></div>"
    display(HTML(html))

In [None]:
def display_model_scorecard(all_results: dict):
    """Display overall scorecard for all models."""
    
    # Aggregate scores
    model_scores = {name: {"output": [], "similarity": []} for name in MODELS.keys()}
    
    for test_name, models in all_results.items():
        for model_name, data in models.items():
            model_scores[model_name]["output"].append(data["output_eval"]["_score"])
            model_scores[model_name]["similarity"].append(data["similarity"])
    
    html = """<div style='background:#1a1a2e; padding:20px; border-radius:8px; margin:20px 0;'>
    <h2 style='color:#eee; margin:0 0 15px 0;'>üèÜ Model Scorecard</h2>
    <table style='width:100%; border-collapse:collapse;'>
        <tr style='background:#0f3460;'>
            <th style='color:#eee; padding:12px; text-align:left;'>Model</th>
            <th style='color:#eee; padding:12px; text-align:center;'>Model ID</th>
            <th style='color:#eee; padding:12px; text-align:center;'>Avg Quality</th>
            <th style='color:#eee; padding:12px; text-align:center;'>Avg Similarity</th>
            <th style='color:#eee; padding:12px; text-align:center;'>Combined</th>
            <th style='color:#eee; padding:12px; text-align:center;'>Rank</th>
        </tr>"""
    
    # Calculate combined scores and rank
    rankings = []
    for model_name in MODELS.keys():
        output_avg = sum(model_scores[model_name]["output"]) / len(model_scores[model_name]["output"])
        sim_avg = sum(model_scores[model_name]["similarity"]) / len(model_scores[model_name]["similarity"])
        combined = (output_avg * 0.7) + (sim_avg * 0.3)
        rankings.append((model_name, MODELS[model_name], output_avg, sim_avg, combined))
    
    rankings.sort(key=lambda x: x[4], reverse=True)
    
    for rank, (model_name, model_id, output_avg, sim_avg, combined) in enumerate(rankings, 1):
        score_color = "#4ade80" if combined >= 0.7 else "#fbbf24" if combined >= 0.5 else "#f87171"
        rank_icon = "ü•á" if rank == 1 else "ü•à" if rank == 2 else "ü•â" if rank == 3 else f"#{rank}"
        
        html += f"""
        <tr style='border-bottom:1px solid #333;'>
            <td style='color:#ddd; padding:12px;'><strong>{model_name}</strong></td>
            <td style='color:#888; padding:12px; text-align:center; font-family:monospace; font-size:11px;'>{model_id}</td>
            <td style='color:#aaa; padding:12px; text-align:center;'>{output_avg:.0%}</td>
            <td style='color:#aaa; padding:12px; text-align:center;'>{sim_avg:.1%}</td>
            <td style='padding:12px; text-align:center;'>
                <span style='background:{score_color}; color:#000; padding:4px 12px; border-radius:12px;'>{combined:.0%}</span>
            </td>
            <td style='color:#eee; padding:12px; text-align:center; font-size:18px;'>{rank_icon}</td>
        </tr>"""
    
    html += "</table></div>"
    display(HTML(html))
    
    return rankings

In [None]:
def get_completion(text: str, model_id: str) -> str:
    """Call a specific model with the fixed prompt."""
    if not groq_client:
        return "[No API Client]"
    
    try:
        response = groq_client.chat.completions.create(
            model=model_id,
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": text}
            ],
            temperature=0.1
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"[Error: {e}]"


def evaluate_output_quality(output: str) -> dict:
    """Evaluate the model output."""
    results = {}
    total_score = 0
    max_score = 0
    
    for metric_id, metric in OUTPUT_METRICS.items():
        check_result = metric["check"](output)
        weight = metric["weight"]
        max_score += weight
        
        if metric_id == "short_sentences":
            passed = check_result <= 10
            results[metric_id] = {"pass": passed, "value": check_result, "weight": weight}
        else:
            passed = bool(check_result)
            results[metric_id] = {"pass": passed, "weight": weight}
        
        if passed:
            total_score += weight
    
    results["_score"] = total_score / max_score if max_score > 0 else 0
    return results


def tfidf_similarity(text1: str, text2: str) -> float:
    """Calculate TF-IDF similarity between texts."""
    try:
        vectorizer = TfidfVectorizer(lowercase=True)
        matrix = vectorizer.fit_transform([text1, text2])
        return round(cosine_similarity(matrix[0:1], matrix[1:2])[0][0], 3)
    except:
        return 0.0

In [None]:
print(f"üîÑ Evaluating {len(MODELS)} models on {len(TEST_TEXTS)} test texts...")
print(f"üìù Using fixed prompt: Prompt B (Structured)\n")

all_results = {}

for test_name, test_text in TEST_TEXTS.items():
    all_results[test_name] = {}
    
    for model_name, model_id in MODELS.items():
        print(f"  ‚Üí {test_name} | {model_name}...", end=" ")
        
        # Get model output
        output = get_completion(test_text, model_id)
        
        # Evaluate output quality
        output_eval = evaluate_output_quality(output)
        
        # Calculate meaning preservation
        similarity = tfidf_similarity(test_text, output)
        
        all_results[test_name][model_name] = {
            "model_id": model_id,
            "output": output,
            "output_eval": output_eval,
            "similarity": similarity
        }
        
        print(f"‚úì ({output_eval['_score']:.0%})")
        time.sleep(0.5)
    
    # Display comparison for this test
    display_model_comparison(all_results[test_name], test_name)

print("\n‚úÖ Evaluation complete!")

rankings = display_model_scorecard(all_results)

# Winner announcement
winner = rankings[0]
print(f"\nüèÜ Winner: {winner[0]} ({winner[1]})")
print(f"   Quality: {winner[2]:.0%} | Similarity: {winner[3]:.1%} | Combined: {winner[4]:.0%}")

---
# 6. Detailed Metrics Breakdown

See exactly which metrics each model passes or fails.

In [None]:
def display_detailed_metrics(all_results: dict):
    """Show detailed metric breakdown per model."""
    
    # Aggregate metrics across all tests
    output_metrics_agg = {name: {m: 0 for m in OUTPUT_METRICS} for name in MODELS}
    num_tests = len(TEST_TEXTS)
    
    for test_name, models in all_results.items():
        for model_name, data in models.items():
            for m in OUTPUT_METRICS:
                if data["output_eval"].get(m, {}).get("pass"):
                    output_metrics_agg[model_name][m] += 1
    
    # Display output metrics
    html = """<div style='background:#1a1a2e; padding:20px; border-radius:8px; margin:20px 0;'>
    <h3 style='color:#eee; margin:0 0 15px 0;'>üìä Output Quality Metrics (per test)</h3>
    <table style='width:100%; border-collapse:collapse; font-size:13px;'>
        <tr style='background:#0f3460;'>
            <th style='color:#eee; padding:8px; text-align:left;'>Metric</th>"""
    
    for name in MODELS:
        short_name = name.split("(")[0].strip()
        html += f"<th style='color:#eee; padding:8px; text-align:center;'>{short_name}</th>"
    html += "</tr>"
    
    for metric_id, metric in OUTPUT_METRICS.items():
        html += f"<tr style='border-bottom:1px solid #333;'><td style='color:#aaa; padding:8px;'>{metric['name']}</td>"
        for name in MODELS:
            count = output_metrics_agg[name][metric_id]
            pct = count / num_tests
            color = "#4ade80" if pct >= 0.8 else "#fbbf24" if pct >= 0.5 else "#f87171"
            icon = "‚úÖ" if pct == 1 else "‚ö†Ô∏è" if pct > 0 else "‚ùå"
            html += f"<td style='text-align:center; padding:8px;'><span style='color:{color};'>{icon} {count}/{num_tests}</span></td>"
        html += "</tr>"
    
    html += "</table></div>"
    display(HTML(html))

display_detailed_metrics(all_results)

---
# 7. A/B Test: Compare Two Models

Direct head-to-head comparison of two specific models.

In [None]:
# Show available models
print("Available models:")
for i, name in enumerate(MODELS.keys()):
    print(f"  {i+1}. {name}")

# Select two models to compare
model_names = list(MODELS.keys())
MODEL_A = model_names[0] if len(model_names) > 0 else None
MODEL_B = model_names[1] if len(model_names) > 1 else None

if MODEL_A and MODEL_B:
    # Count wins
    a_wins = 0
    b_wins = 0
    ties = 0
    
    print(f"\n‚öîÔ∏è A/B Test: {MODEL_A} vs {MODEL_B}\n")
    
    for test_name in TEST_TEXTS:
        score_a = all_results[test_name][MODEL_A]["output_eval"]["_score"]
        score_b = all_results[test_name][MODEL_B]["output_eval"]["_score"]
        sim_a = all_results[test_name][MODEL_A]["similarity"]
        sim_b = all_results[test_name][MODEL_B]["similarity"]
        
        # Combined score for comparison
        combined_a = (score_a * 0.7) + (sim_a * 0.3)
        combined_b = (score_b * 0.7) + (sim_b * 0.3)
        
        if combined_a > combined_b:
            a_wins += 1
            winner = f"üÖ∞Ô∏è {MODEL_A}"
        elif combined_b > combined_a:
            b_wins += 1
            winner = f"üÖ±Ô∏è {MODEL_B}"
        else:
            ties += 1
            winner = "ü§ù Tie"
        
        print(f"  {test_name}: {winner}")
        print(f"      A: {score_a:.0%} quality, {sim_a:.1%} sim ‚Üí {combined_a:.0%}")
        print(f"      B: {score_b:.0%} quality, {sim_b:.1%} sim ‚Üí {combined_b:.0%}")
    
    print(f"\nüìä Results: {MODEL_A} wins {a_wins}, {MODEL_B} wins {b_wins}, Ties: {ties}")
    overall_winner = MODEL_A if a_wins > b_wins else MODEL_B if b_wins > a_wins else "Tie"
    print(f"üèÜ Overall Winner: {overall_winner}")
else:
    print("‚ö†Ô∏è Need at least 2 models for A/B testing")

In [None]:
# End of notebook