# Easy Language Evaluation Template

A simple template to compare how well different models follow Easy Language rules.

## What This Template Does

1. **Side-by-Side Comparison** - See outputs from multiple models next to each other
2. **Rule Adherence Check** - Measure how well each model follows specific rules
3. **Bilingual Support** - Test both English and German outputs

## How to Use

1. Run the setup cells
2. Add your test texts (or use the defaults)
3. Run the comparison
4. Review the results table

---
# 1. Setup

In [1]:
# Test sklearn installation
import sys
print("Python:", sys.executable)

try:
    import sklearn
    print(f"‚úÖ sklearn version: {sklearn.__version__}")
except ImportError as e:
    print(f"‚ùå sklearn not found: {e}")
    print("Installing...")
    import subprocess
    subprocess.check_call([sys.executable, "-m", "pip", "install", "scikit-learn"])
    print("Restart kernel and run again")

Python: /Users/simonvoegely/Desktop/easysprache/klartext/.venv/bin/python
‚úÖ sklearn version: 1.8.0


In [2]:
import html
import os
import re
import time
from dotenv import load_dotenv, find_dotenv
from groq import Groq
from IPython.display import display, HTML, Markdown
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Load environment
found_path = find_dotenv(usecwd=True)
if found_path:
    load_dotenv(found_path, override=True)
    print(f"‚úÖ Loaded .env from: {found_path}")

# Initialize Groq
groq_client = None
if os.getenv("GROQ_API_KEY"):
    groq_client = Groq(api_key=os.getenv("GROQ_API_KEY"))
    print("‚úÖ Groq Client Ready")
else:
    print("‚ö†Ô∏è GROQ_API_KEY not found")

# TF-IDF Similarity Function
def tfidf_similarity(text1: str, text2: str) -> float:
    """Calculate TF-IDF cosine similarity between two texts."""
    try:
        vectorizer = TfidfVectorizer(lowercase=True, stop_words=None)
        tfidf_matrix = vectorizer.fit_transform([text1, text2])
        similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]
        return round(similarity, 3)
    except:
        return 0.0

print("‚úÖ TF-IDF Similarity Ready")

‚úÖ Loaded .env from: /Users/simonvoegely/Desktop/easysprache/klartext/.env
‚úÖ Groq Client Ready
‚úÖ TF-IDF Similarity Ready


---
# 2. Configuration

### Models to Compare

In [3]:
# Models to compare (add or remove as needed)
MODELS = [
    "llama-3.1-8b-instant",      # Fast, smaller model
    "llama-3.3-70b-versatile",   # Larger, more capable
]

# Short names for display
MODEL_NAMES = {
    "llama-3.1-8b-instant": "Llama 8B",
    "llama-3.3-70b-versatile": "Llama 70B",
}

### Test Texts (English + German) 

In [4]:
# English Test Cases. Public Sector Information
TEST_TEXTS_EN = {
    "Safety Advice": """Make sure that the area is a safe place, especially if you plan on walking home at night. 
It's always a good idea to practice the buddy system. Have a friend meet up and walk with you. 
Research the bus, train or streetcar route you plan to take. Check the schedule for both outgoing and return travel. 
Some public transportation ceases to run late at night. Make sure you don't get stuck without a way home.""",

    "Legal Text": """The obligations contained herein shall remain in full force and effect indefinitely, 
notwithstanding the termination of this Agreement, until such time as the Confidential Information 
no longer qualifies as confidential under applicable law. 
The Receiving Party agrees to return all physical copies of the Confidential Information upon request.""",

    "Medical Info": """Patients should take the prescribed medication twice daily with food to minimize 
gastrointestinal discomfort. If adverse reactions occur, discontinue use immediately and consult 
your healthcare provider. Do not exceed the recommended dosage without medical supervision."""
}

# German Test Cases. Info √∂ffentliche Beh√∂rde
TEST_TEXTS_DE = {
    "Beh√∂rdliche Info": """F√ºr die Beantragung des Wohngeldes ist ein vollst√§ndig ausgef√ºllter Antrag 
beim zust√§ndigen Amt einzureichen. Die erforderlichen Nachweise √ºber das Einkommen sowie die 
Mietbescheinigung sind beizuf√ºgen. Die Bearbeitungsdauer betr√§gt in der Regel sechs bis acht Wochen.""",

    "Medizinische Info": """Das Medikament ist zweimal t√§glich mit ausreichend Fl√ºssigkeit einzunehmen. 
Bei Auftreten von Nebenwirkungen ist die Einnahme sofort zu unterbrechen und ein Arzt zu konsultieren. 
Eine √úberschreitung der empfohlenen Tagesdosis ist zu vermeiden.""",

    "Rechtlicher Hinweis": """Die in diesem Dokument enthaltenen Verpflichtungen bleiben auch nach 
Beendigung des Vertragsverh√§ltnisses in vollem Umfang bestehen, bis die vertraulichen Informationen 
nach geltendem Recht nicht mehr als vertraulich eingestuft werden."""
}

### Easy Language Rules & Prompts

In [5]:
# The rules we want models to follow (used for evaluation)
EASY_LANGUAGE_RULES = {
    "short_sentences": {
        "name": "Short Sentences",
        "description": "Maximum 10 words per sentence",
        "check": lambda text: max([len(s.split()) for s in re.split(r'[.!?]', text) if s.strip()], default=0)
    },
    "uses_bullets": {
        "name": "Uses Bullet Points",
        "description": "Uses bullet points or numbered lists for steps, lists or multiple items",
        "check": lambda text: bool(re.search(r'[‚Ä¢\-\*]\s|^\d+\.\s', text, re.MULTILINE))
    },
    "has_paragraphs": {
        "name": "Clear Paragraphs",
        "description": "Has blank lines between sections",
        "check": lambda text: '\n\n' in text or '\n \n' in text
    },
    "no_intro_text": {
        "name": "No Intro/Outro Text",
        "description": "No introductory or concluding text like 'Here is the simplified text'",
        "check": lambda text: not bool(re.match(r'^(Here's|Here is|This is|The following|Hier ist|In summary|To summarize)', text.strip(), re.IGNORECASE))
    },
    "no_xml_tags": {
        "name": "No XML/HTML Tags",
        "description": "Never output any XML/HTML tags or attributes (no <...>, no id=...)",
        "check": lambda text: not bool(re.search(r'<[^>]+>|id\s*=', text))
    },
    "keep_meaning": {
        "name": "Keep Meaning",
        "description": "Do not drop meaning - rewrite sentence by sentence, do not condense or join",
        "check": lambda text: True  # Manual review needed - cannot auto-check
    },
    "active_voice": {
        "name": "Active Voice",
        "description": "Uses active voice (approximation: few passive markers)",
        "check": lambda text: text.lower().count(' is ') + text.lower().count(' are ') + text.lower().count(' was ') + text.lower().count(' were ') < 5
    }
}

# System Prompts
PROMPT_EN = """You are an expert in plain language writing.

Rules:
- Very short sentences (maximum 10 words)
- Use only simple, everyday words
- Explain uncommon words in parentheses
- Add blank lines between paragraphs
- Use bullet points for lists or steps
- Do NOT add introductory text like "Here is the simplified text"
- Output ONLY the simplified text

Rewrite this text in simple language:"""

PROMPT_DE = """Du bist ein Experte f√ºr Einfache Sprache.

Regeln:
- Sehr kurze S√§tze (maximal 10 W√∂rter)
- Nur einfache, allt√§gliche W√∂rter verwenden
- Ungew√∂hnliche W√∂rter in Klammern erkl√§ren
- Leerzeilen zwischen Abs√§tzen einf√ºgen
- Aufz√§hlungen f√ºr Listen oder Schritte verwenden
- KEINE Einleitung wie "Hier ist der vereinfachte Text"
- Gib NUR den vereinfachten Text aus

Schreibe diesen Text in einfacher Sprache:"""

---
# 3. Helper Functions

In [6]:
def simplify_text(text: str, model: str, language: str = "en") -> str:
    """Call the model to simplify text."""
    if not groq_client:
        return "[No API Client]"
    
    prompt = PROMPT_EN if language == "en" else PROMPT_DE
    
    try:
        response = groq_client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": prompt},
                {"role": "user", "content": text}
            ],
            temperature=0.1
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"[Error: {e}]"


def evaluate_rules(text: str, original_text: str = None) -> dict:
    """Check how well the text follows Easy Language rules.
    
    Args:
        text: The simplified output text
        original_text: The original text (for TF-IDF similarity comparison)
    """
    results = {}
    for rule_id, rule in EASY_LANGUAGE_RULES.items():
        
        if rule_id == "short_sentences":
            # For sentence length, lower is better, pass if max <= 10
            check_result = rule["check"](text)
            results[rule_id] = {"value": check_result, "pass": check_result <= 10}
        
        elif rule_id == "keep_meaning":
            # Use TF-IDF similarity to check meaning preservation
            if original_text:
                similarity = tfidf_similarity(original_text, text)
                # Pass if similarity >= 0.3 (threshold can be adjusted)
                # Note: Simple words replacing complex ones will lower this score
                results[rule_id] = {"value": similarity, "pass": similarity >= 0.3}
            else:
                results[rule_id] = {"value": "N/A", "pass": True}
        
        else:
            # For boolean checks, True = pass
            check_result = rule["check"](text)
            results[rule_id] = {"value": check_result, "pass": bool(check_result)}
    
    return results


def count_words(text: str) -> int:
    """Count words in text."""
    return len(text.split())

In [16]:
def display_side_by_side(original: str, results: dict, test_name: str):
    """Display model outputs side by side with rule evaluation."""
    
    n_models = len(results)
    col_width = 100 // n_models
    
    # Escape original text for safe HTML embedding
    original_escaped = html.escape(original[:200]) + ('...' if len(original) > 200 else '')
    
    # Header
    html_content = f"""<div style='background:#1a1a2e; padding:15px; border-radius:8px; margin:10px 0;'>
    <h3 style='color:#eee; margin:0 0 10px 0;'>üìÑ {html.escape(test_name)}</h3>
    <div style='background:#16213e; padding:10px; border-radius:5px; margin-bottom:15px;'>
        <strong style='color:#888;'>Original:</strong>
        <p style='color:#aaa; margin:5px 0; font-size:13px;'>{original_escaped}</p>
    </div>
    <div style='display:flex; gap:10px;'>"""
    
    # Each model column
    for model, data in results.items():
        output = data["output"]
        rules = data["rules"]
        model_name = MODEL_NAMES.get(model, model)
        
        # Escape output text for safe HTML embedding
        output_escaped = html.escape(output)
        
        # Calculate rule score
        passed = sum(1 for r in rules.values() if r["pass"])
        total = len(rules)
        score_pct = (passed / total) * 100
        score_color = "#4ade80" if score_pct >= 80 else "#fbbf24" if score_pct >= 60 else "#f87171"
        
        html_content += f"""
        <div style='flex:1; background:#0f3460; padding:12px; border-radius:6px;'>
            <div style='display:flex; justify-content:space-between; align-items:center; margin-bottom:10px;'>
                <strong style='color:#e0e0e0;'>ü§ñ {html.escape(model_name)}</strong>
                <span style='background:{score_color}; color:#000; padding:2px 8px; border-radius:10px; font-size:12px;'>
                    {passed}/{total} rules
                </span>
            </div>
            <div style='background:#1a1a2e; padding:10px; border-radius:4px; margin-bottom:10px; max-height:200px; overflow-y:auto;'>
                <pre style='color:#ddd; font-size:12px; white-space:pre-wrap; margin:0;'>{output_escaped}</pre>
            </div>
            <div style='font-size:11px;'>"""
        
        # Rule indicators
        for rule_id, result in rules.items():
            icon = "‚úÖ" if result["pass"] else "‚ùå"
            rule_name = EASY_LANGUAGE_RULES[rule_id]["name"]
            
            # Show value for specific rules
            if rule_id == "short_sentences":
                value_str = f" ({result['value']}w)"
            elif rule_id == "keep_meaning" and isinstance(result['value'], float):
                value_str = f" ({result['value']:.0%})"
            else:
                value_str = ""
            
            html_content += f"<div style='color:#aaa;'>{icon} {html.escape(rule_name)}{value_str}</div>"
        
        html_content += "</div></div>"
    
    html_content += "</div></div>"
    display(HTML(html_content))

---
# 4. Run Evaluation

### English Tests

In [9]:
print("üá¨üáß Running English Evaluation...\n")

results_en = {}

for test_name, test_text in TEST_TEXTS_EN.items():
    results_en[test_name] = {}
    
    for model in MODELS:
        # Get simplified output
        output = simplify_text(test_text, model, language="en")
        
        # Evaluate rules (pass original text for TF-IDF similarity)
        rules = evaluate_rules(output, original_text=test_text)
        
        results_en[test_name][model] = {
            "output": output,
            "rules": rules
        }
        
        time.sleep(0.5)  # Rate limit
    
    # Display side by side
    display_side_by_side(test_text, results_en[test_name], f"üá¨üáß {test_name}")

üá¨üáß Running English Evaluation...



### German Tests

In [10]:
print("üá©üá™ Running German Evaluation...\n")

results_de = {}

for test_name, test_text in TEST_TEXTS_DE.items():
    results_de[test_name] = {}
    
    for model in MODELS:
        # Get simplified output
        output = simplify_text(test_text, model, language="de")
        
        # Evaluate rules (pass original text for TF-IDF similarity)
        rules = evaluate_rules(output, original_text=test_text)
        
        results_de[test_name][model] = {
            "output": output,
            "rules": rules
        }
        
        time.sleep(0.5)  # Rate limit
    
    # Display side by side
    display_side_by_side(test_text, results_de[test_name], f"üá©üá™ {test_name}")

üá©üá™ Running German Evaluation...



---
# 5. Summary Table

Overall performance across all test cases.

In [12]:
def calculate_summary(results_en: dict, results_de: dict) -> None:
    """Calculate and display summary statistics for each model."""
    
    # Combine results
    all_results = {}
    for test_name, models in results_en.items():
        all_results[f"EN: {test_name}"] = models
    for test_name, models in results_de.items():
        all_results[f"DE: {test_name}"] = models
    
    # Calculate per-model stats
    model_stats = {model: {"passed": 0, "total": 0, "max_sentence": [], "similarity": []} for model in MODELS}
    
    for test_name, models in all_results.items():
        for model, data in models.items():
            for rule_id, result in data["rules"].items():
                model_stats[model]["total"] += 1
                if result["pass"]:
                    model_stats[model]["passed"] += 1
                if rule_id == "short_sentences":
                    model_stats[model]["max_sentence"].append(result["value"])
                if rule_id == "keep_meaning" and isinstance(result["value"], float):
                    model_stats[model]["similarity"].append(result["value"])
    
    # Display summary
    html = """<div style='background:#1a1a2e; padding:20px; border-radius:8px; margin:20px 0;'>
    <h2 style='color:#eee; margin:0 0 15px 0;'>üìä Summary</h2>
    <table style='width:100%; border-collapse:collapse;'>
        <tr style='background:#0f3460;'>
            <th style='color:#eee; padding:10px; text-align:left;'>Model</th>
            <th style='color:#eee; padding:10px; text-align:center;'>Rules Passed</th>
            <th style='color:#eee; padding:10px; text-align:center;'>Score</th>
            <th style='color:#eee; padding:10px; text-align:center;'>Avg Max Sentence</th>
            <th style='color:#eee; padding:10px; text-align:center;'>Avg TF-IDF Sim</th>
        </tr>"""
    
    for model in MODELS:
        stats = model_stats[model]
        score = (stats["passed"] / stats["total"]) * 100 if stats["total"] > 0 else 0
        avg_max = sum(stats["max_sentence"]) / len(stats["max_sentence"]) if stats["max_sentence"] else 0
        avg_sim = sum(stats["similarity"]) / len(stats["similarity"]) if stats["similarity"] else 0
        
        score_color = "#4ade80" if score >= 80 else "#fbbf24" if score >= 60 else "#f87171"
        sim_color = "#4ade80" if avg_sim >= 0.4 else "#fbbf24" if avg_sim >= 0.3 else "#f87171"
        model_name = MODEL_NAMES.get(model, model)
        
        html += f"""
        <tr style='border-bottom:1px solid #333;'>
            <td style='color:#ddd; padding:10px;'><strong>{model_name}</strong></td>
            <td style='color:#aaa; padding:10px; text-align:center;'>{stats['passed']}/{stats['total']}</td>
            <td style='padding:10px; text-align:center;'>
                <span style='background:{score_color}; color:#000; padding:3px 10px; border-radius:12px;'>{score:.0f}%</span>
            </td>
            <td style='color:#aaa; padding:10px; text-align:center;'>{avg_max:.1f} words</td>
            <td style='padding:10px; text-align:center;'>
                <span style='background:{sim_color}; color:#000; padding:3px 10px; border-radius:12px;'>{avg_sim:.0%}</span>
            </td>
        </tr>"""
    
    html += "</table></div>"
    display(HTML(html))

# Run summary
calculate_summary(results_en, results_de)

Model,Rules Passed,Score,Avg Max Sentence,Avg TF-IDF Sim
Llama 8B,34/42,81%,16.8 words,28%
Llama 70B,31/42,74%,18.3 words,22%


---
# 6. Example Sentences Comparison

Quick view of how each model handles specific phrases.

In [13]:
def extract_first_sentences(results: dict, n: int = 3) -> None:
    """Show first N sentences from each model for quick comparison."""
    
    html_content = """<div style='background:#1a1a2e; padding:20px; border-radius:8px; margin:20px 0;'>
    <h2 style='color:#eee; margin:0 0 15px 0;'>üìù Example Sentences</h2>"""
    
    for test_name, models in results.items():
        html_content += f"<h4 style='color:#888; margin:15px 0 10px 0;'>{html.escape(test_name)}</h4>"
        html_content += "<table style='width:100%; border-collapse:collapse;'>"
        
        for model, data in models.items():
            output = data["output"]
            # Extract first few sentences/lines
            lines = [l.strip() for l in output.split('\n') if l.strip()][:n]
            sentences = ' | '.join(lines)
            model_name = MODEL_NAMES.get(model, model)
            
            # Escape text for safe HTML embedding
            sentences_escaped = html.escape(sentences[:300]) + ('...' if len(sentences) > 300 else '')
            
            html_content += f"""
            <tr style='border-bottom:1px solid #333;'>
                <td style='color:#4ade80; padding:8px; width:120px;'><strong>{html.escape(model_name)}</strong></td>
                <td style='color:#ddd; padding:8px; font-size:13px;'>{sentences_escaped}</td>
            </tr>"""
        
        html_content += "</table>"
    
    html_content += "</div>"
    display(HTML(html_content))

# Show examples from English results
print("üá¨üáß English Examples:")
extract_first_sentences(results_en)

print("\nüá©üá™ German Examples:")
extract_first_sentences(results_de)

üá¨üáß English Examples:


0,1
Llama 8B,Make sure the area is safe to walk in. | * Check for streetlights and people around. | * Avoid dark or quiet areas.
Llama 70B,Be safe at night. | Walk with a friend. | * Walk with a friend

0,1
Llama 8B,These rules will stay in place forever. | They will not change even if this agreement ends. | They will stay until the secret information is no longer secret.
Llama 70B,Rules last forever. | They stay until secret info is not secret. | * Keep rules even if deal ends

0,1
Llama 8B,"Take medicine twice a day with food. | This helps your stomach feel better. | If you feel bad, stop taking the medicine."
Llama 70B,"Take medicine twice a day with food. | This helps your stomach feel better. | If you feel bad, stop taking it."



üá©üá™ German Examples:


0,1
Llama 8B,"Um Wohngeld zu bekommen, brauchst du: | - einen vollst√§ndigen Antrag | - beim richtigen Amt einreichen"
Llama 70B,Man muss einen Antrag ausf√ºllen. | Der Antrag geht zum Amt. | Man braucht:

0,1
Llama 8B,Das Medikament: | - Nimm es zweimal am Tag. | - Trink viel Wasser dazu.
Llama 70B,"Nimm das Medikament zweimal am Tag. | Trinke viel Wasser dazu. | Nimm das Medikament nicht mehr, wenn du dich schlecht f√ºhlst."

0,1
Llama 8B,Verpflichtungen bleiben nach Vertrag. | 1. Vertrag endet. | 2. Verpflichtungen bleiben bestehen.
Llama 70B,Die Regeln bleiben. | Sie gelten nach Vertragsende. | Hier sind die Regeln:


---
# 7. Custom Test (Add Your Own Text)

Test any text you want to compare.

In [14]:
# ====================================
# ADD YOUR CUSTOM TEXT HERE
# ====================================

CUSTOM_TEXT = """
This research analyzes the political framing evident in congressional debates surrounding the introduction and subsequent abolition of the so-called healthcare copayment. Given that framing is conceptually and methodologically applied inconsistently, a political framing approach was derived. In particular, the conflictual dimension of political framing has heretofore received insufficient scholarly attention.

The investigation demonstrates that the discursive construction of political reality is substantially influenced by strategic framing. Both institutional factors and media amplification effects play a significant role in establishing particular interpretive patterns.


"""

CUSTOM_LANGUAGE = "en"  # Change to "de" for German

# ====================================

if CUSTOM_TEXT.strip() and CUSTOM_TEXT.strip() != "Paste your text here to test.":
    custom_results = {}
    
    for model in MODELS:
        output = simplify_text(CUSTOM_TEXT, model, language=CUSTOM_LANGUAGE)
        rules = evaluate_rules(output, original_text=CUSTOM_TEXT)
        custom_results[model] = {"output": output, "rules": rules}
        time.sleep(0.5)
    
    display_side_by_side(CUSTOM_TEXT, custom_results, "üîß Custom Test")
else:
    print("‚ÑπÔ∏è Add your text above and run this cell to test.")

---
# 8. Export Results (Optional)

Save results to CSV for later analysis.

In [15]:
import csv
from datetime import datetime

EXPORT_RESULTS = False  # Set to True to export

if EXPORT_RESULTS:
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"../outputs/reports/evaluation_{timestamp}.csv"
    
    # Combine all results
    all_results = []
    
    for lang, results in [("EN", results_en), ("DE", results_de)]:
        for test_name, models in results.items():
            for model, data in models.items():
                row = {
                    "language": lang,
                    "test_case": test_name,
                    "model": model,
                    "output": data["output"][:500],
                }
                for rule_id, result in data["rules"].items():
                    row[f"rule_{rule_id}"] = result["pass"]
                    if rule_id == "short_sentences":
                        row["max_sentence_words"] = result["value"]
                all_results.append(row)
    
    # Write CSV
    if all_results:
        with open(filename, 'w', newline='', encoding='utf-8') as f:
            writer = csv.DictWriter(f, fieldnames=all_results[0].keys())
            writer.writeheader()
            writer.writerows(all_results)
        print(f"‚úÖ Results exported to: {filename}")
else:
    print("‚ÑπÔ∏è Set EXPORT_RESULTS = True to save results.")

‚ÑπÔ∏è Set EXPORT_RESULTS = True to save results.


---
# Notes for Future Use

## Adding New Models
Add model IDs to the `MODELS` list in Section 2.

## Adding New Rules
Add new rule definitions to `EASY_LANGUAGE_RULES` in Section 2.

## Use Cases
This template can be used to evaluate:
- **PDF uploads**: Add extracted text to `CUSTOM_TEXT`
- **Pasted text**: Add to `CUSTOM_TEXT` or `TEST_TEXTS_EN/DE`
- **Chrome extension**: Add URL-extracted text to `CUSTOM_TEXT`

## Metrics to Track
- Rule adherence score (% rules passed)
- Average max sentence length
- Compression ratio (output words / input words)
- Response time per model