# Prompt Evaluation Template

Compare and evaluate different prompts to find the best one for Easy Language simplification.

## What This Template Does

1. **Side-by-Side Prompt Comparison** - Test multiple prompts on the same text
2. **Output Quality Metrics** - Measure how well each prompt produces Easy Language output
3. **Prompt Structure Analysis** - Evaluate the prompt itself (clarity, completeness, specificity)
4. **A/B Testing** - Directly compare two prompts to pick the winner

## Evaluation Dimensions

| Dimension | What We Measure |
|-----------|-----------------|
| **Output Quality** | Does the output follow Easy Language rules? |
| **Instruction Adherence** | Does the model follow the prompt's instructions? |
| **Consistency** | Does the same prompt produce consistent results? |
| **Prompt Clarity** | Is the prompt well-structured and unambiguous? |

---
# 1. Setup

In [1]:
import os
import re
import time
from dotenv import load_dotenv, find_dotenv
from groq import Groq
from IPython.display import display, HTML, Markdown
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Load environment
found_path = find_dotenv(usecwd=True)
if found_path:
    load_dotenv(found_path, override=True)
    print(f"‚úÖ Loaded .env from: {found_path}")

# Initialize Groq
groq_client = None
if os.getenv("GROQ_API_KEY"):
    groq_client = Groq(api_key=os.getenv("GROQ_API_KEY"))
    print("‚úÖ Groq Client Ready")
else:
    print("‚ö†Ô∏è GROQ_API_KEY not found")

‚úÖ Loaded .env from: /Users/simonvoegely/Desktop/easysprache/klartext/.env
‚úÖ Groq Client Ready


---
# 2. Configuration

### Model to Use (fixed for fair comparison)

In [2]:
# Use ONE model to compare prompts fairly
EVAL_MODEL = "llama-3.3-70b-versatile"

# Number of runs per prompt (for consistency testing)
NUM_RUNS = 1  # Increase for more reliable consistency scores

### Prompts to Compare

Add your prompts here. Each prompt should be a complete system prompt.

In [3]:
# =============================================
# PROMPTS TO COMPARE
# =============================================

PROMPTS = {
    "Prompt A (Old)": """# Identity

You are an expert in plain language writing specialized in making complex information 
accessible to everyone, including people with learning disabilities or low literacy.

# Instructions

*   Rewrite the input text to be extremely simple and easy to understand (Level I).
*   Use very short sentences (maximum 10 words per sentence).
*   Use only simple, everyday words. Explain any uncommon words in parentheses locally.
*   Add blank lines between every paragraph.
*   Use bullet points for steps, lists, or multiple items. Otherwise use short sentences.
*   Do NOT include any introductory or concluding text (e.g., "Here is the simplified text").
*   Output ONLY the simplified text.

# Plain Language Rules

* Address the reader directly using "you". Use a friendly, neutral tone.
* Avoid bureaucratic, legalistic, or commanding language.
* Prefer active voice. Avoid passive voice whenever possible.
* Use positive wording. Avoid negations and never use double negatives.
* Use simple, familiar words. Avoid technical, foreign, or formal terms.
* Replace abstract nouns with concrete, active verbs.
* Explain necessary technical terms or abbreviations the first time they appear.
* Remove filler words and unnecessary details. Keep only essential information.
* Use the same words consistently. Do not switch terms for the same thing.
* Break up long sentences. No sentence longer than 10 words.
* Keep subjects and verbs close together.
* Use clear structure. Use bullet points for lists or steps.

# Examples

The following are example pairs. Learn the style and constraints from them.

<example id="1">
<original_text>
Upon arrival at the facility, visitors are required to sign in at the front desk and present valid photo identification.
</original_text>

<simplified_text>
When you arrive:

* Go to the front desk.
* Sign your name.
* Show your photo ID.
</simplified_text>
</example>

<example id="2">
<original_text>
The medication should be administered twice daily with food to minimize potential gastrointestinal discomfort.
</original_text>

<simplified_text>
Take this medicine two times every day.

* Eat food when you take it.
* This helps your stomach feel better.
</simplified_text>
</example>

Rewrite this text in simple language:""",



    "Prompt B (Structured)": """# Identity

You are an expert in plain language writing.
You specialise in rewriting text to be accessible 
to people with learning disabilities or low literacy.

# Core Task 

* Rewrite the input text to be extremely simple and easy to understand.
* Keep the same meaning as the original text.

# Constraints

* Do NOT include any introductory or concluding text (e.g., "Here is the simplified text").
* Output ONLY the simplified text.
* Never output any XML/HTML tags or attributes (no <...>, no id=...).

# Structure & Formatting Rules

* Use clear structure.
* Use bullet points for steps, lists, or multiple items. Otherwise prefer short sentences.
* Add blank lines between every paragraph.

# Plain Language Rules
# Sentence & Length Rules

* Use very short sentences (maximum 10 words per sentence).
* Break up long sentences.
* Keep subjects and verbs close together.

# Vocabulary & Wording Rules

* Use simple, familiar words. Avoid technical, foreign, or formal terms.
* Explain any uncommon or necessary technical words or abbreviations in parentheses the first time they appear.
* Explain complex ideas or uncommon nouns in parentheses.
* Use positive wording. Avoid negations and never use double negatives.
* Replace abstract nouns with concrete, active verbs.

# Tone & Audience Rules

* Prefer active voice. Avoid passive voice whenever possible.
* Address the reader directly using ‚Äúyou‚Äù.
* Use a friendly, neutral tone.
* Avoid bureaucratic, legalistic, or commanding language.

# Consistency Rules

* Remove filler words and unnecessary details. Keep only essential information.
* Use the same words consistently. Do not switch terms for the same thing.

# Examples
# The following are example pairs.
# Learn the style and constraints from them.
# Do NOT copy the XML tags into your output.

<examples>

  <example id="1">
    <original_text>
    Upon arrival at the facility, visitors are required to sign in at the front desk and present valid photo identification.
    </original_text>

    <simplified_text>
    When you arrive:

    * Go to the front desk.
    * Sign in with your name.
    * Show your photo ID.
    </simplified_text>
  </example>

  <example id="2">
    <original_text>
    The medication should be administered twice daily with food to minimize potential gastrointestinal discomfort.
    </original_text>

    <simplified_text>
    Take this medicine two times every day.

    Eat food when you take it. This helps your stomach feel better.
    </simplified_text>
  </example>

</examples>

Rewrite this text in simple language:"""
}

print(f"üìù Loaded {len(PROMPTS)} prompts to compare")

üìù Loaded 2 prompts to compare


### Test Texts (same for all prompts)

In [4]:
# Test texts to evaluate prompts on
TEST_TEXTS = {
    "Legal": """The obligations contained herein shall remain in full force and effect indefinitely, 
notwithstanding the termination of this Agreement, until such time as the Confidential Information 
no longer qualifies as confidential under applicable law.""",

    "Medical": """Patients should take the prescribed medication twice daily with food to minimize 
gastrointestinal discomfort. If adverse reactions occur, discontinue use immediately and consult 
your healthcare provider.""",

    "Bureaucratic": """For the application of housing benefit, a fully completed application form 
must be submitted to the responsible authority. The required proof of income and the rent 
certificate must be attached. Processing time is usually six to eight weeks."""
}

### Evaluation Metrics

We evaluate prompts on two dimensions:
1. **Prompt Quality** - Structure and clarity of the prompt itself
2. **Output Quality** - How well the output follows Easy Language rules

In [12]:
# =============================================
# PROMPT QUALITY METRICS (evaluate the prompt itself)
# =============================================

PROMPT_METRICS = {
    "has_identity": {
        "name": "Has Identity/Role",
        "description": "Defines who the AI should be",
        "weight": 1,
        "check": lambda p: bool(re.search(r'(you are|act as|identity|role)', p.lower()))
    },
    "has_rules": {
        "name": "Has Explicit Rules",
        "description": "Contains numbered or bulleted rules",
        "weight": 2,
        "check": lambda p: bool(re.search(r'[-*‚Ä¢]\s|^\d+\.|rules:|instructions:', p, re.MULTILINE | re.IGNORECASE))
    },
    "has_constraints": {
        "name": "Has Constraints",
        "description": "Specifies what NOT to do",
        "weight": 1,
        "check": lambda p: bool(re.search(r'(do not|don\'t|never|avoid|no )', p.lower()))
    },
    "has_examples": {
        "name": "Has Examples",
        "description": "Includes example input/output",
        "weight": 2,
        "check": lambda p: bool(re.search(r'(example|original|simplified|before|after)', p.lower()))
    },
    "has_word_limit": {
        "name": "Specifies Word/Sentence Limit",
        "description": "Mentions specific word or sentence length",
        "weight": 1,
        "check": lambda p: bool(re.search(r'(\d+\s*words?|\d+\s*sentences?|maximum|max)', p.lower()))
    },
    "has_formatting": {
        "name": "Specifies Formatting",
        "description": "Mentions bullet points, paragraphs, structure",
        "weight": 1,
        "check": lambda p: bool(re.search(r'(bullet|paragraph|blank line|format|structure)', p.lower()))
    },
    "no_output_instruction": {
        "name": "Output-Only Instruction",
        "description": "Says to output only the result (no intro text)",
        "weight": 1,
        "check": lambda p: bool(re.search(r'(only|output only|no intro|no conclusion)', p.lower()))
    }
}

# =============================================
# OUTPUT QUALITY METRICS (evaluate the model output)
# =============================================

OUTPUT_METRICS = {
    "short_sentences": {
        "name": "Short Sentences",
        "description": "Maximum 10 words per sentence",
        "weight": 2,
        "check": lambda text: max([len(s.split()) for s in re.split(r'[.!?]', text) if s.strip()], default=0)
    },
    "uses_bullets": {
        "name": "Uses Bullet Points",
        "description": "Uses bullet points or numbered lists",
        "weight": 1,
        "check": lambda text: bool(re.search(r'[‚Ä¢\-\*]\s|^\d+\.\s', text, re.MULTILINE))
    },
    "has_paragraphs": {
        "name": "Clear Paragraphs",
        "description": "Has blank lines between sections",
        "weight": 1,
        "check": lambda text: '\n\n' in text
    },
    "no_intro_text": {
        "name": "No Intro/Outro Text",
        "description": "Starts directly without meta-text",
        "weight": 1,
        "check": lambda text: not bool(re.match(r'^(Here\'s|Here is|This is|The following|Sure|Certainly)', text.strip(), re.IGNORECASE))
    },
    "no_xml_tags": {
        "name": "No XML/HTML Tags",
        "description": "No markup in output",
        "weight": 1,
        "check": lambda text: not bool(re.search(r'<[^>]+>', text))
    }
}

print(f"üìä Prompt metrics: {len(PROMPT_METRICS)}")
print(f"üìä Output metrics: {len(OUTPUT_METRICS)}")

üìä Prompt metrics: 7
üìä Output metrics: 5


---
# 3. Helper Functions

In [13]:
def get_completion(text: str, system_prompt: str) -> str:
    """Call the model with a given prompt."""
    if not groq_client:
        return "[No API Client]"
    
    try:
        response = groq_client.chat.completions.create(
            model=EVAL_MODEL,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": text}
            ],
            temperature=0.1
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"[Error: {e}]"


def evaluate_prompt_quality(prompt: str) -> dict:
    """Evaluate the prompt itself (structure, clarity)."""
    results = {}
    total_score = 0
    max_score = 0
    
    for metric_id, metric in PROMPT_METRICS.items():
        passed = metric["check"](prompt)
        weight = metric["weight"]
        max_score += weight
        if passed:
            total_score += weight
        results[metric_id] = {"pass": passed, "weight": weight}
    
    results["_score"] = total_score / max_score if max_score > 0 else 0
    return results


def evaluate_output_quality(output: str) -> dict:
    """Evaluate the model output."""
    results = {}
    total_score = 0
    max_score = 0
    
    for metric_id, metric in OUTPUT_METRICS.items():
        check_result = metric["check"](output)
        weight = metric["weight"]
        max_score += weight
        
        if metric_id == "short_sentences":
            passed = check_result <= 10
            results[metric_id] = {"pass": passed, "value": check_result, "weight": weight}
        else:
            passed = bool(check_result)
            results[metric_id] = {"pass": passed, "weight": weight}
        
        if passed:
            total_score += weight
    
    results["_score"] = total_score / max_score if max_score > 0 else 0
    return results


def tfidf_similarity(text1: str, text2: str) -> float:
    """Calculate TF-IDF similarity between texts."""
    try:
        vectorizer = TfidfVectorizer(lowercase=True)
        matrix = vectorizer.fit_transform([text1, text2])
        return round(cosine_similarity(matrix[0:1], matrix[1:2])[0][0], 3)
    except:
        return 0.0

In [7]:
def display_prompt_comparison(results: dict, test_name: str):
    """Display side-by-side comparison of prompts with checkboxes."""
    
    html = f"""<div style='background:#1a1a2e; padding:15px; border-radius:8px; margin:10px 0;'>
    <h3 style='color:#eee; margin:0 0 15px 0;'>üìÑ Test: {test_name}</h3>
    <div style='display:flex; gap:10px; flex-wrap:wrap;'>"""
    
    for prompt_name, data in results.items():
        prompt_eval = data["prompt_eval"]
        output_eval = data["output_eval"]
        prompt_score = prompt_eval["_score"]
        output_score = output_eval["_score"]
        combined = (prompt_score + output_score) / 2
        
        # Count passed metrics
        prompt_passed = sum(1 for k, v in prompt_eval.items() if k != "_score" and v.get("pass"))
        prompt_total = len([k for k in prompt_eval if k != "_score"])
        output_passed = sum(1 for k, v in output_eval.items() if k != "_score" and v.get("pass"))
        output_total = len([k for k in output_eval if k != "_score"])
        
        score_color = "#4ade80" if combined >= 0.7 else "#fbbf24" if combined >= 0.5 else "#f87171"
        
        html += f"""
        <div style='flex:1; min-width:300px; background:#0f3460; padding:12px; border-radius:6px;'>
            <div style='display:flex; justify-content:space-between; align-items:center; margin-bottom:10px;'>
                <strong style='color:#e0e0e0; font-size:14px;'>{prompt_name}</strong>
                <span style='background:{score_color}; color:#000; padding:2px 8px; border-radius:10px; font-size:11px;'>
                    {combined:.0%}
                </span>
            </div>
            
            <div style='background:#1a1a2e; padding:8px; border-radius:4px; margin-bottom:10px; max-height:150px; overflow-y:auto;'>
                <pre style='color:#ddd; font-size:11px; white-space:pre-wrap; margin:0;'>{data["output"][:400]}{'...' if len(data["output"]) > 400 else ''}</pre>
            </div>
            
            <div style='display:flex; gap:15px;'>
                <div style='flex:1;'>
                    <div style='color:#888; font-size:10px; margin-bottom:5px; text-transform:uppercase;'>Prompt ({prompt_passed}/{prompt_total})</div>
                    <div style='font-size:11px;'>"""
        
        # Prompt metric checkboxes
        for metric_id, result in prompt_eval.items():
            if metric_id == "_score":
                continue
            icon = "‚úÖ" if result["pass"] else "‚ùå"
            metric_name = PROMPT_METRICS[metric_id]["name"]
            html += f"<div style='color:#aaa;'>{icon} {metric_name}</div>"
        
        html += f"""</div>
                </div>
                <div style='flex:1;'>
                    <div style='color:#888; font-size:10px; margin-bottom:5px; text-transform:uppercase;'>Output ({output_passed}/{output_total})</div>
                    <div style='font-size:11px;'>"""
        
        # Output metric checkboxes
        for metric_id, result in output_eval.items():
            if metric_id == "_score":
                continue
            icon = "‚úÖ" if result["pass"] else "‚ùå"
            metric_name = OUTPUT_METRICS[metric_id]["name"]
            # Show value for sentence length
            value_str = f" ({result.get('value', '')}w)" if metric_id == "short_sentences" else ""
            html += f"<div style='color:#aaa;'>{icon} {metric_name}{value_str}</div>"
        
        html += """</div>
                </div>
            </div>
        </div>"""
    
    html += "</div></div>"
    display(HTML(html))


def display_prompt_scorecard(all_results: dict):
    """Display overall scorecard for all prompts."""
    
    # Aggregate scores
    prompt_scores = {name: {"prompt": [], "output": []} for name in PROMPTS.keys()}
    
    for test_name, prompts in all_results.items():
        for prompt_name, data in prompts.items():
            prompt_scores[prompt_name]["prompt"].append(data["prompt_eval"]["_score"])
            prompt_scores[prompt_name]["output"].append(data["output_eval"]["_score"])
    
    html = """<div style='background:#1a1a2e; padding:20px; border-radius:8px; margin:20px 0;'>
    <h2 style='color:#eee; margin:0 0 15px 0;'>üèÜ Prompt Scorecard</h2>
    <table style='width:100%; border-collapse:collapse;'>
        <tr style='background:#0f3460;'>
            <th style='color:#eee; padding:12px; text-align:left;'>Prompt</th>
            <th style='color:#eee; padding:12px; text-align:center;'>Prompt Quality</th>
            <th style='color:#eee; padding:12px; text-align:center;'>Avg Output Quality</th>
            <th style='color:#eee; padding:12px; text-align:center;'>Combined Score</th>
            <th style='color:#eee; padding:12px; text-align:center;'>Rank</th>
        </tr>"""
    
    # Calculate combined scores and rank
    rankings = []
    for prompt_name in PROMPTS.keys():
        prompt_avg = sum(prompt_scores[prompt_name]["prompt"]) / len(prompt_scores[prompt_name]["prompt"])
        output_avg = sum(prompt_scores[prompt_name]["output"]) / len(prompt_scores[prompt_name]["output"])
        combined = (prompt_avg + output_avg) / 2
        rankings.append((prompt_name, prompt_avg, output_avg, combined))
    
    rankings.sort(key=lambda x: x[3], reverse=True)
    
    for rank, (prompt_name, prompt_avg, output_avg, combined) in enumerate(rankings, 1):
        score_color = "#4ade80" if combined >= 0.7 else "#fbbf24" if combined >= 0.5 else "#f87171"
        rank_icon = "ü•á" if rank == 1 else "ü•à" if rank == 2 else "ü•â" if rank == 3 else f"#{rank}"
        
        html += f"""
        <tr style='border-bottom:1px solid #333;'>
            <td style='color:#ddd; padding:12px;'><strong>{prompt_name}</strong></td>
            <td style='color:#aaa; padding:12px; text-align:center;'>{prompt_avg:.0%}</td>
            <td style='color:#aaa; padding:12px; text-align:center;'>{output_avg:.0%}</td>
            <td style='padding:12px; text-align:center;'>
                <span style='background:{score_color}; color:#000; padding:4px 12px; border-radius:12px;'>{combined:.0%}</span>
            </td>
            <td style='color:#eee; padding:12px; text-align:center; font-size:18px;'>{rank_icon}</td>
        </tr>"""
    
    html += "</table></div>"
    display(HTML(html))
    
    return rankings

---
# 4. Run Prompt Evaluation

Compare all prompts on each test text.-

In [14]:
print(f"üîÑ Evaluating {len(PROMPTS)} prompts on {len(TEST_TEXTS)} test texts...\n")

all_results = {}

for test_name, test_text in TEST_TEXTS.items():
    all_results[test_name] = {}
    
    for prompt_name, prompt in PROMPTS.items():
        # Evaluate prompt structure
        prompt_eval = evaluate_prompt_quality(prompt)
        
        # Get model output
        output = get_completion(test_text, prompt)
        
        # Evaluate output quality
        output_eval = evaluate_output_quality(output)
        
        # Calculate meaning preservation
        similarity = tfidf_similarity(test_text, output)
        
        all_results[test_name][prompt_name] = {
            "prompt": prompt,
            "output": output,
            "prompt_eval": prompt_eval,
            "output_eval": output_eval,
            "similarity": similarity
        }
        
        time.sleep(0.5)
    
    # Display comparison for this test
    display_prompt_comparison(all_results[test_name], test_name)

print("\n‚úÖ Evaluation complete!")

üîÑ Evaluating 2 prompts on 3 test texts...




‚úÖ Evaluation complete!


---
# 5. Overall Scorecard

See which prompt performs best overall.

In [9]:
rankings = display_prompt_scorecard(all_results)

# Winner announcement
winner = rankings[0]
print(f"\nüèÜ Winner: {winner[0]} with {winner[3]:.0%} combined score")

Prompt,Prompt Quality,Avg Output Quality,Combined Score,Rank
Prompt A (Old),100%,100%,100%,ü•á
Prompt B (Structured),100%,78%,89%,ü•à



üèÜ Winner: Prompt A (Old) with 100% combined score


---
# 6. Detailed Metrics Breakdown

See exactly which metrics each prompt passes or fails.

In [15]:
def display_detailed_metrics(all_results: dict):
    """Show detailed metric breakdown per prompt."""
    
    # Aggregate metrics across all tests
    prompt_metrics_agg = {name: {m: 0 for m in PROMPT_METRICS} for name in PROMPTS}
    output_metrics_agg = {name: {m: 0 for m in OUTPUT_METRICS} for name in PROMPTS}
    num_tests = len(TEST_TEXTS)
    
    for test_name, prompts in all_results.items():
        for prompt_name, data in prompts.items():
            for m in PROMPT_METRICS:
                if data["prompt_eval"].get(m, {}).get("pass"):
                    prompt_metrics_agg[prompt_name][m] += 1
            for m in OUTPUT_METRICS:
                if data["output_eval"].get(m, {}).get("pass"):
                    output_metrics_agg[prompt_name][m] += 1
    
    # Display prompt metrics
    html = """<div style='background:#1a1a2e; padding:20px; border-radius:8px; margin:20px 0;'>
    <h3 style='color:#eee; margin:0 0 15px 0;'>üìã Prompt Structure Metrics</h3>
    <table style='width:100%; border-collapse:collapse; font-size:13px;'>
        <tr style='background:#0f3460;'>
            <th style='color:#eee; padding:8px; text-align:left;'>Metric</th>"""
    
    for name in PROMPTS:
        short_name = name.split("(")[0].strip()
        html += f"<th style='color:#eee; padding:8px; text-align:center;'>{short_name}</th>"
    html += "</tr>"
    
    for metric_id, metric in PROMPT_METRICS.items():
        html += f"<tr style='border-bottom:1px solid #333;'><td style='color:#aaa; padding:8px;'>{metric['name']}</td>"
        for name in PROMPTS:
            passed = prompt_metrics_agg[name][metric_id] == num_tests
            icon = "‚úÖ" if passed else "‚ùå"
            html += f"<td style='text-align:center; padding:8px;'>{icon}</td>"
        html += "</tr>"
    
    html += "</table></div>"
    
    # Display output metrics
    html += """<div style='background:#1a1a2e; padding:20px; border-radius:8px; margin:20px 0;'>
    <h3 style='color:#eee; margin:0 0 15px 0;'>üìä Output Quality Metrics (avg across tests)</h3>
    <table style='width:100%; border-collapse:collapse; font-size:13px;'>
        <tr style='background:#0f3460;'>
            <th style='color:#eee; padding:8px; text-align:left;'>Metric</th>"""
    
    for name in PROMPTS:
        short_name = name.split("(")[0].strip()
        html += f"<th style='color:#eee; padding:8px; text-align:center;'>{short_name}</th>"
    html += "</tr>"
    
    for metric_id, metric in OUTPUT_METRICS.items():
        html += f"<tr style='border-bottom:1px solid #333;'><td style='color:#aaa; padding:8px;'>{metric['name']}</td>"
        for name in PROMPTS:
            count = output_metrics_agg[name][metric_id]
            pct = count / num_tests
            color = "#4ade80" if pct >= 0.8 else "#fbbf24" if pct >= 0.5 else "#f87171"
            html += f"<td style='text-align:center; padding:8px; color:{color};'>{count}/{num_tests}</td>"
        html += "</tr>"
    
    html += "</table></div>"
    display(HTML(html))

display_detailed_metrics(all_results)

Metric,Prompt A,Prompt B
Has Identity/Role,‚úÖ,‚úÖ
Has Explicit Rules,‚úÖ,‚úÖ
Has Constraints,‚úÖ,‚úÖ
Has Examples,‚úÖ,‚úÖ
Specifies Word/Sentence Limit,‚úÖ,‚úÖ
Specifies Formatting,‚úÖ,‚úÖ
Output-Only Instruction,‚úÖ,‚úÖ

Metric,Prompt A,Prompt B
Short Sentences,2/3,2/3
Uses Bullet Points,3/3,2/3
Clear Paragraphs,3/3,3/3
No Intro/Outro Text,3/3,3/3
No XML/HTML Tags,3/3,3/3


---
# 7. A/B Test: Compare Two Prompts

Direct head-to-head comparison of two specific prompts.

In [11]:
# Show available prompts
print("Available prompts:")
for i, name in enumerate(PROMPTS.keys()):
    print(f"  {i+1}. {name}")

# Select two prompts to compare (use actual prompt names from PROMPTS dict)
prompt_names = list(PROMPTS.keys())
PROMPT_A = prompt_names[0] if len(prompt_names) > 0 else None
PROMPT_B = prompt_names[1] if len(prompt_names) > 1 else None

if PROMPT_A and PROMPT_B:
    # Count wins
    a_wins = 0
    b_wins = 0
    ties = 0
    
    print(f"\n‚öîÔ∏è A/B Test: {PROMPT_A} vs {PROMPT_B}\n")
    
    for test_name in TEST_TEXTS:
        score_a = all_results[test_name][PROMPT_A]["output_eval"]["_score"]
        score_b = all_results[test_name][PROMPT_B]["output_eval"]["_score"]
        
        if score_a > score_b:
            a_wins += 1
            winner = f"üÖ∞Ô∏è {PROMPT_A}"
        elif score_b > score_a:
            b_wins += 1
            winner = f"üÖ±Ô∏è {PROMPT_B}"
        else:
            ties += 1
            winner = "ü§ù Tie"
        
        print(f"  {test_name}: {winner} ({score_a:.0%} vs {score_b:.0%})")
    
    print(f"\nüìä Results: {PROMPT_A} wins {a_wins}, {PROMPT_B} wins {b_wins}, Ties: {ties}")
    print(f"üèÜ Overall Winner: {PROMPT_A if a_wins > b_wins else PROMPT_B if b_wins > a_wins else 'Tie'}")
else:
    print("‚ö†Ô∏è Need at least 2 prompts for A/B testing")

Available prompts:
  1. Prompt A (Old)
  2. Prompt B (Structured)

‚öîÔ∏è A/B Test: Prompt A (Old) vs Prompt B (Structured)

  Legal: üÖ∞Ô∏è Prompt A (Old) (100% vs 50%)
  Medical: üÖ∞Ô∏è Prompt A (Old) (100% vs 83%)
  Bureaucratic: ü§ù Tie (100% vs 100%)

üìä Results: Prompt A (Old) wins 2, Prompt B (Structured) wins 0, Ties: 1
üèÜ Overall Winner: Prompt A (Old)


---
# Notes for Future Use

## Prompt Quality Metrics (what makes a good prompt)

| Metric | Description | Weight |
|--------|-------------|--------|
| Has Identity | Defines role (e.g., "You are an expert...") | 1x |
| Has Rules | Contains explicit numbered/bulleted rules | 2x |
| Has Constraints | Specifies what NOT to do | 1x |
| Has Examples | Includes input/output examples | 2x |
| Word/Sentence Limit | Specifies max length | 1x |
| Formatting Instructions | Mentions bullets, paragraphs | 1x |
| Output-Only Instruction | Says to output only the result | 1x |

## Output Quality Metrics (what makes good output)

| Metric | Description | Weight |
|--------|-------------|--------|
| Short Sentences | Max 10 words per sentence | 2x |
| Uses Bullets | Has bullet points or numbered lists | 1x |
| Clear Paragraphs | Has blank lines between sections | 1x |
| No Intro Text | Starts directly without "Here is..." | 1x |
| No XML Tags | No markup in output | 1x |

## Adding New Prompts

1. Add your prompt to the `PROMPTS` dictionary in Section 2
2. Re-run all cells
3. Check the scorecard to see how it compares

## Tips for Better Prompts

- **Be specific**: "Max 10 words" beats "short sentences"
- **Add examples**: Few-shot prompts often outperform zero-shot
- **State constraints**: Tell the model what NOT to do
- **Structure clearly**: Use sections like Identity, Rules, Examples