# Prompt Evaluation and Testing Suite

This notebook demonstrates how to automate the evaluation and benchmarking of prompt effectiveness, including A/B testing, accuracy checks, and basic performance analytics for LLM outputs.

## Key Features

- **A/B Testing:** Compare responses from different prompts.
- **Accuracy Checks:** Evaluate model output against expected answers.
- **Performance Analytics:** Summarize and visualize prompt performance.

In [21]:
import pandas as pd
from openai import AzureOpenAI
import os
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())
client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_KEY"),
    api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")
)
AZURE_OPENAI_DEPLOYMENT_NAME = os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME")

In [22]:
# === Test Data ===
TEST_CASES = [
    {
        "id": 1,
        "input": "Translate to French: 'Good morning'",
        "expected": "Bonjour"
    },
    {
        "id": 2,
        "input": "Translate to French: 'Good night'",
        "expected": "Bonne nuit"
    }
]

In [23]:
# === Prompt Variants ===
PROMPTS = {
    "zero_shot": lambda input: input,
    "instructional": lambda input: f"Please respond with a one-word French translation only.\n{input}"
}


In [25]:
import time
from typing import List, Dict


# === Evaluation Function ===
def evaluate_prompt_style(prompt_name: str, prompt_func, test_cases: List[Dict]):
    print(f"\nEvaluating Prompt Style: {prompt_name}")
    success_count = 0
    total_time = 0

    for test in test_cases:
        final_prompt = prompt_func(test["input"])
        print(f"\nPrompt: {final_prompt}")
        start = time.time()
        response = client.chat.completions.create(
            model=AZURE_OPENAI_DEPLOYMENT_NAME,
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": final_prompt}
            ]
        )
        end = time.time()

        model_reply = response.choices[0].message.content.strip().lower()
        expected = test["expected"].lower()
        passed = expected in model_reply
        if passed:
            success_count += 1
        total_time += (end - start)

        print(f"Expected: {expected} | Got: {model_reply} | {'✅ Pass' if passed else '❌ Fail'}")

    total = len(test_cases)
    print(f"\n{prompt_name} Accuracy: {success_count}/{total} ({(success_count/total)*100:.1f}%)")
    print(f"Avg Response Time: {total_time/total:.2f} seconds")

# === Run Evaluation for All Prompt Styles ===
for name, func in PROMPTS.items():
    evaluate_prompt_style(name, func, TEST_CASES)



Evaluating Prompt Style: zero_shot

Prompt: Translate to French: 'Good morning'
Expected: bonjour | Got: 'good morning' in french is translated as 'bonjour'. | ✅ Pass

Prompt: Translate to French: 'Good night'
Expected: bonne nuit | Got: the translation of "good night" in french is "bonne nuit." | ✅ Pass

zero_shot Accuracy: 2/2 (100.0%)
Avg Response Time: 1.38 seconds

Evaluating Prompt Style: instructional

Prompt: Please respond with a one-word French translation only.
Translate to French: 'Good morning'
Expected: bonjour | Got: bonjour | ✅ Pass

Prompt: Please respond with a one-word French translation only.
Translate to French: 'Good night'
Expected: bonne nuit | Got: bonne nuit | ✅ Pass

instructional Accuracy: 2/2 (100.0%)
Avg Response Time: 0.69 seconds
