Skip to content

KazKozDev/llm-evaluation-framework

Repository files navigation

LLM Evaluation Framework

GitHub Python License LLM AI

Production-grade framework for automated LLM quality evaluation, A/B testing, and benchmarking with LLM-as-a-judge patterns.

A comprehensive system for evaluating and comparing Large Language Models:

  • ✅ Automated benchmarking across multiple tasks
  • ✅ A/B testing with statistical significance
  • ✅ LLM-as-a-judge for quality assessment
  • ✅ Comprehensive metrics tracking
  • ✅ Production-ready evaluation pipelines

Key Features

1. Automated Benchmarks

Pre-built benchmark suites for:

  • Question answering
  • Summarization
  • Code generation
  • Logical reasoning
  • Classification

2. A/B Testing Framework

Compare prompt variants with:

  • Statistical significance testing
  • Confidence intervals
  • Pairwise comparisons
  • Performance metrics

3. LLM-as-a-Judge

Quality assessment using:

  • Multiple evaluation criteria
  • Configurable judges
  • Rule-based or LLM-based scoring
  • Detailed reasoning

4. Metrics Tracking

Monitor:

  • Quality scores
  • Latency
  • Token usage
  • Cost estimates
  • Experiment comparison

Architecture

Test Cases → Evaluation Runner → LLM → Judges → Metrics
                                    ↓
                              Responses
                                    ↓
                          Quality Scores + Metadata

Installation

# Clone repository
git clone https://github.com/KazKozDev/llm-evaluation-framework.git
cd llm-evaluation-framework

# Install dependencies
pip install -r requirements.txt

# Or install in development mode
pip install -e .

Requirements:

  • Python 3.8+
  • numpy, pandas, scipy

Quick Start

1. Basic Evaluation

from evaluator.core import EvaluationRunner, TestCase
from judges.llm_judge import LLMJudge, JudgmentCriteria
from metrics.tracker import MetricsTracker

# Define your LLM function
def my_llm(prompt: str) -> str:
    # Call your LLM API here
    return "Response from LLM"

# Create test cases
test_cases = [
    TestCase(
        task_id="qa_1",
        prompt="What is the capital of France?",
        reference="Paris",
    ),
]

# Setup judges
judges = [
    LLMJudge(
        criteria=[
            JudgmentCriteria.RELEVANCE,
            JudgmentCriteria.CORRECTNESS,
        ],
    )
]

# Create evaluation runner
runner = EvaluationRunner(
    model_fn=my_llm,
    judges=judges,
    metrics_tracker=MetricsTracker(),
)

# Run evaluation
results = runner.evaluate(test_cases)

# Get summary
summary = runner.get_summary()
print(f"Overall Score: {summary['overall_score']:.2f}")

2. A/B Testing Prompts

from evaluator.ab_testing import ABTest, PromptVariant
from judges.llm_judge import LLMJudge

# Define prompt variants
variant_a = PromptVariant(
    name="basic",
    prompt_template="Answer: {question}",
)

variant_b = PromptVariant(
    name="detailed",
    prompt_template="Provide a detailed answer to: {question}",
)

# Create A/B test
ab_test = ABTest(
    model_fn=my_llm,
    judges=[LLMJudge()],
    confidence_level=0.95,
)

# Run test
result = ab_test.run_test(
    variant_a=variant_a,
    variant_b=variant_b,
    test_inputs=[
        {"question": "What is ML?"},
        {"question": "Explain AI"},
    ],
)

# Check results
if result.is_significant:
    print(f"Winner: {result.winner}")
    print(f"Improvement: {result.improvement:.1f}%")

3. Benchmark Suite

from benchmarks.suite import BenchmarkSuite
from evaluator.core import EvaluationRunner, TestCase

# Load pre-built benchmarks
suite = BenchmarkSuite()
suite.load_default_benchmarks()

# Convert to test cases
test_cases = [
    TestCase(
        task_id=task.task_id,
        prompt=task.prompt,
        reference=task.reference,
    )
    for task in suite.tasks
]

# Evaluate
runner = EvaluationRunner(model_fn=my_llm)
results = runner.evaluate(test_cases)

Running Examples

Basic Evaluation

python examples/basic_evaluation.py

Demonstrates:

  • Creating test cases
  • Running evaluations
  • Using LLM judges
  • Tracking metrics
  • Getting summaries

A/B Testing

python examples/ab_testing_demo.py

Demonstrates:

  • Comparing prompt variants
  • Statistical significance testing
  • Finding best prompts
  • Multiple variant comparison

Benchmark Comparison

python examples/benchmark_comparison.py

Demonstrates:

  • Running standard benchmarks
  • Comparing models
  • Performance by task type
  • Making deployment decisions

Multi-Provider Comparison

python examples/multi_provider_comparison.py

Demonstrates:

  • Testing OpenAI, Anthropic, Hugging Face, and Ollama
  • Comparing quality and latency across providers
  • Using local vs cloud models
  • Cost-effective evaluation

Core Components

EvaluationRunner

Main engine for running evaluations.

runner = EvaluationRunner(
    model_fn=your_llm_function,
    judges=[judge1, judge2],
    metrics_tracker=tracker,
)

results = runner.evaluate(test_cases)
summary = runner.get_summary()

LLMJudge

Quality assessment using LLM-as-a-judge pattern.

judge = LLMJudge(
    judge_fn=judge_llm_function,  # Optional
    criteria=[
        JudgmentCriteria.RELEVANCE,
        JudgmentCriteria.CORRECTNESS,
        JudgmentCriteria.COHERENCE,
    ],
    use_reference=True,
)

judgment = judge.judge(prompt, response, reference)

ABTest

Statistical A/B testing for prompts.

ab_test = ABTest(
    model_fn=your_llm,
    judges=[judge],
    confidence_level=0.95,
)

result = ab_test.run_test(variant_a, variant_b, test_inputs)

MetricsTracker

Track and aggregate evaluation metrics.

tracker = MetricsTracker(experiment_name="exp_1")
tracker.record(result)
summary = tracker.get_summary()
tracker.save("results.json")

BenchmarkSuite

Pre-built evaluation benchmarks.

suite = BenchmarkSuite(name="standard")
suite.load_default_benchmarks()

# Get tasks by type
qa_tasks = suite.get_tasks(TaskType.QUESTION_ANSWERING)

Evaluation Criteria

Standard criteria supported by LLMJudge:

Criterion Description
RELEVANCE Response relevance to prompt
CORRECTNESS Factual accuracy
COHERENCE Logical flow and structure
HELPFULNESS Utility of response
CLARITY Ease of understanding
COMPLETENESS Coverage of topic
CONCISENESS Brevity without losing meaning
SAFETY Absence of harmful content

Metrics Tracked

  • Quality Scores: Per-criterion and overall scores
  • Latency: Response time in milliseconds
  • Token Usage: Input/output tokens (if available)
  • Cost: Estimated API costs (if configured)

Project Structure

llm-evaluation-framework/
├── evaluator/
│   ├── core.py              # Main evaluation engine
│   └── ab_testing.py        # A/B testing framework
├── judges/
│   └── llm_judge.py         # LLM-as-a-judge implementation
├── benchmarks/
│   └── suite.py             # Benchmark tasks
├── metrics/
│   └── tracker.py           # Metrics tracking
├── examples/
│   ├── basic_evaluation.py
│   ├── ab_testing_demo.py
│   └── benchmark_comparison.py
├── requirements.txt
├── pyproject.toml
└── README.md

Use Cases

1. Model Selection

Compare different LLMs on your specific tasks to choose the best one.

2. Prompt Engineering

A/B test prompt variants to optimize performance.

3. Quality Monitoring

Track model quality over time in production.

4. Regression Testing

Ensure new model versions don't degrade quality.

5. Research & Development

Systematic evaluation for model research.

Advanced Usage

Custom Judges

def my_custom_judge_fn(prompt: str) -> str:
    # Your custom judging logic
    # Return formatted judgment
    return "Score: 8\nReasoning: Good response"

judge = LLMJudge(judge_fn=my_custom_judge_fn)

Custom Benchmarks

from benchmarks.suite import BenchmarkTask, TaskType

task = BenchmarkTask(
    task_id="custom_1",
    task_type=TaskType.QUESTION_ANSWERING,
    prompt="Your custom prompt",
    reference="Expected answer",
)

suite.add_task(task)

Multiple Experiments

tracker_v1 = MetricsTracker("experiment_v1")
tracker_v2 = MetricsTracker("experiment_v2")

# Run experiments...

# Compare
comparison = tracker_v1.compare_experiments(
    tracker_v2,
    metric_name="quality_correctness"
)

Best Practices

1. Use Reference Answers

Provide reference answers for better correctness evaluation.

2. Multiple Test Cases

Use sufficient test cases for statistical significance (recommended: 30+).

3. Appropriate Judges

Choose judges that match your evaluation needs.

4. Track Everything

Use MetricsTracker to maintain evaluation history.

5. Statistical Rigor

Use A/B testing with proper confidence levels for comparisons.

Integration with LLM Providers

The framework includes built-in integrations for multiple LLM providers. Use the factory function for easy setup:

OpenAI (GPT models)

from evaluator.integrations import create_llm_function

# Create LLM function
llm = create_llm_function(
    "openai",
    api_key="your-api-key",
    model="gpt-5"  # or "gpt-5-mini", "gpt-5-nano"
)

# Use in evaluation
runner = EvaluationRunner(model_fn=llm)

Anthropic (Claude models)

llm = create_llm_function(
    "anthropic",
    api_key="your-api-key",
    model="claude-sonnet-4-5-20250929"  # or "claude-opus-4-1-20250805", "claude-haiku-4-5-20251001"
)

Hugging Face (Open-source models up to 70B)

# No API key needed! Runs locally
llm = create_llm_function(
    "huggingface",
    model_name="meta-llama/Llama-3.1-8B-Instruct",  # or Qwen2.5, Mistral, Gemma2
    device="cpu"  # or "cuda" for GPU
)

Popular models:

  • Llama 3.1: meta-llama/Llama-3.1-8B-Instruct, meta-llama/Llama-3.1-70B-Instruct
  • Qwen: Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen3-7B-Instruct
  • Mistral: mistralai/Mistral-7B-Instruct-v0.3
  • Gemma: google/gemma-2-9b-it
  • DeepSeek: deepseek-ai/DeepSeek-Coder-V2-Instruct

Install: pip install transformers torch

Ollama (Local models up to 70B)

# Run models locally with Ollama
llm = create_llm_function(
    "ollama",
    model="llama3.1:8b",  # or "qwen2.5-coder:7b", "deepseek-r1:7b"
    base_url="http://localhost:11434"
)

Popular models:

  • Llama 3.1: llama3.1:8b, llama3.1:70b
  • Qwen: qwen2.5-coder:7b, qwen3-vl:7b
  • Mistral: mistral-small3.1:24b
  • DeepSeek: deepseek-r1:7b
  • Phi: phi3:3.8b
  • Gemma: gemma2:9b, gemma3:9b

Setup:

  1. Install Ollama: https://ollama.ai
  2. Pull model: ollama pull llama3.1:8b
  3. Run server: ollama serve

Install: pip install requests

Compare Multiple Providers

# Evaluate different providers
providers = {
    "GPT-5": create_llm_function("openai", api_key=key, model="gpt-5"),
    "Claude-4.5": create_llm_function("anthropic", api_key=key, model="claude-sonnet-4-5-20250929"),
    "Llama-3.1": create_llm_function("ollama", model="llama3.1:8b"),
    "Qwen-2.5": create_llm_function("huggingface", model_name="Qwen/Qwen2.5-7B-Instruct"),
}

for name, llm in providers.items():
    runner = EvaluationRunner(model_fn=llm)
    results = runner.evaluate(test_cases)
    print(f"{name}: {runner.get_summary()['overall_score']:.2f}")

Performance Considerations

  • Parallel Evaluation: Process test cases in batches
  • Caching: Cache LLM responses for repeated evaluations
  • Sampling: Use stratified sampling for large test sets
  • Async: Use async LLM calls for better throughput

Resources

Papers

  • "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena"
  • "G-Eval: NLG Evaluation using LLMs with Better Human Alignment"
  • "Self-Consistency Improves Chain of Thought Reasoning in Language Models"

Related Tools

License

MIT License - See LICENSE file

Citation

If you use this framework in research or production:

@software{llm_evaluation_framework_2025,
  title = {LLM Evaluation Framework: Automated Quality Assessment and A/B Testing},
  author = {Artem Kazakov Kozlov},
  year = {2025},
  url = {https://github.com/KazKozDev/llm-evaluation-framework}
}

Building better LLMs through systematic evaluation 🎯

About

Production-grade LLM evaluation framework with A/B testing, benchmarking, and LLM-as-a-judge.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages