LLM Evaluation Framework

Production-grade framework for automated LLM quality evaluation, A/B testing, and benchmarking with LLM-as-a-judge patterns.

A comprehensive system for evaluating and comparing Large Language Models:

✅ Automated benchmarking across multiple tasks
✅ A/B testing with statistical significance
✅ LLM-as-a-judge for quality assessment
✅ Comprehensive metrics tracking
✅ Production-ready evaluation pipelines

Key Features

1. Automated Benchmarks

Pre-built benchmark suites for:

Question answering
Summarization
Code generation
Logical reasoning
Classification

2. A/B Testing Framework

Compare prompt variants with:

Statistical significance testing
Confidence intervals
Pairwise comparisons
Performance metrics

3. LLM-as-a-Judge

Quality assessment using:

Multiple evaluation criteria
Configurable judges
Rule-based or LLM-based scoring
Detailed reasoning

4. Metrics Tracking

Monitor:

Quality scores
Latency
Token usage
Cost estimates
Experiment comparison

Architecture

Test Cases → Evaluation Runner → LLM → Judges → Metrics
                                    ↓
                              Responses
                                    ↓
                          Quality Scores + Metadata

Installation

# Clone repository
git clone https://github.com/KazKozDev/llm-evaluation-framework.git
cd llm-evaluation-framework

# Install dependencies
pip install -r requirements.txt

# Or install in development mode
pip install -e .

Requirements:

Python 3.8+
numpy, pandas, scipy

Quick Start

1. Basic Evaluation

from evaluator.core import EvaluationRunner, TestCase
from judges.llm_judge import LLMJudge, JudgmentCriteria
from metrics.tracker import MetricsTracker

# Define your LLM function
def my_llm(prompt: str) -> str:
    # Call your LLM API here
    return "Response from LLM"

# Create test cases
test_cases = [
    TestCase(
        task_id="qa_1",
        prompt="What is the capital of France?",
        reference="Paris",
    ),
]

# Setup judges
judges = [
    LLMJudge(
        criteria=[
            JudgmentCriteria.RELEVANCE,
            JudgmentCriteria.CORRECTNESS,
        ],
    )
]

# Create evaluation runner
runner = EvaluationRunner(
    model_fn=my_llm,
    judges=judges,
    metrics_tracker=MetricsTracker(),
)

# Run evaluation
results = runner.evaluate(test_cases)

# Get summary
summary = runner.get_summary()
print(f"Overall Score: {summary['overall_score']:.2f}")

2. A/B Testing Prompts

from evaluator.ab_testing import ABTest, PromptVariant
from judges.llm_judge import LLMJudge

# Define prompt variants
variant_a = PromptVariant(
    name="basic",
    prompt_template="Answer: {question}",
)

variant_b = PromptVariant(
    name="detailed",
    prompt_template="Provide a detailed answer to: {question}",
)

# Create A/B test
ab_test = ABTest(
    model_fn=my_llm,
    judges=[LLMJudge()],
    confidence_level=0.95,
)

# Run test
result = ab_test.run_test(
    variant_a=variant_a,
    variant_b=variant_b,
    test_inputs=[
        {"question": "What is ML?"},
        {"question": "Explain AI"},
    ],
)

# Check results
if result.is_significant:
    print(f"Winner: {result.winner}")
    print(f"Improvement: {result.improvement:.1f}%")

3. Benchmark Suite

from benchmarks.suite import BenchmarkSuite
from evaluator.core import EvaluationRunner, TestCase

# Load pre-built benchmarks
suite = BenchmarkSuite()
suite.load_default_benchmarks()

# Convert to test cases
test_cases = [
    TestCase(
        task_id=task.task_id,
        prompt=task.prompt,
        reference=task.reference,
    )
    for task in suite.tasks
]

# Evaluate
runner = EvaluationRunner(model_fn=my_llm)
results = runner.evaluate(test_cases)

Running Examples

Basic Evaluation

python examples/basic_evaluation.py

Demonstrates:

Creating test cases
Running evaluations
Using LLM judges
Tracking metrics
Getting summaries

A/B Testing

python examples/ab_testing_demo.py

Demonstrates:

Comparing prompt variants
Statistical significance testing
Finding best prompts
Multiple variant comparison

Benchmark Comparison

python examples/benchmark_comparison.py

Demonstrates:

Running standard benchmarks
Comparing models
Performance by task type
Making deployment decisions

Multi-Provider Comparison

python examples/multi_provider_comparison.py

Demonstrates:

Testing OpenAI, Anthropic, Hugging Face, and Ollama
Comparing quality and latency across providers
Using local vs cloud models
Cost-effective evaluation

Core Components

EvaluationRunner

Main engine for running evaluations.

runner = EvaluationRunner(
    model_fn=your_llm_function,
    judges=[judge1, judge2],
    metrics_tracker=tracker,
)

results = runner.evaluate(test_cases)
summary = runner.get_summary()

LLMJudge

Quality assessment using LLM-as-a-judge pattern.

judge = LLMJudge(
    judge_fn=judge_llm_function,  # Optional
    criteria=[
        JudgmentCriteria.RELEVANCE,
        JudgmentCriteria.CORRECTNESS,
        JudgmentCriteria.COHERENCE,
    ],
    use_reference=True,
)

judgment = judge.judge(prompt, response, reference)

ABTest

Statistical A/B testing for prompts.

ab_test = ABTest(
    model_fn=your_llm,
    judges=[judge],
    confidence_level=0.95,
)

result = ab_test.run_test(variant_a, variant_b, test_inputs)

MetricsTracker

Track and aggregate evaluation metrics.

tracker = MetricsTracker(experiment_name="exp_1")
tracker.record(result)
summary = tracker.get_summary()
tracker.save("results.json")

BenchmarkSuite

Pre-built evaluation benchmarks.

suite = BenchmarkSuite(name="standard")
suite.load_default_benchmarks()

# Get tasks by type
qa_tasks = suite.get_tasks(TaskType.QUESTION_ANSWERING)

Evaluation Criteria

Standard criteria supported by LLMJudge:

Criterion	Description
RELEVANCE	Response relevance to prompt
CORRECTNESS	Factual accuracy
COHERENCE	Logical flow and structure
HELPFULNESS	Utility of response
CLARITY	Ease of understanding
COMPLETENESS	Coverage of topic
CONCISENESS	Brevity without losing meaning
SAFETY	Absence of harmful content

Metrics Tracked

Quality Scores: Per-criterion and overall scores
Latency: Response time in milliseconds
Token Usage: Input/output tokens (if available)
Cost: Estimated API costs (if configured)

Project Structure

llm-evaluation-framework/
├── evaluator/
│   ├── core.py              # Main evaluation engine
│   └── ab_testing.py        # A/B testing framework
├── judges/
│   └── llm_judge.py         # LLM-as-a-judge implementation
├── benchmarks/
│   └── suite.py             # Benchmark tasks
├── metrics/
│   └── tracker.py           # Metrics tracking
├── examples/
│   ├── basic_evaluation.py
│   ├── ab_testing_demo.py
│   └── benchmark_comparison.py
├── requirements.txt
├── pyproject.toml
└── README.md

Use Cases

1. Model Selection

Compare different LLMs on your specific tasks to choose the best one.

2. Prompt Engineering

A/B test prompt variants to optimize performance.

3. Quality Monitoring

Track model quality over time in production.

4. Regression Testing

Ensure new model versions don't degrade quality.

5. Research & Development

Systematic evaluation for model research.

Advanced Usage

Custom Judges

def my_custom_judge_fn(prompt: str) -> str:
    # Your custom judging logic
    # Return formatted judgment
    return "Score: 8\nReasoning: Good response"

judge = LLMJudge(judge_fn=my_custom_judge_fn)

Custom Benchmarks

from benchmarks.suite import BenchmarkTask, TaskType

task = BenchmarkTask(
    task_id="custom_1",
    task_type=TaskType.QUESTION_ANSWERING,
    prompt="Your custom prompt",
    reference="Expected answer",
)

suite.add_task(task)

Multiple Experiments

tracker_v1 = MetricsTracker("experiment_v1")
tracker_v2 = MetricsTracker("experiment_v2")

# Run experiments...

# Compare
comparison = tracker_v1.compare_experiments(
    tracker_v2,
    metric_name="quality_correctness"
)

Best Practices

1. Use Reference Answers

Provide reference answers for better correctness evaluation.

2. Multiple Test Cases

Use sufficient test cases for statistical significance (recommended: 30+).

3. Appropriate Judges

Choose judges that match your evaluation needs.

4. Track Everything

Use MetricsTracker to maintain evaluation history.

5. Statistical Rigor

Use A/B testing with proper confidence levels for comparisons.

Integration with LLM Providers

The framework includes built-in integrations for multiple LLM providers. Use the factory function for easy setup:

OpenAI (GPT models)

from evaluator.integrations import create_llm_function

# Create LLM function
llm = create_llm_function(
    "openai",
    api_key="your-api-key",
    model="gpt-5"  # or "gpt-5-mini", "gpt-5-nano"
)

# Use in evaluation
runner = EvaluationRunner(model_fn=llm)

Anthropic (Claude models)

llm = create_llm_function(
    "anthropic",
    api_key="your-api-key",
    model="claude-sonnet-4-5-20250929"  # or "claude-opus-4-1-20250805", "claude-haiku-4-5-20251001"
)

Hugging Face (Open-source models up to 70B)

# No API key needed! Runs locally
llm = create_llm_function(
    "huggingface",
    model_name="meta-llama/Llama-3.1-8B-Instruct",  # or Qwen2.5, Mistral, Gemma2
    device="cpu"  # or "cuda" for GPU
)

Popular models:

Llama 3.1: meta-llama/Llama-3.1-8B-Instruct, meta-llama/Llama-3.1-70B-Instruct
Qwen: Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen3-7B-Instruct
Mistral: mistralai/Mistral-7B-Instruct-v0.3
Gemma: google/gemma-2-9b-it
DeepSeek: deepseek-ai/DeepSeek-Coder-V2-Instruct

Install: pip install transformers torch

Ollama (Local models up to 70B)

# Run models locally with Ollama
llm = create_llm_function(
    "ollama",
    model="llama3.1:8b",  # or "qwen2.5-coder:7b", "deepseek-r1:7b"
    base_url="http://localhost:11434"
)

Popular models:

Llama 3.1: llama3.1:8b, llama3.1:70b
Qwen: qwen2.5-coder:7b, qwen3-vl:7b
Mistral: mistral-small3.1:24b
DeepSeek: deepseek-r1:7b
Phi: phi3:3.8b
Gemma: gemma2:9b, gemma3:9b

Setup:

Install Ollama: https://ollama.ai
Pull model: ollama pull llama3.1:8b
Run server: ollama serve

Install: pip install requests

Compare Multiple Providers

# Evaluate different providers
providers = {
    "GPT-5": create_llm_function("openai", api_key=key, model="gpt-5"),
    "Claude-4.5": create_llm_function("anthropic", api_key=key, model="claude-sonnet-4-5-20250929"),
    "Llama-3.1": create_llm_function("ollama", model="llama3.1:8b"),
    "Qwen-2.5": create_llm_function("huggingface", model_name="Qwen/Qwen2.5-7B-Instruct"),
}

for name, llm in providers.items():
    runner = EvaluationRunner(model_fn=llm)
    results = runner.evaluate(test_cases)
    print(f"{name}: {runner.get_summary()['overall_score']:.2f}")

Performance Considerations

Parallel Evaluation: Process test cases in batches
Caching: Cache LLM responses for repeated evaluations
Sampling: Use stratified sampling for large test sets
Async: Use async LLM calls for better throughput

Resources

Papers

"Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena"
"G-Eval: NLG Evaluation using LLMs with Better Human Alignment"
"Self-Consistency Improves Chain of Thought Reasoning in Language Models"

Related Tools

License

MIT License - See LICENSE file

Citation

If you use this framework in research or production:

@software{llm_evaluation_framework_2025,
  title = {LLM Evaluation Framework: Automated Quality Assessment and A/B Testing},
  author = {Artem Kazakov Kozlov},
  year = {2025},
  url = {https://github.com/KazKozDev/llm-evaluation-framework}
}

Building better LLMs through systematic evaluation 🎯

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
benchmarks		benchmarks
evaluator		evaluator
examples		examples
judges		judges
metrics		metrics
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark_results.txt		benchmark_results.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

KazKozDev/llm-evaluation-framework

Folders and files

Latest commit

History

Repository files navigation

LLM Evaluation Framework

Key Features

1. Automated Benchmarks

2. A/B Testing Framework

3. LLM-as-a-Judge

4. Metrics Tracking

Architecture

Installation

Quick Start

1. Basic Evaluation

2. A/B Testing Prompts

3. Benchmark Suite

Running Examples

Basic Evaluation

A/B Testing

Benchmark Comparison

Multi-Provider Comparison

Core Components

EvaluationRunner

LLMJudge

ABTest

MetricsTracker

BenchmarkSuite

Evaluation Criteria

Metrics Tracked

Project Structure

Use Cases

1. Model Selection

2. Prompt Engineering

3. Quality Monitoring

4. Regression Testing

5. Research & Development

Advanced Usage

Custom Judges

Custom Benchmarks

Multiple Experiments

Best Practices

1. Use Reference Answers

2. Multiple Test Cases

3. Appropriate Judges

4. Track Everything

5. Statistical Rigor

Integration with LLM Providers

OpenAI (GPT models)

Anthropic (Claude models)

Hugging Face (Open-source models up to 70B)

Ollama (Local models up to 70B)

Compare Multiple Providers

Performance Considerations

Resources

Papers

Related Tools

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Languages