Production-grade framework for automated LLM quality evaluation, A/B testing, and benchmarking with LLM-as-a-judge patterns.
A comprehensive system for evaluating and comparing Large Language Models:
- ✅ Automated benchmarking across multiple tasks
- ✅ A/B testing with statistical significance
- ✅ LLM-as-a-judge for quality assessment
- ✅ Comprehensive metrics tracking
- ✅ Production-ready evaluation pipelines
Pre-built benchmark suites for:
- Question answering
- Summarization
- Code generation
- Logical reasoning
- Classification
Compare prompt variants with:
- Statistical significance testing
- Confidence intervals
- Pairwise comparisons
- Performance metrics
Quality assessment using:
- Multiple evaluation criteria
- Configurable judges
- Rule-based or LLM-based scoring
- Detailed reasoning
Monitor:
- Quality scores
- Latency
- Token usage
- Cost estimates
- Experiment comparison
Test Cases → Evaluation Runner → LLM → Judges → Metrics
↓
Responses
↓
Quality Scores + Metadata
# Clone repository
git clone https://github.com/KazKozDev/llm-evaluation-framework.git
cd llm-evaluation-framework
# Install dependencies
pip install -r requirements.txt
# Or install in development mode
pip install -e .Requirements:
- Python 3.8+
- numpy, pandas, scipy
from evaluator.core import EvaluationRunner, TestCase
from judges.llm_judge import LLMJudge, JudgmentCriteria
from metrics.tracker import MetricsTracker
# Define your LLM function
def my_llm(prompt: str) -> str:
# Call your LLM API here
return "Response from LLM"
# Create test cases
test_cases = [
TestCase(
task_id="qa_1",
prompt="What is the capital of France?",
reference="Paris",
),
]
# Setup judges
judges = [
LLMJudge(
criteria=[
JudgmentCriteria.RELEVANCE,
JudgmentCriteria.CORRECTNESS,
],
)
]
# Create evaluation runner
runner = EvaluationRunner(
model_fn=my_llm,
judges=judges,
metrics_tracker=MetricsTracker(),
)
# Run evaluation
results = runner.evaluate(test_cases)
# Get summary
summary = runner.get_summary()
print(f"Overall Score: {summary['overall_score']:.2f}")from evaluator.ab_testing import ABTest, PromptVariant
from judges.llm_judge import LLMJudge
# Define prompt variants
variant_a = PromptVariant(
name="basic",
prompt_template="Answer: {question}",
)
variant_b = PromptVariant(
name="detailed",
prompt_template="Provide a detailed answer to: {question}",
)
# Create A/B test
ab_test = ABTest(
model_fn=my_llm,
judges=[LLMJudge()],
confidence_level=0.95,
)
# Run test
result = ab_test.run_test(
variant_a=variant_a,
variant_b=variant_b,
test_inputs=[
{"question": "What is ML?"},
{"question": "Explain AI"},
],
)
# Check results
if result.is_significant:
print(f"Winner: {result.winner}")
print(f"Improvement: {result.improvement:.1f}%")from benchmarks.suite import BenchmarkSuite
from evaluator.core import EvaluationRunner, TestCase
# Load pre-built benchmarks
suite = BenchmarkSuite()
suite.load_default_benchmarks()
# Convert to test cases
test_cases = [
TestCase(
task_id=task.task_id,
prompt=task.prompt,
reference=task.reference,
)
for task in suite.tasks
]
# Evaluate
runner = EvaluationRunner(model_fn=my_llm)
results = runner.evaluate(test_cases)python examples/basic_evaluation.pyDemonstrates:
- Creating test cases
- Running evaluations
- Using LLM judges
- Tracking metrics
- Getting summaries
python examples/ab_testing_demo.pyDemonstrates:
- Comparing prompt variants
- Statistical significance testing
- Finding best prompts
- Multiple variant comparison
python examples/benchmark_comparison.pyDemonstrates:
- Running standard benchmarks
- Comparing models
- Performance by task type
- Making deployment decisions
python examples/multi_provider_comparison.pyDemonstrates:
- Testing OpenAI, Anthropic, Hugging Face, and Ollama
- Comparing quality and latency across providers
- Using local vs cloud models
- Cost-effective evaluation
Main engine for running evaluations.
runner = EvaluationRunner(
model_fn=your_llm_function,
judges=[judge1, judge2],
metrics_tracker=tracker,
)
results = runner.evaluate(test_cases)
summary = runner.get_summary()Quality assessment using LLM-as-a-judge pattern.
judge = LLMJudge(
judge_fn=judge_llm_function, # Optional
criteria=[
JudgmentCriteria.RELEVANCE,
JudgmentCriteria.CORRECTNESS,
JudgmentCriteria.COHERENCE,
],
use_reference=True,
)
judgment = judge.judge(prompt, response, reference)Statistical A/B testing for prompts.
ab_test = ABTest(
model_fn=your_llm,
judges=[judge],
confidence_level=0.95,
)
result = ab_test.run_test(variant_a, variant_b, test_inputs)Track and aggregate evaluation metrics.
tracker = MetricsTracker(experiment_name="exp_1")
tracker.record(result)
summary = tracker.get_summary()
tracker.save("results.json")Pre-built evaluation benchmarks.
suite = BenchmarkSuite(name="standard")
suite.load_default_benchmarks()
# Get tasks by type
qa_tasks = suite.get_tasks(TaskType.QUESTION_ANSWERING)Standard criteria supported by LLMJudge:
| Criterion | Description |
|---|---|
| RELEVANCE | Response relevance to prompt |
| CORRECTNESS | Factual accuracy |
| COHERENCE | Logical flow and structure |
| HELPFULNESS | Utility of response |
| CLARITY | Ease of understanding |
| COMPLETENESS | Coverage of topic |
| CONCISENESS | Brevity without losing meaning |
| SAFETY | Absence of harmful content |
- Quality Scores: Per-criterion and overall scores
- Latency: Response time in milliseconds
- Token Usage: Input/output tokens (if available)
- Cost: Estimated API costs (if configured)
llm-evaluation-framework/
├── evaluator/
│ ├── core.py # Main evaluation engine
│ └── ab_testing.py # A/B testing framework
├── judges/
│ └── llm_judge.py # LLM-as-a-judge implementation
├── benchmarks/
│ └── suite.py # Benchmark tasks
├── metrics/
│ └── tracker.py # Metrics tracking
├── examples/
│ ├── basic_evaluation.py
│ ├── ab_testing_demo.py
│ └── benchmark_comparison.py
├── requirements.txt
├── pyproject.toml
└── README.md
Compare different LLMs on your specific tasks to choose the best one.
A/B test prompt variants to optimize performance.
Track model quality over time in production.
Ensure new model versions don't degrade quality.
Systematic evaluation for model research.
def my_custom_judge_fn(prompt: str) -> str:
# Your custom judging logic
# Return formatted judgment
return "Score: 8\nReasoning: Good response"
judge = LLMJudge(judge_fn=my_custom_judge_fn)from benchmarks.suite import BenchmarkTask, TaskType
task = BenchmarkTask(
task_id="custom_1",
task_type=TaskType.QUESTION_ANSWERING,
prompt="Your custom prompt",
reference="Expected answer",
)
suite.add_task(task)tracker_v1 = MetricsTracker("experiment_v1")
tracker_v2 = MetricsTracker("experiment_v2")
# Run experiments...
# Compare
comparison = tracker_v1.compare_experiments(
tracker_v2,
metric_name="quality_correctness"
)Provide reference answers for better correctness evaluation.
Use sufficient test cases for statistical significance (recommended: 30+).
Choose judges that match your evaluation needs.
Use MetricsTracker to maintain evaluation history.
Use A/B testing with proper confidence levels for comparisons.
The framework includes built-in integrations for multiple LLM providers. Use the factory function for easy setup:
from evaluator.integrations import create_llm_function
# Create LLM function
llm = create_llm_function(
"openai",
api_key="your-api-key",
model="gpt-5" # or "gpt-5-mini", "gpt-5-nano"
)
# Use in evaluation
runner = EvaluationRunner(model_fn=llm)llm = create_llm_function(
"anthropic",
api_key="your-api-key",
model="claude-sonnet-4-5-20250929" # or "claude-opus-4-1-20250805", "claude-haiku-4-5-20251001"
)# No API key needed! Runs locally
llm = create_llm_function(
"huggingface",
model_name="meta-llama/Llama-3.1-8B-Instruct", # or Qwen2.5, Mistral, Gemma2
device="cpu" # or "cuda" for GPU
)Popular models:
- Llama 3.1:
meta-llama/Llama-3.1-8B-Instruct,meta-llama/Llama-3.1-70B-Instruct - Qwen:
Qwen/Qwen2.5-7B-Instruct,Qwen/Qwen3-7B-Instruct - Mistral:
mistralai/Mistral-7B-Instruct-v0.3 - Gemma:
google/gemma-2-9b-it - DeepSeek:
deepseek-ai/DeepSeek-Coder-V2-Instruct
Install: pip install transformers torch
# Run models locally with Ollama
llm = create_llm_function(
"ollama",
model="llama3.1:8b", # or "qwen2.5-coder:7b", "deepseek-r1:7b"
base_url="http://localhost:11434"
)Popular models:
- Llama 3.1:
llama3.1:8b,llama3.1:70b - Qwen:
qwen2.5-coder:7b,qwen3-vl:7b - Mistral:
mistral-small3.1:24b - DeepSeek:
deepseek-r1:7b - Phi:
phi3:3.8b - Gemma:
gemma2:9b,gemma3:9b
Setup:
- Install Ollama: https://ollama.ai
- Pull model:
ollama pull llama3.1:8b - Run server:
ollama serve
Install: pip install requests
# Evaluate different providers
providers = {
"GPT-5": create_llm_function("openai", api_key=key, model="gpt-5"),
"Claude-4.5": create_llm_function("anthropic", api_key=key, model="claude-sonnet-4-5-20250929"),
"Llama-3.1": create_llm_function("ollama", model="llama3.1:8b"),
"Qwen-2.5": create_llm_function("huggingface", model_name="Qwen/Qwen2.5-7B-Instruct"),
}
for name, llm in providers.items():
runner = EvaluationRunner(model_fn=llm)
results = runner.evaluate(test_cases)
print(f"{name}: {runner.get_summary()['overall_score']:.2f}")- Parallel Evaluation: Process test cases in batches
- Caching: Cache LLM responses for repeated evaluations
- Sampling: Use stratified sampling for large test sets
- Async: Use async LLM calls for better throughput
- "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena"
- "G-Eval: NLG Evaluation using LLMs with Better Human Alignment"
- "Self-Consistency Improves Chain of Thought Reasoning in Language Models"
MIT License - See LICENSE file
If you use this framework in research or production:
@software{llm_evaluation_framework_2025,
title = {LLM Evaluation Framework: Automated Quality Assessment and A/B Testing},
author = {Artem Kazakov Kozlov},
year = {2025},
url = {https://github.com/KazKozDev/llm-evaluation-framework}
}Building better LLMs through systematic evaluation 🎯