Skip to content

JNZader/md-evals

md-evals

GitHub License Python 3.12+ Tests GitHub Models

Evaluate AI skills with scientific rigor. Compare prompts with and without injected context using A/B testing, multiple LLM providers, and production-grade evaluation techniques.

Lightweight CLI tool for evaluating AI skills (SKILL.md) with Control vs Treatment testing using LiteLLM.

Inspired by LangChain skills-benchmarks.

πŸ“š Full Documentation | Quick Start | GitHub Models Guide | Examples

Why md-evals?

Building AI applications that work reliably requires scientific validation. md-evals makes it easy:

Challenge Solution
πŸ€” "Does my skill actually help?" A/B test Control vs Treatment automatically
πŸ’° "Can't afford to evaluate with expensive APIs?" Use free GitHub Models (Claude, GPT-4, DeepSeek)
πŸ“Š "How do I know if my results are real?" Hybrid regex + LLM-as-judge evaluation
πŸ”„ "Evaluating 100+ test cases manually is tedious" Parallel workers, beautiful terminal output, JSON/Markdown export
βœ… "How do I prevent bad skills from merging?" Built-in linter (400-line limit, best practices)
πŸ—οΈ "Will this integrate with my CI/CD?" Simple YAML config, exit codes for automation

Features

  • ✨ A/B Testing: Compare Control (no skill) vs Treatment (with skill) prompts side-by-side
  • 🎯 Multiple Treatments: Run wildcards like LCC_* to test different skill variations in one go
  • 🧠 Hybrid Evaluation: Combine regex pattern matching + LLM-as-a-judge for flexible validation
  • πŸš€ Multiple LLM Providers: GitHub Models (free!), OpenAI, Anthropic, LiteLLM, and more
  • πŸ“‹ Linter: Enforce 400-line limit, quality checks, and best practices for SKILL.md
  • πŸ“Š Rich Output: Beautiful terminal tables with pass rates, comparisons, and statistics
  • πŸ’Ύ Export: JSON, Markdown, or table format for reporting and analysis
  • ⚑ Parallel Execution: Run multiple tests concurrently for faster feedback
  • πŸŽ‰ GitHub Models Support: Use free/low-cost models (Claude 3.5, GPT-4, DeepSeek, Grok)

Installation

Using uv (Recommended)

# Clone the repository
git clone https://github.com/JNZader/md-evals.git
cd md-evals

# Install with uv (fastest)
uv sync

# Activate virtual environment
source .venv/bin/activate

Using pip

git clone https://github.com/JNZader/md-evals.git
cd md-evals

# Install dependencies
pip install -e .

Requirements: Python 3.12+

Quick Start

1. Initialize your evaluation

md-evals init

This creates:

  • eval.yaml - Your evaluation config
  • SKILL.md - Template for your AI skill

2. Run evaluation

md-evals run

3. Check your skill

md-evals lint        # Validate SKILL.md
md-evals list        # List treatments and tests

⏱️ Complete example in 2 minutes

# 1. Create evaluation
md-evals init

# 2. Preflight auth (env var first, gh login fallback)
md-evals smoke --provider github-models

# 3. Run with GitHub Models (free!)
export GITHUB_TOKEN="github_pat_..."
md-evals run --provider github-models --model claude-3.5-sonnet --config eval.yaml

# 4. View results
# β†’ Beautiful table with Control vs Treatment comparison
# β†’ Pass rates and statistics

πŸŽ‰ GitHub Models: Free LLM Evaluation

Evaluate your skills completely free using GitHub's Models API (public preview):

Setup (One-time)

# Preferred: set GITHUB_TOKEN directly
export GITHUB_TOKEN="github_pat_..."

# Fallback for users already logged in with GitHub CLI
gh auth login

# Verify auth preflight before first run
md-evals smoke --provider github-models --config examples/eval_with_github_models.yaml

Run Evaluation with Free Models

# Use Claude 3.5 Sonnet (200k context, free!)
md-evals run --config eval.yaml --provider github-models --model claude-3.5-sonnet

# Or use GPT-4o
md-evals run --config eval.yaml --provider github-models --model gpt-4o

# Or use DeepSeek R1 (fastest)
md-evals run --config eval.yaml --provider github-models --model deepseek-r1

Available Models

Model Context Best For Cost
claude-3.5-sonnet 200k Reasoning, complex tasks 🟒 Free
gpt-4o 128k General-purpose, balanced 🟒 Free
deepseek-r1 64k Speed, cost efficiency 🟒 Free
grok-3 128k Latest, edge cases 🟒 Free

Rate Limits: 15 requests/min (public preview) Β· Full Guide β†’

Configuration

Create eval.yaml to define your evaluation. Here's a complete example:

name: "My AI Skill Evaluation"
version: "1.0"
description: "Evaluate skill effectiveness with Control vs Treatment"

defaults:
  model: "claude-3.5-sonnet"
  provider: "github-models"  # Free! (or: openai, anthropic, etc.)
  temperature: 0.7
  max_tokens: 500

treatments:
  CONTROL:
    description: "Baseline: No skill injected"
    skill_path: null
  
  WITH_SKILL:
    description: "Treatment: With skill injected"
    skill_path: "./SKILL.md"
  
  WITH_SKILL_V2:
    description: "Alternative skill variant"
    skill_path: "./SKILL_V2.md"

tests:
  - name: "test_basic_greeting"
    prompt: "Greet {name} and ask how they're doing."
    variables:
      name: "Alice"
    evaluators:
      - type: "regex"
        name: "has_greeting"
        pattern: "(hello|hi|greetings)"
      - type: "llm"
        name: "is_friendly"
        criteria: "Does the response feel warm and friendly?"
  
  - name: "test_complex_reasoning"
    prompt: "Explain {concept} to a {audience}."
    variables:
      concept: "quantum computing"
      audience: "5-year-old child"
    evaluators:
      - type: "llm"
        name: "is_age_appropriate"
        criteria: "Is the explanation suitable for a 5-year-old?"

Key Sections

Section Purpose
defaults LLM model, provider, temperature, token limits
treatments Different skill configurations to compare
tests Test cases with prompts, variables, and evaluators

Evaluators

  • type: regex - Pattern matching (fast, deterministic)
  • type: llm - LLM-as-judge (flexible, intelligent)

Commands

Command Purpose
md-evals init πŸš€ Scaffold eval.yaml and SKILL.md templates
md-evals run ▢️ Run evaluations (Control vs Treatment)
md-evals run --treatment WITH_SKILL 🎯 Run specific treatment
md-evals lint βœ… Validate SKILL.md (400-line limit, best practices)
md-evals list πŸ“‹ List available treatments and tests
md-evals list-models πŸ€– List available models per provider
md-evals smoke --provider github-models --config eval.yaml πŸ§ͺ Local preflight (provider, config, auth)

Common Workflows

# Evaluate with default provider
md-evals run

# Use specific provider and model
md-evals run --provider github-models --model claude-3.5-sonnet

# Run only specific treatment
md-evals run --treatment WITH_SKILL

# Export results as JSON
md-evals run --output json > results.json

# Run with 4 parallel workers
md-evals run -n 4

# Repeat each test 5 times (for statistical significance)
md-evals run --count 5

# Export to Markdown report
md-evals run --output markdown > report.md

# Validate before running
md-evals lint

Full Options Reference

run

  • -c, --config FILE - Config file (default: eval.yaml)
  • -t, --treatment TREATMENT - Run specific treatment(s)
  • -m, --model MODEL - Override model
  • -p, --provider PROVIDER - Provider: github-models, openai, anthropic, etc.
  • -n WORKERS - Parallel workers (default: 1)
  • --count N - Repeat tests N times for statistical validation
  • -o, --output FORMAT - Output format: table (default), json, markdown
  • --no-lint - Skip SKILL.md linting
  • --debug - Enable debug logging

list-models

  • -p, --provider PROVIDER - Filter by provider
  • -v, --verbose - Show metadata (temperature ranges, costs, rate limits)

Development

Setup

# Install with dev dependencies
uv sync --extra dev

# Activate virtual environment
source .venv/bin/activate

Testing

md-evals has a comprehensive test suite with 94.95% code coverage and 321 passing tests.

Quick Start

# Run all tests
pytest

# Run tests in parallel (73% faster)
pytest -n 4

# View coverage report
pytest --cov=md_evals --cov-report=html
open htmlcov/index.html

Test Documentation

Complete testing guides for different audiences:

Guide Audience Purpose
TESTING.md Everyone How to run tests, markers, parallel execution
TEST_DEVELOPMENT_GUIDE.md Developers Writing new tests, fixtures, mocking strategies
TEST_ARCHITECTURE.md Tech Leads Test organization, fixture hierarchy, isolation patterns
TEST_CI_INTEGRATION.md DevOps/CI Engineers CI/CD setup, Docker, reporting, multiple platforms
TEST_QUICK_REFERENCE.md All Command cheat sheet, one-liners, common patterns
TEST_COVERAGE_ANALYSIS.md Maintainers Coverage gaps, improvement roadmap, module analysis

Common Testing Tasks

# Run only unit tests (fast feedback)
pytest -m unit

# Run only integration tests
pytest -m integration

# Run specific test file
pytest tests/test_github_models_provider.py -v

# Debug a specific test
pytest tests/test_engine.py::TestExecutionEngine::test_run_basic -vvv --pdb

# Run tests that match pattern
pytest -k "github_models"

# Skip slow tests (faster local development)
pytest -m "not slow"

# Generate all reports
pytest -n 4 \
  --cov=md_evals \
  --cov-report=html \
  --cov-report=xml \
  --cov-report=json \
  --junit-xml=test-results.xml

Test Coverage

  • Overall: 94.95% (production standard: 90%)
  • Critical modules: >95% (engine, evaluators, config)
  • Test count: 321 tests (unit, integration, E2E, performance)
  • Execution time: 6.63s parallel / 22.09s serial

Test Structure

tests/
β”œβ”€β”€ conftest.py                    # Shared fixtures and config
β”œβ”€β”€ test_cli.py                    # CLI command tests (100+ tests)
β”œβ”€β”€ test_engine.py                 # Core evaluation engine
β”œβ”€β”€ test_evaluator.py              # Regex & LLM evaluators
β”œβ”€β”€ test_github_models_provider.py # Provider tests (43 tests)
β”œβ”€β”€ test_e2e_workflow.py          # End-to-end workflow tests
β”œβ”€β”€ test_linter.py                 # SKILL.md validation
β”œβ”€β”€ test_reporter.py               # Report generation
└── ... (10+ test files total)

Performance

Configuration Time Speedup
Serial 22.09s β€”
Parallel (4 workers) 6.63s 73%
Unit tests only ~5s 78%
Fast tests (no slow) ~10s 55%

For more details, see TESTING.md.

Project Structure

md_evals/
β”œβ”€β”€ cli.py                    # Command-line interface
β”œβ”€β”€ engine.py                 # Evaluation engine (A/B testing)
β”œβ”€β”€ llm.py                    # LLM provider interface
β”œβ”€β”€ providers/                # LLM provider implementations
β”‚   β”œβ”€β”€ github_models.py     # GitHub Models (free!)
β”‚   β”œβ”€β”€ openai_provider.py
β”‚   β”œβ”€β”€ anthropic_provider.py
β”‚   └── litellm_provider.py
β”œβ”€β”€ evaluators/               # Evaluation strategies
β”‚   β”œβ”€β”€ regex_evaluator.py
β”‚   └── llm_evaluator.py
└── config.py                 # YAML config parsing

tests/
β”œβ”€β”€ test_engine.py
β”œβ”€β”€ test_github_models_provider.py  # 43 tests
β”œβ”€β”€ test_provider_registry.py       # 11 tests
└── ...

Community & Support

πŸ“– Documentation

🀝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for:

  • Fork β†’ Branch β†’ Pull Request workflow
  • Code style guidelines (Ruff, 100 char lines)
  • Testing requirements (>80% coverage)
  • Conventional Commits format

πŸ“‹ Community

πŸ“ License

MIT

About

Lightweight CLI tool for evaluating AI skills (SKILL.md) with Control vs Treatment testing using LiteLLM

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors