# Themis Quick Start Tutorial

This notebook demonstrates the basics of using Themis for LLM evaluation.

## Setup

First, make sure Themis is installed:

```bash
pip install themis-eval[math,nlp]
# or
uv pip install themis-eval[math,nlp]
```

In [None]:
# Import Themis
from themis import evaluate

## 1. Your First Evaluation

Let's evaluate a model on a built-in benchmark. We'll use the `demo` benchmark which has only 10 samples - perfect for testing!

In [None]:
# Evaluate on demo benchmark (10 samples)
result = evaluate(
    benchmark="demo",
    model="fake-math-llm",  # Fake model for testing (no API key needed)
    limit=5,
)

print(f"Accuracy: {result.evaluation_report.metrics['ExactMatch'].mean:.2%}")
print(f"Total samples: {len(result.generation_results)}")

## 2. List Available Benchmarks

Themis includes 6 built-in benchmarks:

In [None]:
from themis.presets import list_benchmarks

benchmarks = list_benchmarks()
print("Available benchmarks:")
for benchmark in benchmarks:
    print(f"  - {benchmark}")

## 3. Using Real Models

To use real models, set your API keys as environment variables:

```python
import os
os.environ["OPENAI_API_KEY"] = "your-key-here"
```

Then evaluate:

In [None]:
# Uncomment and run with your API key
# result = evaluate(
#     benchmark="gsm8k",
#     model="gpt-3.5-turbo",
#     limit=10,  # Start small
# )
# print(f"Accuracy: {result.evaluation_report.metrics['ExactMatch'].mean:.2%}")

## 4. Custom Configuration

You can customize sampling parameters:

In [None]:
result = evaluate(
    benchmark="demo",
    model="fake-math-llm",
    temperature=0.7,      # Sampling temperature
    max_tokens=256,       # Max response length
    num_samples=1,        # Samples per prompt
    workers=4,            # Parallel workers
    limit=5,
)

print(f"Results: {result.evaluation_report.metrics}")

## 5. View Detailed Results

The result object contains all evaluation details:

In [None]:
# Access the report
print("Run ID:", result.run_id)
print("Metrics:", result.evaluation_report.metrics)
print("Number of samples:", len(result.generation_results))

# View the full report
print("\nFull Report:")
print(result.report)

## 6. Caching and Resume

Themis automatically caches results. Specify a `run_id` to resume failed runs:

In [None]:
# First run
result1 = evaluate(
    benchmark="demo",
    model="fake-math-llm",
    run_id="my-experiment",
    limit=5,
)

print("First run completed")

# Second run with same run_id (will use cache)
result2 = evaluate(
    benchmark="demo",
    model="fake-math-llm",
    run_id="my-experiment",
    resume=True,  # Use cached results
    limit=5,
)

print("Second run used cache (instant!)")

## 7. Custom Dataset

You can evaluate on your own data:

In [None]:
# Your custom dataset
dataset = [
    {"prompt": "What is 2+2?", "answer": "4"},
    {"prompt": "What is 5+3?", "answer": "8"},
    {"prompt": "What is 10-7?", "answer": "3"},
]

result = evaluate(
    dataset,
    model="fake-math-llm",
    prompt="Question: {prompt}\nAnswer:",
    metrics=["ExactMatch"],
)

print(f"Accuracy: {result.evaluation_report.metrics['ExactMatch'].mean:.2%}")

## Next Steps

- **Tutorial 02**: Learn how to compare multiple runs
- **Tutorial 03**: Explore custom metrics and advanced features
- **Documentation**: Check out [docs/index.md](../docs/index.md)

## Summary

In this tutorial, you learned:
- âœ… How to run evaluations with `evaluate()`
- âœ… Using built-in benchmarks
- âœ… Customizing model parameters
- âœ… Caching and resuming runs
- âœ… Evaluating custom datasets

Happy evaluating! ðŸŽ¯