# Evaluating a Basic Model with LM Evaluation Harness

This notebook demonstrates how to evaluate a basic language model using the LM Evaluation Harness framework.

## Overview

The LM Evaluation Harness provides a unified framework to test generative language models on various evaluation tasks. In this notebook, we'll:

1. Set up the environment and install dependencies
2. Load a simple model (HuggingFace transformers)
3. Evaluate it on a basic task (HellaSwag)/ set of tasks


## Step 1: Installation and Setup

First, let's ensure we have all the required dependencies installed.


In [1]:
# Install the package in development mode if not already installed
# Uncomment the line below if you need to install from scratch
# !pip install -e .

# Check if we can import the package
import sys
import os

# Add the current directory to the path
sys.path.insert(0, os.path.abspath('.'))

print("Setup complete!")


Setup complete!


## Step 2: Import Required Libraries


In [2]:
import torch
import json
from lm_eval import simple_evaluate
from lm_eval.tasks import TaskManager
from lm_eval.utils import setup_logging

# Setup logging to see what's happening
setup_logging(verbosity="INFO")

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")


  from .autonotebook import tqdm as notebook_tqdm


PyTorch version: 2.8.0a0+5228986c39.nv25.05
CUDA available: True
CUDA device: NVIDIA H100 80GB HBM3


## Step 3: Explore Available Tasks

Let's see what tasks are available for evaluation.


In [3]:
# Initialize task manager to explore available tasks
task_manager = TaskManager(verbosity="INFO")

# List some popular tasks
print("\n=== Popular Evaluation Tasks ===")
popular_tasks = [
    "hellaswag",      # Commonsense reasoning
    "arc_easy",       # Science questions (easy)
    "arc_challenge",  # Science questions (challenging)
    "lambada_openai", # Language modeling
    "piqa",           # Physical commonsense reasoning
    "winogrande",     # Pronoun resolution
]

for task in popular_tasks:
    if task in task_manager._all_tasks:
        print(f"✓ {task}")
    else:
        print(f"✗ {task} (not found)")

print("\nTo see all available tasks, run: lm_eval --tasks list")



=== Popular Evaluation Tasks ===
✓ hellaswag
✓ arc_easy
✓ arc_challenge
✓ lambada_openai
✓ piqa
✓ winogrande

To see all available tasks, run: lm_eval --tasks list


## Step 4: Choose a Model and Task

For this tutorial, we'll use:
- **Model**: `EleutherAI/pythia-160m` - A small, fast model perfect for testing
- **Task**: `hellaswag` - A commonsense reasoning task

You can easily change these to evaluate different models or tasks!


In [4]:
# Configuration
MODEL_NAME = "EleutherAI/pythia-160m"  # Small model for quick testing
# Alternative models to try:
#MODEL_NAME = "gpt2"  # OpenAI's GPT-2
# MODEL_NAME = "EleutherAI/gpt-neo-125M"  # GPT-Neo 125M

TASK_NAME = "hellaswag"  # Commonsense reasoning task
# Alternative tasks you caton try:
# TASK_NAME = "arc_easy"  # Science questions
# TASK_NAME = "piqa"  # Physical commonsense

# Limit the number of examples for faster evaluation (remove or set to None for full evaluation)
LIMIT = 50  # Only evaluate on 50 examples for this demo

# Device configuration
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"
BATCH_SIZE = 8  # Adjust based on the GPU memory

print(f"Model: {MODEL_NAME}")
print(f"Task: {TASK_NAME}")
print(f"Device: {DEVICE}")
print(f"Batch size: {BATCH_SIZE}")
print(f"Limit: {LIMIT} examples")


Model: EleutherAI/pythia-160m
Task: hellaswag
Device: cuda:0
Batch size: 8
Limit: 50 examples


## Step 5: Run Evaluation

Now let's evaluate the model using the `simple_evaluate` function. This is the main API for running evaluations.


## Step 10: Direct Model Testing (Alternative)

For easier debugging, you can also test the model class directly without going through the full evaluation harness. This is useful for:
- Testing schema loading
- Testing JSON extraction
- Testing validation
- Debugging specific issues


In [None]:
# Direct testing of the SGLangSchemaLM class
# This is useful for debugging individual components

from lm_eval.models.sglang_schema import SGLangSchemaLM
from lm_eval.api.instance import Instance

# Define configuration (can be run independently)
#SCHEMA_MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"  # Use a model compatible with SGLang
# Alternative: Use your medical model path
SCHEMA_MODEL_NAME = "OpenMeditron/Meditron3-8B"

schema_file_path = "test_schema.json"  # Should exist from previous cell, or create it here

# Create schema file if it doesn't exist
import json
import os
if not os.path.exists(schema_file_path):
    test_schema = {
        "type": "object",
        "properties": {
            "answer": {
                "type": "string",
                "description": "The answer to the question"
            },
            "confidence": {
                "type": "number",
                "minimum": 0,
                "maximum": 1,
                "description": "Confidence score between 0 and 1"
            },
            "reasoning": {
                "type": "string",
                "description": "Brief reasoning for the answer"
            }
        },
        "required": ["answer", "confidence"]
    }
    with open(schema_file_path, 'w') as f:
        json.dump(test_schema, f, indent=2)
    print(f"Created schema file: {schema_file_path}")

# Test 1: Check if the model can be instantiated
print("="*60)
print("Test 1: Model Instantiation")
print("="*60)

try:
    # Create model instance with schema
    test_model = SGLangSchemaLM.create_from_arg_string(
        f"pretrained={SCHEMA_MODEL_NAME},"
        f"response_schema={schema_file_path},"
        "validate_with_pydantic=True"
    )
    print("✓ Model instantiated successfully")
    print(f"  Schema loaded: {test_model.response_schema is not None}")
    print(f"  Pydantic model created: {test_model.schema_model is not None}")
    if test_model.schema_model:
        print(f"  Pydantic model name: {test_model.schema_model.__name__}")
except Exception as e:
    print(f"✗ Model instantiation failed: {e}")
    import traceback
    traceback.print_exc()
    test_model = None

# Test 2: Test JSON extraction (if model was created)
if test_model:
    print("\n" + "="*60)
    print("Test 2: JSON Extraction")
    print("="*60)
    
    test_cases = [
        '{"answer": "test", "confidence": 0.9}',
        '```json\n{"answer": "test", "confidence": 0.9}\n```',
        'Some text {"answer": "test", "confidence": 0.9} more text',
    ]
    
    for i, test_text in enumerate(test_cases, 1):
        extracted = test_model._extract_json(test_text)
        print(f"  Test {i}: {extracted is not None}")
        if extracted:
            print(f"    Extracted: {extracted[:50]}...")

# Test 3: Test schema validation (if model was created)
if test_model and test_model.schema_model:
    print("\n" + "="*60)
    print("Test 3: Schema Validation")
    print("="*60)
    
    # Valid JSON matching schema
    valid_json = '{"answer": "This is a test answer", "confidence": 0.85}'
    is_valid, model_instance, error = test_model._validate_schema(valid_json)
    print(f"  Valid JSON: {is_valid}")
    if is_valid:
        print(f"    ✓ Validation passed")
        print(f"    Model instance: {model_instance}")
    else:
        print(f"    ✗ Validation failed: {error}")
    
    # Invalid JSON (missing required field)
    invalid_json = '{"answer": "test"}'  # Missing "confidence"
    is_valid, model_instance, error = test_model._validate_schema(invalid_json)
    print(f"\n  Invalid JSON (missing field): {is_valid}")
    if not is_valid:
        print(f"    ✓ Correctly rejected: {error[:100]}...")

print("\n" + "="*60)
print("Direct testing complete!")
print("="*60)


Test 1: Model Instantiation
✗ Model instantiation failed: attempted to use 'sglang' LM type, but package `sglang` is not installed. Please install sglang via official document here:https://docs.sglang.ai/start/install.html#install-sglang

Direct testing complete!


Traceback (most recent call last):
  File "/tmp/ipykernel_8101/1124215433.py", line 85, in <module>
    test_model = SGLangSchemaLM.create_from_arg_string(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mloscratch/users/hatrouho/lm-evaluation-harness/lm_eval/models/sglang_schema.py", line 951, in create_from_arg_string
    return cls(**args)
           ^^^^^^^^^^^
  File "/mloscratch/users/hatrouho/lm-evaluation-harness/lm_eval/models/sglang_schema.py", line 93, in __init__
    super().__init__(pretrained=pretrained, **kwargs)
  File "/mloscratch/users/hatrouho/lm-evaluation-harness/lm_eval/models/sglang_causallms.py", line 70, in __init__
    raise ModuleNotFoundError(
ModuleNotFoundError: attempted to use 'sglang' LM type, but package `sglang` is not installed. Please install sglang via official document here:https://docs.sglang.ai/start/install.html#install-sglang


In [7]:
print(f"\n{'='*60}")
print(f"Starting evaluation of {MODEL_NAME} on {TASK_NAME}")
print(f"{'='*60}\n")

# Run evaluation
# The simple_evaluate function is the main entry point
results = simple_evaluate(
    model="hf",  # Use HuggingFace model backend
    model_args=f"pretrained={MODEL_NAME},dtype=float32",  # Model arguments
    tasks=[TASK_NAME],  # List of tasks to evaluate on
    device=DEVICE,  # Device to run on
    batch_size=BATCH_SIZE,  # Batch size for evaluation
    limit=LIMIT,  # Limit number of examples (for testing)
    num_fewshot=0,  # Number of few-shot examples (0 = zero-shot)
    log_samples=True,  # Log individual samples for analysis
    verbosity="INFO",  # Logging verbosity
)

print("\nEvaluation complete!")


2025-11-20:14:48:21 INFO     [evaluator:202] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-11-20:14:48:21 INFO     [evaluator:240] Initializing hf model, with arguments: {'pretrained': 'EleutherAI/pythia-160m', 'dtype': 'float32'}
2025-11-20:14:48:21 INFO     [models.huggingface:156] Using device 'cuda:0'



Starting evaluation of EleutherAI/pythia-160m on hellaswag



2025-11-20:14:48:22 INFO     [models.huggingface:423] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'}
2025-11-20:14:48:40 INFO     [api.task:434] Building contexts for hellaswag on rank 0...
100%|██████████| 50/50 [00:00<00:00, 4462.79it/s]
2025-11-20:14:48:40 INFO     [evaluator:574] Running loglikelihood requests
Running loglikelihood requests: 100%|██████████| 200/200 [00:00<00:00, 476.64it/s]



Evaluation complete!


## Step 6: Display Results

Let's examine the evaluation results in detail.


In [8]:
# Display results in a readable format
print("\n" + "="*60)
print("EVALUATION RESULTS")
print("="*60)

# The results dictionary from simple_evaluate() contains (as per evaluator.py):
# - 'results': Task-specific metrics (dict with task names as keys)
# - 'config': Configuration used for evaluation (model, model_args, batch_size, device, etc.)
# - 'versions': Version information for each task
# - 'samples': Individual sample results (if log_samples=True)
# - 'n-shot': Number of few-shot examples per task
# - 'higher_is_better': Whether higher metric values are better
# - 'n-samples': Number of samples evaluated
# - 'configs': Per-task configurations
# - 'git_hash': Git commit hash
# - 'date': Timestamp of evaluation

if results:
    # Print task results
    if 'results' in results:
        task_results = results['results']
        if TASK_NAME in task_results:
            print(f"\nTask: {TASK_NAME}")
            print("-" * 60)
            
            task_metrics = task_results[TASK_NAME]
            for metric_name, metric_value in task_metrics.items():
                # Skip internal keys like 'alias' and 'samples'
                if metric_name in ['alias', 'samples']:
                    continue
                if isinstance(metric_value, dict):
                    # Handle nested metrics
                    print(f"\n{metric_name}:")
                    for sub_metric, value in metric_value.items():
                        if isinstance(value, (int, float)):
                            print(f"  {sub_metric}: {value:.4f}")
                        else:
                            print(f"  {sub_metric}: {value}")
                elif isinstance(metric_value, (int, float)):
                    print(f"{metric_name}: {metric_value:.4f}")
                else:
                    print(f"{metric_name}: {metric_value}")
    
    # Print configuration info
    if 'config' in results:
        print("\n" + "="*60)
        print("CONFIGURATION")
        print("="*60)
        config = results['config']
        print(f"Model: {config.get('model', 'N/A')}")
        print(f"Model args: {config.get('model_args', 'N/A')}")
        print(f"Tasks evaluated: {list(results.get('results', {}).keys())}")
        print(f"Batch size: {config.get('batch_size', 'N/A')}")
        print(f"Device: {config.get('device', 'N/A')}")
        print(f"Limit: {config.get('limit', 'N/A')}")
        print(f"Number of few-shot examples: {results.get('n-shot', {}).get(TASK_NAME, 'N/A')}")
    
    # Print additional info if available
    if 'n-samples' in results and TASK_NAME in results['n-samples']:
        print(f"\nSamples evaluated: {results['n-samples'][TASK_NAME]}")
else:
    print("No results returned! (This can happen in multi-GPU setups where rank != 0)")



EVALUATION RESULTS

Task: hellaswag
------------------------------------------------------------
acc,none: 0.3600
acc_stderr,none: 0.0686
acc_norm,none: 0.4600
acc_norm_stderr,none: 0.0712

CONFIGURATION
Model: hf
Model args: pretrained=EleutherAI/pythia-160m,dtype=float32
Tasks evaluated: ['hellaswag']
Batch size: 8
Device: cuda:0
Limit: 50
Number of few-shot examples: 0

Samples evaluated: {'original': 10042, 'effective': 50}


## Step 7: Analyze Individual Samples

Let's look at some individual examples to understand how the model is performing.


In [19]:
# Examine individual samples (if available)
if results and 'samples' in results:
    samples = results['samples']
    
    if TASK_NAME in samples and len(samples[TASK_NAME]) > 0:
        print(f"\n{'='*60}")
        print(f"SAMPLE RESULTS (showing first 3 examples)")
        print(f"{'='*60}\n")
        
        # Show first few samples
        for i, sample in enumerate(samples[TASK_NAME][:3]):
            print(f"\n--- Example {i+1} ---")
            
            # Print the input/prompt
            if 'doc' in sample:
                doc = sample['doc']
                print(f"\nPrompt/Context:")
                # HellaSwag specific fields
                if 'ctx' in doc:
                    print(f"  Context: {doc['ctx']}")
                if 'endings' in doc:
                    print(f"  Options:")
                    for j, ending in enumerate(doc['endings']):
                        print(f"    {chr(65+j)}. {ending}")
            
            # Print the correct answer
            if 'doc' in sample and 'label' in sample['doc']:
                correct_idx = sample['doc']['label']
                # Convert to int if it's a string (some tasks store labels as strings)
                try:
                    correct_idx = int(correct_idx)
                except (ValueError, TypeError):
                    # If conversion fails, try to find the index in endings
                    if 'endings' in sample['doc']:
                        # Label might be the actual text, find its index
                        try:
                            correct_idx = sample['doc']['endings'].index(correct_idx)
                        except (ValueError, TypeError):
                            print(f"\nCorrect answer: {correct_idx} (could not convert to index)")
                            correct_idx = None
                
                if correct_idx is not None and 'endings' in sample['doc']:
                    print(f"\nCorrect answer: {chr(65+correct_idx)} ({sample['doc']['endings'][correct_idx]})")
                elif 'endings' in sample['doc']:
                    print(f"\nCorrect answer: {sample['doc']['label']}")
            
            # Print model's prediction
            if 'resps' in sample:
                print(f"Model responses: {sample['resps']}")
            
            # Print if correct
            if 'filtered_resps' in sample:
                print(f"Filtered responses: {sample['filtered_resps']}")
            
            # Check if correct
            if 'acc' in sample or 'acc_norm' in sample:
                acc = sample.get('acc', sample.get('acc_norm', None))
                if acc is not None:
                    status = "✓ CORRECT" if acc == 1.0 else "✗ INCORRECT"
                    print(f"Result: {status}")
    else:
        print("No samples available for this task.")
else:
    print("Samples not logged. Set log_samples=True to see individual examples.")



SAMPLE RESULTS (showing first 3 examples)


--- Example 1 ---

Prompt/Context:
  Context: A man is sitting on a roof. he
  Options:
    A. is using wrap to wrap a pair of skis.
    B. is ripping level tiles off.
    C. is holding a rubik's cube.
    D. starts pulling up roofing on a roof.

Correct answer: D (starts pulling up roofing on a roof.)
Model responses: [[(-43.672725677490234, False)], [(-34.15314483642578, False)], [(-28.880355834960938, False)], [(-35.52629852294922, False)]]
Filtered responses: [(-43.672725677490234, False), (-34.15314483642578, False), (-28.880355834960938, False), (-35.52629852294922, False)]
Result: ✗ INCORRECT

--- Example 2 ---

Prompt/Context:
  Context: A lady walks to a barbell. She bends down and grabs the pole. the lady
  Options:
    A. swings and lands in her arms.
    B. pulls the barbell forward.
    C. pulls a rope attached to the barbell.
    D. stands and lifts the weight over her head.

Correct answer: D (stands and lifts the weight over 

## Step 8: Evaluate on Multiple Tasks

We can also evaluate on multiple tasks at once. Let's try a few simple tasks.


In [10]:
# Evaluate on multiple tasks
MULTIPLE_TASKS = ["hellaswag", "arc_easy", "piqa"]

print(f"\n{'='*60}")
print(f"Evaluating on multiple tasks: {', '.join(MULTIPLE_TASKS)}")
print(f"{'='*60}\n")

# Run evaluation on multiple tasks
multi_results = simple_evaluate(
    model="hf",
    model_args=f"pretrained={MODEL_NAME},dtype=float32",
    tasks=MULTIPLE_TASKS,
    device=DEVICE,
    batch_size=BATCH_SIZE,
    limit=20,  # Small limit for demo
    num_fewshot=0,
    verbosity="INFO",
)

# Display summary of results
if multi_results and 'results' in multi_results:
    print("\n" + "="*60)
    print("MULTI-TASK RESULTS SUMMARY")
    print("="*60)
    
    for task in MULTIPLE_TASKS:
        if task in multi_results['results']:
            task_metrics = multi_results['results'][task]
            # Extract accuracy metric (common across tasks)
            acc = None
            if 'acc' in task_metrics:
                acc = task_metrics['acc']
            elif 'acc_norm' in task_metrics:
                acc = task_metrics['acc_norm']
            
            if acc is not None:
                print(f"{task:20s}: {acc:.4f} ({acc*100:.2f}%)")
            else:
                print(f"{task:20s}: {task_metrics}")


2025-11-16:16:07:50 INFO     [evaluator:202] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-11-16:16:07:50 INFO     [evaluator:240] Initializing hf model, with arguments: {'pretrained': 'EleutherAI/pythia-160m', 'dtype': 'float32'}
2025-11-16:16:07:50 INFO     [models.huggingface:156] Using device 'cuda:0'



Evaluating on multiple tasks: hellaswag, arc_easy, piqa



2025-11-16:16:07:50 INFO     [models.huggingface:423] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'}
Generating train split: 100%|██████████| 2251/2251 [00:00<00:00, 495740.53 examples/s]
Generating test split: 100%|██████████| 2376/2376 [00:00<00:00, 664023.61 examples/s]
Generating validation split: 100%|██████████| 570/570 [00:00<00:00, 290139.96 examples/s]
Generating train split: 100%|██████████| 16113/16113 [00:00<00:00, 1450368.49 examples/s]
Generating validation split: 100%|██████████| 1838/1838 [00:00<00:00, 694891.90 examples/s]
Generating test split: 100%|██████████| 3084/3084 [00:00<00:00, 1112899.73 examples/s]
2025-11-16:16:08:17 INFO     [api.task:434] Building contexts for piqa on rank 0...
100%|██████████| 20/20 [00:00<00:00, 1682.90it/s]
2025-11-16:16:08:17 INFO     [api.task:434] Building contexts for arc_easy on rank 0...
100%|██████████| 20/20 [00:00<00:00, 1873.46it/s]
2025-11-16:16:08:17 INFO     [api.task:434] 


MULTI-TASK RESULTS SUMMARY
hellaswag           : {'alias': 'hellaswag', 'acc,none': 0.25, 'acc_stderr,none': 0.09933992677987828, 'acc_norm,none': 0.3, 'acc_norm_stderr,none': 0.10513149660756935}
arc_easy            : {'alias': 'arc_easy', 'acc,none': 0.5, 'acc_stderr,none': 0.11470786693528086, 'acc_norm,none': 0.25, 'acc_norm_stderr,none': 0.09933992677987828}
piqa                : {'alias': 'piqa', 'acc,none': 0.65, 'acc_stderr,none': 0.1094243309804831, 'acc_norm,none': 0.7, 'acc_norm_stderr,none': 0.10513149660756936}


## Step 10: Understanding the Results

### Key Metrics Explained:

1. **Accuracy (acc)**: The percentage of examples the model got correct
2. **Normalized Accuracy (acc_norm)**: Accuracy after normalizing the predictions
3. **Perplexity**: For language modeling tasks, lower is better

### Interpreting HellaSwag Results:

- **HellaSwag** is a commonsense reasoning task where the model must choose the best ending for a given context
- Random guessing would achieve ~25% accuracy (1 out of 4 choices)
- Good models typically achieve 70-90%+ accuracy

### Tips for Better Evaluation:

1. **Remove the limit**: Set `limit=None` for full evaluation on all test examples
2. **Try different models**: Experiment with larger models for better performance
3. **Try different tasks**: Each task tests different capabilities
4. **Use few-shot learning**: Set `num_fewshot=5` or higher to provide examples
5. **Adjust batch size**: Larger batch sizes are faster but require more memory


## Additional Examples

### Example 1: Using a Different Model

```python
# Evaluate GPT-2
results = simple_evaluate(
    model="hf",
    model_args="pretrained=gpt2,dtype=float32",
    tasks=["hellaswag"],
    device="cuda:0",
    batch_size=8,
    limit=100,
)
```

### Example 2: Using Few-Shot Learning

```python
# Provide 5 examples in the prompt
results = simple_evaluate(
    model="hf",
    model_args=f"pretrained={MODEL_NAME},dtype=float32",
    tasks=["hellaswag"],
    device=DEVICE,
    batch_size=BATCH_SIZE,
    num_fewshot=5,  # 5-shot learning
    limit=100,
)
```

### Example 3: Using the CLI Instead

We can also run evaluations from the command line:

```bash
lm_eval --model hf \\
    --model_args pretrained=EleutherAI/pythia-160m,dtype=float32 \\
    --tasks hellaswag \\
    --device cuda:0 \\
    --batch_size 8 \\
    --limit 50
```
