## Installation

### Basic Installation
```bash
pip install eval-framework
```

### With HuggingFace Support
```bash
pip install eval-framework[transformers,accelerate]
```

### With All Features
```bash
pip install eval-framework[all]
```

**Includes**: OpenAI, vLLM, Mistral, COMET metrics, and more

## Quick Start: Running an Evaluation

Simple evaluation of GSM8K on a HuggingFace SmolLM-360M-Instruct model, with 5 few-shots and 10 samples.

In [None]:
%%bash
eval_framework \
    --llm-name eval_framework.llm.huggingface.HFLLM_from_name \
    --llm-args model_name="HuggingFaceTB/SmolLM-360M-Instruct" \
    --task-name "MMLU" \
    --output-dir ./eval \
    --num-fewshot 5 \
    --num-samples 10


## Output Structure

After running an evaluation, you get:

```
./eval/
├── aggregated_results.json    # Final metrics
├── results.jsonl              # Per-sample metrics
├── output.jsonl               # Model completions
├── metadata.json              # Run configuration
└── evaluation.log             # Detailed logs







```






## Eval-Framework Features

### 1. **Flexible Model Support**
```bash
# HuggingFace Transformers
--llm-name eval_framework.llm.huggingface.HFLLM_from_name \
--llm-args model_name="meta-llama/Llama-3.2-1B-Instruct"

# Predefined models from registry
--llm-name Smollm135MInstruct

# vLLM for high-performance inference (10-20x faster)
--llm-name eval_framework.llm.vllm.VLLM \
--llm-args model_name="meta-llama/Llama-3.2-8B-Instruct"

# OpenAI API models
--llm-name eval_framework.llm.openai.OpenAIModel \
--llm-args model_name="gpt-4o"

# Mistral AI models
--llm-name eval_framework.llm.mistral.MistralModel \
--llm-args model_name="mistral-large-latest"


```

### 2. **Built-in Benchmarks**
```bash
# Math & Reasoning: GSM8K, MATH, MMLU (57 subjects), ARC, HellaSwag
--task-name "GSM8K"

# Code Generation: HumanEval, MBPP, BigCodeBench
--task-name "HumanEval"

# Long Context: InfiniteBench, ZeroScrolls, QUALITY
--task-name "InfiniteBench_CodeDebug"

# Translation: WMT14/16/20, Flores200, FloresPlus
--task-name "WMT20"

# Multilingual: BELEBELE (122 languages), MMLU-DE, ARC-FI
--task-name "BELEBELE"

# Safety: TruthfulQA, Instruction Following (IFEval)
--task-name "TruthfulQA"


```

### 3. **Few-Shot Learning**
```bash
--num-fewshot 5  # Add 5 examples to each prompt


```

### 4. **LLM-as-Judge Metrics**
```bash
# Evaluate with GPT-4 as judge
--llm-name eval_framework.llm.huggingface.HFLLM_from_name \
--llm-args model_name="meta-llama/Llama-3.2-1B-Instruct" \
--judge-model-name eval_framework.llm.openai.OpenAIModel \
--judge-model-args model_name="gpt-4o"


```

### 5. **Batch Processing**
```bash
# Parallel batch processing
--batch-size 8

# Multi-GPU inference (vLLM)
--llm-args tensor_parallel_size=4


```

### 6. **Experiment Tracking & Reproducibility**
```bash
# Weights & Biases integration
--wandb-project "llm-evaluations" \
--wandb-entity "my-team" \
--description "Baseline Llama-3.2-1B evaluation"

# HuggingFace Hub uploads
--hf-upload-repo "my-org/eval-results" \
--hf-upload-dir "experiment-1"


```

### 7. **Evaluation Options**
```bash
# Control generation parameters
--max-tokens 256
--llm-args sampling_params.temperature=0.7 \
--llm-args sampling_params.top_p=0.9

# Filter specific subjects (e.g., MMLU math only)
--task-subjects "abstract_algebra" "college_mathematics"


```


### 8. **Robustness Testing**
```bash
# Add character-level perturbations
--perturbation-type "replace" \
--perturbation-probability 0.05 \
--perturbation-seed 42

# Available: editor, permute, replace, delete, uppercase


```