# Language Model Evaluation using lm-eval and tinyBenchmarks

This notebook demonstrates how to evaluate a language model using the `lm-eval` library and a specific benchmark from the `tinyBenchmarks` collection.

We will cover:
- Installing the necessary libraries.
- Listing available benchmarks.
- Setting up and running a model evaluation.
- Exploring the evaluation results.
- Manually examining sample predictions.

## Installing required packages

First, we need to install the necessary libraries: `lm_eval` for evaluating language models and `tinyBenchmarks` which provides various benchmarks for evaluation.

In [12]:
# Install lm_eval for language model evaluation
# Install langdetect (a dependency for some tasks)
# Install tinyBenchmarks from the GitHub repository
! pip install lm_eval  langdetect -q
! pip install git+https://github.com/felipemaiapolo/tinyBenchmarks

Collecting git+https://github.com/felipemaiapolo/tinyBenchmarks
  Cloning https://github.com/felipemaiapolo/tinyBenchmarks to /tmp/pip-req-build-ueqci1mo
  Running command git clone --filter=blob:none --quiet https://github.com/felipemaiapolo/tinyBenchmarks /tmp/pip-req-build-ueqci1mo
  Resolved https://github.com/felipemaiapolo/tinyBenchmarks to commit 9c7e20302301ad531bfdfd9a7288e6e916bf22e9
  Preparing metadata (setup.py) ... [?25l[?25hdone


## List all available benchmarks

The `lm_eval` library supports a wide range of benchmarks. You can list all of them using the following command. This is helpful to identify the benchmark you want to use for your evaluation.

In [2]:
# List all available evaluation tasks/benchmarks
! lm_eval --tasks list

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
|global_mmlu_full_sw_professional_accounting                                           |lm_eval/tasks/global_mmlu/full/sw/global_mmlu_full_sw_professional_accounting.yaml                                                                              |multiple_choice      |
|global_mmlu_full_sw_professional_law                                                  |lm_eval/tasks/global_mmlu/full/sw/global_mmlu_full_sw_professional_law.yaml                                                                                     |multiple_choice      |
|global_mmlu_full_sw_professional_medicine                                             |lm_eval/tasks/global_mmlu/full/sw/global_mmlu_full_sw_professional_medicine.yaml                                                                                |multiple_choice      |
|global_mmlu_full_sw_professional_psychology                                           |lm_eval/tasks/global_mmlu/full/

## Model Evaluation Setup and Execution

Here, we set up the parameters for our evaluation, including the model we want to evaluate and the benchmark task. We use the `lm_eval.evaluator.simple_evaluate` function to run the evaluation.

In [3]:
import torch # Import the PyTorch library
from lm_eval import evaluator # Import the evaluator module from lm_eval
from joblib import dump # Import the dump function from joblib for saving results
from huggingface_hub import login # Import the login function from huggingface_hub
from google.colab import userdata # Import userdata to access Colab secrets


# Get the Hugging Face token from Colab secrets
HF_TOKEN = userdata.get('HF_TOKEN')

# Log in to Hugging Face to access models
login(HF_TOKEN)

# Define the configuration for the evaluation
config = {
    "model": "meta-llama/Llama-3.2-1B-Instruct", # Specify the model to evaluate
    "tasks": ["tinyGSM8k"], # Specify the benchmark task
    "batch_size": "2", # Define the batch size for evaluation
}

# Determine the device to use for evaluation (GPU, MPS, or CPU)
device = (
    "cuda" # Use CUDA if available
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu" # Use MPS if available, otherwise CPU
)

# Run the model evaluation
results = evaluator.simple_evaluate(
    model="hf", # Specify the model type as Hugging Face
    model_args=f"pretrained={config['model']},parallelize=True,trust_remote_code=True, use_auth_token='{HF_TOKEN}'", # Arguments for the model loading
    tasks=config["tasks"], # The list of tasks to evaluate on
    device=device, # The device to use
    batch_size=config["batch_size"], # The batch size for evaluation
)

# Save the evaluation results to a file
dump(results, "evaluation_results.joblib")

        chat template is not applied. Recommend setting `apply_chat_template` (optionally `fewshot_as_multiturn`).


config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

main/train-00000-of-00001.parquet:   0%|          | 0.00/5.31M [00:00<?, ?B/s]

main/test-00000-of-00001.parquet:   0%|          | 0.00/213k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

100%|██████████| 100/100 [00:00<00:00, 143.73it/s]
Running generate_until requests:   0%|          | 0/100 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Running generate_until requests: 100%|██████████| 100/100 [10:11<00:00,  6.12s/it]


['evaluation_results.joblib']

In [13]:
# Display the keys available in the results dictionary
results.keys()

dict_keys(['results', 'group_subtasks', 'configs', 'versions', 'n-shot', 'higher_is_better', 'n-samples', 'samples', 'config', 'git_hash', 'date', 'pretty_env_info', 'transformers_version', 'lm_eval_version', 'upper_git_hash', 'tokenizer_pad_token', 'tokenizer_eos_token', 'tokenizer_bos_token', 'eot_token_id', 'max_length'])

In [7]:
results["results"]

{'tinyGSM8k': {'alias': 'tinyGSM8k',
  'exact_match,strict-match': np.float64(0.3901494861944388),
  'exact_match_stderr,strict-match': 'N/A',
  'exact_match,flexible-extract': np.float64(0.3901494861944388),
  'exact_match_stderr,flexible-extract': 'N/A'}}

## Examine Samples Manually

It's often helpful to look at individual samples to understand how the model performed and see the model's generated answers compared to the target answers.

In [8]:
def print_sample(i):
  print(f"""
  Question: {results['samples']['tinyGSM8k'][i]['doc']['question']}\n\n
  Target: {results['samples']['tinyGSM8k'][i]['target']}\n\n
  Answer: {results['samples']['tinyGSM8k'][i]['resps'][0][0]}
  """)

In [9]:
print_sample(0)


  Question: Rory orders 2 subs for $7.50 each, 2 bags of chips for $1.50 each and 2 cookies for $1.00 each for delivery.  There’s a 20% delivery fee added at check out and she wants to add a $5.00 tip.  What will her delivery order cost?


  Target: 2 subs are $7.50 each so that’s 2*7.50 = $<<2*7.5=15.00>>15.00
2 bags of chips are $1.50 each so that’s 2*1.50 = $<<2*1.50=3.00>>3.00
2 cookies are $1.00 each so that’s 2*1 = $<<2*1=2.00>>2.00
Her delivery order will be 15+3+2= $<<15+3+2=20.00>>20.00
There’s a 20% delivery fee on the $20.00 which adds .20*20 = $4.00 to her bill
The delivery order is $20.00, there’s a $4.00 delivery fee and she adds a $5.00 tip for a total of 20+4+5 = $<<20+4+5=29.00>>29.00
#### 29


  Answer:  Rory will spend $7.50 x 2 = $15.00 on subs.
She will spend $1.50 x 2 = $3.00 on chips.
She will spend $1.00 x 2 = $2.00 on cookies.
So, the total cost of the subs, chips, and cookies is $15.00 + $3.00 + $2.00 = $20.00.
The delivery fee is 20% of $20.00, which is 0.20

In [10]:
print_sample(55)


  Question: Miguel uses 2 pads of paper a week for his drawing. If there are 30 sheets of paper on a pad of paper, how many sheets of paper does he use every month?


  Target: Miguel uses 30 x 2 = <<30*2=60>>60 sheets of paper every week.
Therefore, he uses 60 x 4 = <<60*4=240>>240 sheets of paper every month.
#### 240


  Answer:  Miguel uses 2 pads a week, so 2 pads/week * 52 weeks/year = 104 pads/year.
104 pads/year * 12 months/year = 1,248 sheets/year.
1,248 sheets/year * 12 months/year = 14,784 sheets/year.
#### 14,784


  
