This notebook shows how to evaluate LLMs with the Evaluation Harness framework, focusing on quantized LLMs and LoRA adapters. It requires at least 15 GB of VRAM. If you have less VRAM, reduce the batch size.

First, we need to install lm-eval which runs the evaluation harness. Also,install bitsandbytes to evaluate quantized models.


In [None]:
!pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git
!pip install bitsandbytes

Collecting git+https://github.com/EleutherAI/lm-evaluation-harness.git
  Cloning https://github.com/EleutherAI/lm-evaluation-harness.git to /tmp/pip-req-build-oizipxlb
  Running command git clone --filter=blob:none --quiet https://github.com/EleutherAI/lm-evaluation-harness.git /tmp/pip-req-build-oizipxlb
  Resolved https://github.com/EleutherAI/lm-evaluation-harness.git to commit 65b8761db922513dada0320b860fabb1b4f01dc3
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting evaluate (from lm-eval==0.4.0)
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Collecting peft>=0.2.0 (from lm-eval==0.4.0)
  Downloading peft-0.7.1-py3-none-any.whl (168 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m168.3/168.3 kB[0m [31m12.3

To evaluate models from the Hugging Face Hub, you will need to enter your access token if the model is protected.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

We can print the list of all the benchmarks available with the following command:

In [None]:
!lm-eval --tasks list

2023-12-19:08:58:54,533 INFO     [utils.py:160] NumExpr defaulting to 8 threads.
2023-12-19 08:58:55.356748: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-19 08:58:55.356796: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-19 08:58:55.358145: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-19:08:58:59,748 INFO     [__main__.py:155] Verbosity set to INFO
2023-12-19:08:59:08,404 INFO     [__main__.py:172] Available Tasks:
 - advanced_ai_risk
 - advanced_ai_risk_fewshot-coordinate-itself
 - advanced_ai_risk_fewshot-coordinate-other-ais
 - advanced_ai_risk_fewshot-

Evaluate Llama 2 7B on Truthfulqa, HellaSwag, and Winogrande.

In [None]:
!lm_eval --model hf \
    --model_args pretrained=meta-llama/Llama-2-7b-hf,dtype="float16" \
    --tasks truthfulqa,hellaswag,winogrande \
    --device cuda:0 \
    --batch_size 6\
    --output_path ./eval_llama2_7b \
    --log_samples

[1;30;43mLe flux de sortie a été tronqué et ne contient que les 5000 dernières lignes.[0m
2023-12-19:10:27:47,356 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:10:27:47,467 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:10:27:47,580 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:10:27:47,685 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:10:27:47,791 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:10:27:47,899 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:10:27:48,009 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:10:27:48,118 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:10:27:48,229 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:10:27:48,338 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:10:27:48,447 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:10:27:48,560 INFO     [rouge

Evaluate Llama 2 7B quantized with bitsandbytes NF4.

In [None]:
!lm_eval --model hf \
    --model_args pretrained=meta-llama/Llama-2-7b-hf,load_in_4bit=True,dtype="float16" \
    --tasks truthfulqa,hellaswag,winogrande \
    --device cuda:0 \
    --batch_size 14\
    --output_path ./eval_llama2_7b_nf4 \
    --log_samples

[1;30;43mLe flux de sortie a été tronqué et ne contient que les 5000 dernières lignes.[0m
2023-12-19:14:23:20,656 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:14:23:20,764 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:14:23:20,871 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:14:23:20,977 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:14:23:21,086 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:14:23:21,194 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:14:23:21,306 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:14:23:21,411 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:14:23:21,522 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:14:23:21,631 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:14:23:21,736 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:14:23:21,841 INFO     [rouge

Evaluate Llama 2 7B, quantized with bitsandbytes NF4, and with a loaded LoRA adapter fine-tuned on oasst-guanaco.

In [None]:
!lm_eval --model hf \
    --model_args pretrained=meta-llama/Llama-2-7b-hf,load_in_4bit=True,peft=kaitchup/Llama-2-7B-oasstguanaco-adapter,dtype="float16" \
    --tasks truthfulqa,hellaswag,winogrande \
    --device cuda:0 \
    --batch_size 14\
    --output_path ./eval_llama2_7b_nf4_lora \
    --log_samples

[1;30;43mLe flux de sortie a été tronqué et ne contient que les 5000 dernières lignes.[0m
2023-12-19:16:36:43,938 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:16:36:44,053 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:16:36:44,169 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:16:36:44,284 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:16:36:44,398 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:16:36:44,517 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:16:36:44,628 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:16:36:44,740 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:16:36:44,858 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:16:36:44,971 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:16:36:45,079 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:16:36:45,191 INFO     [rouge

To evaluate a model quantized with GPTQ with need install auto-gptq and optimum:

In [None]:
!pip install auto-gptq optimum

Collecting optimum
  Downloading optimum-1.16.1-py3-none-any.whl (403 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m403.3/403.3 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Collecting coloredlogs (from optimum)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
Collecting humanfriendly>=9.1 (from coloredlogs->optimum)
  Downloading humanfriendly-10.0-py2.py3-none-any.whl (86 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: humanfriendly, coloredlogs, optimum
Successfully installed coloredlogs-15.0.1 humanfriendly-10.0 optimum-1.16.1


Evaluate Llama 2 7B quantized with GPTQ.

In [None]:
!lm_eval --model hf \
    --model_args pretrained=kaitchup/Llama-2-7b-gptq-4bit\
    --tasks truthfulqa,hellaswag,winogrande \
    --device cuda:0 \
    --batch_size 16 \
    --output_path ./eval_llama2_7b_gptq \
    --log_samples

[1;30;43mLe flux de sortie a été tronqué et ne contient que les 5000 dernières lignes.[0m
2023-12-19:19:12:55,214 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:19:12:55,322 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:19:12:55,425 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:19:12:55,530 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:19:12:55,632 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:19:12:55,731 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:19:12:55,833 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:19:12:55,931 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:19:12:56,034 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:19:12:56,132 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:19:12:56,231 INFO     [rouge_scorer.py:83] Using default tokenizer.
2023-12-19:19:12:56,330 INFO     [rouge