### Eleuther AI Evaluation Harness

It's easiest to let Eleuther AI explain what they were going for:


>"...the LM Evaluation Harness, [is] a unifying framework that allows any causal language model to be tested on the same exact inputs and codebase. This provides a ground-truth location to evaluate new LLMs and saves practitioners time implementing few-shot evaluations repeatedly while ensuring that their results can be compared against previous work. The LM Eval Harness currently supports several different NLP tasks and model frameworks, all with a unified interface and task versioning for reproducibility."

Let's get started with a simple task called `hellaswag`!

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

First, we'll want to clone the Eleuther AI repository so we can use their evaluation scripts.

In [None]:
!git clone https://github.com/EleutherAI/lm-evaluation-harness

Cloning into 'lm-evaluation-harness'...
remote: Enumerating objects: 10475, done.[K
remote: Counting objects: 100% (3020/3020), done.[K
remote: Compressing objects: 100% (439/439), done.[K
remote: Total 10475 (delta 2682), reused 2696 (delta 2564), pack-reused 7455[K
Receiving objects: 100% (10475/10475), 12.36 MiB | 19.99 MiB/s, done.
Resolving deltas: 100% (6936/6936), done.


Next, let's install the required dependencies.

In [None]:
%cd lm-evaluation-harness/

/content/lm-evaluation-harness


In [None]:
!pip install -q -e .

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.0/72.0 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.5/79.5 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.8/56.8 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m222.3/222.3 kB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m115.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

These tests can/will take a long time!

While the script is provided to explain how you can run some tests - you shouldn't run this cell yourself unless you have a lot of time!

In [None]:
!python main.py \
    --model hf-causal \
    --model_args pretrained=bigscience/bloomz-3b \
    --tasks hellaswag \
    --device cuda:0

Selected Tasks: ['hellaswag']
Using device 'cuda:0'
Downloading (…)lve/main/config.json: 100% 507/507 [00:00<00:00, 2.66MB/s]
Downloading (…)model.bin.index.json: 100% 26.8k/26.8k [00:00<00:00, 93.6MB/s]
Downloading shards:   0% 0/2 [00:00<?, ?it/s]
Downloading (…)l-00001-of-00002.bin:   0% 0.00/9.98G [00:00<?, ?B/s][A
Downloading (…)l-00001-of-00002.bin:   0% 21.0M/9.98G [00:00<00:47, 209MB/s][A
Downloading (…)l-00001-of-00002.bin:   1% 52.4M/9.98G [00:00<00:37, 265MB/s][A
Downloading (…)l-00001-of-00002.bin:   1% 94.4M/9.98G [00:00<00:31, 316MB/s][A
Downloading (…)l-00001-of-00002.bin:   1% 136M/9.98G [00:00<00:29, 330MB/s] [A
Downloading (…)l-00001-of-00002.bin:   2% 178M/9.98G [00:00<00:29, 337MB/s][A
Downloading (…)l-00001-of-00002.bin:   2% 220M/9.98G [00:00<00:29, 336MB/s][A
Downloading (…)l-00001-of-00002.bin:   3% 262M/9.98G [00:00<00:27, 348MB/s][A
Downloading (…)l-00001-of-00002.bin:   3% 304M/9.98G [00:00<00:27, 347MB/s][A
Downloading (…)l-00001-of-00002.bin:   3% 

### Assignment Part 2:

Test your model on another task! The task choice is up to you, but you'll need to explain it - and determine the models performance on that task.

Again, this task will take a large amount of time -

In [1]:
### YOUR CODE HERE
import locale
locale.getpreferredencoding = lambda: "UTF-8"

!git clone https://github.com/EleutherAI/lm-evaluation-harness

%cd lm-evaluation-harness/

!pip install -q -e .

!python main.py \
    --model hf-causal \
    --model_args pretrained=bigscience/bloomz-3b \
    --tasks arc_easy \
    --device cuda:0

Cloning into 'lm-evaluation-harness'...
remote: Enumerating objects: 17419, done.[K
remote: Counting objects: 100% (3211/3211), done.[K
remote: Compressing objects: 100% (614/614), done.[K
remote: Total 17419 (delta 2906), reused 2664 (delta 2595), pack-reused 14208[K
Receiving objects: 100% (17419/17419), 19.76 MiB | 17.01 MiB/s, done.
Resolving deltas: 100% (11575/11575), done.
/content/lm-evaluation-harness
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.5/79.5 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━