# Instructions
This notebook allows you to load and evaluate a huggingface model on a subset of BLiMP (a linguistic acceptability judgment dataset) and GLUE (a natural language understanding benchmark collection). It is HIGHLY recommended to clone the GitHub repository and evaluate your model in the command-line; this will give you more freedom in the kinds of models you can evaluate. However, Colab provides a GPU that will allow you to load and evaluate smaller models.

To use this notebook:

1. Start by making a copy of this notebook so that you can make edits and run the code: File > Save a copy in Drive.

2. Set Runtime > Change runtime type > Hardware accelerator to GPU if it isn't already.

3. Run the setup script to install the required packages for evaluating.

4. Upload your model to the colab in the `/content/model_folder/` directory. This folder should include the following files, and probably a couple more depending on the type of model and tokenizer you use:
* `config.json`
* `pytorch_model.bin`
* `tokenizer_config.json`
* `vocab.json`

  a. To obtain these files given your pre-trained model and your tokenizer, load them using huggingface `transformers` and save them using these commands:
```
tokenizer.save_pretrained("./model_dir")
model.save_pretrained("./model_dir")
```
  b. Then, upload all the contents of `model_dir` (including any other files not mentioned above) to the `model_folder` folder in this Colab.

5. Choose the proper model type in the dropdown in the "load model and evaluate" cell. Use "decoder" for autoregressive (sometimes called "causal") language models, like GPT/OPT; "encoder" for masked language models, like BERT/RoBERTa; or "encoder-decoder" for text-to-text models, like T5/BART.

6. Run the cells below to load and evaluate your model.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import sys
sys.path.append('/content/drive/MyDrive/Colab Notebooks/')

In [None]:
%cd /content/drive/MyDrive/Colab Notebooks/

/content/drive/MyDrive/Colab Notebooks


In [None]:
%pwd

'/content/drive/MyDrive/Colab Notebooks'

In [None]:
#@title Setup script { display-mode: "form" }
#@markdown Run this cell to install the necessary packages (may take a few minutes).
%%shell
# Remove previous installation if it exists
cd /content
mkdir model_folder
pip uninstall -y lm-eval
rm -rf evaluation-pipeline/

# Install evaluation-pipeline
git clone -b colab https://github.com/babylm/evaluation-pipeline &> /dev/null
cd evaluation-pipeline/
pip install -e ".[colab]"
# Install other necessary packages
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113

# Unpack dataset
unzip filter_data.zip

[0mObtaining file:///content/evaluation-pipeline
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting accelerate@ git+https://github.com/huggingface/accelerate@main (from lm-eval==0.2.0)
  Cloning https://github.com/huggingface/accelerate (to revision main) to /tmp/pip-install-9iifhf38/accelerate_a09cef1fac23484aa9563400974b94e6
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/accelerate /tmp/pip-install-9iifhf38/accelerate_a09cef1fac23484aa9563400974b94e6
  Resolved https://github.com/huggingface/accelerate to commit 67308ca6ef1e99c20e59ab89b8dfa575e55f83fa
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting datasets>=2.0.0 (from lm-eval==0.2.0)
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m9.5 MB/s[0m eta [36m



# Evaluation

In [None]:
#@title Load model and evaluate (BLiMP)(curriculum learning) { display-mode: "form" }
model = "/content/drive/MyDrive/Colab Notebooks/babyberta_max_curriculum/checkpoint-160000" #@param {"type": "string"}
model_type = "encoder" #@param ["decoder", "encoder", "encoder-decoder"]
# file_name = "examples3.csv" #@param {"type": "string"}
# model_names = ["opt-125m", "opt-350m", "opt-1.3b", "opt-2.7b"] #@param {"type": "raw"}

%cd /content/evaluation-pipeline
%run /content/evaluation-pipeline/babylm_eval.py \
  "$model" \
  "$model_type" \
  -t "blimp"

/content/evaluation-pipeline


Generating train split: 0 examples [00:00, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/1956 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/1956 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 1956/1956 [00:00<00:00, 4874.77it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 3912/3912 [00:44<00:00, 88.42it/s]


anaphor_agreement:	64.62%


Generating train split: 0 examples [00:00, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/8248 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/8248 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 8248/8248 [00:02<00:00, 3875.47it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 16496/16496 [03:10<00:00, 86.76it/s]


argument_structure:	55.01%


Generating train split: 0 examples [00:00, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/6738 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/6738 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 6738/6738 [00:01<00:00, 4252.57it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 13476/13476 [02:40<00:00, 83.72it/s]


binding:	61.72%


Generating train split: 0 examples [00:00, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/4526 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/4526 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 4526/4526 [00:00<00:00, 6646.23it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 9052/9052 [01:50<00:00, 81.68it/s]


control_raising:	53.38%


Generating train split: 0 examples [00:00, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/7542 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/7542 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 7542/7542 [00:01<00:00, 3931.48it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 15084/15084 [02:56<00:00, 85.49it/s]


determiner_noun_agreement:	67.99%


Generating train split: 0 examples [00:00, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/1732 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/1732 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 1732/1732 [00:00<00:00, 6984.64it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 3464/3464 [00:44<00:00, 77.36it/s]


ellipsis:	44.92%


Generating train split: 0 examples [00:00, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/6426 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/6426 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 6426/6426 [00:01<00:00, 3462.08it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 12852/12852 [02:42<00:00, 78.91it/s]


filler_gap:	63.26%


Generating train split: 0 examples [00:00, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/1965 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/1965 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 1965/1965 [00:00<00:00, 6829.53it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 3930/3930 [00:48<00:00, 81.68it/s]


irregular_forms:	70.08%


Generating train split: 0 examples [00:00, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/2676 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/2676 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 2676/2676 [00:00<00:00, 4227.43it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 5352/5352 [01:05<00:00, 82.20it/s]


island_effects:	44.92%


Generating train split: 0 examples [00:00, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/6586 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/6586 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 6586/6586 [00:01<00:00, 5265.65it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 13172/13172 [02:40<00:00, 82.18it/s]


npi_licensing:	53.64%


Generating train split: 0 examples [00:00, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/3882 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/3882 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 3882/3882 [00:00<00:00, 4287.45it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 7764/7764 [01:32<00:00, 84.18it/s]


quantifiers:	54.71%


Generating train split: 0 examples [00:00, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/5535 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/5535 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 5535/5535 [00:00<00:00, 6626.20it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 11070/11070 [02:13<00:00, 82.75it/s]


subject_verb_agreement:	53.57%

Scores:
anaphor_agreement:	64.62%
argument_structure:	55.01%
binding:	61.72%
control_raising:	53.38%
determiner_noun_agreement:	67.99%
ellipsis:	44.92%
filler_gap:	63.26%
irregular_forms:	70.08%
island_effects:	44.92%
npi_licensing:	53.64%
quantifiers:	54.71%
subject_verb_agreement:	53.57%


In [7]:
#@title Load model and evaluate (BLiMP)(random) { display-mode: "form" }
model = "/content/drive/MyDrive/Colab Notebooks/babyberta_max_random/checkpoint-160000" #@param {"type": "string"}
model_type = "encoder" #@param ["decoder", "encoder", "encoder-decoder"]
# file_name = "examples3.csv" #@param {"type": "string"}
# model_names = ["opt-125m", "opt-350m", "opt-1.3b", "opt-2.7b"] #@param {"type": "raw"}

%cd /content/evaluation-pipeline
%run /content/evaluation-pipeline/babylm_eval.py \
  "$model" \
  "$model_type" \
  -t "blimp"

/content/evaluation-pipeline


Generating train split:   0%|          | 0/1956 [00:00<?, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/1956 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/1956 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 1956/1956 [00:00<00:00, 3521.44it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 3912/3912 [00:44<00:00, 87.08it/s]


anaphor_agreement:	65.49%


Generating train split:   0%|          | 0/8248 [00:00<?, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/8248 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/8248 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 8248/8248 [00:01<00:00, 6626.37it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 16496/16496 [03:14<00:00, 84.76it/s]


argument_structure:	54.75%


Generating train split:   0%|          | 0/6738 [00:00<?, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/6738 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/6738 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 6738/6738 [00:01<00:00, 5219.87it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 13476/13476 [02:46<00:00, 80.95it/s]


binding:	63.40%


Generating train split:   0%|          | 0/4526 [00:00<?, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/4526 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/4526 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 4526/4526 [00:00<00:00, 5236.35it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 9052/9052 [01:55<00:00, 78.32it/s]


control_raising:	53.25%


Generating train split:   0%|          | 0/7542 [00:00<?, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/7542 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/7542 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 7542/7542 [00:01<00:00, 6898.65it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 15084/15084 [02:54<00:00, 86.26it/s]


determiner_noun_agreement:	67.50%


Generating train split:   0%|          | 0/1732 [00:00<?, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/1732 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/1732 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 1732/1732 [00:00<00:00, 6140.32it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 3464/3464 [00:45<00:00, 76.43it/s]


ellipsis:	45.73%


Generating train split:   0%|          | 0/6426 [00:00<?, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/6426 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/6426 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 6426/6426 [00:00<00:00, 6542.06it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 12852/12852 [02:42<00:00, 79.27it/s]


filler_gap:	63.97%


Generating train split:   0%|          | 0/1965 [00:00<?, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/1965 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/1965 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 1965/1965 [00:00<00:00, 5856.86it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 3930/3930 [00:47<00:00, 82.92it/s]


irregular_forms:	73.33%


Generating train split:   0%|          | 0/2676 [00:00<?, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/2676 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/2676 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 2676/2676 [00:00<00:00, 6183.28it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 5352/5352 [01:05<00:00, 81.90it/s]


island_effects:	46.56%


Generating train split:   0%|          | 0/6586 [00:00<?, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/6586 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/6586 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 6586/6586 [00:01<00:00, 5296.02it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 13172/13172 [02:37<00:00, 83.83it/s]


npi_licensing:	51.55%


Generating train split:   0%|          | 0/3882 [00:00<?, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/3882 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/3882 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 3882/3882 [00:00<00:00, 4058.76it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 7764/7764 [01:31<00:00, 84.86it/s]


quantifiers:	67.00%


Generating train split:   0%|          | 0/5535 [00:00<?, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/5535 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/5535 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 5535/5535 [00:01<00:00, 3994.18it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 11070/11070 [02:16<00:00, 81.27it/s]


subject_verb_agreement:	53.08%

Scores:
anaphor_agreement:	65.49%
argument_structure:	54.75%
binding:	63.40%
control_raising:	53.25%
determiner_noun_agreement:	67.50%
ellipsis:	45.73%
filler_gap:	63.97%
irregular_forms:	73.33%
island_effects:	46.56%
npi_licensing:	51.55%
quantifiers:	67.00%
subject_verb_agreement:	53.08%


In [8]:
#@title Load model and evaluate (BLiMP)(babyBERTa-base model) { display-mode: "form" }
model = "/content/drive/MyDrive/Colab Notebooks/BabyBERTa-1" #@param {"type": "string"}
model_type = "encoder" #@param ["decoder", "encoder", "encoder-decoder"]
# file_name = "examples3.csv" #@param {"type": "string"}
# model_names = ["opt-125m", "opt-350m", "opt-1.3b", "opt-2.7b"] #@param {"type": "raw"}

%cd /content/evaluation-pipeline
%run /content/evaluation-pipeline/babylm_eval.py \
  "$model" \
  "$model_type" \
  -t "blimp"

/content/evaluation-pipeline


Generating train split:   0%|          | 0/1956 [00:00<?, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/1956 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/1956 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 1956/1956 [00:00<00:00, 6550.01it/s]



» Running all `loglikelihood` requests


INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 3912/3912 [00:45<00:00, 85.22it/s]


anaphor_agreement:	69.73%


Generating train split:   0%|          | 0/8248 [00:00<?, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/8248 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/8248 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 8248/8248 [00:01<00:00, 6453.88it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 16496/16496 [03:14<00:00, 84.66it/s]


argument_structure:	54.39%


Generating train split:   0%|          | 0/6738 [00:00<?, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/6738 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/6738 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 6738/6738 [00:02<00:00, 3307.56it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 13476/13476 [02:44<00:00, 81.80it/s]


binding:	62.78%


Generating train split:   0%|          | 0/4526 [00:00<?, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/4526 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/4526 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 4526/4526 [00:00<00:00, 5649.04it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 9052/9052 [01:54<00:00, 78.98it/s]


control_raising:	55.90%


Generating train split:   0%|          | 0/7542 [00:00<?, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/7542 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/7542 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 7542/7542 [00:01<00:00, 5227.56it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 15084/15084 [02:59<00:00, 83.83it/s]


determiner_noun_agreement:	78.14%


Generating train split:   0%|          | 0/1732 [00:00<?, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/1732 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/1732 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 1732/1732 [00:00<00:00, 5881.07it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 3464/3464 [00:47<00:00, 72.98it/s]


ellipsis:	51.62%


Generating train split:   0%|          | 0/6426 [00:00<?, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/6426 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/6426 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 6426/6426 [00:00<00:00, 6615.76it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 12852/12852 [02:41<00:00, 79.56it/s]


filler_gap:	66.82%


Generating train split:   0%|          | 0/1965 [00:00<?, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/1965 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/1965 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 1965/1965 [00:00<00:00, 5985.19it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 3930/3930 [00:46<00:00, 84.48it/s]


irregular_forms:	69.26%


Generating train split:   0%|          | 0/2676 [00:00<?, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/2676 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/2676 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 2676/2676 [00:00<00:00, 6334.49it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 5352/5352 [01:08<00:00, 78.46it/s]


island_effects:	51.20%


Generating train split:   0%|          | 0/6586 [00:00<?, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/6586 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/6586 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 6586/6586 [00:01<00:00, 5226.32it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 13172/13172 [02:42<00:00, 81.06it/s]


npi_licensing:	57.29%


Generating train split:   0%|          | 0/3882 [00:00<?, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/3882 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/3882 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 3882/3882 [00:00<00:00, 4056.62it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 7764/7764 [01:33<00:00, 83.40it/s]


quantifiers:	60.25%


Generating train split:   0%|          | 0/5535 [00:00<?, ? examples/s]


» Assigning unique IDs to 'blimp_from_file+null' docs


INFO:lm_eval.evaluator:
» Assigning unique IDs to 'blimp_from_file+null' docs


Map:   0%|          | 0/5535 [00:00<?, ? examples/s]


» Filtering invalid docs from 'blimp_from_file+null'


INFO:lm_eval.evaluator:
» Filtering invalid docs from 'blimp_from_file+null'


Filter:   0%|          | 0/5535 [00:00<?, ? examples/s]


» Constructing 'blimp_from_file+null' contexts and requests


INFO:lm_eval.evaluator:
» Constructing 'blimp_from_file+null' contexts and requests
100%|██████████| 5535/5535 [00:01<00:00, 4787.41it/s]


» Running all `loglikelihood` requests



INFO:lm_eval.evaluator:
» Running all `loglikelihood` requests
100%|██████████| 11070/11070 [02:16<00:00, 81.10it/s]


subject_verb_agreement:	55.03%

Scores:
anaphor_agreement:	69.73%
argument_structure:	54.39%
binding:	62.78%
control_raising:	55.90%
determiner_noun_agreement:	78.14%
ellipsis:	51.62%
filler_gap:	66.82%
irregular_forms:	69.26%
island_effects:	51.20%
npi_licensing:	57.29%
quantifiers:	60.25%
subject_verb_agreement:	55.03%


In [None]:
#@title Load model and evaluate ((Super)GLUE) (won't be use in this project) { display-mode: "form" }
#@markdown Run this cell to fine-tune your model on (Super)GLUE tasks.
#@markdown We provide some default hyperparameters that you may adjust.
model = "/content/drive/MyDrive/Colab Notebooks/checkpoint-160000" #@param {"type": "string"}
learning_rate = 5e-5 #@param {"type": "number"}
batch_size = 64 #@param {"type": "integer"}
eval_every = 200 #@param {"type": "integer"}
patience = 10 #@param {"type": "integer"}
max_epochs = 10 #@param {"type": "integer"}
seed = 12 #@param {"type": "integer"}
# file_name = "examples3.csv" #@param {"type": "string"}
# model_names = ["opt-125m", "opt-350m", "opt-1.3b", "opt-2.7b"] #@param {"type": "raw"}

%cd /content/evaluation-pipeline
!./finetune_all_tasks.sh \
    "$model" \
    "$learning_rate" \
    "$patience" \
    "$batch_size" \
    "$eval_every" \
    "$max_epochs" \
    "$seed"