## Diagnostic Task Evals

This notebook runs the diagnostic task evals for the models reported in the paper.

In [4]:
import subprocess

def run_model_evals(tasks_and_models, model_class):
    # Loop through the list and run the eval script with each parameter
    for task in tasks_and_models.keys():
        models = tasks_and_models[task]
        for model, seed in models:
            command = f"python diagnostic_task_eval.py {task} {model} {model_class} --random_seed {seed} --hard_delete"
            # Execute the command using subprocess.run
            subprocess.run(command, shell=True)

### T5 Models

In [5]:
# List of parameters you want to loop through
tasks_and_models = {
    "vowel_removal": [("t5_simple_vowel_removal", 121)],
    "contextual_vowel_removal": [("t5_contextual_vowel_removal", 11)],
    "merge_ABC": [("t5_merge_ABC", 241)],
}

run_model_evals(tasks_and_models, "T5")

Loading model...
vowel_removal
Path: /nlp/scr3/nlp/llms-in-llms/mrt5/models/vowel_removal/T5/t5_simple_vowel_removal_seed121/checkpoints/checkpoint-30000

Loading vowel_removal test dataset...


100%|█████████▉| 249/250 [00:41<00:00,  5.93it/s]


Eval cross entropy loss: 5.142786203714422e-05
Eval percent deleted tokens: 0.0
Eval new sequence length: 64.0
Eval token accuracy: 0.999987410263958
Eval sequence accuracy: 0.99934375
Examples evaluated: 32000

Loading model...
contextual_vowel_removal
Path: /nlp/scr3/nlp/llms-in-llms/mrt5/models/contextual_vowel_removal/T5/t5_contextual_vowel_removal_seed11/checkpoints/checkpoint-30000

Loading contextual_vowel_removal test dataset...


100%|█████████▉| 249/250 [00:43<00:00,  5.77it/s]


Eval cross entropy loss: 5.095320295799866e-05
Eval percent deleted tokens: 0.0
Eval new sequence length: 64.0
Eval token accuracy: 0.999982861869418
Eval sequence accuracy: 0.99909375
Examples evaluated: 32000

Loading model...
merge_ABC
Path: /nlp/scr3/nlp/llms-in-llms/mrt5/models/merge_ABC/T5/t5_merge_ABC_seed241/checkpoints/checkpoint-30000

Loading merge_ABC test dataset...


100%|█████████▉| 249/250 [00:42<00:00,  5.91it/s]


Eval cross entropy loss: 7.225487371078999e-05
Eval percent deleted tokens: 0.0
Eval new sequence length: 64.0
Eval token accuracy: 0.9999715122233978
Eval sequence accuracy: 0.9984375
Examples evaluated: 32000



### MrT5 Models

In [6]:
# List of parameters you want to loop through
tasks_and_models = {
    "vowel_removal":
      [
       ("mrt5_simple_vowel_removal_1e-4", 429),
       ("mrt5_simple_vowel_removal_1e-3", 93),
      ],
    "contextual_vowel_removal":
      [
        ("mrt5_contextual_vowel_removal_1e-2", 934),
        ("mrt5_contextual_vowel_removal_1e-3", 510),
      ],
    "merge_ABC":
      [
        ("mrt5_merge_ABC_1e-2", 14),
        ("mrt5_merge_ABC_1.5e-2", 123)
      ],
}

run_model_evals(tasks_and_models, "MrT5")

Loading model...
vowel_removal
Path: /nlp/scr3/nlp/llms-in-llms/mrt5/models/vowel_removal/MrT5/mrt5_simple_vowel_removal_1e-4_seed429/checkpoints/checkpoint-30000

Loading vowel_removal test dataset...


100%|█████████▉| 249/250 [00:42<00:00,  5.85it/s]


Eval cross entropy loss: 4.533697633360134e-05
Eval percent deleted tokens: 18.58466796875
Eval new sequence length: 59.46
Eval token accuracy: 0.9999850047236116
Eval sequence accuracy: 0.99921875
Examples evaluated: 32000

Loading model...
vowel_removal
Path: /nlp/scr3/nlp/llms-in-llms/mrt5/models/vowel_removal/MrT5/mrt5_simple_vowel_removal_1e-3_seed93/checkpoints/checkpoint-30000

Loading vowel_removal test dataset...


100%|█████████▉| 249/250 [00:42<00:00,  5.84it/s]


Eval cross entropy loss: 5.423742860853053e-05
Eval percent deleted tokens: 20.14716796875
Eval new sequence length: 58.46
Eval token accuracy: 0.9999838014902892
Eval sequence accuracy: 0.9991875
Examples evaluated: 32000

Loading model...
contextual_vowel_removal
Path: /nlp/scr3/nlp/llms-in-llms/mrt5/models/contextual_vowel_removal/MrT5/mrt5_contextual_vowel_removal_1e-2_seed934/checkpoints/checkpoint-30000

Loading contextual_vowel_removal test dataset...


100%|█████████▉| 249/250 [00:43<00:00,  5.78it/s]


Eval cross entropy loss: 0.00013472470228134624
Eval percent deleted tokens: 18.96748046875
Eval new sequence length: 60.56
Eval token accuracy: 0.99994624392371
Eval sequence accuracy: 0.99715625
Examples evaluated: 32000

Loading model...
contextual_vowel_removal
Path: /nlp/scr3/nlp/llms-in-llms/mrt5/models/contextual_vowel_removal/MrT5/mrt5_contextual_vowel_removal_1e-3_seed510/checkpoints/checkpoint-30000

Loading contextual_vowel_removal test dataset...


100%|█████████▉| 249/250 [00:42<00:00,  5.82it/s]


Eval cross entropy loss: 0.00010874113667932761
Eval percent deleted tokens: 1.5625
Eval new sequence length: 63.0
Eval token accuracy: 0.9999592178636283
Eval sequence accuracy: 0.99784375
Examples evaluated: 32000

Loading model...
merge_ABC
Path: /nlp/scr3/nlp/llms-in-llms/mrt5/models/merge_ABC/MrT5/mrt5_merge_ABC_1e-2_seed14/checkpoints/checkpoint-30000

Loading merge_ABC test dataset...


100%|█████████▉| 249/250 [00:42<00:00,  5.82it/s]


Eval cross entropy loss: 0.00025213346126292893
Eval percent deleted tokens: 10.146875
Eval new sequence length: 61.996
Eval token accuracy: 0.9998968378415278
Eval sequence accuracy: 0.994375
Examples evaluated: 32000

Loading model...
merge_ABC
Path: /nlp/scr3/nlp/llms-in-llms/mrt5/models/merge_ABC/MrT5/mrt5_merge_ABC_1.5e-2_seed123/checkpoints/checkpoint-30000

Loading merge_ABC test dataset...


100%|█████████▉| 249/250 [00:42<00:00,  5.91it/s]


Eval cross entropy loss: 0.0004563363664965436
Eval percent deleted tokens: 17.371142578125
Eval new sequence length: 60.996
Eval token accuracy: 0.9998471618873562
Eval sequence accuracy: 0.991875
Examples evaluated: 32000

