## Diagnostic Task Evals

This notebook runs the diagnostic task evals for the models reported in the paper.

In [1]:
import subprocess

def run_model_evals(tasks_and_models, model_class, random_seed=59):
    # Loop through the list and run the eval script with each parameter
    for task in tasks_and_models.keys():
        models = tasks_and_models[task]
        for model in models:
            command = f"python diagnostic_task_eval.py {task} {model} {model_class} --random_seed {random_seed}"
            # Execute the command using subprocess.run
            subprocess.run(command, shell=True)

### T5 Models

In [2]:
# List of parameters you want to loop through
tasks_and_models = {
    "vowel_removal": ["t5_simple_vowel_removal"],
    "contextual_vowel_removal": ["t5_contextual_vowel_removal"],
    "merge_ABC": ["t5_merge_ABC"],
}

run_model_evals(tasks_and_models, "T5", random_seed=59)

Loading model...
vowel_removal
Path: /nlp/scr3/nlp/llms-in-llms/mrt5/models/vowel_removal/T5/t5_simple_vowel_removal_seed59/checkpoints/checkpoint-20000

Loading vowel_removal test dataset...


100%|█████████▉| 249/250 [01:28<00:00,  2.81it/s]


Eval cross entropy loss: 0.0005509855701238849
Eval percent deleted tokens: 0.0
Eval new sequence length: 128.0
Eval token accuracy: 0.99972265625
Eval sequence accuracy: 0.96534375
Examples evaluated: 32000

Loading model...
contextual_vowel_removal
Path: /nlp/scr3/nlp/llms-in-llms/mrt5/models/contextual_vowel_removal/T5/t5_contextual_vowel_removal_seed59/checkpoints/checkpoint-20000

Loading contextual_vowel_removal test dataset...


100%|█████████▉| 249/250 [01:30<00:00,  2.76it/s]


Eval cross entropy loss: 0.00017452824787233113
Eval percent deleted tokens: 0.0
Eval new sequence length: 128.0
Eval token accuracy: 0.999933349609375
Eval sequence accuracy: 0.99153125
Examples evaluated: 32000

Loading model...
merge_ABC
Path: /nlp/scr3/nlp/llms-in-llms/mrt5/models/merge_ABC/T5/t5_merge_ABC_seed59/checkpoints/checkpoint-20000

Loading merge_ABC test dataset...


100%|█████████▉| 249/250 [01:29<00:00,  2.79it/s]


Eval cross entropy loss: 0.00026714385767627393
Eval percent deleted tokens: 0.0
Eval new sequence length: 128.0
Eval token accuracy: 0.99987744140625
Eval sequence accuracy: 0.9844375
Examples evaluated: 32000



### MrT5 Models

In [3]:
# List of parameters you want to loop through
tasks_and_models = {
    "vowel_removal":
      [
        "mrt5_simple_vowel_removal_0.0",
        "mrt5_simple_vowel_removal_0.0001",
      ],
    "contextual_vowel_removal":
      [
        "mrt5_contextual_vowel_removal_0.001",
        "mrt5_contextual_vowel_removal_0.001_L7",
        "mrt5_contextual_vowel_removal_0.001_L8",
      ],
    "merge_ABC":
      [
        "mrt5_merge_ABC_0.001",
        "mrt5_merge_ABC_0.001_L6",
        "mrt5_merge_ABC_0.001_L7",
      ],
}

run_model_evals(tasks_and_models, "MrT5", random_seed=59)

Loading model...
vowel_removal
Path: /nlp/scr3/nlp/llms-in-llms/mrt5/models/vowel_removal/MrT5/mrt5_simple_vowel_removal_0.0_seed59/checkpoints/checkpoint-20000

Loading vowel_removal test dataset...


100%|█████████▉| 249/250 [01:28<00:00,  2.81it/s]


Eval cross entropy loss: 9.813663885506685e-05
Eval percent deleted tokens: 18.9471435546875
Eval new sequence length: 114.564
Eval token accuracy: 0.999973388671875
Eval sequence accuracy: 0.9966875
Examples evaluated: 32000

Loading model...
vowel_removal
Path: /nlp/scr3/nlp/llms-in-llms/mrt5/models/vowel_removal/MrT5/mrt5_simple_vowel_removal_0.0001_seed59/checkpoints/checkpoint-20000

Loading vowel_removal test dataset...


100%|█████████▉| 249/250 [01:24<00:00,  2.94it/s]


Eval cross entropy loss: 0.008236341631039977
Eval percent deleted tokens: 51.1510986328125
Eval new sequence length: 76.956
Eval token accuracy: 0.997520263671875
Eval sequence accuracy: 0.781625
Examples evaluated: 32000

Loading model...
contextual_vowel_removal
Path: /nlp/scr3/nlp/llms-in-llms/mrt5/models/contextual_vowel_removal/MrT5/mrt5_contextual_vowel_removal_0.001_seed59/checkpoints/checkpoint-20000

Loading contextual_vowel_removal test dataset...


100%|█████████▉| 249/250 [01:31<00:00,  2.73it/s]


Eval cross entropy loss: 0.0003811254109177753
Eval percent deleted tokens: 1.5625
Eval new sequence length: 126.0
Eval token accuracy: 0.999878173828125
Eval sequence accuracy: 0.9844375
Examples evaluated: 32000

Loading model...
contextual_vowel_removal
Path: /nlp/scr3/nlp/llms-in-llms/mrt5/models/contextual_vowel_removal/MrT5/mrt5_contextual_vowel_removal_0.001_L7_seed59/checkpoints/checkpoint-20000

Loading contextual_vowel_removal test dataset...


100%|█████████▉| 249/250 [01:30<00:00,  2.76it/s]


Eval cross entropy loss: 0.0010879925589688355
Eval percent deleted tokens: 18.51201171875
Eval new sequence length: 120.16
Eval token accuracy: 0.999718994140625
Eval sequence accuracy: 0.964875
Examples evaluated: 32000

Loading model...
contextual_vowel_removal
Path: /nlp/scr3/nlp/llms-in-llms/mrt5/models/contextual_vowel_removal/MrT5/mrt5_contextual_vowel_removal_0.001_L8_seed59/checkpoints/checkpoint-20000

Loading contextual_vowel_removal test dataset...


100%|█████████▉| 249/250 [01:31<00:00,  2.71it/s]


Eval cross entropy loss: 0.0015164393637678586
Eval percent deleted tokens: 18.51201171875
Eval new sequence length: 120.152
Eval token accuracy: 0.99959228515625
Eval sequence accuracy: 0.9491875
Examples evaluated: 32000

Loading model...
merge_ABC
Path: /nlp/scr3/nlp/llms-in-llms/mrt5/models/merge_ABC/MrT5/mrt5_merge_ABC_0.001_seed59/checkpoints/checkpoint-20000

Loading merge_ABC test dataset...


100%|█████████▉| 249/250 [01:31<00:00,  2.73it/s]


Eval cross entropy loss: 0.00024240130726866483
Eval percent deleted tokens: 1.5625
Eval new sequence length: 126.0
Eval token accuracy: 0.99991259765625
Eval sequence accuracy: 0.988875
Examples evaluated: 32000

Loading model...
merge_ABC
Path: /nlp/scr3/nlp/llms-in-llms/mrt5/models/merge_ABC/MrT5/mrt5_merge_ABC_0.001_L6_seed59/checkpoints/checkpoint-20000

Loading merge_ABC test dataset...


100%|█████████▉| 249/250 [01:30<00:00,  2.75it/s]


Eval cross entropy loss: 0.0004778707785771985
Eval percent deleted tokens: 8.305029296875
Eval new sequence length: 126.0
Eval token accuracy: 0.9998212890625
Eval sequence accuracy: 0.97740625
Examples evaluated: 32000

Loading model...
merge_ABC
Path: /nlp/scr3/nlp/llms-in-llms/mrt5/models/merge_ABC/MrT5/mrt5_merge_ABC_0.001_L7_seed59/checkpoints/checkpoint-20000

Loading merge_ABC test dataset...


100%|█████████▉| 249/250 [01:30<00:00,  2.75it/s]


Eval cross entropy loss: 0.004423437299672514
Eval percent deleted tokens: 16.936083984375
Eval new sequence length: 124.452
Eval token accuracy: 0.998436767578125
Eval sequence accuracy: 0.85221875
Examples evaluated: 32000

