GitHub - ruiqi-zhong/acl2021-instance-level: Code for the paper "Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level", accepted to ACL 2021 Findings

Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level

Authors: Ruiqi Zhong, Dhruba Ghosh, Dan Klein, Jacob Steinhardt

This GitHub Repo contains the pretrained models, model predictions (data) and code for the experiments. If you have any questions, need additional information about the models/data, or want to request a new feature, feel free to send an email to ruiqi-zhong@berkeley.edu .

The paper is accepted to ACL 2021 Findings. Paper link: https://arxiv.org/abs/2105.06020

1. Pre-trained Models

You can download all of our pre-trained BERT models from here. We experimented with 5 different sizes (mini, small, medium, base, large) as described in the official BERT github Repo and stored them in the same format. We used the exact same pre-training code as the oringal BERT paper and similar training settings, except that we used the training corpus from this paper, reduced the context size from 512 to 128, and increase the number of training steps from 1M to 2M. For each size, we pre-trained 10 times with the identical procedure but different random seeds.

The directory for the medium-sized BERT model with pre-training seed 3 after pretraining for 2000000 steps is pretrained/medium/pretrain_seed3step2000000.

2. Model Predictions

We release the model predictions on 3 different tasks (sst-2, MNLI, and QQP) for 5 different model sizes, each with 10 pretraining seeds and 5 finetuning seeds, as described in the paper. Additionally, we provide the model predictions for several out-of-domain datasets (e.g. SNLI/HANS for MNLI, TwitterPPDB for QQP) after training for 3.0, 3.33, 3.67 epochs (in the paper we always finetune for 4 epochs). You can download them from here.

2.1 model_data_for_release/

This folder contains the raw model predictions and the corresponding datapoints. For each task of MNLI, QQP and SST-2, there is a folder with the following files/folders (we performed 5 fold cross-validation on SST-2):

data.json contains the data used for fine-tuning (training) and prediction, where each datapoint is represented as a dictionary. For example,

import json
data = json.load(open('qqp/data.json'))
print('Number of datapoints for model prediction', len(data['predict'])) # prints the number of datapoints that are used for evaluation
print(data['predict'][0]) # prints the first datapoint for evaluation

, and we have

Number of datapoints for model prediction, 79497
{'guid': 'dev-0', 'label': '1', 
'text_b': 'Where are the best places to eat in New York City?', 
'text_a': 'Where are the best places to eat in New York City that have a great vibe?'}

Notice that string before "-" for "guid" denotes the original data split, not our train/test split. For example, "train" might occur in the evaluation set, since we used part of the original training split for testing. In our paper, we used the "dev_matched" split for MNLI, "dev" for QQP, and "train" for SST-2. We downloaded our QQP data from here, and the NLI challenge set from here.

[predict/train].tf_record contains the tokenized data in the tensorflow format that are used for fine-tuning.

size2hyperparam.json contains the hyper-parameter used for different model sizes.

results/ is the folder that contains the models' prediction, each in a .tsv format, representing a matrix of dimension (number of datapoints, number of classes). Each row is the models' predicted probability for each class, and correspond to one datapoint in data['predict']. results/slargep8f5epoch9over3.tsv means the predictions of the large size model with pretraining seed 8 finetuning seed 5 evaluated at (9/3)=3 epoch.

2.2 Processed Predictions: Correctness tensors

We extract 3 types of "correctness tensors" from the .tsv and data.json files, with the command

python3 dump_correctness.py

The results are dumped into correctness/, correctness_p/, ensemble_c/

correctness/: stores binary tensors. 1 if the model prediction is correct, 0 otherwise. For example

import pickle as pkl
qqp_size2correctness, qqp_data = pkl.load(open('correctness/qqp.pkl', 'rb')) # qqp_data is exactly data['predict'] as mentioned above
print(qqp_size2correctness.keys()) # output: dict_keys(['mini', 'small', 'medium', 'base', 'large']). qqp_size2correctness is a mapping from model size to the correctness tensor
print(qqp_size2correctness['large'].shape) # output: (79497, 10, 5, 4). 79497 is the number of datapoints, 10/5 is the number of pretraining/finetuning seeds, 4 represents different checkpoints at [3, 3.33, 3.67, 4] epochs.

correctness_p/: Same as "correctness", the probability assigned to the correct class.

ensemble_c/: the correctness tensor using the last checkpoint only, after marginalizing over all fine-tuning seeds.

2.3 Code to Produce Model Predictions

We use almost the exact same code as the original BERT paper to fine-tune a pre-trained model. To convert the data into the tensorflow format, run

python3 process_input.py

To finetune the pretrained models, run, for example,

python3 run_classifier.py --data_dir qqp/ --pretrain_seed 10 --model_size medium --dataorder_seed 10 --initialization_seed 10 --tpu_name lm-4

3. Code

Run pip3 install -r requirements.txt to install the required dependencies, and download the files to the corresponding folders. The results.ipynb jupyter notebook computes the following:

3.1 Variations within a Single Model Size

within/across pretraining seed difference
0/1 loss bias-finetuningvariance-pretrainingvariance-decomposition
variance conditioned on bias
squared loss decomposition using probability assigned to the correct label

3.2 Estimating the Fraction of Decaying Instances with Random Baseline

Our approach, with CDF visualization
Comparison with the conventional Benjamini-Hochberg Procedure
Example decaying instances

The decaying instances are stored in the decaying/ folder. Each .pkl file is a map from the data split name to a 3-tuple, which are

a list of datapoints
whether the datapoint belongs to the control group (random non-decaying instances), or the treatment group (decaying instances)
Ruiqi Zhong's annotation of whether he thinks the label is correct, wrong, reasonable or unsure. (None means not annotated)

3.3 Correlation of Instance Difference

instance difference as measured by pearson-r correlation (numbers represent models sizes, smaller numbers correspond to smaller models)
measuring with spearman rank

Some results might be different from that in the paper due to random seeds; however they should be close and lead to the exact qualitative conclusions.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
correctness		correctness
correctness_p		correctness_p
decaying		decaying
ensemble_c		ensemble_c
README.md		README.md
decompose_bias_variance.py		decompose_bias_variance.py
dump_correctness.py		dump_correctness.py
full-appendix.pdf		full-appendix.pdf
modeling.py		modeling.py
optimization.py		optimization.py
process_input.py		process_input.py
requirements.txt		requirements.txt
results.ipynb		results.ipynb
run_classifier.py		run_classifier.py
tokenization.py		tokenization.py
utils.py		utils.py

ruiqi-zhong/acl2021-instance-level

Folders and files

Latest commit

History

Repository files navigation

Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level

1. Pre-trained Models

2. Model Predictions

2.1 model_data_for_release/

2.2 Processed Predictions: Correctness tensors

2.3 Code to Produce Model Predictions

3. Code

3.1 Variations within a Single Model Size

3.2 Estimating the Fraction of Decaying Instances with Random Baseline

3.3 Correlation of Instance Difference

About

Resources

Stars

Watchers

Forks

Languages