# Multi-bootstrap for evaluating pretrained LMs

This notebook shows an example of the paired analysis described in Section 4.1 of the paper. This type of analysis is applicable for any kind of intervention that is applied independently to a particular pretraining (e.g. BERT) checkpoint, including:

- Interventions such as intermediate task training or pruning which directly manipulate a pretraining checkpoint.
- Changes to any fine-tuning or probing procedure which is applied after pretraining.

In the most general case, we'll have a set of $k$ pretraining checkpoints (seeds), to which we'll apply our intervention, perform any additional transformations (like fine-tuning), then evaluate a downstream metric $L$ on a finite evaluation set. The multiple bootstrap procedure allows us to account for three sources of variance:

1. Variation between pretraining checkpoints
2. Expected variance due to a finite evaluation set
3. Variation due to fine-tuning or other procedure

## MultiBERTs vs. Original BERT

Here, we'll compare the MultiBERTs models run for 2M steps with the single previously-released `bert-base-uncased` model. Our analysis will be unpaired with respect to seeds, but we'll still sample jointly over _examples_ in the evaluation set and report confidence intervals as described in Section 4.1 of the paper.

We'll use SQuAD 2.0 here, but the code below can easily be modified to handle other tasks.

In [1]:
import json
import os
import re

import numpy as np
import pandas as pd

from tqdm.notebook import tqdm  # for progress indicator

In [2]:
scratch_dir = "/tmp/multiberts_squad"
if not os.path.isdir(scratch_dir): 
    os.mkdir(scratch_dir)
    
preds_root = "https://storage.googleapis.com/multiberts/public/example-predictions/SQuAD"
# Fetch SQuAD eval script. Rename to allow module import, as this is invalid otherwise.
!curl $preds_root/evaluate-v2.0.py -o $scratch_dir/evaluate_squad2.py
# Fetch development set labels
!curl -O $preds_root/dev-v2.0.json --output-dir $scratch_dir
# Fetch predictions index file
!curl -O $preds_root/index.tsv --output-dir $scratch_dir

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10547  100 10547    0     0  60614      0 --:--:-- --:--:-- --:--:-- 60614
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 4268k  100 4268k    0     0  24.0M      0 --:--:-- --:--:-- --:--:-- 23.9M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31869  100 31869    0     0   220k      0 --:--:-- --:--:-- --:--:--  220k


In [3]:
!ls $scratch_dir

dev-v2.0.json  evaluate_squad2.py  index.tsv  __pycache__  v2.0


Load the run metadata. You can also just look through the directory, but this index file is convenient if (as we do here) you only want to download some of the files.

In [4]:
run_info = pd.read_csv(os.path.join(scratch_dir, 'index.tsv'), sep='\t')
# Filter to SQuAD 2.0 runs from either 2M MultiBERTs or the original BERT checkpoint ("public").
mask = run_info.task == "v2.0"
mask &= (run_info.n_steps == "2M") | (run_info.release == 'public')
run_info = run_info[mask]
run_info

Unnamed: 0,file,task,pretrain_id,n_steps,lr,ft_seed,release
55,"v2.0/release=multiberts,pretrain_id=0,n_steps=...",v2.0,0,2M,0.00005,0,multiberts
56,"v2.0/release=multiberts,pretrain_id=0,n_steps=...",v2.0,0,2M,0.00005,1,multiberts
57,"v2.0/release=multiberts,pretrain_id=0,n_steps=...",v2.0,0,2M,0.00005,2,multiberts
58,"v2.0/release=multiberts,pretrain_id=0,n_steps=...",v2.0,0,2M,0.00005,3,multiberts
59,"v2.0/release=multiberts,pretrain_id=0,n_steps=...",v2.0,0,2M,0.00005,4,multiberts
...,...,...,...,...,...,...,...
305,"v2.0/release=public,pretrain_id=0,n_steps=0,lr...",v2.0,0,0,0.00005,0,public
306,"v2.0/release=public,pretrain_id=0,n_steps=0,lr...",v2.0,0,0,0.00005,1,public
307,"v2.0/release=public,pretrain_id=0,n_steps=0,lr...",v2.0,0,0,0.00005,2,public
308,"v2.0/release=public,pretrain_id=0,n_steps=0,lr...",v2.0,0,0,0.00005,3,public


In [5]:
# Download all prediction files
for fname in tqdm(run_info.file):
    !curl $preds_root/$fname -o $scratch_dir/$fname --create-dirs --silent

  0%|          | 0/130 [00:00<?, ?it/s]

In [6]:
!ls $scratch_dir/v2.0

'release=multiberts,pretrain_id=0,n_steps=2M,lr=5e-05,ft_seed=0.json'
'release=multiberts,pretrain_id=0,n_steps=2M,lr=5e-05,ft_seed=1.json'
'release=multiberts,pretrain_id=0,n_steps=2M,lr=5e-05,ft_seed=2.json'
'release=multiberts,pretrain_id=0,n_steps=2M,lr=5e-05,ft_seed=3.json'
'release=multiberts,pretrain_id=0,n_steps=2M,lr=5e-05,ft_seed=4.json'
'release=multiberts,pretrain_id=10,n_steps=2M,lr=5e-05,ft_seed=0.json'
'release=multiberts,pretrain_id=10,n_steps=2M,lr=5e-05,ft_seed=1.json'
'release=multiberts,pretrain_id=10,n_steps=2M,lr=5e-05,ft_seed=2.json'
'release=multiberts,pretrain_id=10,n_steps=2M,lr=5e-05,ft_seed=3.json'
'release=multiberts,pretrain_id=10,n_steps=2M,lr=5e-05,ft_seed=4.json'
'release=multiberts,pretrain_id=11,n_steps=2M,lr=5e-05,ft_seed=0.json'
'release=multiberts,pretrain_id=11,n_steps=2M,lr=5e-05,ft_seed=1.json'
'release=multiberts,pretrain_id=11,n_steps=2M,lr=5e-05,ft_seed=2.json'
'release=multiberts,pretrain_id=11,n_steps=2M,lr=5e-05,ft_seed=3.json

Now we should have everything in our scratch directory, and can load individual predictions.

SQuAD has a monolithic eval script that isn't easily compatible with a bootstrap procedure (among other things, it parses a lot of JSON, and you don't want to do that in the inner loop!). Ultimately, though, it relies on computing some point-wise scores (exact-match $\in \{0,1\}$ and F1 $\in [0,1]$) and averaging these across examples. For efficiency, we'll pre-compute these before running our bootstrap.

In [7]:
# Import the SQuAD 2.0 eval script; we'll use some functions from this below.
import sys
sys.path.append(scratch_dir)
import evaluate_squad2 as squad_eval

In [8]:
# Load dataset
with open(os.path.join(scratch_dir, 'dev-v2.0.json')) as fd:
    dataset = json.load(fd)['data']

The official script supports thresholding for no-answer, but the default settings ignore this and treat only predictions of emptystring (`""`) as no-answer. So, we can score on `exact_raw` and `f1_raw` directly.

In [9]:
exact_scores = {}  # filename -> qid -> score
f1_scores = {}     # filename -> qid -> score
for fname in tqdm(run_info.file):
    with open(os.path.join(scratch_dir, fname)) as fd:
        preds = json.load(fd)
    
    exact_raw, f1_raw = squad_eval.get_raw_scores(dataset, preds)
    exact_scores[fname] = exact_raw
    f1_scores[fname] = f1_raw
    
def dict_of_dicts_to_matrix(dd):
    """Convert a scores to a dense matrix.
    
    Outer keys assumed to be rows, inner keys are columns (e.g. example IDs).
    Uses pandas to ensure that different rows are correctly aligned.
    
    Args:
      dd: map of row -> column -> value
      
    Returns:
      np.ndarray of shape [num_rows, num_columns]
    """
    # Use pandas to ensure keys are correctly aligned.
    df = pd.DataFrame(dd).transpose()
    return df.values

exact_scores = dict_of_dicts_to_matrix(exact_scores)
f1_scores = dict_of_dicts_to_matrix(f1_scores)

  0%|          | 0/130 [00:00<?, ?it/s]

In [10]:
exact_scores.shape

(130, 11873)

## Run multibootstrap

base (`L`) is the original BERT checkpoint, expt (`L'`) is MultiBERTs with 2M steps. Since we pre-computed the pointwise exact match and F1 scores for each run and each example, we can just pass dummy labels and use a simple average over predictions as our scoring function.

In [11]:
import multibootstrap

num_bootstrap_samples = 1000

selected_runs = run_info.copy()
selected_runs['seed'] = selected_runs['pretrain_id']
selected_runs['intervention'] = (selected_runs['release'] == 'multiberts')

# Dummy labels
dummy_labels = np.zeros_like(exact_scores[0])  # [num_examples]
score_fn = lambda y_true, y_pred: np.mean(y_pred)

# Targets; run once for each.
targets = {'exact': exact_scores, 'f1': f1_scores}

stats = {}
for name, preds in targets.items():
    print(f"Metric: {name:s}")
    samples = multibootstrap.multibootstrap(selected_runs, preds, dummy_labels, score_fn,
                                            nboot=num_bootstrap_samples,
                                            paired_seeds=False,
                                            progress_indicator=tqdm)
    stats[name] = multibootstrap.report_ci(samples, c=0.95)
    print("")

pd.concat({k: pd.DataFrame(v) for k,v in stats.items()}).transpose()

Metric: exact
Multibootstrap (unpaired) on 11873 examples
  Base seeds (1): [0]
  Expt seeds (25): [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]


  0%|          | 0/1000 [00:00<?, ?it/s]

Bootstrap statistics from 1000 samples:
  E[L]  = 0.722 with 95% CI of (0.715 to 0.729)
  E[L'] = 0.747 with 95% CI of (0.741 to 0.754)
  E[L'-L] = 0.0254 with 95% CI of (0.0211 to 0.0296); p-value = 0

Metric: f1
Multibootstrap (unpaired) on 11873 examples
  Base seeds (1): [0]
  Expt seeds (25): [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]


  0%|          | 0/1000 [00:00<?, ?it/s]

Bootstrap statistics from 1000 samples:
  E[L]  = 0.754 with 95% CI of (0.748 to 0.762)
  E[L'] = 0.778 with 95% CI of (0.773 to 0.785)
  E[L'-L] = 0.0239 with 95% CI of (0.0191 to 0.0286); p-value = 0



Unnamed: 0_level_0,exact,exact,exact,exact,f1,f1,f1,f1
Unnamed: 0_level_1,mean,low,high,p,mean,low,high,p
base,0.721923,0.714658,0.728527,,0.754427,0.748034,0.761508,
expt,0.747291,0.740529,0.753553,,0.778307,0.772585,0.784535,
delta,0.025368,0.02109,0.02965,0.0,0.02388,0.019124,0.028629,0.0
