# Worse Fine Tuning

Try to make BERT / RoBERTa worse by doing some additional pre-training on Wikipedia shuffled sentences.

Author: Bai Li  
Based loosely off this tutorial: https://huggingface.co/blog/how-to-train

In [1]:
import sys
sys.path.append('../')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tqdm
import torch
from collections import defaultdict
import random
import math
import pickle

from torch.utils.data import Dataset
from transformers import (
  AutoTokenizer,
  AutoModelForMaskedLM,
  DataCollatorForLanguageModeling,
  Trainer,
  TrainingArguments,
  pipeline,
)
from datasets import load_dataset

%matplotlib inline
%load_ext autoreload
%autoreload 2
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100

In [2]:
# The GPU to use for training
%env CUDA_VISIBLE_DEVICES=0

env: CUDA_VISIBLE_DEVICES=0


## Load pretrained model

To compare against corrupted model, try a simple fill-mask task with the original model.

In [3]:
model_name = 'roberta-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name).cuda()

In [4]:
fill_mask = pipeline(
  "fill-mask",
  model=model,
  tokenizer=tokenizer,
  device=0
)

In [5]:
fill_mask("I <mask> a book about animals.")

[{'score': 0.8505654335021973,
  'token': 875,
  'token_str': ' wrote',
  'sequence': 'I wrote a book about animals.'},
 {'score': 0.04043307527899742,
  'token': 33,
  'token_str': ' have',
  'sequence': 'I have a book about animals.'},
 {'score': 0.029625510796904564,
  'token': 3116,
  'token_str': ' write',
  'sequence': 'I write a book about animals.'},
 {'score': 0.01930156722664833,
  'token': 1027,
  'token_str': ' published',
  'sequence': 'I published a book about animals.'},
 {'score': 0.012223951518535614,
  'token': 222,
  'token_str': ' did',
  'sequence': 'I did a book about animals.'}]

## Construct scrambled sentences from Wikipedia

In [6]:
wiki_dataset = load_dataset('wikitext', 'wikitext-2-v1', split='train')

Using the latest cached version of the module from /h/zining/.cache/huggingface/modules/datasets_modules/datasets/wikitext/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20 (last modified on Tue Aug 31 17:24:58 2021) since it couldn't be found locally at wikitext., or remotely on the Hugging Face Hub.
Exception ignored in: <function tqdm.__del__ at 0x7f22e7f179d0>
Traceback (most recent call last):
  File "/h/zining/.conda/envs/transformers4/lib/python3.8/site-packages/tqdm/std.py", line 1147, in __del__
    self.close()
  File "/h/zining/.conda/envs/transformers4/lib/python3.8/site-packages/tqdm/notebook.py", line 286, in close
    self.disp(bar_style='danger', check_delay=False)
AttributeError: 'tqdm' object has no attribute 'disp'
Reusing dataset wikitext (/h/zining/.cache/huggingface/datasets/wikitext/wikitext-2-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20)


In [7]:
random.seed(12345)
scrambled_sentences = []
for sent in wiki_dataset:
  sent_toks = sent['text'].split()
  random.shuffle(sent_toks)
  scrambled_sentences.append(' '.join(sent_toks))

## Dataloader

In [8]:
class ShuffledWikiDataset(Dataset):
  def __len__(self):
    return len(scrambled_sentences)
  def __getitem__(self, i):
    return tokenizer(scrambled_sentences[i], max_length=128)

## Do more pre-training to degrade model

In [9]:
# This controls the amount of degradation.
corrupt_training_steps = 6400

training_args = TrainingArguments(
  output_dir='./checkpoints/',
  per_device_train_batch_size=16,
  max_steps=corrupt_training_steps,
)

data_collator = DataCollatorForLanguageModeling(
  tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

trainer = Trainer(
  model=model,
  tokenizer=tokenizer,
  data_collator=data_collator,
  train_dataset=ShuffledWikiDataset(),
  args=training_args
)

max_steps is given, it will override any value given in num_train_epochs


In [10]:
trainer.train()

***** Running training *****
  Num examples = 36718
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 6400
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mziningzhu[0m (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: wandb version 0.12.9 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Step,Training Loss
500,5.2187
1000,5.0274
1500,5.0019
2000,4.9624
2500,4.883
3000,4.8794
3500,4.8196
4000,4.7865
4500,4.7598
5000,4.7101


Saving model checkpoint to ./checkpoints/checkpoint-500
Configuration saved in ./checkpoints/checkpoint-500/config.json
Model weights saved in ./checkpoints/checkpoint-500/pytorch_model.bin
tokenizer config file saved in ./checkpoints/checkpoint-500/tokenizer_config.json
Special tokens file saved in ./checkpoints/checkpoint-500/special_tokens_map.json
Saving model checkpoint to ./checkpoints/checkpoint-1000
Configuration saved in ./checkpoints/checkpoint-1000/config.json
Model weights saved in ./checkpoints/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in ./checkpoints/checkpoint-1000/tokenizer_config.json
Special tokens file saved in ./checkpoints/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to ./checkpoints/checkpoint-1500
Configuration saved in ./checkpoints/checkpoint-1500/config.json
Model weights saved in ./checkpoints/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in ./checkpoints/checkpoint-1500/tokenizer_config.json
Special tok

TrainOutput(global_step=6400, training_loss=4.858393764495849, metrics={'train_runtime': 3413.6661, 'train_samples_per_second': 29.997, 'train_steps_per_second': 1.875, 'total_flos': 6720978055895136.0, 'train_loss': 4.858393764495849, 'epoch': 2.79})

## Try fill-mask on corrupted model

As we expected, the predictions are still reasonable, but worse (eg: top prediction is the same but confidence score is a lot lower).

In [13]:
fill_mask = pipeline(
  "fill-mask",
  model=model,
  tokenizer=tokenizer,
  device=0
)

In [14]:
fill_mask("I <mask> a book about animals.")

[{'sequence': 'I wrote a book about animals.',
  'score': 0.3081257939338684,
  'token': 875,
  'token_str': ' wrote'},
 {'sequence': 'I have a book about animals.',
  'score': 0.08756621181964874,
  'token': 33,
  'token_str': ' have'},
 {'sequence': 'I published a book about animals.',
  'score': 0.0630703717470169,
  'token': 1027,
  'token_str': ' published'},
 {'sequence': 'I had a book about animals.',
  'score': 0.06285504996776581,
  'token': 56,
  'token_str': ' had'},
 {'sequence': 'I was a book about animals.',
  'score': 0.05396825447678566,
  'token': 21,
  'token_str': ' was'}]

## Save model to disk

To load, do `AutoModelForMaskedLM.from_pretrained(model_path)`.

In [15]:
model.save_pretrained(f"checkpoints/{model_name}-corrupt-{corrupt_training_steps}-steps")

Configuration saved in checkpoints/roberta-base-corrupt-200-steps/config.json
Model weights saved in checkpoints/roberta-base-corrupt-200-steps/pytorch_model.bin
