# Worse Fine Tuning

Try to make BERT / RoBERTa worse by doing some additional pre-training on Wikipedia shuffled sentences.

Based loosely off this tutorial: https://huggingface.co/blog/how-to-train

In [1]:
import sys
sys.path.append('../')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tqdm
import torch
from collections import defaultdict
import random
import math
import pickle

from torch.utils.data import Dataset
from transformers import (
  AutoTokenizer,
  AutoModelForMaskedLM,
  DataCollatorForLanguageModeling,
  Trainer,
  TrainingArguments,
  pipeline,
)
from datasets import load_dataset

%matplotlib inline
%load_ext autoreload
%autoreload 2
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100

In [2]:
# The GPU to use for training
%env CUDA_VISIBLE_DEVICES=0

env: CUDA_VISIBLE_DEVICES=0


## Load pretrained model

To compare against corrupted model, try a simple fill-mask task with the original model.

In [4]:
model_name = 'bert-base-multilingual-cased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [6]:
fill_mask = pipeline(
  "fill-mask",
  model=model,
  tokenizer=tokenizer
)

In [8]:
fill_mask("I [MASK] a book about animals.")

[{'sequence': 'I have a book about animals.',
  'score': 0.32578471302986145,
  'token': 10529,
  'token_str': 'have'},
 {'sequence': 'I had a book about animals.',
  'score': 0.15185633301734924,
  'token': 10374,
  'token_str': 'had'},
 {'sequence': 'I wrote a book about animals.',
  'score': 0.12935718894004822,
  'token': 13954,
  'token_str': 'wrote'},
 {'sequence': 'I did a book about animals.',
  'score': 0.035083625465631485,
  'token': 12172,
  'token_str': 'did'},
 {'sequence': 'I write a book about animals.',
  'score': 0.030550533905625343,
  'token': 28685,
  'token_str': 'write'}]

## Construct scrambled sentences from Wikipedia

In [9]:
wiki_dataset = load_dataset('wikitext', 'wikitext-2-v1', split='train')

Downloading:   0%|          | 0.00/1.93k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Downloading and preparing dataset wikitext/wikitext-2-v1 (download: 4.27 MiB, generated: 12.72 MiB, post-processed: Unknown size, total: 16.99 MiB) to /Users/zhuzi/.cache/huggingface/datasets/wikitext/wikitext-2-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20...


Downloading:   0%|          | 0.00/4.48M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset wikitext downloaded and prepared to /Users/zhuzi/.cache/huggingface/datasets/wikitext/wikitext-2-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20. Subsequent calls will reuse this data.


In [10]:
random.seed(12345)
scrambled_sentences = []
for sent in wiki_dataset:
  sent_toks = sent['text'].split()
  random.shuffle(sent_toks)
  scrambled_sentences.append(' '.join(sent_toks))

## Dataloader

In [11]:
class ShuffledWikiDataset(Dataset):
  def __len__(self):
    return len(scrambled_sentences)
  def __getitem__(self, i):
    return tokenizer(scrambled_sentences[i], max_length=128)

## Do more pre-training to degrade model

In [12]:
# This controls the amount of degradation.
corrupt_training_steps = 200

training_args = TrainingArguments(
  output_dir='./checkpoints/',
  per_device_train_batch_size=16,
  max_steps=corrupt_training_steps,
)

data_collator = DataCollatorForLanguageModeling(
  tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

trainer = Trainer(
  model=model,
  tokenizer=tokenizer,
  data_collator=data_collator,
  train_dataset=ShuffledWikiDataset(),
  args=training_args
)

In [13]:
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mziningzhu[0m (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: wandb version 0.12.2 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Step,Training Loss


TrainOutput(global_step=200, training_loss=4.498060913085937, metrics={'train_runtime': 10113.209, 'train_samples_per_second': 0.02, 'total_flos': 436177113376032.0, 'epoch': 0.09, 'init_mem_cpu_alloc_delta': 10190848, 'init_mem_cpu_peaked_delta': 16384, 'train_mem_cpu_alloc_delta': 2810331136, 'train_mem_cpu_peaked_delta': 3823091712})

## Try fill-mask on corrupted model

As we expected, the predictions are still reasonable, but worse (eg: top prediction is the same but confidence score is a lot lower).

In [15]:
fill_mask = pipeline(
  "fill-mask",
  model=model,
  tokenizer=tokenizer
)

In [17]:
fill_mask("I [MASK] a book about animals.")

[{'sequence': 'I wrote a book about animals.',
  'score': 0.13334225118160248,
  'token': 13954,
  'token_str': 'wrote'},
 {'sequence': 'I, a book about animals.',
  'score': 0.08559725433588028,
  'token': 117,
  'token_str': ','},
 {'sequence': 'I. a book about animals.',
  'score': 0.0455552339553833,
  'token': 119,
  'token_str': '.'},
 {'sequence': 'I is a book about animals.',
  'score': 0.028200369328260422,
  'token': 10124,
  'token_str': 'is'},
 {'sequence': 'I was a book about animals.',
  'score': 0.02120114676654339,
  'token': 10134,
  'token_str': 'was'}]

## Save model to disk

To load, do `AutoModelForMaskedLM.from_pretrained(model_path)`.

In [15]:
#model.save_pretrained(f"checkpoints/{model_name}-corrupt-{corrupt_training_steps}-steps")

## Evaluate corrupted model

In [None]:
# TODO - how to evaluate a model downloaded from cluster?
