## Reformer - Pushing the Limits of Language Modeling

Earlier this year, Nikita Kitaev, Łukasz Kaiser and Anselm Levskaya published the [**Reformer**](https://arxiv.org/abs/2001.04451), a transformer model variant with astounishing low memory consumption.

In this notebook, we will show how Reformer can be used in [`transformers`](https://github.com/huggingface/transformers). 


### ***Disclaimer***:

This notebook is derived from https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/PyTorch_Reformer.ipynb and should showcase how Reformer can be leveraged for masked language modeling.

### ***IMPORTANT***:

This notebook has by no means fitting configurations for large-scale pretraining. It just showcases *technically* how one can use Reformer for masked language modeling. Before starting a costly pretraining of Reformer, one has to make sure the dataset is correctly processed, the Reformer configuration is carefully designed, the tokenizer is carefully chosen / designed and the training parameters, *e.g.* learning rate, are carefully set. Also, this script has to be adapted if one want to train padded batches on Reformer as explained later.

First, let's check whether we are given the full portion of the GPU. 

In [None]:
#@title Check availble memory of GPU
# Check that we are using 100% of GPU
# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip -q install gputil
!pip -q install psutil
!pip -q install humanize
import psutil
import humanize
import os
import GPUtil as GPU
GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn’t guaranteed
gpu = GPUs[0]
def printm():
 process = psutil.Process(os.getpid())
 print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " | Proc size: " + humanize.naturalsize( process.memory_info().rss))
 print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()

Gen RAM Free: 11.2 GB  | Proc size: 3.1 GB
GPU RAM Free: 940MB | Used: 10501MB | Util  92% | Total 11441MB


In case GPU utilisation (`Util`) is not at 0%, you can uncomment and run the following line to kill all processes to get the full GPU afterwards. 
Make sure to comment out the line again to not constantly crash the notebook on purpose. 

In [None]:
# !kill -9 -1

Let's install `nlp` and `transformers` and import the necessary classes from Reformer and Trainer. 

In [None]:
# install nlp
!pip install -qq nlp==0.2.0

# Make sure that we have a recent version of pyarrow in the session before we continue - otherwise reboot Colab to activate it
import pyarrow
if int(pyarrow.__version__.split('.')[1]) < 16:
    import os
    os.kill(os.getpid(), 9)

!pip install -qq git+git://github.com/huggingface/transformers.git@reformer_masked_lm

  Building wheel for transformers (setup.py) ... [?25l[?25hdone


In case the notebook crashesh the wrong version of `pyarrow` was installed here. Simply rerun the cell to install the correct version.

In [None]:
# imports
from transformers import (
    ReformerForMaskedLM,
    ReformerTokenizer,
    ReformerConfig,
    Trainer,
    DataCollatorForLanguageModeling,
    TrainingArguments,
)
import nlp
import torch
from torch.utils.data.dataset import Dataset

First we download *Crime and Punish* which contains the content of a 800 page book using the convenient `nlp` library.

In [None]:
# load the dataset
dataset = nlp.load_dataset("crime_and_punish", split="train")

Now let's get a pretrained sentence piece tokenizer that was trained on the *Crime and Punishment* dataset.

In [None]:
# get a pretrained tokenizer
tokenizer = ReformerTokenizer.from_pretrained("google/reformer-crime-and-punishment")

Special tokens have been added in the vocabulary, make sure the associated word emebedding are fine-tuned or trained.


Because we want to do masked language modeling, let's add a [MASK] token to the tokenizer.

In [None]:
tokenizer.add_special_tokens({"mask_token": '[MASK]'})

1

Alright, let's check that the tokenizer has a mask token and see how many word embeddings are needed.

In [None]:
print(tokenizer.mask_token_id)
len(tokenizer)

321


322

In this notebook, we will use the first 16384 tokens to showcase how masked language modeling can be done.
We can use the handy `map()` function to reduce the dataset into one sample.

In [None]:
sequence_length = 2 ** 14  # 16384

# define our map function to reduce the dataset to one sample
def flatten_and_tokenize(batch):
  all_input_text = ["".join(batch["line"])]
  input_ids_dict = tokenizer(all_input_text, pad_to_max_length=True, max_length=sequence_length)

  # duplicate data 8 times to have have 8 examples in dataset
  for key in input_ids_dict.keys():
    input_ids_dict[key] = [4 * [x] for x in input_ids_dict[key]][0]

  return input_ids_dict

# reduce the dataset and set batch_size to all inputs
dataset = dataset.map(
  flatten_and_tokenize, batched=True, batch_size=-1, remove_columns=["line"]
)

# prepare dataset to be in torch format
dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])

Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'only_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you may want to check this is the right behavior.
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'only_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you may want to check this is the right behavior.


Let's do a quick check that our dataset samples have a length of 16384 and that there are 4 samples.

In [None]:
print(dataset['input_ids'].shape)

torch.Size([4, 16384])


With the `Trainer` framework of `transformers`, we will use the language model data collator.

In [None]:
# copy 0.15 from run language modeling script
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

Great! We can now move on to wrap the dataset into a dataset class as expected by the language data collator.

**Note**: This data collator currently assumes that no attention mask is needed. Thus it will not work for batch training that includes padded `input_ids`. In order to allow for training with padding one will have to write his own language model data collator.

In [None]:
class MLMReformerDataset(Dataset):

  def __init__(self, dataset):
    self.dataset = dataset

  def __len__(self):
    return len(self.dataset)

  def __getitem__(self, i):
    return self.dataset['input_ids'][i]

mlm_dataset = MLMReformerDataset(dataset)

Next, we will define our reformer model by defining the `ReformerConfig`. 
For the sake of this notebook, the `google/reformer-enwik8` config is taken whereas the vocabulary size is adapted to be used with our tokenizer.

In [None]:
config = {
  "attention_head_size": 128,
  "attn_layers": [
    "local",
    "local",
    "lsh",
    "local",
    "local",
    "local",
    "lsh",
    "local",
    "local",
    "local",
    "lsh",
    "local"
  ],
  "axial_norm_std": 1.0,
  "axial_pos_embds": True,
  "axial_pos_embds_dim": [
    256,
    768
  ],
  "axial_pos_shape": [
    128,
    128
  ],
  "chunk_size_feed_forward": 0,
  "chunk_size_lm_head": 0,
  "eos_token_id": 2,
  "feed_forward_size": 4096,
  "hidden_act": "relu",
  "hidden_dropout_prob": 0.2,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "is_decoder": False,
  "layer_norm_eps": 1e-12,
  "local_attention_probs_dropout_prob": 0.2,
  "local_attn_chunk_length": 128,
  "local_num_chunks_after": 0,
  "local_num_chunks_before": 1,
  "lsh_attention_probs_dropout_prob": 0.1,
  "lsh_attn_chunk_length": 256,
  "lsh_num_chunks_after": 0,
  "lsh_num_chunks_before": 1,
  "max_position_embeddings": 16384,
  "model_type": "reformer",
  "num_attention_heads": 8,
  "num_buckets": 512,
  "num_hashes": 1,
  "pad_token_id": 0,
  "vocab_size": 322  # +1 for [MASK] token
}

config = ReformerConfig(**config)
model = ReformerForMaskedLM(config)
model = model.train()

Lastly, let's set up the training args. **Note**: *these training settings have not throughly been tested and might be tuned for better results*.

In [None]:
# define the training args
training_args = {
    "learning_rate": 1e-3,
    "max_steps": 20,
    "do_train": True,
    "gradient_accumulation_steps": 4,
    "logging_steps": 4,
    "warmup_steps": 0,
    "weight_decay": 0.001,
    "per_gpu_train_batch_size": 1,
    "per_gpu_eval_batch_size": 1,
    "save_steps": 20,
    "output_dir": "./"
}

training_args = TrainingArguments(**training_args)

Finally we can start training. Since Google Colab only gives us a single GPU, this might take quite some time.

In [None]:
# create the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=mlm_dataset
)

# train
trainer.train()

Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.


HBox(children=(FloatProgress(value=0.0, description='Epoch', max=21.0, style=ProgressStyle(description_width='…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…






HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…





TrainOutput(global_step=21, training_loss=5.60564170564924)