<a href="https://colab.research.google.com/github/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# `RoBERTa` --> `Longformer`: build a "long" version of pretrained models

This notebook replicates the procedure descriped in the [Longformer paper](https://arxiv.org/abs/2004.05150) to train a Longformer model starting from the RoBERTa checkpoint. The same procedure can be applied to build the "long" version of other pretrained models as well. 


### Data, libraries, and imports
Our procedure requires a corpus for pretraining. For demonstration, we will use Wikitext103; a corpus of 100M tokens from wikipedia articles. Depending on your application, consider using a different corpus that is a better match.

In [1]:
# !wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
# !unzip wikitext-103-raw-v1.zip

In [2]:
# !pip install transformers==3.0.2

In [1]:
# Choose GPU
import os, re
os.environ["WANDB_DISABLED"] = "true"
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" 
os.environ["CUDA_VISIBLE_DEVICES"] = "5"

In [2]:
import logging
import math
import copy
import torch
from dataclasses import dataclass, field
from transformers import RobertaForMaskedLM, RobertaTokenizerFast, TextDataset, DataCollatorForLanguageModeling, Trainer
from transformers import TrainingArguments, HfArgumentParser
from transformers import LongformerForMaskedLM
from transformers import (
    AutoConfig,
    AutoModelForSequenceClassification,
    AutoModelForMaskedLM,
    AutoTokenizer,
    DataCollatorWithPadding,
    EvalPrediction,
    HfArgumentParser,
    PretrainedConfig,
    Trainer,
    TrainingArguments,
    default_data_collator,
    set_seed,
)

from torch.utils.data import Dataset

import ujson

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)


### RobertaLong

`RobertaLongForMaskedLM` represents the "long" version of the `RoBERTa` model. It replaces `BertSelfAttention` with `RobertaLongSelfAttention`, which is a thin wrapper around `LongformerSelfAttention`.


Starting from the `roberta-base` checkpoint, the following function converts it into an instance of `RobertaLong`. It makes the following changes:

- extend the position embeddings from `512` positions to `max_pos`. In Longformer, we set `max_pos=4096`

- initialize the additional position embeddings by copying the embeddings of the first `512` positions. This initialization is crucial for the model performance (check table 6 in [the paper](https://arxiv.org/pdf/2004.05150.pdf) for performance without this initialization)

- replaces `modeling_bert.BertSelfAttention` objects with `modeling_longformer.LongformerSelfAttention` with a attention window size `attention_window`

The output of this function works for long documents even without pretraining. Check tables 6 and 11 in [the paper](https://arxiv.org/pdf/2004.05150.pdf) to get a sense of the expected performance of this model before pretraining.

In [3]:
tokenizer = AutoTokenizer.from_pretrained(
        '/home/zhichaoyang/mimic3/sapbert/train/tmp/sapclong'
)
config = AutoConfig.from_pretrained(
    '/home/zhichaoyang/mimic3/sapbert/train/tmp/sapclong'
)
model = LongformerForMaskedLM.from_pretrained(
    '/home/zhichaoyang/mimic3/sapbert/train/tmp/sapclong'
)

# tokenizer = AutoTokenizer.from_pretrained(
#         'yikuan8/Clinical-Longformer'
# )
# config = AutoConfig.from_pretrained(
#     'yikuan8/Clinical-Longformer'
# )
# model = LongformerForMaskedLM.from_pretrained(
#     'yikuan8/Clinical-Longformer'
# )

Some weights of LongformerForMaskedLM were not initialized from the model checkpoint at /home/zhichaoyang/mimic3/sapbert/train/tmp/sapclong and are newly initialized: ['lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:
max_pos = 16384
# save_model_to = '/home/zhichaoyang/mimic3/sapbert/train/tmp/clinicallongform-16384'
save_model_to = '/tmp/sapclong'


# extend position embeddings
tokenizer.model_max_length = max_pos
tokenizer.init_kwargs['model_max_length'] = max_pos
current_max_pos, embed_size = model.longformer.embeddings.position_embeddings.weight.shape
max_pos += 2  # NOTE: RoBERTa has positions 0,1 reserved, so embedding size is max position + 2
config.max_position_embeddings = max_pos
assert max_pos > current_max_pos
# allocate a larger position embedding matrix
new_pos_embed = model.longformer.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size)
# copy position embeddings over and over to initialize the new position embeddings
new_pos_embed[0:2] = copy.deepcopy(model.longformer.embeddings.position_embeddings.weight)[:2]
k = 2
step = current_max_pos - 2
while k < max_pos - 1:
    new_pos_embed[k:(k + step)] = copy.deepcopy(model.longformer.embeddings.position_embeddings.weight)[2:]
    k += step
model.longformer.embeddings.position_embeddings.weight.data = new_pos_embed
model.longformer.embeddings.position_ids.data = torch.tensor([i for i in range(max_pos)]).reshape(1, max_pos)


In [5]:
model.longformer.embeddings.position_embeddings.weight

Parameter containing:
tensor([[-0.0113,  0.0200,  0.0194,  ...,  0.0049, -0.0269, -0.0432],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0376, -0.0078, -0.0615,  ..., -0.0535,  0.0187,  0.0271],
        ...,
        [ 0.0915, -0.0609, -0.0356,  ..., -0.0104,  0.1057,  0.0129],
        [ 0.0811, -0.0208, -0.0377,  ..., -0.0262,  0.1199,  0.0012],
        [ 0.0774,  0.2616, -0.1230,  ..., -0.0810,  0.1119, -0.0127]],
       requires_grad=True)

In [6]:
model.longformer.embeddings.position_ids.data.shape

torch.Size([1, 16386])

In [10]:
logger.info(f'saving model to {save_model_to}')
model.save_pretrained(save_model_to)
tokenizer.save_pretrained(save_model_to)

INFO:__main__:saving model to /tmp/sapclong


('/tmp/sapclong/tokenizer_config.json',
 '/tmp/sapclong/special_tokens_map.json',
 '/tmp/sapclong/vocab.json',
 '/tmp/sapclong/merges.txt',
 '/tmp/sapclong/added_tokens.json',
 '/tmp/sapclong/tokenizer.json')

In [7]:
model.longformer.embeddings.position_embeddings


Embedding(4098, 768, padding_idx=1)

Pretraining on Masked Language Modeling (MLM) doesn't update the global projection layers. After pretraining, the following function copies `query`, `key`, `value` to their global counterpart projection matrices.
For more explanation on "local" vs. "global" attention, please refer to the documentation [here](https://huggingface.co/transformers/model_doc/longformer.html#longformer-self-attention).

### Pretrain and Evaluate on masked language modeling (MLM)

The following function pretrains and evaluates a model on MLM.

In [8]:
class MimicFullDataset(Dataset):
    def __init__(self, version, mode, truncate_length, tokenizer):
        self.path = os.path.join("/home/zhichaoyang/mimic3/ICD-MSMN/sample_data/mimic3", f"{version}_{mode}.json")
        self.tokenizer = tokenizer

        with open(self.path, "r") as f:
            df = ujson.load(f)
        self.examples = []

        block_size = truncate_length

        for index in range(len(df)):
            text = df[index]['TEXT']
            text = re.sub(r'\[\*\*[^\]]*\*\*\]', '', text)  # remove any mimic special token like [**2120-2-28**] or [**Hospital1 3278**]
            text = re.sub(r'  +', ' ', text.lower().replace("\n"," "))
            tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
            if len(tokenized_text) > block_size:
                for i in range(0, len(tokenized_text) - block_size + 1, block_size):  # Truncate in block of block_size
                    self.examples.append(
                        tokenizer.build_inputs_with_special_tokens(tokenized_text[i : i + block_size])
                    )
                tokenized_text = tokenized_text[i + block_size : ]
            else:
                if len(tokenized_text) > 100:
                    self.examples.append(
                            tokenizer.build_inputs_with_special_tokens(tokenized_text)
                    )


        self.len = len(self.examples)
    
    def __len__(self):
        return self.len

    def __getitem__(self, index):
        return torch.tensor(self.examples[index], dtype=torch.long)

def pretrain_and_evaluate(args, model, tokenizer, eval_only, model_path):
    val_dataset = MimicFullDataset("mimic3", "test", 8190, tokenizer)

    
    if eval_only:
        train_dataset = val_dataset
    else:
        train_dataset = MimicFullDataset("mimic3", "train", 8190, tokenizer)

    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.3)
    trainer = Trainer(model=model, args=args, data_collator=data_collator,
                      train_dataset=train_dataset, eval_dataset=val_dataset)

    eval_loss = trainer.evaluate()
    eval_loss = eval_loss['eval_loss']
    logger.info(f'Initial eval bpc: {eval_loss/math.log(2)}')
    
    if not eval_only:
        trainer.train(model_path=model_path)
        # trainer.train()
        trainer.save_model()

        eval_loss = trainer.evaluate()
        eval_loss = eval_loss['eval_loss']
        logger.info(f'Eval bpc after pretraining: {eval_loss/math.log(2)}')

In [12]:
# val_dataset = MimicFullDataset("mimic3", "test", 8190, tokenizer)
# train_dataset = MimicFullDataset("mimic3", "train", 8190, tokenizer)
# len(train_dataset)

**Training hyperparameters**

- Following RoBERTa pretraining setting, we set number of tokens per batch to be `2^18` tokens. Changing this number might require changes in the lr, lr-scheudler, #steps and #warmup steps. Therefor, it is a good idea to keep this number constant.

- Note that: `#tokens/batch = batch_size x #gpus x gradient_accumulation x seqlen`
   
- In [the paper](https://arxiv.org/pdf/2004.05150.pdf), we train for 65k steps, but 3k is probably enough (check table 6)

- **Important note**: The lr-scheduler in [the paper](https://arxiv.org/pdf/2004.05150.pdf) is polynomial_decay with power 3 over 65k steps. To train for 3k steps, use a constant lr-scheduler (after warmup). Both lr-scheduler are not supported in HF trainer, and at least **constant lr-scheduler** will need to be added. 

- Pretraining will take 2 days on 1 x 32GB GPU with fp32. Consider using fp16 and using more gpus to train faster (if you increase `#gpus`, reduce `gradient_accumulation` to maintain `#tokens/batch` as mentioned earlier).

- As a demonstration, this notebook is training on wikitext103 but wikitext103 is rather small that it takes 7 epochs to train for 3k steps Consider doing a single epoch on a larger dataset (800M tokens) instead.

- Set #gpus using `CUDA_VISIBLE_DEVICES`

In [9]:
@dataclass
class ModelArgs:
    attention_window: int = field(default=512, metadata={"help": "Size of attention window"})
    max_pos: int = field(default=8190, metadata={"help": "Maximum position"})

parser = HfArgumentParser((TrainingArguments, ModelArgs,))


training_args, model_args = parser.parse_args_into_dataclasses(look_for_args_file=False, args=[
    "--fp16", "--prediction_loss_only", "True",
    '--output_dir', '/home/zhichaoyang/mimic3/sapbert/train/tmp/sapclong-16384',
    '--warmup_steps', '500',
    '--learning_rate', '9e-6',
    '--weight_decay', '0.01',
    '--adam_epsilon', '1e-5',
    '--max_steps', '6000',
    '--logging_steps', '500',
    '--eval_steps', '1000',
    '--save_steps', '1000',
    '--max_grad_norm', '5.0',
    '--per_device_eval_batch_size', '4',
    '--per_device_train_batch_size', '2',  # 32GB gpu with fp32
    '--gradient_accumulation_steps', '4',
    '--do_train',
    '--do_eval',
])




Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


### Put it all together

1) Evaluating `roberta-base` on MLM to establish a baseline. Validation `bpc` = `2.536` which is higher than the `bpc` values in table 6 [here](https://arxiv.org/pdf/2004.05150.pdf) because wikitext103 is harder than our pretraining corpus.

In [None]:
roberta_base = LongformerForMaskedLM.from_pretrained(
    save_model_to
)
roberta_base_tokenizer = AutoTokenizer.from_pretrained(
    save_model_to
)
logger.info('Evaluating roberta-base (seqlen: 512) for refernece ...')
pretrain_and_evaluate(training_args, roberta_base, roberta_base_tokenizer, eval_only=True, model_path=None)

INFO:__main__:Evaluating roberta-base (seqlen: 512) for refernece ...
max_steps is given, it will override any value given in num_train_epochs
Using amp fp16 backend
Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future version. Using `--per_device_eval_batch_size` is preferred.
***** Running Evaluation *****
  Num examples = 20
  Batch size = 4


Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future version. Using `--per_device_eval_batch_size` is preferred.
INFO:__main__:Initial eval bpc: 25.475985444599225


2) As descriped in `create_long_model`, convert a `roberta-base` model into `roberta-base-4096` which is an instance of `RobertaLong`, then save it to the disk.

3) Load `roberta-base-4096` from the disk. This model works for long sequences even without pretraining. If you don't want to pretrain, you can stop here and start finetuning your `roberta-base-4096` on downstream tasks 🎉🎉🎉

4) Pretrain `roberta-base-4096` for `3k` steps, each steps has `2^18` tokens. Notes: 

- The `training_args.max_steps = 3 ` is just for the demo. **Remove this line for the actual training**

- Training for `3k` steps will take 2 days on a single 32GB gpu with `fp32`. Consider using `fp16` and more gpus to train faster. 

- Tokenizing the training data the first time is going to take 5-10 minutes.

- MLM validation `bpc` **before** pretraining: **2.652**, a bit worse than the **2.536** of `roberta-base`. As discussed in [the paper](https://arxiv.org/pdf/2004.05150.pdf) this is expected because the model didn't learn yet to work with the sliding window attention. 

- MLM validation `bpc` after pretraining for a few number of steps: **2.628**. It is quickly getting better. By 3k steps, it should be better than the **2.536** of `roberta-base`.

In [14]:
logger.info(f'Pretraining roberta-base-{model_args.max_pos} ... ')

tokenizer = AutoTokenizer.from_pretrained(
    save_model_to
)
model = LongformerForMaskedLM.from_pretrained(
    save_model_to
)

# training_args.max_steps = 3   ## <<<<<<<<<<<<<<<<<<<<<<<< REMOVE THIS <<<<<<<<<<<<<<<<<<<<<<<<

pretrain_and_evaluate(training_args, model, tokenizer, eval_only=False, model_path=training_args.output_dir)

INFO:__main__:Pretraining roberta-base-7166 ... 
Token indices sequence length is longer than the specified maximum sequence length for this model (17207 > 16384). Running this sequence through the model will result in indexing errors
max_steps is given, it will override any value given in num_train_epochs
Using amp fp16 backend
***** Running Evaluation *****
  Num examples = 3374
  Batch size = 4
Input ids are automatically padded from 5832 to 6144 to be a multiple of `config.attention_window`: 512


Input ids are automatically padded from 4238 to 4608 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 3948 to 4096 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 2967 to 3072 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 4317 to 4608 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 4328 to 4608 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 3345 to 3584 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 3901 to 4096 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 5207 to 5632 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 6156 to 6656 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 4822 to 5120 to be a mul

Step,Training Loss
1000,5.6403
2000,2.0808


Input ids are automatically padded from 3303 to 3584 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 3715 to 4096 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 3094 to 3584 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 2356 to 2560 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 2545 to 2560 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 5025 to 5120 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 1695 to 2048 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 2906 to 3072 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 3710 to 4096 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 2876 to 3072 to be a mul

KeyboardInterrupt: 

: 

5) Copy global projection layers. MLM pretraining doesn't train global projections, so we need to call `copy_proj_layers` to copy the local projection layers to the global ones.

In [None]:
logger.info(f'Copying local projection layers into global projection layers ... ')
model = copy_proj_layers(model)
logger.info(f'Saving model to {model_path}')
model.save_pretrained(model_path)


INFO:__main__:Copying local projection layers into global projection layers ... 
INFO:__main__:Saving model to tmp/roberta-base-4096
INFO:transformers.configuration_utils:Configuration saved in tmp/roberta-base-4096/config.json
INFO:transformers.modeling_utils:Model weights saved in tmp/roberta-base-4096/pytorch_model.bin


🎉🎉🎉🎉 **DONE**. 🎉🎉🎉🎉

`model` can now be used for finetuning on downstream tasks after loading it from the disk. 



In [None]:
logger.info(f'Loading the model from {model_path}')
tokenizer = RobertaTokenizerFast.from_pretrained(model_path)
model = RobertaLongForMaskedLM.from_pretrained(model_path)