<a href="https://colab.research.google.com/github/T0bler0ne/Daily-Baggage/blob/main/scripts/convert_model_to_long.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# `RoBERTa` --> `Longformer`: build a "long" version of pretrained models

This notebook replicates the procedure descriped in the [Longformer paper](https://arxiv.org/abs/2004.05150) to train a Longformer model starting from the RoBERTa checkpoint. The same procedure can be applied to build the "long" version of other pretrained models as well.


### Data, libraries, and imports
Our procedure requires a corpus for pretraining. For demonstration, we will use Wikitext103; a corpus of 100M tokens from wikipedia articles. Depending on your application, consider using a different corpus that is a better match.

In [None]:
!mkdir -p data/Repos data/Repos-cleaned


In [3]:
!chmod +x /content/data/cloner.sh
!bash /content/data/cloner.sh


0xProject/0x.js e25cc301fddbc67f793ca0eb0f7635cdb9147a71
0xProject/contracts d80460d94daf8725b0017ff40c81f02a9a8f7f89
1backend/1backend 29869b6b160feb764b5a4f9f1984a9d1db0bed80
2fd/graphdoc a5bbc7b601975b00ec83b781a6afe6014ebe171b
43081j/rar.js b5c577235d905382082429cff4f8666106090a90
500tech/angular-tree-component 3d6c603ce28ee33174f0db40c69bcdfd4b9bddfa
5calls/5calls 70deaf479d60883cfc9b0d2f49a7297d812759bb
74th/vscode-vim 1db62fd74b5dce48e12a732e001472ff6bdec5a4
fatal: could not read Username for 'https://github.com': No such device or address
fatal: cannot change to 'Repos/0xProject/contracts': No such file or directory
aberezkin/ng2-image-upload 89a891b4a7d7d34a91434308488c027149f4fa1b
HEAD is now at b5c5772 Merge pull request #8 from 43081j/typescript
accounts-js/accounts b90ef6e32011ceca44b4c15df76a333aa7ad0e52
HEAD is now at 29869b6 Fixing delete project (#148)
acekyd/made-in-nigeria cace557f9437a60d309a71e9a88db135b8a349db
HEAD is now at 1db62fd fix #64 #69
ademilter/bricklaye

In [1]:
!apt-get install build-essential
!pip install --upgrade pip setuptools wheel


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
build-essential is already the newest version (12.9ubuntu3).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


In [2]:
!apt-get install -y build-essential


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
build-essential is already the newest version (12.9ubuntu3).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


In [5]:
!pip install tokenizers==0.11.6




In [6]:
!pip install transformers==4.16.0




In [7]:
import logging
import os
import math
import copy
import torch
from dataclasses import dataclass, field
from transformers import RobertaForMaskedLM, RobertaTokenizerFast, TextDataset, DataCollatorForLanguageModeling, Trainer
from transformers import TrainingArguments, HfArgumentParser
from transformers.models.longformer import LongformerSelfAttention

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

### RobertaLong

`RobertaLongForMaskedLM` represents the "long" version of the `RoBERTa` model. It replaces `BertSelfAttention` with `RobertaLongSelfAttention`, which is a thin wrapper around `LongformerSelfAttention`.


In [8]:
class RobertaLongSelfAttention(LongformerSelfAttention):
    def forward(
        self,
        hidden_states,
        attention_mask=None,
        head_mask=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        output_attentions=False,
    ):
        return super().forward(hidden_states, attention_mask=attention_mask, output_attentions=output_attentions)


class RobertaLongForMaskedLM(RobertaForMaskedLM):
    def __init__(self, config):
        super().__init__(config)
        for i, layer in enumerate(self.roberta.encoder.layer):
            # replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
            layer.attention.self = RobertaLongSelfAttention(config, layer_id=i)

Starting from the `roberta-base` checkpoint, the following function converts it into an instance of `RobertaLong`. It makes the following changes:

- extend the position embeddings from `512` positions to `max_pos`. In Longformer, we set `max_pos=4096`

- initialize the additional position embeddings by copying the embeddings of the first `512` positions. This initialization is crucial for the model performance (check table 6 in [the paper](https://arxiv.org/pdf/2004.05150.pdf) for performance without this initialization)

- replaces `modeling_bert.BertSelfAttention` objects with `modeling_longformer.LongformerSelfAttention` with a attention window size `attention_window`

The output of this function works for long documents even without pretraining. Check tables 6 and 11 in [the paper](https://arxiv.org/pdf/2004.05150.pdf) to get a sense of the expected performance of this model before pretraining.

In [9]:
def create_long_model(save_model_to, attention_window, max_pos):
    model = RobertaForMaskedLM.from_pretrained('roberta-base')
    tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', model_max_length=max_pos)
    config = model.config

    # extend position embeddings
    tokenizer.model_max_length = max_pos
    tokenizer.init_kwargs['model_max_length'] = max_pos
    current_max_pos, embed_size = model.roberta.embeddings.position_embeddings.weight.shape
    max_pos += 2  # NOTE: RoBERTa has positions 0,1 reserved, so embedding size is max position + 2
    config.max_position_embeddings = max_pos
    assert max_pos > current_max_pos
    # allocate a larger position embedding matrix
    new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size)
    # copy position embeddings over and over to initialize the new position embeddings
    k = 2
    step = current_max_pos - 2
    while k < max_pos - 1:
        new_pos_embed[k:(k + step)] = model.roberta.embeddings.position_embeddings.weight[2:]
        k += step
    model.roberta.embeddings.position_embeddings.weight.data = new_pos_embed
    model.roberta.embeddings.position_ids.data = torch.tensor([i for i in range(max_pos)]).reshape(1, max_pos)

    # replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
    config.attention_window = [attention_window] * config.num_hidden_layers
    for i, layer in enumerate(model.roberta.encoder.layer):
        longformer_self_attn = LongformerSelfAttention(config, layer_id=i)
        longformer_self_attn.query = layer.attention.self.query
        longformer_self_attn.key = layer.attention.self.key
        longformer_self_attn.value = layer.attention.self.value

        longformer_self_attn.query_global = copy.deepcopy(layer.attention.self.query)
        longformer_self_attn.key_global = copy.deepcopy(layer.attention.self.key)
        longformer_self_attn.value_global = copy.deepcopy(layer.attention.self.value)

        layer.attention.self = longformer_self_attn

    logger.info(f'saving model to {save_model_to}')
    model.save_pretrained(save_model_to)
    tokenizer.save_pretrained(save_model_to)
    return model, tokenizer

Pretraining on Masked Language Modeling (MLM) doesn't update the global projection layers. After pretraining, the following function copies `query`, `key`, `value` to their global counterpart projection matrices.
For more explanation on "local" vs. "global" attention, please refer to the documentation [here](https://huggingface.co/transformers/model_doc/longformer.html#longformer-self-attention).

In [10]:
def copy_proj_layers(model):
    for i, layer in enumerate(model.roberta.encoder.layer):
        layer.attention.self.query_global = copy.deepcopy(layer.attention.self.query)
        layer.attention.self.key_global = copy.deepcopy(layer.attention.self.key)
        layer.attention.self.value_global = copy.deepcopy(layer.attention.self.value)
    return model

### Pretrain and Evaluate on masked language modeling (MLM)

The following function pretrains and evaluates a model on MLM.

In [11]:
def pretrain_and_evaluate(args, model, tokenizer, eval_only, model_path):
    val_dataset = TextDataset(tokenizer=tokenizer,
                              file_path=args.val_datapath,
                              block_size=tokenizer.model_max_length)
    if eval_only:
        train_dataset = val_dataset
    else:
        logger.info(f'Loading and tokenizing training data is usually slow: {args.train_datapath}')
        train_dataset = TextDataset(tokenizer=tokenizer,
                                    file_path=args.train_datapath,
                                    block_size=tokenizer.model_max_length)

    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
    trainer = Trainer(model=model, args=args, data_collator=data_collator,
                      train_dataset=train_dataset, eval_dataset=val_dataset, prediction_loss_only=True,)

    eval_loss = trainer.evaluate()
    eval_loss = eval_loss['eval_loss']
    logger.info(f'Initial eval bpc: {eval_loss/math.log(2)}')

    if not eval_only:
        trainer.train(model_path=model_path)
        trainer.save_model()

        eval_loss = trainer.evaluate()
        eval_loss = eval_loss['eval_loss']
        logger.info(f'Eval bpc after pretraining: {eval_loss/math.log(2)}')

**Training hyperparameters**

- Following RoBERTa pretraining setting, we set number of tokens per batch to be `2^18` tokens. Changing this number might require changes in the lr, lr-scheudler, #steps and #warmup steps. Therefor, it is a good idea to keep this number constant.

- Note that: `#tokens/batch = batch_size x #gpus x gradient_accumulation x seqlen`
   
- In [the paper](https://arxiv.org/pdf/2004.05150.pdf), we train for 65k steps, but 3k is probably enough (check table 6)

- **Important note**: The lr-scheduler in [the paper](https://arxiv.org/pdf/2004.05150.pdf) is polynomial_decay with power 3 over 65k steps. To train for 3k steps, use a constant lr-scheduler (after warmup). Both lr-scheduler are not supported in HF trainer, and at least **constant lr-scheduler** will need to be added.

- Pretraining will take 2 days on 1 x 32GB GPU with fp32. Consider using fp16 and using more gpus to train faster (if you increase `#gpus`, reduce `gradient_accumulation` to maintain `#tokens/batch` as mentioned earlier).

- As a demonstration, this notebook is training on wikitext103 but wikitext103 is rather small that it takes 7 epochs to train for 3k steps Consider doing a single epoch on a larger dataset (800M tokens) instead.

- Set #gpus using `CUDA_VISIBLE_DEVICES`

In [15]:
!cp -r /content/data/Repos/* /content/data/Repos-cleaned/

In [12]:
!npm install typescript


[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K
added 1 package in 2s
[1G[0K⠋[1G[0K[1mnpm[22m [96mnotice[39m
[1mnpm[22m [96mnotice[39m New [31mmajor[39m version of npm available! [31m10.8.2[39m -> [34m11.0.0[39m
[1mnpm[22m [96mnotice[39m Changelog: [34mhttps://github.com/npm/cli/releases/tag/v11.0.0[39m
[1mnpm[22m [96mnotice[39m To update run: [4mnpm install -g npm@11.0.0[24m
[1mnpm[22m [96mnotice[39m
[1G[0K⠋[1G[0K

In [16]:
!node /content/CleanRepos.js


Config in: /content/data/Repos/1backend/1backend
Error processing vs
Config in: /content/data/Repos/2fd/graphdoc
Config in: /content/data/Repos/43081j/rar.js
Config in: /content/data/Repos/500tech/angular-tree-component
Config in: /content/data/Repos/5calls/5calls
Config in: /content/data/Repos/74th/vscode-vim
Config in: /content/data/Repos/AFASSoftware/maquette
Config in: /content/data/Repos/Alberplz/angular2-color-picker
Skipping: /content/data/Repos/aberezkin/ng2-image-upload/.git
Config in: /content/data/Repos/aberezkin/ng2-image-upload/demo
Config in: /content/data/Repos/aberezkin/ng2-image-upload/src
Config in: /content/data/Repos/accounts-js/accounts
Skipping: /content/data/Repos/acekyd/made-in-nigeria/.git
Skipping: /content/data/Repos/ademilter/bricklayer/.git
Config in: /content/data/Repos/adriancarriger/angularfire2-offline
Config in: /content/data/Repos/afrad/angular2-websocket
Config in: /content/data/Repos/aggarwalankush/ionic-mosum
Config in: /content/data/Repos/aggarwal

In [18]:
!node GetTypes.js


Config in: data/Repos-cleaned/1backend/1backend
Error processing vs
Config in: data/Repos-cleaned/2fd/graphdoc
Config in: data/Repos-cleaned/43081j/rar.js
Config in: data/Repos-cleaned/500tech/angular-tree-component
Config in: data/Repos-cleaned/5calls/5calls
Config in: data/Repos-cleaned/74th/vscode-vim
Config in: data/Repos-cleaned/AFASSoftware/maquette
Config in: data/Repos-cleaned/Alberplz/angular2-color-picker
Skipping: data/Repos-cleaned/aberezkin/ng2-image-upload/.git
Config in: data/Repos-cleaned/aberezkin/ng2-image-upload/demo
Config in: data/Repos-cleaned/aberezkin/ng2-image-upload/src
Config in: data/Repos-cleaned/accounts-js/accounts
!? 1, 0
Skipping: data/Repos-cleaned/acekyd/made-in-nigeria/.git
Skipping: data/Repos-cleaned/ademilter/bricklayer/.git
Config in: data/Repos-cleaned/adriancarriger/angularfire2-offline
!? 1, 0
Config in: data/Repos-cleaned/afrad/angular2-websocket
Config in: data/Repos-cleaned/aggarwalankush/ionic-mosum
Config in: data/Repos-cleaned/aggarwalan

In [19]:

@dataclass
class ModelArgs:
    attention_window: int = field(default=512, metadata={"help": "Size of attention window"})
    max_pos: int = field(default=4096, metadata={"help": "Maximum position"})

parser = HfArgumentParser((TrainingArguments, ModelArgs,))


training_args, model_args = parser.parse_args_into_dataclasses(look_for_args_file=False, args=[
    '--output_dir', 'tmp',
    '--warmup_steps', '500',
    '--learning_rate', '0.00003',
    '--weight_decay', '0.01',
    '--adam_epsilon', '1e-6',
    '--max_steps', '3000',
    '--logging_steps', '500',
    '--save_steps', '500',
    '--max_grad_norm', '5.0',
    '--per_gpu_eval_batch_size', '8',
    '--per_gpu_train_batch_size', '2',  # 32GB gpu with fp32
    '--gradient_accumulation_steps', '32',
    '--evaluation_strategy','steps',
    '--do_train',
    '--do_eval',
])
training_args.val_datapath = '/content/data/outputs-gold'
training_args.train_datapath = '/content/data/outputs-all'

# Choose GPU
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [21]:
!python lexer.py


Processing 0: akserg__ng2-toasty.json
Processing 1: Alberplz__angular2-color-picker.json
Processing 2: airbrake__airbrake-js.json
Processing 3: akserg__ng2-slim-loading-bar.json
Processing 4: adriancarriger__angularfire2-offline.json
Processing 5: 500tech__angular-tree-component.json
Processing 6: ahomu__Talkie.json
Processing 7: alamgird__angular-next-starter-kit.json
Processing 8: 43081j__rar.js.json
Processing 9: akserg__ng2-dnd.json
Processing 10: 1backend__1backend.json
Processing 11: aggarwalankush__ionic-push-base.json
Processing 12: afrad__angular2-websocket.json
Processing 13: 74th__vscode-vim.json
Processing 14: 2fd__graphdoc.json
Processing 15: AFASSoftware__maquette.json
Processing 16: 5calls__5calls.json
Processing 17: akfish__node-vibrant.json
Processing 18: aggarwalankush__ionic-mosum.json
Processing 19: accounts-js__accounts.json
Processing 20: alefragnani__vscode-project-manager.json
Processing 21: aioutecism__amVim-for-VSCode.json
Processing 22: aikoven__typescript-fs

### Put it all together

1) Evaluating `roberta-base` on MLM to establish a baseline. Validation `bpc` = `2.536` which is higher than the `bpc` values in table 6 [here](https://arxiv.org/pdf/2004.05150.pdf) because wikitext103 is harder than our pretraining corpus.

In [1]:
roberta_base = RobertaForMaskedLM.from_pretrained('roberta-base')
roberta_base_tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base')
logger.info('Evaluating roberta-base (seqlen: 512) for refernece ...')
pretrain_and_evaluate(training_args, roberta_base, roberta_base_tokenizer, eval_only=True, model_path=None)

NameError: name 'RobertaForMaskedLM' is not defined

2) As descriped in `create_long_model`, convert a `roberta-base` model into `roberta-base-4096` which is an instance of `RobertaLong`, then save it to the disk.

In [None]:
model_path = f'{training_args.output_dir}/roberta-base-{model_args.max_pos}'
if not os.path.exists(model_path):
    os.makedirs(model_path)

logger.info(f'Converting roberta-base into roberta-base-{model_args.max_pos}')
model, tokenizer = create_long_model(
    save_model_to=model_path, attention_window=model_args.attention_window, max_pos=model_args.max_pos)

3) Load `roberta-base-4096` from the disk. This model works for long sequences even without pretraining. If you don't want to pretrain, you can stop here and start finetuning your `roberta-base-4096` on downstream tasks 🎉🎉🎉

In [None]:
logger.info(f'Loading the model from {model_path}')
tokenizer = RobertaTokenizerFast.from_pretrained(model_path)
model = RobertaLongForMaskedLM.from_pretrained(model_path)

4) Pretrain `roberta-base-4096` for `3k` steps, each steps has `2^18` tokens. Notes:

- The `training_args.max_steps = 3 ` is just for the demo. **Remove this line for the actual training**

- Training for `3k` steps will take 2 days on a single 32GB gpu with `fp32`. Consider using `fp16` and more gpus to train faster.

- Tokenizing the training data the first time is going to take 5-10 minutes.

- MLM validation `bpc` **before** pretraining: **2.652**, a bit worse than the **2.536** of `roberta-base`. As discussed in [the paper](https://arxiv.org/pdf/2004.05150.pdf) this is expected because the model didn't learn yet to work with the sliding window attention.

- MLM validation `bpc` after pretraining for a few number of steps: **2.628**. It is quickly getting better. By 3k steps, it should be better than the **2.536** of `roberta-base`.

In [None]:
logger.info(f'Pretraining roberta-base-{model_args.max_pos} ... ')

training_args.max_steps = 3   ## <<<<<<<<<<<<<<<<<<<<<<<< REMOVE THIS <<<<<<<<<<<<<<<<<<<<<<<<

pretrain_and_evaluate(training_args, model, tokenizer, eval_only=False, model_path=training_args.output_dir)

INFO:__main__:Pretraining roberta-base-4096 ... 
INFO:filelock:Lock 140598563391376 acquired on wikitext-103-raw/cached_lm_RobertaTokenizerFast_4094_wiki.valid.raw.lock
INFO:transformers.data.datasets.language_modeling:Loading features from cached file wikitext-103-raw/cached_lm_RobertaTokenizerFast_4094_wiki.valid.raw [took 0.017 s]
INFO:filelock:Lock 140598563391376 released on wikitext-103-raw/cached_lm_RobertaTokenizerFast_4094_wiki.valid.raw.lock
INFO:__main__:Loading and tokenizing training data is usually slow: wikitext-103-raw/wiki.train.raw
INFO:filelock:Lock 140599059908048 acquired on wikitext-103-raw/cached_lm_RobertaTokenizerFast_4094_wiki.train.raw.lock
INFO:transformers.data.datasets.language_modeling:Loading features from cached file wikitext-103-raw/cached_lm_RobertaTokenizerFast_4094_wiki.train.raw [took 5.838 s]
INFO:filelock:Lock 140599059908048 released on wikitext-103-raw/cached_lm_RobertaTokenizerFast_4094_wiki.train.raw.lock
INFO:transformers.trainer:You are ins

HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=8.0, style=ProgressStyle(description_wid…

INFO:__main__:Initial eval bpc: 2.6521989344600327
INFO:transformers.trainer:***** Running training *****
INFO:transformers.trainer:  Num examples = 29114
INFO:transformers.trainer:  Num Epochs = 1
INFO:transformers.trainer:  Instantaneous batch size per device = 2
INFO:transformers.trainer:  Total train batch size (w. parallel, distributed & accumulation) = 64
INFO:transformers.trainer:  Gradient Accumulation steps = 32
INFO:transformers.trainer:  Total optimization steps = 3
INFO:transformers.trainer:  Starting fine-tuning.



{"eval_loss": 1.8383642137050629, "step": null}


HBox(children=(FloatProgress(value=0.0, description='Epoch', max=1.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=14557.0, style=ProgressStyle(description_…

INFO:transformers.trainer:

Training completed. Do not forget to share your model on huggingface.co/models =)


INFO:transformers.trainer:Saving model checkpoint to tmp
INFO:transformers.configuration_utils:Configuration saved in tmp/config.json






INFO:transformers.modeling_utils:Model weights saved in tmp/pytorch_model.bin
INFO:transformers.trainer:***** Running Evaluation *****
INFO:transformers.trainer:  Num examples = 61
INFO:transformers.trainer:  Batch size = 8


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=8.0, style=ProgressStyle(description_wid…

INFO:__main__:Eval bpc after pretraining: 2.6277886199054827



{"eval_loss": 1.8214442729949951, "epoch": 0.008793020539946418, "step": 4}


5) Copy global projection layers. MLM pretraining doesn't train global projections, so we need to call `copy_proj_layers` to copy the local projection layers to the global ones.

In [None]:
logger.info(f'Copying local projection layers into global projection layers ... ')
model = copy_proj_layers(model)
logger.info(f'Saving model to {model_path}')
model.save_pretrained(model_path)


INFO:__main__:Copying local projection layers into global projection layers ... 
INFO:__main__:Saving model to tmp/roberta-base-4096
INFO:transformers.configuration_utils:Configuration saved in tmp/roberta-base-4096/config.json
INFO:transformers.modeling_utils:Model weights saved in tmp/roberta-base-4096/pytorch_model.bin


🎉🎉🎉🎉 **DONE**. 🎉🎉🎉🎉

`model` can now be used for finetuning on downstream tasks after loading it from the disk.



In [None]:
logger.info(f'Loading the model from {model_path}')
tokenizer = RobertaTokenizerFast.from_pretrained(model_path)
model = RobertaLongForMaskedLM.from_pretrained(model_path)