# Pretrain DistilBert

In this notebook, we'll pretrain DistilBert model on the MLM objective using all 5'UTR exonic sequences

## Prerequisites

This notebook requires:
- [DS_BASE.tsv](https://drive.google.com/file/d/1gPjOoxWOAPpfPmKFbQlVT0hncjIpst5E/view?usp=sharing)

Download and unpack the required data, then provide a path to a base dataset below.

For instance, starting in the project's root:

```bash
cd data
gdown --fuzzy https://drive.google.com/file/d/1gPjOoxWOAPpfPmKFbQlVT0hncjIpst5E/view?usp=sharing
tar -xzf DS_BASE.tsv.tar.gz
rm DS_BASE.tsv.tar.gz
```

In [1]:
import logging
from pathlib import Path

import numpy as np
import pandas as pd
import pytorch_lightning as pl
import torch
from torch.utils.data import RandomSampler, SequentialSampler, DataLoader, TensorDataset
from transformers import DistilBertConfig, DistilBertForMaskedLM, get_linear_schedule_with_warmup

from uBERTa.loader import uBERTaLoader
from uBERTa.model import uBERTa_mlm
from uBERTa.tokenizer import DNATokenizer
from uBERTa.utils import split_values, fill_row_around_ones

## Setup

In [6]:
def load_existing(paths):
    loader = (
        lambda p: None if not p.exists() else 
        (pd.read_hdf(p) if p.suffix == '.h5' else torch.load(p))
    )
    return {k: loader(v) for k, v in paths.items()}

def parse_base(path_base, min_seq_size):
    """
    Read the base dataset, split sequence values and filter by the sequence size.
    """
    df = pd.read_csv(path_base, sep='\t')
    df['SeqSize'] = df['Seq'].apply(len)
    print(f'Initial ds: {len(df)}')
    df = df[df.SeqSize >= min_seq_size]
    print(f'Conforming to size threshold: {len(df)}')
    split_values(df, 'SeqEnum')
    split_values(df, 'SeqEnumPositive')
    split_values(df, 'Classes')
    split_values(df, 'Signal', dtype=float)
    return df

def calc_scheduler_steps(loader, warmup_perc=0.1, max_epochs=100):
    """
    Calculate warmup steps based on the number of batches in train_dataloader.
    """
    epoch_steps = len(loader.train_dataloader())
    total_steps = epoch_steps * max_epochs
    warmup_steps = int(warmup_perc * total_steps)
    return warmup_steps, total_steps

class Loader(pl.LightningDataModule):
    """
    Loader for the MLM objective.
    """
    def __init__(self, tds, batch_size, val_frac=0.05, num_proc=4):
        self.batch_size = batch_size
        self.val_frac = val_frac
        self.tds = tds
        self.num_proc = num_proc
        
        self.val_tds, self.train_tds = None, None
    
    def setup(self):
        idx = np.random.binomial(1, self.val_frac, len(self.tds)) == 1
        idx_val = np.where(idx)[0]
        idx_train = np.where(~idx)[0]
        self.val_tds = TensorDataset(*self.tds[idx_val])
        self.train_tds = TensorDataset(*self.tds[idx_train])
    
    def train_dataloader(self) -> DataLoader:
        return DataLoader(
            self.train_tds, sampler=RandomSampler(self.train_tds),
            batch_size=self.batch_size, num_workers=self.num_proc)

    def val_dataloader(self) -> DataLoader:
        return DataLoader(
            self.val_tds, sampler=SequentialSampler(self.val_tds),
            batch_size=self.batch_size, num_workers=self.num_proc)

In [3]:
KMER = 3
WINDOW = 100
MIN_SEQ_SIZE = 25
STEP = WINDOW // 2
MAX_EPOCHS = 100

DS = f'ws{WINDOW}_step{STEP}'
MODEL = f'{DS}_pretrain'

DATA = Path('../data')
DATA.mkdir(exist_ok=True)
# Path to save the trained model
MODEL_PATH = Path(f'../models/{MODEL}')
MODEL_PATH.mkdir(exist_ok=True, parents=True)
# Path to a base dataset
PATH_BASE_DS = DATA / 'DS_BASE.tsv'

# Base dir for the prepared datasets
datasets_base = DATA / f'datasets/{DS}'
datasets_base.mkdir(exist_ok=True, parents=True)

DATASETS = {
    'train_ds': datasets_base / 'train_ds.h5',
    'val_ds': datasets_base / 'val_ds.h5',
    'test_ds': datasets_base / 'test_ds.h5',
    'train_tds': datasets_base / 'train_tds.bin',
    'val_tds': datasets_base / 'val_tds.bin',
    'test_tds': datasets_base / 'test_tds.bin'
}

np.random.seed(666)

In [4]:
logging.basicConfig(level=logging.DEBUG)

## Prepare datasets

From the base dataset, we prepare the `.h5` dataframes and `TensorDataset` objects, where the latter are used to initialize `Loader`.

First, we load and parse the base dataset.

In [7]:
ds_paths = (DATASETS['train_ds'], DATASETS['val_ds'], DATASETS['test_ds'])
if any(not p.exists() for p in ds_paths):
    ds = parse_base(PATH_BASE_DS, MIN_SEQ_SIZE)
else:
    ds = None

Initial ds: 79677
Conforming to size threshold: 75169


Then, we initialize tokenizer and uBERTaLoader. The latter won't be used directly, but we'll utilize some of its methods to prepare the MLM dataset. Namely, we'll kmerize the sequence data (sequence, its coordinates, and the experimental signal), i.e., slide a window with size three and step one over the kmerized sequence data. Finally, we'll roll the window with size `WINDOW - 2` and `STEP` defined above. We substract two from `WINDOW` to account for special tokens `CLS` and `SEP`, prepended and appended to a sequence, resp.

In [8]:
tokenizer = DNATokenizer(kmer=KMER)

In [9]:
loader = uBERTaLoader(tokenizer=tokenizer)
ds = loader.kmerize(ds)
ds = loader.roll_window(ds, WINDOW - 2, STEP)

INFO:uBERTa.loader:Using kmer 3 on ('Seq', 'SeqEnum', 'Signal', 'Classes')
INFO:uBERTa.loader:Rolling window with size 98, step 50
  " ".join(seq_chunk), np.array(cls_chunk),
  np.array(seq_enum_chunk), np.array(signal_chunk))


Next, we encode windowed sequences using tokenizer.

In [10]:
encoded = np.array(list(map(
    tokenizer.encode, 
    ds.Seq.apply(lambda x: x.split())
)))

MLM objective means masking ~15% of the input tokens. We'll select ~6% of random tokens in the encoded sequences and expand the selection laterally by one position to ensure we mask (overlapping) consecutive tokens, e.g.,

```
... AAA AAA ATG ACC GGC CGG ...
...  0   0   1   0   0   0  ...

-> 

... AAA AAA ATG ACC GGC CGG ...
...  0   1   1   1   0   0  ...
```
(We used actual kmers instead of token IDs for clarity)

In some cases, this will result in masking `CLS` and `SEP` tokens which we'll "demask" manually. As a result, the total amount of masked tokens will be slightly less than 6% * 3 = 18%. We'll use the composed binary mask to mask the input tokens with `MASK` token, and mask the corresponding class labels with -100. Finally, we'll also create an attention mask for `PAD` tokens.

In [11]:
mask = np.random.binomial(1, 0.06, encoded.shape)
mask = fill_row_around_ones(mask)
mask[encoded == tokenizer.pad_token_id] = 0
mask[:, 0] = 0
mask[:, -1] = 0
mask = mask.astype(bool)

In [12]:
labels = encoded.copy()
labels[~mask] = -100

In [13]:
encoded[mask] = tokenizer.mask_token_id

In [14]:
att_mask = (encoded != tokenizer.pad_token_id).astype(int)

Below, we'll verify that:
- Labels do not contain `MASK` tokens
- The number of non-masked (other than -100) tokens equals to the number of masked tokens in the encoded input
- The number of masked tokens ~15% (15.56 to be more precise)

In [15]:
(labels == 4).sum(), (labels != -100).sum(), (encoded == 4).sum(), (encoded == 4).sum() / np.prod(encoded.shape)

(0, 5405680, 5405680, 0.15556182265017526)

Finally, we'll wrap the encoded input, attention mask and labels into a tensor dataset. `Loader` will accept this `TensorDataset` object and split it into the training and validation subsets internally.

In [16]:
tds = TensorDataset(
    torch.tensor(encoded, dtype=torch.long),
    torch.tensor(att_mask, dtype=torch.int),
    torch.tensor(labels, dtype=torch.long)
)

## Setup model and loader

In [17]:
loader = Loader(tds, 2 ** 6)
loader.setup()

Calculate the exact number of warmup steps based on the number of batches and the maximum training epochs.

In [18]:
warmup_steps, total_steps = calc_scheduler_steps(
    loader, warmup_perc=0.05, max_epochs=MAX_EPOCHS)
warmup_steps, total_steps

(25800, 516000)

Initialize the config. We'll use the default configuration except reducing the model's dimension by two.

In [19]:
config = DistilBertConfig()
config.vocab_size = tokenizer.vocab_size
config.dim = config.dim // 2
config

DistilBertConfig {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 384,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "transformers_version": "4.16.2",
  "vocab_size": 69
}

`uBERTa_mlm` is a lightning module that encapsulates the model, its config, setups for optimizer and scheduler.

In [20]:
model = uBERTa_mlm(
    model=DistilBertForMaskedLM,
    config=config,
    opt_kwargs={'lr': 1e-5, 'weight_decay': 0.01, 'eps': 1e-8}, 
    scheduler=get_linear_schedule_with_warmup,
    scheduler_kwargs={'num_warmup_steps': warmup_steps, 'num_training_steps': total_steps},
)

In [21]:
model.summarize()

  model.summarize()
  rank_zero_deprecation(


  | Name  | Type                  | Params
------------------------------------------------
0 | model | DistilBertForMaskedLM | 18.1 M
------------------------------------------------
18.1 M    Trainable params
0         Non-trainable params
18.1 M    Total params
72.426    Total estimated model params size (MB)

## Train and save the weights

In [22]:
stopper = pl.callbacks.early_stopping.EarlyStopping(
    monitor='val_loss', 
    verbose=True, mode='min', 
    min_delta=1e-6,
    patience=20)
pointer = pl.callbacks.ModelCheckpoint(
    monitor='val_loss', 
    dirpath=f'../models/checkpoints/{MODEL}', 
    verbose=True, mode='min')
logger = pl.loggers.TensorBoardLogger('../logs', f'{MODEL}')
lr_monitor = pl.callbacks.LearningRateMonitor('epoch')
bar = pl.callbacks.TQDMProgressBar()

In [23]:
gpus = [0]
trainer = pl.Trainer(
    gradient_clip_val=1.0, 
    # stochastic_weight_avg=True,
    accelerator="gpu",
    precision=16,
    gpus=gpus,
    callbacks=[stopper, pointer, bar, lr_monitor],
    logger=logger,
    max_epochs=MAX_EPOCHS
)

Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs


In [None]:
trainer.fit(model, loader)

In [None]:
model.model.save_pretrained(MODEL_PATH)