# Train uBERTa

Here, we'll use the pretrained model for fine-tuning on the token-level classification objective.

## Prerequisites

This notebook requires:
- [DS_BASE.tsv](https://drive.google.com/file/d/1gPjOoxWOAPpfPmKFbQlVT0hncjIpst5E/view?usp=sharing) (`../data/DS_BASE.tsv`)
- [dataset_labeling.tsv](https://drive.google.com/file/d/1z_dQtERIPvf_ZqnGLCNGLx_l6GcKHJZ1/view?usp=sharing) (`../data/dataset_labeling.tsv`)
- [pretrained_model](https://drive.google.com/file/d/1fiYXNEOUsqSHGbzwbuWKAu5e26-q0UM1/view?usp=sharing) (`../models/ws100_step50_pretrain`)

One can either download or obtain the requirements manually: `prepare_base_dataset.ipynb` for the first two, and `pretrain_distil.ipynb` for the pretrained model. 

For instance, to download the data, starting from the project's root:
```bash
gdown --fuzzy https://drive.google.com/file/d/1gPjOoxWOAPpfPmKFbQlVT0hncjIpst5E/view?usp=sharing
gdown --fuzzy https://drive.google.com/file/d/1z_dQtERIPvf_ZqnGLCNGLx_l6GcKHJZ1/view?usp=sharing
tar -xzf DS_BASE.tsv.tar.gz
tar -xzf dataset_labeling.tsv.tar.gz
mkdir -p ../models
cd ../models
gdown --fuzzy https://drive.google.com/file/d/1fiYXNEOUsqSHGbzwbuWKAu5e26-q0UM1/view?usp=sharing
tar -xzf ws100_step50_pretrain.tar.gz
```

## Setup

In [1]:
import logging
from collections import Counter
from pathlib import Path

import pandas as pd
import pytorch_lightning as pl
import torch
from sklearn.utils.class_weight import compute_class_weight
from transformers import DistilBertConfig, get_linear_schedule_with_warmup

from uBERTa.loader import uBERTaLoader
from uBERTa.model import uBERTa_classifier, WeightedDistilBertClassifier
from uBERTa.tokenizer import DNATokenizer
from uBERTa.utils import split_values

In [2]:
def load_existing(paths):
    loader = (
        lambda p: None if not p.exists() else 
        (pd.read_hdf(p) if p.suffix == '.h5' else torch.load(p))
    )
    return {k: loader(v) for k, v in paths.items()}

def parse_base(path_base, path_labels, min_seq_size):
    df = pd.read_csv(path_base, sep='\t')
    df['SeqSize'] = df['Seq'].apply(len)
    print(f'Initial ds: {len(df)}')
    df = df.merge(pd.read_csv(path_labels, sep='\t'), on='GeneID')
    print(f'Labeled genes: {len(df)}')
    df = df[df.SeqSize >= min_seq_size]
    print(f'Conforming to size threshold: {len(df)}')
    split_values(df, 'SeqEnum')
    split_values(df, 'SeqEnumPositive')
    split_values(df, 'Classes')
    split_values(df, 'Signal', dtype=float)
    return df

def calc_scheduler_steps(loader, warmup_perc=0.1, max_epochs=100):
    epoch_steps = len(loader.train_dataloader())
    total_steps = epoch_steps * max_epochs
    warmup_steps = int(warmup_perc * total_steps)
    return warmup_steps, total_steps

In [3]:
KMER = 3
MIN_SEQ_SIZE = 10
WINDOW = 100
STEP = WINDOW // 4
MAX_EPOCHS = 30

# Valid start codons
STARTS = ('ACG', 'ATC', 'ATG', 'ATT', 'CTG', 'GTG')
# Name of the dataset
DS = f'ws{WINDOW}_step{STEP}_{"_".join(STARTS)}'
# Name of the model
MODEL = f'{DS}_pretrain_tokenlevel_signal'

MODEL_PATH = Path(f'../models/{MODEL}')
PRETRAINED_PATH = Path('../models/ws100_step50_pretrain/')
PATH_BASE_DS = Path('../data/DS_BASE.tsv')
PATH_LABELS = Path('../data/dataset_labeling.tsv')

datasets_base = Path(f'../data/datasets/{DS}')
datasets_base.mkdir(exist_ok=True, parents=True)

DATASETS = {
    'train_ds': datasets_base / 'train_ds.h5',
    'val_ds': datasets_base / 'val_ds.h5',
    'test_ds': datasets_base / 'test_ds.h5',
    'train_tds': datasets_base / 'train_tds.bin',
    'val_tds': datasets_base / 'val_tds.bin',
    'test_tds': datasets_base / 'test_tds.bin'
}

In [4]:
logging.basicConfig(level=logging.DEBUG)

## Prepare datasets
Preparing datasets is time consuming. The code below will check if the required datasets are present in `datasets_base`: if not (e.g., first time running the notebook), the process will start from the base dataset. The data preparation is fully encapsulated into the `setup` method of `uBERTaLoader`. Basically, this will kmerize sequence data, compose token labels, aggregate the ribo-seq signal for each token, and use the sliding window on sequences to unify the input size.

In [5]:
ds_paths = (DATASETS['train_ds'], DATASETS['val_ds'], DATASETS['test_ds'])
if any(not p.exists() for p in ds_paths):
    ds = parse_base(PATH_BASE_DS, PATH_LABELS, MIN_SEQ_SIZE)
else:
    ds = None

Initial ds: 79677
Labeled genes: 19652
Conforming to size threshold: 19377


In [6]:
tokenizer = DNATokenizer(kmer=KMER)
loader = uBERTaLoader(
    ds, WINDOW, STEP, tokenizer, 
    **load_existing(DATASETS),
    scale_signal_bounds=(0.0, 10.0),
    is_mlm_task=False,
    valid_start_codons=STARTS,
    batch_size=2 ** 6)

In [7]:
loader.setup()
loader.save_all(datasets_base)

INFO:uBERTa.loader:Total initial sequences: 19377
INFO:uBERTa.loader:Split datasets. Train: 15459, Val: 1924, Test: 1994.
INFO:uBERTa.loader:Preparing Train with 15459 records for token-level task
INFO:uBERTa.loader:Using kmer 3 on ('Seq', 'SeqEnum', 'Signal', 'Classes')
DEBUG:uBERTa.loader:Reducing kmers for Train
  return asarray(a).ndim
DEBUG:uBERTa.loader:Filtering to ('ACG', 'ATC', 'ATG', 'ATT', 'CTG', 'GTG') for Train
DEBUG:uBERTa.loader:Capping and scaling signal for Train
DEBUG:uBERTa.loader:Capped signal in (0.1, 5000.0)
DEBUG:uBERTa.loader:Scaled signal between 0 and 1. Min 0.1, Max 5000.0
INFO:uBERTa.loader:Rolling window with size 98, step 25
INFO:uBERTa.loader:Preparing Val with 1924 records for token-level task
INFO:uBERTa.loader:Using kmer 3 on ('Seq', 'SeqEnum', 'Signal', 'Classes')
DEBUG:uBERTa.loader:Reducing kmers for Val
  return asarray(a).ndim
DEBUG:uBERTa.loader:Filtering to ('ACG', 'ATC', 'ATG', 'ATT', 'CTG', 'GTG') for Val
DEBUG:uBERTa.loader:Capping and scalin

Below we'll count unmasked tokens (defined by `STARTS` above) and count positive and negative classes. The `weight` can then be used with `uBERTa_classifier`, although it doesn't lead to performance gains and will likely require manually adjusting the threshold for converting the predictions into binary labels for the desired balance between recall and precision.

In [8]:
for tds in [loader.val_tds, loader.test_tds, loader.train_tds]:
    classes, inp_ids = tds.tensors[2], tds.tensors[0]
    mask = classes != -100
    mask0 = classes == 0
    mask1 = classes == 1
    weight = compute_class_weight('balanced', classes=[0, 1], y=classes[mask].numpy())
    codons = inp_ids[mask]
    print(Counter(codons.numpy()), weight, mask0.sum(), mask1.sum())

Counter({44: 39445, 60: 25270, 10: 17082, 12: 14558, 11: 14270, 16: 9337}) [ 0.52456622 10.67657529] tensor(114344) tensor(5618)
Counter({44: 41275, 60: 25270, 10: 17345, 12: 15340, 11: 13694, 16: 9643}) [ 0.52253562 11.593549  ] tensor(117281) tensor(5286)
Counter({44: 300604, 60: 196150, 10: 123232, 12: 105487, 11: 105119, 16: 72656}) [ 0.52224572 11.73811566] tensor(864773) tensor(38475)


## Setup model

In [9]:
warmup_steps, total_steps = calc_scheduler_steps(
    loader, warmup_perc=0.1, max_epochs=MAX_EPOCHS)
warmup_steps, total_steps

(5463, 54630)

We'll use the config of the pretrained model and add one field so that the model uses the experimental signal internally.

In [10]:
config = DistilBertConfig.from_pretrained(PRETRAINED_PATH)
config.use_signal = True

Note: manually providing the `device` argument is needed only if using `weight`. It will also restrict the model to the provided device, and using distributed training may cause an exception. I couldn't (yet) find a way to use `weight` with distributed training.

In [11]:
model = uBERTa_classifier(
    model=WeightedDistilBertClassifier,
    config=config,
    opt_kwargs={'lr': 1e-5, 'weight_decay': 1.5, 'eps': 1e-8}, 
    scheduler=get_linear_schedule_with_warmup,
    scheduler_kwargs={'num_warmup_steps': warmup_steps, 'num_training_steps': total_steps},
    weight=None, device='cuda:1'
)
# It's a bit nested, most certainly unnecessarily
model.model.bert = model.model.bert.from_pretrained(PRETRAINED_PATH)
model.model.config.use_signal = True

Some weights of the model checkpoint at ../models/ws100_step50_pretrain were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [12]:
model.summarize()

  model.summarize()
  rank_zero_deprecation(


  | Name  | Type                         | Params
-------------------------------------------------------
0 | model | WeightedDistilBertClassifier | 20.9 M
-------------------------------------------------------
20.9 M    Trainable params
0         Non-trainable params
20.9 M    Total params
83.657    Total estimated model params size (MB)

## Train the model

In [13]:
stopper = pl.callbacks.early_stopping.EarlyStopping(
    monitor='val_loss', 
    verbose=True, mode='min', 
    min_delta=1e-6,
    patience=10)
pointer = pl.callbacks.ModelCheckpoint(
    monitor='val_loss', 
    dirpath=f'../models/checkpoints/{MODEL}', 
    verbose=True, mode='min')
logger = pl.loggers.TensorBoardLogger('../logs', f'{MODEL}')
lr_monitor = pl.callbacks.LearningRateMonitor('epoch')
bar = pl.callbacks.TQDMProgressBar()

In [14]:
gpus = [1]
trainer = pl.Trainer(
    gradient_clip_val=0.5, 
    stochastic_weight_avg=True,
    accelerator="gpu",
    precision=16,
    gpus=gpus,
    callbacks=[stopper, pointer, bar, lr_monitor],
    logger=logger,
    max_epochs=MAX_EPOCHS
)

Using 16bit native Automatic Mixed Precision (AMP)
  rank_zero_deprecation(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs


In [None]:
trainer.fit(model, loader)

## Reinitialize the model from the best checkpoint and save the weghts.

Check the checkpoints directory (`../models/checkpoints/{MODEL}`) and put a name of the desired checkpoint file below.

In [None]:
ckpt = 'epoch=20-step=38597.ckpt'

In [None]:
_model = uBERTa_classifier.load_from_checkpoint(
    f'../models/checkpoints/{MODEL}/{ckpt}', 
    model=WeightedDistilBertClassifier,
    config=config)

In [None]:
_model.model.save_pretrained(MODEL_PATH)