# Predict 5'UTR TISs

Here, we'll use the fine-tuned `uBERTa_classifier` to predict TISs in 5'UTR sequences of the protein-coding genes of the human genome.

## Prerequisites

This notebook requires:
- [DS_BASE](https://drive.google.com/file/d/15fQP5ldYNvV1YY2T2Qza9CNFdYj4zZg8/view?usp=sharing) (`../data/DS_BASE_v4.7_seqs.tsv`)
- [dataset_labeling](https://drive.google.com/file/d/1-R1zLJRrJg3KXAaqDe9T60s9MXdrv4RO/view?usp=sharing) (`../data/dataset_labeling_v.4.7.tsv`)
- [trained_model](https://drive.google.com/file/d/1weL5Wp3DrCIoW-kCxJ6aIveYZyQQbDOQ/view?usp=sharing) (`../models/ws100_step20_AAG_ACG_AGG_ATA_ATC_ATG_ATT_CTG_GTG_TTG_nopretrain_tokenlevel_signal`)

One can either download or obtain the requirements manually: `prepare_base_dataset.ipynb` for the first two, and `train_uBERTa.ipynb` for the trained model.

For instance, starting from the project's root:
```bash
gdown --fuzzy https://drive.google.com/file/d/15fQP5ldYNvV1YY2T2Qza9CNFdYj4zZg8/view?usp=sharing
gdown --fuzzy https://drive.google.com/file/d/1-R1zLJRrJg3KXAaqDe9T60s9MXdrv4RO/view?usp=sharing
tar -xzf DS_BASE_v4.7_seqs.tsv.tar.gz
tar -xzf dataset_labeling_v4.7.tsv.tar.gz
mkdir -p ../models
cd ../models
gdown --https://drive.google.com/file/d/1weL5Wp3DrCIoW-kCxJ6aIveYZyQQbDOQ/view?usp=sharing
tar -xzf ws100_step20_AAG_ACG_AGG_ATA_ATC_ATG_ATT_CTG_GTG_TTG_nopretrain_tokenlevel_signal.tar.gz
```

In [1]:
import logging
import operator as op
from itertools import chain
from math import ceil
from pathlib import Path

import numpy as np
import pandas as pd
import pytorch_lightning as pl
import torch
from more_itertools import sliced, unzip
from scipy.stats import pearsonr
from sklearn.metrics import f1_score, recall_score, precision_score, roc_auc_score, balanced_accuracy_score
from torch.utils.data import DataLoader, SequentialSampler, TensorDataset
from tqdm.auto import tqdm
from transformers import DistilBertConfig

from uBERTa.base import VALID_START
from uBERTa.loader import uBERTaLoader
from uBERTa.model import uBERTa_classifier, WeightedDistilBertClassifier2
from uBERTa.tokenizer import DNATokenizer
from uBERTa.utils import split_values, kmerize

In [2]:
MIN_SEQ_SIZE = 30
MAX_SEQ_SIZE = 3000
WINDOW = 100
STEP = WINDOW // 5
BATCH_SIZE = 2 ** 10

BASE = Path('../data/NN_pred')
BASE.mkdir(exist_ok=True)
DATA = BASE.parent
MODEL_PATH = Path('../models/ws100_step20_AAG_ACG_AGG_ATA_ATC_ATG_ATT_CTG_GTG_TTG_nopretrain_tokenlevel_signal/')
DS = DATA / 'DS_BASE_v4.7_seqs.tsv'
DS_LABELS = DATA / 'dataset_labeling_v4.7.tsv'
STARTS = VALID_START

In [3]:
logging.basicConfig(level=logging.DEBUG)

In [4]:
def parse_base(path_base, path_labels, min_seq_size, max_seq_size):
    """
    Read base dataset, merge gene labels (splitting genes into Train, 
        Test, and Valiation) and filter by seq size
    """
    df = pd.read_csv(path_base, sep='\t')
    df['SeqSize'] = df['Seq'].apply(len)
    print(f'Initial ds: {len(df)}')
    df_labels = pd.read_csv(path_labels, sep='\t')
    df = df.merge(df_labels, on='GeneID', how='left')
    df = df[(df.SeqSize >= min_seq_size) & (df.SeqSize <= max_seq_size)]
    print(f'Conforming to size threshold: {len(df)}')
    split_values(df, 'SeqEnum')
    split_values(df, 'Classes')
    split_values(df, 'Signal', dtype=float)
    return df

def aggregate_predictions(predictions):
    """
    Detach and concatenate raw batch predictions
    """
    y_prob = [x[1].detach().cpu().numpy()[:, 1] for x in predictions]
    y_true = [x[2].detach().cpu().numpy() for x in predictions]
    
    return np.concatenate(y_prob), np.concatenate(y_true)

# def safe_take_fst(xs):
#     assert len(xs.unique()) == 1
#     return xs.iloc[0]

# def take_fst(xs):
#     return xs.iloc[0]

# def unravel_base_ds(path, keep_cols=('GeneID', 'TranscriptID')):
#     """
#     Unravel sequence data of the base dataset into the codon-per-row format.
#     """
    
#     def unravel_row(row):
#         keep_values = [row[col] for col in keep_cols]
#         for i, (codon, en, cls) in enumerate(
#             zip(row.Seq.split(), row.SeqEnum, row.Classes)
#         ):
#             if cls != -100:
#                 yield (row.Chrom, row.Strand, en, codon, *keep_values)
    
#     df = pd.read_csv(path, sep='\t')
#     split_values(df, 'SeqEnum')
#     df['Seq'] = df['Seq'].apply(lambda s: kmerize(s, 3))
#     unraveled = chain.from_iterable(
#         map(unravel_row, map(op.itemgetter(1), df.iterrows())))
#     # print(next(unraveled))

#     return pd.DataFrame(
#         unraveled,
#         columns=['Chrom', 'Strand', 'Start', 'Codon'] + list(keep_cols))


# def center_ds(df: pd.DataFrame):
#     def unravel(row):
#         mask = row.Classes != -100
#         seq = row['Seq'].split()
#         sig = row['Signal']

#         prepend_values = [row[c] for c in cols_prepend]
#         for idx in np.where(mask)[0]:
#             pos = row['SeqEnum'][idx]
#             cls = row['Classes'][idx]
#             start = ''.join(seq_c[RNA_SIDE: RNA_SIDE + 3])
#             yield *prepend_values, seq_c, start, cls, pos, sig_c

#     cols_roll = ['Seq', 'Start', 'Classes', 'SeqEnum', 'Signal']
#     cols_prepend = [c for c in df.columns if c not in cols_roll]
#     columns = cols_prepend + cols_roll

#     rows = tqdm(df.iterrows(), total=len(df), desc='Unraveling')

#     unraveled = chain.from_iterable(map(unravel, map(op.itemgetter(1), rows)))

#     return pd.DataFrame(unraveled, columns=columns)


# def unravel_and_group(df, y_prob, y_true, threshold=0.5, pred_agg='mean'):
#     """
#     Unravel sequences in the dataset to restructure as codon-per-row.
#     Aggregate predictions for the same codon.
#     """
#     def unravel_row(row):
#         for i, (codon, en, cls, sig) in enumerate(
#             zip(row.Seq.split(), row.SeqEnum, row.Classes, row.Signal)
#         ):
#             if cls != -100:
#                 yield row.Chrom, row.Strand, en, codon, cls, sig, row.Dataset
    
#     # Unravel the dataset with predictions
#     unraveled = map(unravel_row, map(op.itemgetter(1), df.iterrows()))
#     _df = pd.DataFrame(
#         chain.from_iterable(unraveled), 
#         columns=['Chrom', 'Strand', 'Start', 'Codon', 
#                  'Label', 'Signal', 'Dataset'])
    
#     # This serves as additional sanity check
#     # working iff the number of codons match
#     _df['y_prob'] = np.squeeze(y_prob[:, 1])  # Take the probability of the positive class
#     _df['y_true'] = np.squeeze(y_true)
    
#     _df = _df[_df.Start != 0]
    
#     # Average the predictions for each start codon across transcripts and windows
#     _df = _df.groupby(
#         ['Chrom', 'Strand', 'Start', 'Codon'], 
#         as_index=False
#     ).agg({
#         'y_prob': pred_agg, 
#         'y_true': safe_take_fst,
#         'Signal': take_fst,
#         'Dataset': take_fst,
#     })
    
#     _df['y_pred'] = (_df['y_prob'] > threshold).astype(int)
    
#     return _df

def calc_pred_scores(df):
    y_prob = df['y_prob'].values
    y_pred = df['y_pred'].values
    y_true = df['y_true'].values
    fn, fp, tn, tp = map(
        lambda x: len(df[df.PredictionType == x]), 
        ['FN', 'FP', 'TN', 'TP'])
    return {
        'f1': f1_score(y_true, y_pred, zero_division=0), 
        'prc': precision_score(y_true, y_pred, zero_division=0), 
        'rec': recall_score(y_true, y_pred, zero_division=0),
        'bac': balanced_accuracy_score(y_true, y_pred),
        'FN': fn, 'FP': fp, 'TN': tn, 'TP': tp,
    }

def annotate_predictions(df):
    df = df.copy()
    df.loc[(df.y_true == 1) & (df.y_pred == 1), 'PredictionType'] = 'TP'
    df.loc[(df.y_true == 1) & (df.y_pred == 0), 'PredictionType'] = 'FN'
    df.loc[(df.y_true == 0) & (df.y_pred == 1), 'PredictionType'] = 'FP'
    df.loc[(df.y_true == 0) & (df.y_pred == 0), 'PredictionType'] = 'TN'
    return df

def score(df, threshold):
    df = df.copy()
    df['y_true'] = df['y_true'].astype(int)
    scores = {
        codon: calc_pred_scores(group, threshold) 
        for codon, group in df.groupby('Seq')}
    scores['All'] = calc_pred_scores(df, threshold)
    return scores

def split_into_loaders(tds, chunk_size, batch_size = 2 ** 8):
    for tensors in sliced(tds, chunk_size):
        tensors = list(tensors)
        if all(x.shape[0] > 0 for x in tensors):
            _tds = TensorDataset(*tensors)
            yield DataLoader(
                _tds, 
                sampler=SequentialSampler(_tds), 
                batch_size=batch_size,
                num_workers=4)
        else:
            return
        
# def split_df(df, chunk_size):
#     n = ceil(len(df) // chunk_size)
#     for i in range(n):
#         yield df.iloc[i * chunk_size: (i + 1) * chunk_size]

# def predict(loader, ds, model, trainer, threshold, agg_fn):
#     predictions = trainer.predict(model, loader)
#     y_prob, y_true = aggregate_predictions(predictions)
#     df = unravel_and_group(ds, y_prob, y_true, threshold, agg_fn)
#     return df

def agg_y_true(vs):
    if len(vs) == 1:
        return vs
    s = set(vs)
    if len(s) > 1:
        return ';'.join(map(str, vs))
    return s.pop()

def unravel_scores(scores):
    for ds_name, ds_vs in scores.items():
        for codon_name, codon_scores in ds_vs.items():
            for score_name, score_val in codon_scores.items():
                yield ds_name, codon_name, score_name, score_val
                
def get_color(y_pred, y_true, dataset):
    """ Make RGB colors for the bed file based on the prediction type"""
    green, blue, red, black = (
        '0,255,0', '0,0,255', '255,0,0', '0,0,0')
    if dataset == 'Inference':
        if y_pred == 1:
            return green  # Inference positive
        return blue       # Inference negative
    if y_pred == 1 and y_true == 1:
        return green      # TP
    if y_pred == 0 and y_true == 0:
        return blue       # TN
    if y_pred == 0 and y_true == 1:
        return red        # FN
    return black          # FP

def wrap_row(row, ts=0.5):
    label = row.Dataset
    color = get_color(row.y_pred, row.y_true, row.Dataset)
    start = row.SeqEnum if row.Strand == '+' else row.SeqEnum - 2
    end = start + 3
    return (f'{row.Chrom} {start} {end} {label} '
            f'{int(row.y_prob * 100)} {row.Strand} {start} {end} {color}')

def pred2bed(df, out_path):
    with open(out_path, 'w') as f:
        print('track name="uBERTa predictions v4.7" itemRgb="On"', file=f)
        for _, row in tqdm(df.iterrows()):
            print(wrap_row(row), file=f)

## Prepare data

We'll initalize `uBERTaLoader` without base dataset and use its methods to prepare the sequence data for predictions. This will take care of encoding inputs and sliding the window over the sequence data. Be careful to use the same setup as in `train_uBERTa.ipynb`, especially wrt window parameters and experimental signal bounds.

In [6]:
tokenizer = DNATokenizer(kmer=3)
loader = uBERTaLoader(
    None, WINDOW, STEP, tokenizer, 
    scale_signal_bounds=(0.0, 10.0),
    is_mlm_task=False,
    valid_start_codons=STARTS,
    batch_size=BATCH_SIZE)

In [7]:
ds = parse_base(DS, DS_LABELS, MIN_SEQ_SIZE, MAX_SEQ_SIZE)

Initial ds: 79453
Conforming to size threshold: 73797


In [8]:
ds = loader._prep_token_level(
    ds, 'Main'
)

INFO:uBERTa.loader:Preparing Main with 73797 records for token-level task
INFO:uBERTa.loader:Using kmer 3 on ('Seq', 'SeqEnum', 'Signal', 'Classes')
DEBUG:uBERTa.loader:Reducing kmers for Main
  return asarray(a).ndim
DEBUG:uBERTa.loader:Filtering to ('AAG', 'ACG', 'AGG', 'ATA', 'ATC', 'ATG', 'ATT', 'CTG', 'GTG', 'TTG') for Main
DEBUG:uBERTa.loader:Capping and scaling signal for Main
DEBUG:uBERTa.loader:Capped signal in (0.1, 5000.0)
DEBUG:uBERTa.loader:Scaled signal between 0 and 1. Min 0.1, Max 5000.0
INFO:uBERTa.loader:Rolling window with size 98, step 20


In [9]:
ds.drop(columns='SeqSize')
ds.loc[ds.Dataset.isna(), 'Dataset'] = 'Inference'

In [10]:
tds = loader._prep_tds_cls(ds)

In [11]:
torch.save(tds, DATA / 'tds.torch')

In [12]:
ds['Seq'] = ds['Seq'].apply(lambda x: np.array(x.split()))

In [13]:
cls_indices = [np.where(x != -100)[0] for x in ds['Classes']]
for col in ['Seq', 'Classes', 'SeqEnum', 'Signal']:
    ds[col] = [x[i] for i, x in zip(cls_indices, ds[col])]

In [14]:
ds = ds.explode(['Seq', 'Classes', 'SeqEnum', 'Signal'])

In [15]:
ds.to_csv(DATA / 'base_unraveled.csv', index=False)

## Init loader

In [5]:
tds = torch.load(DATA / 'tds.torch')

In [6]:
predict_loader = DataLoader(
    tds, 
    sampler=SequentialSampler(tds), 
    batch_size=BATCH_SIZE,
    num_workers=10)

## Load model

In [7]:
config = DistilBertConfig.from_pretrained(MODEL_PATH)
model = uBERTa_classifier(
    model=WeightedDistilBertClassifier2,
    config=config,
)
model.model = model.model.from_pretrained(MODEL_PATH)
model.model.config.use_signal = True

Some weights of the model checkpoint at ../models/ws100_step20_AAG_ACG_AGG_ATA_ATC_ATG_ATT_CTG_GTG_TTG_nopretrain_tokenlevel_signal were not used when initializing WeightedDistilBertClassifier2: ['loss.weight']
- This IS expected if you are initializing WeightedDistilBertClassifier2 from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing WeightedDistilBertClassifier2 from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Predict

- Using the whole dataset may overflow RAM, hence we'll split loaders into sizeable chunks and predict them separately

In [9]:
gpus = [0]
trainer = pl.Trainer(
    accelerator="gpu",
    precision=16,
    gpus=gpus,
)

Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs


In [11]:
chunk_size = 100000
loaders = list(split_into_loaders(tds, chunk_size, BATCH_SIZE))
predictions = [aggregate_predictions(trainer.predict(model, l)) for l in loaders]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]


Predicting: 0it [00:00, ?it/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]


Predicting: 0it [00:00, ?it/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]


Predicting: 0it [00:00, ?it/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]


Predicting: 0it [00:00, ?it/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]


Predicting: 0it [00:00, ?it/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]


Predicting: 0it [00:00, ?it/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]


Predicting: 0it [00:00, ?it/s]

## Parse and dump results

- Predictions are outputted for each valid start codon.
- As a result, the number of predicted instances and their order must match those in the `base_unraveled` prepared above.

In [12]:
y_prob, y_true = map(np.concatenate, map(list, unzip(predictions)))
y_prob.shape, y_true.shape

((9568908,), (9568908,))

In [13]:
cds = pd.read_csv(DATA / 'base_unraveled.csv')
len(cds)

9568908

In [14]:
cds['y_prob'] = y_prob
cds['y_true'] = y_true

- We double-check that the true classes outputted by model and those within the `base_unraveled` match exactly.

In [15]:
not len(cds[cds.Classes != cds.y_true])

True

- Explicitly mark the "inference" dataset

In [16]:
idx = cds.Dataset == 'Inference'
cds.loc[idx, 'y_true'] = -1
cds.loc[idx, 'Classes'] = -1
cds.loc[idx, 'Dataset'] = 'Inference'

In [17]:
cds.head()

Unnamed: 0,Chrom,Strand,TranscriptID,GeneID,SeqSize,Dataset,Seq,Classes,SeqEnum,Signal,y_prob,y_true
0,chr1,+,ENST00000003912,ENSG00000001461,715,Inference,ATT,-1,24415803,0.0018,0.000878,-1
1,chr1,+,ENST00000003912,ENSG00000001461,715,Inference,TTG,-1,24415804,0.0018,0.000853,-1
2,chr1,+,ENST00000003912,ENSG00000001461,715,Inference,AAG,-1,24415809,0.0038,0.000697,-1
3,chr1,+,ENST00000003912,ENSG00000001461,715,Inference,AGG,-1,24415810,0.081802,0.011008,-1
4,chr1,+,ENST00000003912,ENSG00000001461,715,Inference,ATG,-1,24415814,0.335807,0.921499,-1


- Group predictions by positions (across transcripts and slices created by the sliding window approach)
- Aggregate predictions:
    - Merge genes.
    - Merge transcripts.
    - Safely aggregate y_true; there should be no positions having opposing `y_true` labels.
    - Take average probability of predictions.

In [18]:
cds = cds.groupby(
        ['Chrom', 'Strand', 'Seq', 'SeqEnum', 'Dataset'], 
        as_index=False
    ).agg(
    {
        'GeneID':  lambda vs: ';'.join(sorted(set(vs))),
        'TranscriptID': lambda vs: ';'.join(sorted(set(vs))),
        'y_true': agg_y_true,
        'y_prob': 'mean',
})
# len(cds)

In [19]:
cds['y_pred'] = (cds['y_prob'] > 0.5).astype(int)

In [20]:
cds = annotate_predictions(cds)

In [21]:
len(cds)

1421737

In [22]:
cds.y_true.value_counts()

-1    1100276
 0     316056
 1       5405
Name: y_true, dtype: int64

In [23]:
cds = cds[
    ['GeneID', 'TranscriptID', 'Chrom', 'Strand', 'Seq', 'SeqEnum',
     'Dataset', 'y_true', 'y_pred', 'y_prob', 'PredictionType']
]

- Score predictions per dataset and codon.

In [24]:
scores = {ds_name: score(cds[cds.Dataset == ds_name]) for ds_name in ('Train', 'Val', 'Test')}

- Format the table for easier visual representation

In [25]:
df_scores = pd.DataFrame(
    unravel_scores(scores), 
    columns=['Dataset', 'Codon', 'ScoreType', 'ScoreVal']
).round(2).pivot(
    index=['Dataset', 'Codon'], columns='ScoreType', values='ScoreVal'
)

for c in ['FN', 'FP', 'TN', 'TP']:
    df_scores[c] = df_scores[c].astype(int)

df_scores['P'] = df_scores['TP'] + df_scores['FN']

df_scores = df_scores.reset_index().sort_values(
    ['Dataset', 'P'], ascending=[True, False]
).set_index(['Dataset', 'Codon'])[[
    'f1', 'prc', 'rec', 'bac', 'TN', 'FN', 'FP', 'TP', 'P'
]]

df_scores

Unnamed: 0_level_0,ScoreType,f1,prc,rec,bac,TN,FN,FP,TP,P
Dataset,Codon,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Test,All,0.61,0.48,0.83,0.91,29746,98,526,486,584
Test,CTG,0.64,0.5,0.88,0.93,5502,23,173,171,194
Test,ATG,0.62,0.48,0.88,0.9,1783,20,163,150,170
Test,GTG,0.63,0.51,0.81,0.9,3733,16,65,69,85
Test,ACG,0.6,0.47,0.82,0.89,1376,8,41,36,44
Test,TTG,0.58,0.51,0.68,0.84,3032,14,29,30,44
Test,ATC,0.68,0.57,0.84,0.92,1907,4,16,21,25
Test,ATT,0.37,0.26,0.6,0.79,2336,6,25,9,15
Test,ATA,0.0,0.0,0.0,0.5,1307,4,4,0,4
Test,AAG,0.0,0.0,0.0,0.5,3758,2,5,0,2


Dump score, predictions, and predictions in bed format.

In [26]:
df_scores.to_csv(BASE / 'prediction_scores_v4.7.tsv', sep='\t')

In [27]:
cds.to_csv(BASE / 'predictions_5UTR_v4.7.tsv', sep='\t', index=False)

In [28]:
pred2bed(cds, BASE / 'predictions_5UTR_v4.7.bed')

0it [00:00, ?it/s]