# Modeling using XGBoost

Here, we'll train the `XGBoostClassifier` and make predictions for the 5'UTR sequences.

**NOTE**: due to different data preparation routines, this notebook may require more RAM than uBERTa notebooks.
For this reason, we may run chunks of the cells: e.g, prepare data first, then restart, then run training.
This will break the numeration order.


## Packages

**NOTE**: this notebook purpesfully doesn't depend on the uBERTa's source code, and has different dependencies. If running from the uBERTa environment, please install additionally the following packages:

- scikit-learn
- optuna
- joblib
- xgboost


## Data prerequisites

- [DS_BASE](https://drive.google.com/file/d/15fQP5ldYNvV1YY2T2Qza9CNFdYj4zZg8/view?usp=sharing) (`../data/DS_BASE_v4.7_seqs.tsv`)
- [dataset_labeling](https://drive.google.com/file/d/1-R1zLJRrJg3KXAaqDe9T60s9MXdrv4RO/view?usp=sharing) (`../data/dataset_labeling_v.4.7.tsv`)
- [expression_data](https://drive.google.com/file/d/1AsrwNL5rsmnQoI6Mw1XdYlT4MUtZLNOZ/view?usp=sharing) (`../data/rna_single_cell_type.tsv.zip`)

One can either download or obtain the requirements manually using `prepare_base_dataset.ipynb`. The expression data can be downloaded either from [proteinatlas.org](https://www.proteinatlas.org/about/download) or using the link above.

For instance, starting from the project's root:
```bash
gdown --fuzzy https://drive.google.com/file/d/15fQP5ldYNvV1YY2T2Qza9CNFdYj4zZg8/view?usp=sharing
gdown --fuzzy https://drive.google.com/file/d/1-R1zLJRrJg3KXAaqDe9T60s9MXdrv4RO/view?usp=sharing
gdown --fuzzy https://drive.google.com/file/d/1AsrwNL5rsmnQoI6Mw1XdYlT4MUtZLNOZ/view?usp=sharing
tar -xzf DS_BASE_v4.7_seqs.tsv.tar.gz
tar -xzf dataset_labeling_v4.7.tsv.tar.gz
```

In [1]:
import operator as op
from collections import namedtuple
from itertools import chain
from pathlib import Path

import joblib
import numpy as np
import optuna
import pandas as pd
from more_itertools import chunked, unzip, sliding_window
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score, balanced_accuracy_score
from tqdm.auto import tqdm
from xgboost import XGBClassifier

In [3]:
DATA_ = Path('../data')
DS = DATA_ / 'DS_BASE_v4.7_seqs.tsv'
DS_LABELS = DATA_ / 'dataset_labeling_v4.7.tsv'
RNA_CELL = DATA_ / 'rna_single_cell_type.tsv.zip'

RNA_SIDE = 50  # Flank sizes around start codons
SIG_SIDE = 50

MIN_SEQ_SIZE = 30
MAX_SEQ_SIZE = 3000

STARTS = ('AAG', 'ACG', 'AGG', 'ATA', 'ATC', 'ATG', 'ATT', 'CTG', 'GTG', 'TTG')
np.random.seed(666)

In [4]:
# ref = Ref(BASE / 'hg38.fa')

In [5]:
# logging.basicConfig(level=logging.DEBUG)

In [6]:
DATA = DATA_ / 'XGB'
DATA.mkdir(exist_ok=True)

DATA_PATHS = {
    'base_inp': DS,
    'labels_inp': DS_LABELS,
    'rna_cell_inp': RNA_CELL,

    'base_prep': DATA / 'base_centered.tsv',
    'meta': DATA / 'metadata.csv',
    'signal': DATA / 'signal.csv',
    'seq_rna': DATA / 'seq_rna.csv',
    'seq_raw': DATA / 'seq_raw.csv',
    'idx': DATA / 'idx.joblib',
    'labels': DATA / 'labels.csv',
    'rna_cell': DATA / 'rna_cell.csv',
    'train_x': DATA / 'train_x.csv',
    'train_y': DATA / 'train_y.csv',
    'val_x': DATA / 'val_x.csv',
    'val_y': DATA / 'val_y.csv',
    'test_x': DATA / 'test_x.csv',
    'test_y': DATA / 'test_y.csv',
    'cv': DATA / 'cv_predictions.csv',
}

In [7]:
XY = namedtuple('xy', ['x', 'y'])


def split_values(
        df: pd.DataFrame, col: str, to_array: bool = True,
        dtype=np.int, sep=',', conv_to=None) -> pd.DataFrame:
    def split(vs):
        if not isinstance(vs, str):
            return vs
        _vs = vs.split(sep)
        if conv_to:
            _vs = list(map(conv_to, _vs))
        if to_array:
            _vs = np.array(_vs, dtype=dtype)
        return _vs

    df[col] = df[col].apply(split)
    return df


def parse_base(path_base, min_seq_size, max_seq_size):
    df = pd.read_csv(path_base, sep='\t')
    df['SeqSize'] = df['Seq'].apply(len)
    print(f'Initial ds: {len(df)}')
    df = df[(df.SeqSize >= min_seq_size) & (df.SeqSize <= max_seq_size)]
    print(f'Conforming to size threshold: {len(df)}')
    split_values(df, 'SeqEnum')
    split_values(df, 'Classes')
    split_values(df, 'Signal', dtype=float)
    return df


def pad_and_slice_around(a, idx, side, **kwargs):
    assert idx < len(a), 'index lower than array size'
    assert idx >= 0, 'index at least 0'
    size_l = side - len(a[max([0, idx - side]):idx])
    size_r = side - len(a[idx + 1: idx + side + 1])

    a_pad = np.pad(a, (size_l, size_r), **kwargs)

    idx_new = idx + size_l

    return a_pad[idx_new - side: idx_new + side + 1]


def center_ds(df: pd.DataFrame):
    def unravel(row):
        mask = row.Classes != -100
        seq = list(row['Seq'])
        sig = row['Signal']

        prepend_values = [row[c] for c in cols_prepend]
        for idx in np.where(mask)[0]:
            pos = row['SeqEnum'][idx]
            cls = row['Classes'][idx]
            seq_c = pad_and_slice_around(seq, idx, RNA_SIDE, constant_values='X')
            seq_c = ''.join(seq_c)
            sig_c = pad_and_slice_around(sig, idx, SIG_SIDE, constant_values=0.0)
            sig_c = ','.join(map(str, sig_c))
            start = ''.join(seq_c[RNA_SIDE: RNA_SIDE + 3])
            yield *prepend_values, seq_c, start, cls, pos, sig_c

    cols_roll = ['Seq', 'Start', 'Classes', 'SeqEnum', 'Signal']
    cols_prepend = [c for c in df.columns if c not in cols_roll]
    columns = cols_prepend + cols_roll

    rows = tqdm(df.iterrows(), total=len(df), desc='Unraveling')

    unraveled = chain.from_iterable(map(unravel, map(op.itemgetter(1), rows)))

    return pd.DataFrame(unraveled, columns=columns)


def encode_one_hot(df, feature_range, col='Seq', prefix='', chars='ACGT', missing='XN'):
    feature_names = list(chain.from_iterable(
        (f'{prefix}{x}_{c}' for c in chars) for x in feature_range))
    mapping = {c: x for c, x in zip(chars, np.eye(len(chars), dtype=int))}
    zero = np.zeros(len(chars), dtype=int)
    mapping.update({c: zero for c in missing})
    encode = lambda s: np.hstack([mapping[c] for c in s])
    xs = np.vstack(df[col].map(encode).values)
    return pd.DataFrame(xs, columns=feature_names)


def scale(x, a, b):
    min_x = np.min(x)
    max_x = np.max(x)
    return a + (b - a) * (x - min_x) / (max_x - min_x)


def centered_range(l):
    mid = l // 2
    return range(-mid, mid + 1)


def prep_rna(df):
    l = len(df.iloc[0]['Seq'])
    return encode_one_hot(df, centered_range(l), 'Seq', 'r_')


def prep_signal(
        x_sig, cap_max: float = 5000.0, scale_min: float = 0.0, scale_max: float = 10.0
) -> pd.DataFrame:
    x_sig[x_sig >= cap_max] = cap_max
    x_sig = scale(x_sig, scale_min, scale_max)
    columns = [f's_{i}' for i in centered_range(x_sig.shape[1])]
    return pd.DataFrame(x_sig, columns=columns)


def standardize(x):
    return (x - np.mean(x)) / np.std(x)


def prep_atlas_data(df, path, columns, idx_name='Gene', val_name='nTPM', val_transform=standardize):
    df_atlas = pd.read_csv(path, sep='\t')
    if val_transform is not None:
        df_atlas[val_name] = val_transform(df_atlas[val_name])
    df_atlas = df_atlas.pivot_table(
        index=idx_name, columns=columns, values=val_name
    ).reset_index().rename(columns={'Gene': 'GeneID'})

    return df[['GeneID']].merge(
        df_atlas, on='GeneID', how='left'
    ).fillna(0.0).drop(columns='GeneID')


def load_chunks(size, datasets):

    chunks = (
        pd.read_csv(DATA_PATHS[n], chunksize=size) for n in 
        ['rna_cell', 'signal', 'seq_rna', 'labels', 'meta'])
    bar = tqdm(desc='Loading chunks')
    
    for i, (cell_chunk, signal_chunk, seq_chunk, ys_chunk, meta_chunk) in enumerate(zip(*chunks), start=1):
        bar.update(1)
        idx = meta_chunk['Dataset'].isin(datasets)
        if idx.sum():
            ds_x = pd.concat([signal_chunk[idx], seq_chunk[idx], cell_chunk[idx]], axis=1)
            meta_chunk = meta_chunk[idx].copy()
            meta_chunk['y_true'] = ys_chunk.loc[idx, 'Classes'].values
            yield ds_x, meta_chunk
    bar.close()

def compute_scores(y_pred, y):
    return {
        'f1': f1_score(y, y_pred, zero_division=0),
        'prc': precision_score(y, y_pred, zero_division=0),
        'rec': recall_score(y, y_pred, zero_division=0),
        'bac': balanced_accuracy_score(y, y_pred)
    }

## Prepare centered dataset

- As explained in the accompanying paper, the dataset preparation routines differs from distilBERT. Namely, we'll create the "centered" dataset, where each valid start-codon in 5'UTR transcript sequences is used as anchor around which the flanking sequences and . As a result, the features and positions are the same for each instance. Furthermore, the sequences themselves are one-hot encoded, while the ribo-seq signal is prepared in the same manner as for uBERTa.

In [7]:
ds = parse_base(DS, MIN_SEQ_SIZE, MAX_SEQ_SIZE).drop(
    columns=['SeqSize']
)

Initial ds: 79453
Conforming to size threshold: 73797


- Create k-merized sequences. The padding is added to match the shape of the sequence-level features and won't be used further.

In [8]:
ds['SeqKmers'] = ds['Seq'].apply(
    lambda x: np.array([''.join(y) for y in sliding_window(x, 3)] + ['PAD', 'PAD']))

- Create masks based on classes' labeling and apply it to sequence-level features to obtain classes' and positions for valid start codons.

In [9]:
ds['ClsIdx'] = [np.where(x != -100)[0] for x in ds['Classes']]
for col in ['SeqKmers', 'Classes', 'SeqEnum']:
    ds[col] = [x[i] for i, x in zip(ds['ClsIdx'], ds[col])]

- Explore the dataset based on masked sequence-level features.

In [10]:
ds = ds.explode(['SeqKmers', 'Classes', 'SeqEnum', 'ClsIdx'])

In [11]:
ds = ds[~ds['ClsIdx'].isna()].copy()

In [12]:
ds.SeqKmers.value_counts()

CTG    511620
AGG    432923
AAG    326845
GTG    324174
TTG    254256
ATT    190227
ATC    170906
ATG    167271
ACG    124085
ATA    111804
Name: SeqKmers, dtype: int64

- Label dataset for each valid codon based on GeneID

In [13]:
labels = pd.read_csv(DS_LABELS, sep='\t')
ds = ds.merge(labels, on='GeneID', how='left')

- Slice and pad sequences and signal around the start codon position

In [14]:
ds['Seq'] = [
    ''.join(pad_and_slice_around(list(s), i, RNA_SIDE, constant_values='X'))
    for i, s in tqdm(ds[['ClsIdx', 'Seq']].itertuples(index=False), total=len(ds))]

  0%|          | 0/2614111 [00:00<?, ?it/s]

In [15]:
ds['Signal'] = [
    pad_and_slice_around(list(s), i, SIG_SIDE, constant_values=0.0)
    for i, s in tqdm(ds[['ClsIdx', 'Signal']].itertuples(index=False), total=len(ds))]

  0%|          | 0/2614111 [00:00<?, ?it/s]

In [16]:
len(ds)

2614111

- Drop sequences for which using different transcripts did not result in changing the sequence around the start codon position

In [17]:
ds = ds.drop_duplicates(['Seq', 'SeqEnum', 'Chrom', 'Strand', 'Dataset'])

In [18]:
len(ds)

1858737

In [19]:
ds = ds.rename(columns={'SeqKmers': 'Start'})

- We'll dump classes as a separate table for convenience

In [20]:
ds[['Classes']].to_csv(DATA_PATHS['labels'], index=False)

- Stack all experimental signals into a single matrix and transform it by capping at 5000 and linearly scaling between 0 and 10.

In [21]:
signals = np.vstack([x for x in ds['Signal']])
prep_signal(signals).to_csv(DATA_PATHS['signal'], index=False)

In [22]:
del signals

In [23]:
with DATA_PATHS['seq_raw'].open('w') as f:
    print('Seq', file=f)
    for s in tqdm(ds['Seq'], total=len(ds)):
        print(s, file=f)

  0%|          | 0/1858737 [00:00<?, ?it/s]

- Dump metadata that we'll use later for predictions and validation

In [24]:
ds[
    ['GeneID', 'TranscriptID', 'Chrom', 'Strand', 
     'SeqEnum', 'Start', 'Dataset']
].to_csv(DATA_PATHS['meta'], index=False)

## Prep sequences

- Here, we'll restart the notebook and run only the fist 7 cells.
- Then we'll encode each sequence using one-hot approach

In [8]:
df_seq = pd.read_csv(DATA_PATHS['seq_raw'])

In [9]:
prep_rna(df_seq).to_csv(DATA_PATHS['seq_rna'], index=False)

## Prep gene-level features

- Per-cell expression levels standardized and merged on GeneID

In [10]:
meta = pd.read_csv(DATA_PATHS['meta'])

In [11]:
prep_atlas_data(
    meta, DATA_PATHS['rna_cell_inp'], ['Cell type']
).round(5).to_csv(DATA_PATHS['rna_cell'], index=False)

In [12]:
# prep_atlas_data(
#     ds, DATA_PATHS['rna_tissue_inp'], ['Tissue']
# ).to_csv(DATA_PATHS['rna_tissue'], index=False)

## Prep xy

- Merge the data into a single table. We'll load the data using sizeable chunks and save the training/testing/validation folds.

In [14]:
for ds in tqdm(['Train', 'Val', 'Test']):
    xs, ms = map(pd.concat, unzip(load_chunks(50000, [ds])))
    ys = ms[['y_true']]
    name = ds.lower()
    xs.to_csv(DATA_PATHS[f'{name}_x'], index=False)
    ys.to_csv(DATA_PATHS[f'{name}_y'], index=False)

  0%|          | 0/3 [00:00<?, ?it/s]

Loading chunks: 0it [00:00, ?it/s]

Loading chunks: 0it [00:00, ?it/s]

Loading chunks: 0it [00:00, ?it/s]

## Optimize params

In [7]:
train = XY(
    pd.read_csv(DATA_PATHS['train_x']), 
    pd.read_csv(DATA_PATHS['train_y']))
val = XY(
    pd.read_csv(DATA_PATHS['val_x']), 
    pd.read_csv(DATA_PATHS['val_y']))

In [8]:
def objective(trial, train=train, val=val):
    params = {
        'grow_policy': trial.suggest_categorical('grow_policy', ['lossguide']),
        'learning_rate': trial.suggest_float('learning_rate', 1e-2, 2),
        'max_depth': trial.suggest_int('max_depth', 4, 20),
        'gamma': trial.suggest_float('gamma', 1e-2, 2.0),
        'reg_lambda': trial.suggest_float('reg_lambda', 0, 20.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 0, 20.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.1, 1.0),
        'scale_pos_weight': trial.suggest_float('scale_pos_weight', 1.0, 20.0),
    }
    model = XGBClassifier(
        **params, objective='binary:logistic', tree_method='gpu_hist',
        gpu_id='1',
        early_stopping_rounds=10, n_jobs=-1)
    model.fit(
        train.x.values, train.y.values,
        eval_set=[(val.x.values, val.y.values)],
        verbose=False)
    y_pred = model.predict(val.x)
    score = f1_score(val.y, y_pred)
    return score

In [9]:
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

[32m[I 2022-10-19 10:13:04,657][0m A new study created in memory with name: no-name-47795dca-c848-4088-aa95-a934b40086e8[0m
[32m[I 2022-10-19 10:13:15,593][0m Trial 0 finished with value: 0.5859065716547902 and parameters: {'grow_policy': 'lossguide', 'learning_rate': 1.4333147796033734, 'max_depth': 16, 'gamma': 1.456887486756961, 'reg_lambda': 4.343441428512205, 'reg_alpha': 18.87490044289007, 'colsample_bytree': 0.5930214226127545, 'colsample_bylevel': 0.43924726554219595, 'scale_pos_weight': 19.240142988514062}. Best is trial 0 with value: 0.5859065716547902.[0m
[32m[I 2022-10-19 10:13:22,449][0m Trial 1 finished with value: 0.585209003215434 and parameters: {'grow_policy': 'lossguide', 'learning_rate': 1.6081560824429457, 'max_depth': 10, 'gamma': 1.2661162484795707, 'reg_lambda': 18.725049515380377, 'reg_alpha': 0.8034072764836298, 'colsample_bytree': 0.6737178419831429, 'colsample_bylevel': 0.58996071887815, 'scale_pos_weight': 17.720787826200667}. Best is trial 0 with v

In [10]:
study.best_params

{'grow_policy': 'lossguide',
 'learning_rate': 0.09936924483593312,
 'max_depth': 8,
 'gamma': 1.477343705723823,
 'reg_lambda': 2.2045941577676516,
 'reg_alpha': 17.903008925728273,
 'colsample_bytree': 0.5943715796308908,
 'colsample_bylevel': 0.7329001180439547,
 'scale_pos_weight': 10.655241464899479}

## Cross-validate

In [7]:
def generate_fold_idx(df: pd.DataFrame, on: str = 'GeneID', frac: float = 0.1):
    vs = df[on].unique()
    np.random.shuffle(vs)
    chunk_size = int(len(vs) * frac)
    chunks = list(chunked(vs, chunk_size, strict=False))
    for i in range(len(chunks)):
        test_vs = chunks[i]
        train_vs = np.concatenate(
            [chunks[j] for j in range(len(chunks)) if i != j])
        yield df[on].isin(train_vs).values, df[on].isin(test_vs).values


def fit_and_predict(train_x, train_y, idx_train, idx_test, i_fold, metadata, params, frac=0.1):

    train_fold_x, train_fold_y = train_x[idx_train], train_y[idx_train]
    test_fold_x, test_fold_y = train_x[idx_test], train_y[idx_test]

    idx_eval = np.random.binomial(1, frac, len(train_fold_x)).astype(bool)

    train_sub_x, train_sub_y = train_fold_x[idx_eval], train_fold_y[idx_eval]
    eval_sub_x, eval_sub_y = train_fold_x[~idx_eval], train_fold_y[~idx_eval]

    classifier = XGBClassifier(**params)
    classifier.fit(train_sub_x, train_sub_y, eval_set=[(eval_sub_x, eval_sub_y)])
    y_prob = classifier.predict_proba(test_fold_x)
    metadata.loc[idx_test, 'fold'] = i_fold
    metadata.loc[idx_test, 'y_prob'] = y_prob[:, 1]

In [8]:
best_params = {
    'grow_policy': 'lossguide',
    'learning_rate': 0.09936924483593312,
    'max_depth': 8,
    'gamma': 1.477343705723823,
    'reg_lambda': 2.2045941577676516,
    'reg_alpha': 17.903008925728273,
    'colsample_bytree': 0.5943715796308908,
    'colsample_bylevel': 0.7329001180439547,
    'scale_pos_weight': 10.655241464899479,
    'n_estimators': 10000,
    'tree_method': 'gpu_hist',
    'early_stopping_rounds': 10,
    'gpu_id': '1'
}

In [9]:
xs, meta = map(pd.concat, unzip(load_chunks(100000, ['Train', 'Val'])))
ys = meta[['y_true']]

Loading chunks: 0it [00:00, ?it/s]

In [None]:
for i, (idx_train, idx_test) in tqdm(
    enumerate(generate_fold_idx(meta), start=1), desc='Fitting models', total=10
):
    fit_and_predict(xs, ys, idx_train, idx_test, i, meta, best_params)

Fitting models:   0%|          | 0/10 [00:00<?, ?it/s]

356258 39549 35348
Fitting the model
[0]	validation_0-logloss:0.61558
[1]	validation_0-logloss:0.54505
[2]	validation_0-logloss:0.48637
[3]	validation_0-logloss:0.43698
[4]	validation_0-logloss:0.39429
[5]	validation_0-logloss:0.35759
[6]	validation_0-logloss:0.32996
[7]	validation_0-logloss:0.30146
[8]	validation_0-logloss:0.27869
[9]	validation_0-logloss:0.25651
[10]	validation_0-logloss:0.23687
[11]	validation_0-logloss:0.21923
[12]	validation_0-logloss:0.20504
[13]	validation_0-logloss:0.19090
[14]	validation_0-logloss:0.17834
[15]	validation_0-logloss:0.16830
[16]	validation_0-logloss:0.15958
[17]	validation_0-logloss:0.15019
[18]	validation_0-logloss:0.14269
[19]	validation_0-logloss:0.13516
[20]	validation_0-logloss:0.12913
[21]	validation_0-logloss:0.12275
[22]	validation_0-logloss:0.11689
[23]	validation_0-logloss:0.11168
[24]	validation_0-logloss:0.10709
[25]	validation_0-logloss:0.10307
[26]	validation_0-logloss:0.09919
[27]	validation_0-logloss:0.09562
[28]	validation_0-log

In [None]:
meta['fold'] = meta['fold'].astype(int)
meta.head()

In [None]:
meta = meta.groupby(
    ['Chrom', 'Strand', 'SeqEnum', 'Start', 'y_true', 'fold'],
    as_index=False
).agg(
    {
        'GeneID': lambda vs: ';'.join(vs),
        'TranscriptID': lambda vs: ';'.join(vs),
        'y_prob': 'mean'
    }
)

In [None]:
meta.to_csv(DATA_PATHS['cv'], index=False)

In [None]:
def unravel_scores(metadata, ts = 0.5, group_vs=['fold', 'Start']):
    for g, gg in metadata.groupby(group_vs):
        y_pred = np.zeros(len(gg), dtype=int)
        y_pred[gg['y_prob'] >= ts] = 1
        # print(g, gg, y_pred, sep='\n')
        num_pos = gg['y_true'].sum()
        num_neg = len(gg) - num_pos
        s = compute_scores(y_pred, gg['y_true'].values)
        if not isinstance(g, tuple):
            g = (g, )
        yield *g, len(gg), num_pos, num_neg, *iter(s.values())

In [None]:
scores_cv = pd.DataFrame(
    unravel_scores(meta),
    columns=['Fold', 'Start', 'Size', 'NumPos',
             'NumNeg', 'F1', 'PRC', 'REC', 'BAC'])

In [None]:
scores_cv_agg = scores_cv[
    scores_cv.NumPos != 0
].groupby(
    ['Start']
).agg(
    {'Size': 'sum', 'NumPos': 'sum', 'NumNeg': 'sum',
     'F1': 'mean', 'PRC': 'mean', 'REC': 'mean'}
)

In [21]:
scores_cv.to_csv(DATA / 'cv_scores.csv', index=False)
scores_cv_agg.to_csv(DATA / 'cv_scores_agg.csv')

## Retrain and save model

In [None]:
best_params = {
    'grow_policy': 'lossguide',
    'learning_rate': 0.09936924483593312,
    'max_depth': 8,
    'gamma': 1.477343705723823,
    'reg_lambda': 2.2045941577676516,
    'reg_alpha': 17.903008925728273,
    'colsample_bytree': 0.5943715796308908,
    'colsample_bylevel': 0.7329001180439547,
    'scale_pos_weight': 10.655241464899479,
    'n_estimators': 10000,
    'tree_method': 'gpu_hist',
    'early_stopping_rounds': 10,
    'gpu_id': '1'
}

In [8]:
train_x = pd.read_csv(DATA_PATHS['train_x'])
train_y = pd.read_csv(DATA_PATHS['train_y'])
val_x = pd.read_csv(DATA_PATHS['val_x'])
val_y = pd.read_csv(DATA_PATHS['val_y'])
test_x = pd.read_csv(DATA_PATHS['test_x'])
test_y = pd.read_csv(DATA_PATHS['test_y'])

In [9]:
len(train_x) + len(val_x) + len(test_x)

437970

In [None]:
classifier = XGBClassifier(**best_params)
classifier.fit(train_x, train_y, eval_set=[(val_x, val_y)])

In [None]:
joblib.dump(classifier, DATA / 'XGBoost_model.joblib')

## Predict 5'UTRs

In [None]:
def load_chunks(size):
    
    chunks = (
        pd.read_csv(DATA_PATHS[n], chunksize=size) for n in 
        ['rna_cell', 'signal', 'seq_rna', 'labels', 'meta'])
    
    for i, (cell_chunk, signal_chunk, seq_chunk, ys_chunk, meta_chunk) in enumerate(zip(*chunks), start=1):
        ds_x = pd.concat([signal_chunk, seq_chunk, cell_chunk], axis=1)
        meta_chunk['y_true'] = ys_chunk['Classes'].values
        yield ds_x, meta_chunk
        
def calc_pred_scores(df, threshold=0.5):
    y_prob = df['y_prob'].values
    y_pred = (df['y_prob'].values > threshold).astype(int)
    y_true = df['y_true'].values
    fn, fp, tn, tp = map(
        lambda x: len(df[df.PredictionType == x]), 
        ['FN', 'FP', 'TN', 'TP'])
    return {
        'f1': f1_score(y_true, y_pred, zero_division=0), 
        'prc': precision_score(y_true, y_pred, zero_division=0), 
        'rec': recall_score(y_true, y_pred, zero_division=0),
        'bac': balanced_accuracy_score(y_true, y_pred),
        'roc_auc': roc_auc_score(y_true, y_prob),
        'FN': fn, 'FP': fp, 'TN': tn, 'TP': tp,
    }

def annotate_predictions(df):
    df = df.copy()
    df.loc[(df.y_true == 1) & (df.y_pred == 1), 'PredictionType'] = 'TP'
    df.loc[(df.y_true == 1) & (df.y_pred == 0), 'PredictionType'] = 'FN'
    df.loc[(df.y_true == 0) & (df.y_pred == 1), 'PredictionType'] = 'FP'
    df.loc[(df.y_true == 0) & (df.y_pred == 0), 'PredictionType'] = 'TN'
    return df

def score(df, threshold):
    df = df.copy()
    df['y_true'] = df['y_true'].astype(int)
    scores = {
        codon: calc_pred_scores(group, threshold) 
        for codon, group in df.groupby('Start')}
    scores['All'] = calc_pred_scores(df, threshold)
    return scores

def agg_y_true(vs):
    if len(vs) == 1:
        return vs
    s = set(vs)
    if len(s) > 1:
        return ';'.join(map(str, vs))
    return s.pop()

def unravel_scores(scores):
    for ds_name, ds_vs in scores.items():
        for codon_name, codon_scores in ds_vs.items():
            for score_name, score_val in codon_scores.items():
                yield ds_name, codon_name, score_name, score_val
                
def get_color(y_pred, y_true, dataset):
    green, blue, red, black = (
        '0,255,0', '0,0,255', '255,0,0', '0,0,0')
    if dataset == 'Inference':
        if y_pred == 1:
            return green
        return blue
    if y_pred == 1 and y_true == 1:
        return green
    if y_pred == 0 and y_true == 0:
        return blue
    if y_pred == 0 and y_true == 1:
        return red
    return black
                
def wrap_row(row):
    label = row.Dataset
    color = get_color(row.y_pred, row.y_true, row.Dataset)
    start = row.SeqEnum if row.Strand == '+' else row.SeqEnum - 2
    end = start + 3
    return (f'{row.Chrom} {start} {end} {label} '
            f'{int(row.y_prob * 100)} {row.Strand} {start} {end} {color}')

def pred2bed(df, out_path):
    with open(out_path, 'w') as f:
        print('track name="XGBboost predictions v4.7" '
              'itemRgb="On"', file=f)
        for _, row in tqdm(df.iterrows()):
            print(wrap_row(row), file=f)

In [None]:
# classifier = joblib.load(DATA / 'XGBoost_model.joblib')

In [None]:
results = []
chunks_iter = load_chunks(100000)
for ds_x, meta in tqdm(chunks_iter):
    y_prob = classifier.predict_proba(ds_x)[:, 1]
    meta['y_prob'] = y_prob
    results.append(meta)

In [None]:
df_pred = pd.concat(results)

In [None]:
df_pred.loc[df_pred.Dataset.isna(), 'Dataset'] = 'Inference'

In [None]:
df_pred.loc[df_pred.Dataset == 'Inference', 'y_true'] = -1

In [None]:
df_pred = df_pred.groupby(
        ['Chrom', 'Strand', 'Start', 'SeqEnum', 'Dataset'], 
        as_index=False
    ).agg(
    {
        'GeneID':  lambda vs: ';'.join(sorted(set(vs))),
        'TranscriptID': lambda vs: ';'.join(sorted(set(vs))),
        'y_true': agg_y_true,
        'y_prob': 'mean',
})

In [None]:
df_pred['y_pred'] = (df_pred['y_prob'] > 0.5).astype(int)

In [None]:
df_pred = annotate_predictions(df_pred)

In [None]:
df_pred.shape

In [None]:
df_pred.y_true.value_counts()

In [40]:
scores = {ds_name: score(df_pred[(df_pred.Dataset == ds_name)], 0.5) for ds_name in ('Train', 'Val', 'Test')}

In [42]:
df_scores = pd.DataFrame(
    unravel_scores(scores), 
    columns=['Dataset', 'Codon', 'ScoreType', 'ScoreVal']
).round(2).pivot(
    index=['Dataset', 'Codon'], columns='ScoreType', values='ScoreVal'
)

for c in ['FN', 'FP', 'TN', 'TP']:
    df_scores[c] = df_scores[c].astype(int)

df_scores['P'] = df_scores['TP'] + df_scores['FN']

df_scores = df_scores.reset_index().sort_values(
    ['Dataset', 'P'], ascending=[True, False]
).set_index(['Dataset', 'Codon'])[[
    'f1', 'prc', 'rec', 'bac', 'TN', 'FN', 'FP', 'TP', 'P'
]]

df_scores

Unnamed: 0_level_0,ScoreType,f1,prc,rec,bac,TN,FN,FP,TP,P
Dataset,Codon,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Test,All,0.71,0.66,0.77,0.88,30037,133,235,451,584
Test,CTG,0.73,0.67,0.79,0.89,5600,40,75,154,194
Test,ATG,0.76,0.7,0.82,0.89,1887,31,59,139,170
Test,GTG,0.66,0.61,0.72,0.85,3759,24,39,61,85
Test,ACG,0.69,0.65,0.73,0.86,1400,12,17,32,44
Test,TTG,0.66,0.62,0.7,0.85,3042,13,19,31,44
Test,ATC,0.8,0.8,0.8,0.9,1918,5,5,20,25
Test,ATT,0.55,0.44,0.73,0.86,2347,4,14,11,15
Test,ATA,0.55,0.43,0.75,0.87,1307,1,4,3,4
Test,AAG,0.0,0.0,0.0,0.5,3762,2,1,0,2


In [43]:
df_pred.to_csv(DATA / 'predictions_5UTR_v4.7.csv', index=False)

In [44]:
df_scores.to_csv(DATA / 'prediction_scores_v4.7.tsv', sep='\t')

In [45]:
pred2bed(df_pred, DATA / 'predictions_5UTR_v4.7.bed')

0it [00:00, ?it/s]

In [46]:
print(df_scores.to_latex())

\begin{tabular}{llrrrrrrrrr}
\toprule
    & ScoreType &    f1 &   prc &   rec &   bac &      TN &   FN &    FP &    TP &     P \\
Dataset & Codon &       &       &       &       &         &      &       &       &       \\
\midrule
Test & All &  0.71 &  0.66 &  0.77 &  0.88 &   30037 &  133 &   235 &   451 &   584 \\
    & CTG &  0.73 &  0.67 &  0.79 &  0.89 &    5600 &   40 &    75 &   154 &   194 \\
    & ATG &  0.76 &  0.70 &  0.82 &  0.89 &    1887 &   31 &    59 &   139 &   170 \\
    & GTG &  0.66 &  0.61 &  0.72 &  0.85 &    3759 &   24 &    39 &    61 &    85 \\
    & ACG &  0.69 &  0.65 &  0.73 &  0.86 &    1400 &   12 &    17 &    32 &    44 \\
    & TTG &  0.66 &  0.62 &  0.70 &  0.85 &    3042 &   13 &    19 &    31 &    44 \\
    & ATC &  0.80 &  0.80 &  0.80 &  0.90 &    1918 &    5 &     5 &    20 &    25 \\
    & ATT &  0.55 &  0.44 &  0.73 &  0.86 &    2347 &    4 &    14 &    11 &    15 \\
    & ATA &  0.55 &  0.43 &  0.75 &  0.87 &    1307 &    1 &     4 &     3 &    

In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
