# Base dataset preparation

This notebook preparers the "base" dataset, i.e., a collection of properly annotated 5'UTRs.
One can use this dataset with `uBERTa_loader` to prepare the training/validation/testing data for the model.

In [1]:
from collections import Counter
from itertools import chain, starmap
from pathlib import Path

import numpy as np
import pandas as pd
from more_itertools import sliding_window
from toolz import curry
from tqdm import tqdm
from uBERTa.base import VALID_START
from uBERTa.utils import Ref, pBWs, reverse_complement

In [2]:
np.random.seed(666)

In [3]:
DATA = Path('../data')
DATA.mkdir(exist_ok=True)

## Expected outputs
- DS_BASE_v4.7_seqs.tsv -- base dataset with annotated 5'UTRs with sequence-level features merged transcript-wise
- dataset_labeling_v4.7.tsv -- labeling of the analyzed genes as training, validation, and testing

## Initial data

This notebook requires:
- hg38 reference genome
- ribo-seq experimental signal from GWIPS-viz (we use P-site identification experiments only)
- our hand-crafted dataset as a bed file
- 5'UTR regions: provided with the rest of the data; also can be obtained via R package [ORFik](https://bioconductor.org/packages/release/bioc/html/ORFik.html)
- A mapping between Ensembl gene IDs and OMIM gene names
- A list of analyzed genes

Download the data and unpack it into the `DATA` dir initialized above.

There are two archives you'll need:
1. [prepare_base_dataset.zip](https://drive.google.com/file/d/1phgab69jDsvgMqOeGUp9mGdDDRU12pk_/view?usp=sharing) (riboseq data, 5'UTRs, etc.) 
2. [hg38.fa.tar.gz](https://drive.google.com/file/d/1obZdHGf06FFeGw7PCDh5gvMtRjHLKMHu/view?usp=sharing) (reference and its indexing in FASTA format)

Download both of them and unpack into `DATA`. The expected structure of the `DATA` after this would be:

```
|____ENSG2OMIM.csv                                                                         
|____List_of_analysed_unique_genes.csv
|____hg38.fa
|____hg38.fa.fai
|____All_starts_AD+AR_v4.7.bed
|____p-sites
| |____Raj16_All.RiboProInit.bw
| |____Fijalkowska17_All.RiboProInit.bw
| |____Gao14_All.RiboProInit.bw
| |____Ji15_All.RiboProInit.bw
| |____Zhang17_All.RiboProInit.bw
| |____Chen20_All.RiboProInit.bw
| |____Gawron16_All.RiboProInit.bw
| |____Crappe15_All.RiboProInit.bw
|____uORF_search_space
| |____uORF_search_space_cage_genes.gff
| |____uORF_search_space_cage_transcripts.gff
```

For instance, starting from the project's root.

1. Download the data

```bash
gdown --fuzzy https://drive.google.com/file/d/1obZdHGf06FFeGw7PCDh5gvMtRjHLKMHu/view?usp=sharing
gdown --fuzzy https://drive.google.com/file/d/1phgab69jDsvgMqOeGUp9mGdDDRU12pk_/view?usp=sharing
```

2. Unpack downloads

```bash
unzip prepare_base_dataset.zip
rm prepare_base_dataset.zip

for path in $(ls *.tar.gz); do
    tar -xzf $path
done

rm *.tar.gz

ls
```

->

```bash
All_starts_AD+AR_v4.7.bed  ENSG2OMIM.csv  hg38.fa  hg38.fa.fai  List_of_analysed_unique_genes.csv  p-sites  uORF_search_space
```

In [4]:
# Reference genome to fetch sequence regions
ref = Ref(DATA / 'hg38.fa')
# Experimental ribo-seq signal as a collection of BigWig files queried simultaneously
pbws = pBWs(Path(DATA / 'p-sites').glob('*'))

## Parse the hand-crafted dataset

- Convert the hand-crafted dataset into a dataframe
- Validate annotation correctness
- Fetch start codon sequences
- Filter invalid start codons (should be removed in the final version, but exercising some caution never hurts)

In [5]:
def parse_hand_crafted(path):
    """
    Convert a bed file with hand-crafted uORFs to a Pandas dataframe
    """
    df = pd.read_csv(
        path, sep='\s+', skiprows=1, skipfooter=1,
        names=['Chrom', 'Start', 'End', 'Ann', 'X', 'Strand'])
    valid_ann_idx = np.array(
        [len(x.split('-')) == 5 for x in df['Ann']])
    df_invalid = df[~valid_ann_idx]
    df = df[valid_ann_idx]
    df[['Group', 'Gene', 'StartCodon', 'KozakScore', 'Level']] = [
        x.split('-') for x in df['Ann']]
    return df, df_invalid

In [6]:
ds, ds_inv = parse_hand_crafted(
    DATA / 'All_starts_AD+AR_v4.7.bed'
)
ds = ds.drop(
    columns=['Ann', 'X', 'KozakScore']
).sort_values(
    ['Chrom', 'Start']
)

print(len(ds), len(ds_inv))
ds.tail()

7741 0


  df = pd.read_csv(


Unnamed: 0,Chrom,Start,End,Strand,Group,Gene,StartCodon,Level
5398,chrX,155264457,155264460,-,u,RAB39B,ATC,88
5400,chrX,155264493,155264496,-,u,RAB39B,ATG,113
6802,chrX,155545273,155545276,-,m,TMLHE,ATG,0
6800,chrX,155612830,155612833,-,u,TMLHE,CTG,247
6801,chrX,155612861,155612864,-,u,TMLHE,CTG,127


In [7]:
ds['StartCodonFetched'] = [
    ref.fetch(*x[1:]).upper() for x in
    tqdm(ds[['Chrom', 'Start', 'End']].itertuples(), total=len(ds))]

neg_idx = ds['Strand'] == '-'
ds.loc[neg_idx, 'StartCodonFetched'] = ds.loc[
    neg_idx, 'StartCodonFetched'].apply(reverse_complement)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7741/7741 [00:00<00:00, 215928.62it/s]


In [8]:
print(f'Initial {len(ds)}')
ds = ds[ds.StartCodon == ds.StartCodonFetched]
print(f'Matching start codons: {len(ds)}')
ds = ds.drop(columns=['StartCodonFetched'])
ds = ds[ds.StartCodon.isin(VALID_START)]
print(f'Supported start codons: {len(ds)}')

Initial 7741
Matching start codons: 7725
Supported start codons: 7719


Mapping gene names to ensemble IDs

In [9]:
name2id = dict(
    x[1:] for x in
    pd.read_csv(DATA / 'ENSG2OMIM.csv', sep=';').itertuples())
ds['GeneID'] = ds['Gene'].map(name2id.get)

In [10]:
print(f'Final ds size: {len(ds)}')
ds.head()

Final ds size: 7719


Unnamed: 0,Chrom,Start,End,Strand,Group,Gene,StartCodon,Level,GeneID
3336,chr1,1013523,1013526,+,u,ISG15,TTG,770,ENSG00000187608
3337,chr1,1013546,1013549,+,ma,ISG15,GTG,556,ENSG00000187608
3338,chr1,1013573,1013576,+,m,ISG15,ATG,0,ENSG00000187608
224,chr1,1020172,1020175,+,m,AGRN,ATG,174,ENSG00000188157
1990,chr1,1349062,1349065,-,m,DVL1,ATG,379,ENSG00000107404


## Parse 5'UTR regions

- Parse GFF files with 5'UTR coordinates obtained externally using ORFik
- Fetch and enumerate sequences
- Fetch experimental signal for 5'UTRs
- Concatenate data transcript-wise
- Handle strand direction

In [11]:
def get_anno_value(field_name, anno):
    _anno = dict(x.split('=') for x in anno.split(';'))
    return _anno[field_name]

def parse_granges_gff(path):
    allowed_chr = list(map(str, range(1, 23))) + ['X', 'Y']

    df_gff = pd.read_csv(
        path, usecols=[0, 3, 4, 6, 8], sep=r'\s+', low_memory=False,
        names=['Chrom', 'Start', 'End', 'Strand', 'Anno'], skiprows=3)
    print(f'Initial size: {len(df_gff)}')
    df_gff = df_gff[df_gff.Anno.apply(lambda x: 'exon' in x)]
    print(f'Filtered to exons: {len(df_gff)}')
    df_gff = df_gff[df_gff.Chrom.isin(allowed_chr)]
    print(f'Filtered to canonical chromosomes: {len(df_gff)}')
    df_gff['Chrom'] = df_gff.Chrom.apply(lambda x: 'chr' + x)
    df_gff['ExonID'] = df_gff.Anno.apply(lambda x: get_anno_value('exon_name', x))
    df_gff['ID'] = df_gff.Anno.apply(lambda x: get_anno_value('Name', x))
    df_gff = df_gff.drop(columns=['Anno'])
    df_gff = df_gff.drop_duplicates(ignore_index=True)
    print(f'Filtered out duplicates: {len(df_gff)}')
    return df_gff

@curry
def join(it, sep=';'):
    return sep.join(map(str, it))

def safe_take_fst(vs):
    if len(vs.unique()) != 1:
        raise ValueError(vs)
    return vs.iloc[0]

The two files parsed below encompass exactly the same 5'UTR regions, only the IDs attached to regions are on different level (genes and transcripts, respectively).

In [12]:
genes_path = DATA / 'uORF_search_space' / 'uORF_search_space_cage_genes.gff'
transcripts_path = DATA / 'uORF_search_space' / 'uORF_search_space_cage_transcripts.gff'
assert genes_path.exists()
assert transcripts_path.exists()

In [13]:
df_genes = parse_granges_gff(
    genes_path
).rename(
    columns={'ID': 'GeneID'}
)
df_trans = parse_granges_gff(
    transcripts_path
).rename(
    columns={'ID': 'TranscriptID'}
)

Initial size: 252701
Filtered to exons: 163270
Filtered to canonical chromosomes: 163233
Filtered out duplicates: 112368
Initial size: 252701
Filtered to exons: 163270
Filtered to canonical chromosomes: 163233
Filtered out duplicates: 163233


In [14]:
var = ['Chrom', 'Start', 'End', 'Strand']

df_utr = pd.merge(
    df_genes, df_trans, on=var + ['ExonID']
).sort_values(
    var
).reset_index(drop=True)
df_utr['Start'] = df_utr['Start'] - 1
len(df_utr)

163233

In [15]:
# df.to_csv('data/intermediate/5UTR_exons.tsv', sep='\t', index=False)

Parsed genomic ranges are continuous exonic regions within 5'UTR. We first compose sequence-level features for them, and then concatenate these features (sequences, enumeration, signal) for each transcript.

Example (exons are already sorted by the starting coordinate in the ascending order):
```
ENST   ENSE  Seq  Enum    Signal
100500 1     ACGT 1,2,3,4 0.1.0.0,0.1,20.0
100500 2     CCGT 6,7,8,9 0.2.0.1,4.0,0.0
```
-->
```
ENST   Seq      Enum            Signal
100500 ACGTCCGT 1,2,3,4,6,7,8,9 0.1.0.0,0.1,20.0,0.2.0.1,4.0,0.0
```

In [16]:
df_utr['Seq'] = [
    ref.fetch(*x[1:]).upper() for x in 
    tqdm(df_utr[['Chrom', 'Start', 'End']].itertuples(), total=len(df_utr), desc='Fetching seqs')
]
df_utr['SeqEnum'] = [
    join(range(row['Start'], row['End']), sep=',') for _, row in 
    tqdm(df_utr.iterrows(), total=len(df_utr), desc='Enumerating seqs')
]
df_utr['Signal'] = [
    join(pbws.query(*x[1:]), sep=',') for x in 
    tqdm(df_utr[['Chrom', 'Start', 'End']].itertuples(), total=len(df_utr), desc='Fetching signal')
]

Fetching seqs: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 163233/163233 [00:00<00:00, 320805.13it/s]
Enumerating seqs: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 163233/163233 [00:07<00:00, 21522.80it/s]
Fetching signal: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 163233/163233 [01:47<00:00, 1520.43it/s]


In [17]:
df_utr = df_utr.groupby(
    'TranscriptID', as_index=False
).agg(
    {'ExonID': join, 
     'GeneID': safe_take_fst, 
     'Seq': join(sep=''), 
     'SeqEnum': join(sep=','),
     'Signal': join(sep=','),
     'Chrom': safe_take_fst,
     'Strand': safe_take_fst
    })

- Remove anomalously short sequences right away

In [18]:
len(df_utr)

89404

In [19]:
df_utr = df_utr[df_utr.Seq.apply(lambda x: len(x) > 5)]

In [20]:
len(df_utr)

88507

Sanity check: all the concatenated sequences are of equal lengths

In [21]:
all(len(enum.split(',')) == len(signal.split(',')) == len(seq) 
    for seq, enum, signal in df_utr[
        ['Seq', 'SeqEnum', 'Signal']].itertuples(index=False))

True

Handle direction:
- For sequences: take the reverse complement
- For signal and enumeration: reverse and join

In [22]:
reverse_join = lambda x: join(x.split(',')[::-1], sep=',')
idx = df_utr.Strand == '-'
df_utr.loc[idx, 'Seq'] = df_utr.loc[idx, 'Seq'].apply(reverse_complement)
df_utr.loc[idx, 'SeqEnum'] = df_utr.loc[idx, 'SeqEnum'].apply(reverse_join)
df_utr.loc[idx, 'Signal'] = df_utr.loc[idx, 'Signal'].apply(reverse_join)

- Kmerize sequences for overlapping with the hand-crafted dataset

In [23]:
df_kmers = df_utr[['TranscriptID', 'GeneID', 'Seq', 'SeqEnum', 'Signal', 'Chrom', 'Strand']].copy()
df_kmers['Seq'] = df_kmers['Seq'].apply(
    lambda x: [''.join(y) for y in sliding_window(x + 'XX', 3)])
df_kmers['SeqEnum'] = df_kmers['SeqEnum'].apply(lambda x: x.split(','))
df_kmers['Signal'] = df_kmers['Signal'].apply(lambda x: x.split(','))
df_kmers = df_kmers.explode(['SeqEnum', 'Seq', 'Signal']).dropna()
df_kmers['SeqEnum'] = df_kmers['SeqEnum'].astype(int)

In [24]:
df_kmers = df_kmers.rename(columns={'Seq': 'StartCodon'})

- Account for position shift for the reversed sequences
- Merge with the hand-crafted dataset on genomic positions

In [25]:
df_kmers['SeqEnumShift'] = df_kmers['SeqEnum'].copy()
idx = df_kmers.Strand == '-'
df_kmers.loc[idx, 'SeqEnumShift'] = df_kmers.loc[idx, 'SeqEnumShift'] - 2

In [26]:
df_kmers = df_kmers.merge(
    ds[['StartCodon', 'Start', 'Chrom', 'Strand']], 
    left_on=['SeqEnumShift', 'Chrom', 'Strand'], 
    right_on=['Start', 'Chrom', 'Strand'],
    suffixes=['_transcript', '_curated'],
    how='left'
).drop(columns='Start')

- This shows mismatching start-codons likely changed due to alternative splicing

In [27]:
df_kmers[(df_kmers.StartCodon_curated != df_kmers.StartCodon_transcript) & 
         ~df_kmers.StartCodon_curated.isna()]

Unnamed: 0,TranscriptID,GeneID,StartCodon_transcript,SeqEnum,Signal,Chrom,Strand,SeqEnumShift,StartCodon_curated
12442888,ENST00000532463,ENSG00000198561,ATA,57762066,52.0,chr11,+,57762066,ATG
13056604,ENST00000540610,ENSG00000159403,ATA,7092387,0.0,chr12,-,7092385,ATG


In [28]:
len(df_kmers)

20579712

In [29]:
df_kmers = df_kmers.drop_duplicates()
len(df_kmers)

20579712

In [30]:
df_kmers.head()

Unnamed: 0,TranscriptID,GeneID,StartCodon_transcript,SeqEnum,Signal,Chrom,Strand,SeqEnumShift,StartCodon_curated
0,ENST00000000233,ENSG00000004059,CTG,127588410,110.0,chr7,+,127588410,
1,ENST00000000233,ENSG00000004059,TGC,127588411,9.0,chr7,+,127588411,
2,ENST00000000233,ENSG00000004059,GCT,127588412,28.0,chr7,+,127588412,
3,ENST00000000233,ENSG00000004059,CTG,127588413,65.0,chr7,+,127588413,
4,ENST00000000233,ENSG00000004059,TGC,127588414,1.0,chr7,+,127588414,


## Annotate 5'UTRs

For the 5'UTRs, we assign classes to all supported start codons according to our hand-crafted dataset.
Namely, for each codon, we place:
- `1`, if it is within our data
- `0`, if it is a valid start codon, but its absent in our data
- `-100` if none of the above is True (`-100` is a default masking value for the loss functions in PyTorch)

We then append `(-100, -100)` to the `Classes` field of each sequence to account for kmerization.

For example, in the sequence below, where ATG is absent in our data, and CTG is present, we'll compose the following sequence of classes:
```
ATGCTG -> ATG TGC GCT CTG -> 0 -100 -100 1 -100 -100
```
Hence, effectively we attribute a class to each character of a sequence.

In [31]:
def make_classes(df):
    def classify_codon(codon_transcript, codon_curated):
        if isinstance(codon_curated, str):
            return '1'
        if codon_transcript in VALID_START:
            return '0'
        return '-100'
    
    classes = ','.join(
        classify_codon(x, y) for x, y in 
        df[['StartCodon_transcript', 'StartCodon_curated']].itertuples(index=False))
    
    return pd.Series({'Classes': classes})

- We group k-merized sequences transcript-wise and aggregate (join) sequence data and associated genes.
- Then we create classes for the same groups and add them to the aggregated dataframe.

In [32]:
groups = df_kmers.groupby(
    ['Chrom', 'Strand', 'TranscriptID'], 
    as_index=False
)
df_seqs = groups.agg(
    GeneID=pd.NamedAgg('GeneID', lambda x: x.iloc[0]),
    SeqEnum=pd.NamedAgg('SeqEnum', join(sep=',')),
    Signal=pd.NamedAgg('Signal', join(sep=',')),
    Seq=pd.NamedAgg('StartCodon_transcript', lambda vs: ''.join(x[0] for x in vs)),
)
df_seqs['Classes'] = groups.apply(make_classes)['Classes']

In [33]:
len(df_seqs)

88507

## Validate and save

- Below, we'll verify the labeled start codons. Namely, we use the -100 values of the assigned classes as mask, and check that applying this mask results in subsetting the sequence to valid start codons.

In [34]:
# def validate_codon(start, codon, seq, seq_enum):
#     """Validate the codon sequence"""
#     idx_start = np.where(seq_enum == start)[0]
#     seq_codon = seq[idx_start:idx_start + 3]
#     return seq_codon == codon

STARTS = np.array(VALID_START)

def verify_codons(row):
    classes = np.array(list(map(int, row['Classes'].split(','))))
    m = classes != -100
    starts = np.array(tuple(map(
        lambda x: ''.join(x), sliding_window(row['Seq'], 3))) + ('PAD', 'PAD'))
    try:
        codons = starts[m]
        mask = np.isin(codons, STARTS)
        invalid = codons[~mask]
        has_invalid = len(invalid) > 0
        if has_invalid:
            print(
                f'Invalid start codons {invalid} at positions {np.where(m)[0][~mask]} '
                f'of transcript {row.TranscriptID}')
        return not has_invalid
    except Exception as e:
        print(e)
        return False

In [35]:
df_kmers[~df_kmers.StartCodon_transcript.isin(VALID_START) & ~df_kmers.StartCodon_curated.isna()]

Unnamed: 0,TranscriptID,GeneID,StartCodon_transcript,SeqEnum,Signal,Chrom,Strand,SeqEnumShift,StartCodon_curated


In [36]:
idx = np.array([verify_codons(row) for _, row in df_seqs.iterrows()])
print(f'{(~idx).sum()} problematic, {idx.sum()} OK')

0 problematic, 88507 OK


In [37]:
# df_seqs = df_seqs[idx]

Frequently, different transcripts don't change the 5'UTR sequence. Below, we get rid of such duplicates by retaining only unique sequences. Thus, the retained sequences will be labeled by the first transcript ID in the dataset.

In [38]:
print(f'Initial size: {len(df_seqs)}')
df_seqs = df_seqs.drop_duplicates(['Seq', 'Signal'])
print(f'Removed duplcates: {len(df_seqs)}')

Initial size: 88507
Removed duplcates: 79453


In [39]:
df_seqs.to_csv(DATA / 'DS_BASE_v4.7_seqs.tsv', sep='\t', index=False)

In [40]:
df_kmers.to_csv(DATA / 'DS_BASE_v4.7_kmers.tsv', sep='\t', index=False)

## Partition the modeling dataset into training, validation, and testing

- The modeling dataset encompasses the sequences of analyzed genes. Here, we divide genes into training, validation, and testing subsets and label their sequences accordingly.

Certain analyzed genes overlap with other genes outside of the composed list. As we want to prevent including the same position into the training and validation/testing subsets, we need to handle such cases explicitly. For this, we'll group k-merized sequences by the genomic position and aggregate genes into overlapping groups. Then we'll distribute the groups between train/test/validation subsets. This ad-hoc solution handles only genes overlapping over sequences that were analyzed manually. However, if an analyzed gene X overlaps some gene Y, and the latter overlaps some gene Z that doesn't overlap X, gene Z won't be considered "analyzed".

In [41]:
df_analyzed = pd.read_csv(DATA / 'List_of_analysed_unique_genes.csv', names=['GeneName'])
df_analyzed['GeneID'] = df_analyzed['GeneName'].map(name2id)
df_analyzed['Analyzed'] = True
genes_curated = set(df_analyzed.GeneID)
print('Genes in the supplied list: ', len(genes_curated))

Genes in the supplied list:  3644


In [42]:
df_genes = pd.merge(
    df_kmers[['GeneID', 'Chrom', 'Strand', 'SeqEnum', 'StartCodon_curated']], 
    df_analyzed[['GeneID', 'Analyzed']], 
    on=['GeneID'], how='left',
)

In [43]:
df_genes.loc[~df_genes.StartCodon_curated.isna(), 'Analyzed'] = True

In [44]:
df_genes = df_genes.groupby(
    ['Chrom', 'Strand', 'SeqEnum'], as_index=False
).agg({'Analyzed': 'any', 'GeneID': list})

In [45]:
genes_factual = set(chain.from_iterable(df_genes.loc[df_genes.Analyzed, 'GeneID']))
print('Factual analyzed genes: ', len(genes_factual))

Factual analyzed genes:  3761


In [46]:
analyzed_gene_groups = list(df_genes.loc[
    df_genes.Analyzed, 'GeneID'
].apply(lambda x: ';'.join(sorted(set(x)))).unique())
len(analyzed_gene_groups)

3757

In [47]:
grouped_entries = [x for x in analyzed_gene_groups if ';' in x]
print('Total overlapping genes groups: ', len(grouped_entries))
overlapping = list(chain.from_iterable(x.split(';') for x in grouped_entries))
print('Total overlapping genes: ', len(overlapping))
non_overlapping = [x for x in analyzed_gene_groups if x not in overlapping]
print('Total non-overlapping gene groups: ', len(non_overlapping))

Total overlapping genes groups:  123
Total overlapping genes:  249
Total non-overlapping gene groups:  3640


In [48]:
df_genes = pd.DataFrame({'GeneID': non_overlapping})

val_frac, test_frac = 0.1, 0.1
l = len(df_genes)
n_test = int(l * val_frac)
n_val = int(l * test_frac)
n_train = l - n_test - n_val
labels = np.concatenate(
    [np.full(n_train, 'Train'), 
     np.full(n_val, 'Val'), 
     np.full(n_test, 'Test')])
np.random.shuffle(labels)
df_genes['Dataset'] = labels
df_genes['GeneID'] = df_genes['GeneID'].apply(lambda x: x.split(';'))
df_genes = df_genes.explode('GeneID').drop_duplicates('GeneID')
print(len(df_genes))
df_genes['Dataset'].value_counts()

3761


Train    3011
Test      379
Val       371
Name: Dataset, dtype: int64

In [49]:
df_genes.to_csv(DATA / 'dataset_labeling_v4.7.tsv', sep='\t', index=False)