<a href="https://colab.research.google.com/github/ML-Bioinfo-CEITEC/mirna_binding/blob/master/notebook/generate_encori_train_test_datasets_lock.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Setup

The cell runs commands to install the required Python package `pybedtools` as  

well as to download the human reference genome used by ENCORI database.

In [49]:
# install pybedtools
!apt-get install bedtools
!pip install pybedtools

# download encori ago2 dataset
!wget https://github.com/ML-Bioinfo-CEITEC/mirna_binding/raw/master/data/encori/encori_ago2.tsv.gz

# download reference genome, annotation, targetscan families
!wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/GRCh37_mapping/GRCh37.primary_assembly.genome.fa.gz
!wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/GRCh37_mapping/gencode.v34lift37.basic.annotation.gtf.gz
!wget http://www.targetscan.org/vert_72/vert_72_data_download/miR_Family_Info.txt.zip

# decompress files
!gunzip GRCh37.primary_assembly.genome.fa.gz
!gunzip gencode.v34lift37.basic.annotation.gtf.gz
!unzip miR_Family_Info.txt.zip
!gunzip encori_ago2.tsv.gz


Reading package lists... Done
Building dependency tree       
Reading state information... Done
bedtools is already the newest version (2.26.0+dfsg-5).
The following package was automatically installed and is no longer required:
  libnvidia-common-440
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 59 not upgraded.
--2020-06-20 11:02:11--  https://github.com/ML-Bioinfo-CEITEC/mirna_binding/raw/master/data/encori/encori_ago2.tsv.gz
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ML-Bioinfo-CEITEC/mirna_binding/master/data/encori/encori_ago2.tsv.gz [following]
--2020-06-20 11:02:11--  https://raw.githubusercontent.com/ML-Bioinfo-CEITEC/mirna_binding/master/data/encori/encori_ago2.tsv.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.

# General Python Modules

In [1]:
import pandas as pd
import numpy as np
import pybedtools

# Paramenters
This section allows the user to define train paramenters:  
- NEG_RATIO: starting ratio for samples of the negative class (int., def. 1);
- RANDOM_STATE: set random state for reproducibility (int., def. 1789);
- WINDOW_SIZE: size of windows overlapping real binding site
- TIMES: number of windows expanding from real binding site




In [2]:
NEG_RATIO = 20
RANDOM_STATE = 1789
WINDOW_SIZE = 50
TIMES = 10
np.random.seed(RANDOM_STATE)


## negative class generator

The function generates the negative class by creating a connection between each

binding site and all mirna (expect the real one). If argument mirna_dict is

provided as dictionary of mirna sequences, this dictionary will be used to

create the negative class. Otherwise, all unique mirna sequences of the input

df will be used to generate samples for the negative class.

In [3]:
def negative_class_generator(df, mirna_dict=None,
                            neg_ratio=None, random_state=None):
    """
    create miRNA negative by shuffling the input dataset

    parameters:
    df=original pandas dataframe
    mirna_dict=optional miRNA dataframe
    neg_ratio=generate N number of miRNA negatives
    random_state=set seed for reproducibility

    outputs:
    miRNA negative set
    """
    if not mirna_dict:
        # generate mirna db of unique sequences
        mirnadb = pd.DataFrame(
            df.mirna_binding_sequence.unique(), columns=['mirnaid']
        )
    else:
        mirnadb = pd.DataFrame(mirna_dict)
        mirnadb.columns = ['mirnaid']
    # add mirna db to each row of df
    connections = mirnadb.assign(key=1).merge(
          df.assign(key=1), on='key'
          ).drop(['key', 'label'],axis=1)
    # find index of positive connection
    positive_samples_mask = (connections.mirnaid == 
                             connections.mirna_binding_sequence)
    # drop positive connection to create negative samples
    negative_df = connections[~positive_samples_mask].copy().drop(
      ['mirna_binding_sequence'], axis=1
      ).reset_index(drop=True)
    # rename cols
    negative_df.columns = ['mirna_binding_sequence', 'binding_sequence']
    # add negative labels
    negative_df['label'] = 'negative'
    if neg_ratio == None:
        return negative_df
    else:
        neg_samples = int(df.shape[0] * neg_ratio)
        return negative_df.sample(
            n = neg_samples,
            random_state=random_state)

# ENCORI Preprocessing

## Load Encori Dataset and Create Test Set
The Encori Dataset was obtained by downloading the whole dataset through the Encori API at [URL](http://www.sysu.edu.cn/403.html).

The cell loads the original ENCORI dataset, which is filtered for only binding regions mapping on chromosome 1 and obtained by immunoprecipitation of AGO2.

Encori original database should contain 2055403 samples, and after filtering the new dataset should contain 460120 samples.

expected output:

Encori original AGO2 samples are: (460120, 23)



In [4]:
#load encori
encori_original = pd.read_csv('encori_ago2.tsv', sep='\t', comment='#')
# filter encori
# encori_original_chr1 = encori_original[(
#     encori_original.chromosome == 'chr1') &
#     (encori_original.RBP == 'AGO2')].reset_index(drop=True)
# assign unique id to each sample
encori_original['name'] = encori_original.index.to_list()

print('Encori original AGO2 samples are:',
      encori_original.shape)
# print('Encori original AGO2 samples mapping to chrom1 are:',
#       encori_original.shape)

Encori original AGO2 samples are: (460120, 23)


## Filter ENCORI

This cell removes ENCORI samples not mapping on UTR.

1. load annotation;
2. filter `encori_original` for UTR regions only.

expected output:

samples mapping on UTR are: (425370, 22)

In [5]:
# 1. load annotation and retrive chrom1 only features.

gtf_annotation = pd.read_csv('gencode.v34lift37.basic.annotation.gtf',
                             sep='\t', comment='#',
                             names=['chromosome', 'source', 'feature', 'start',
                                    'end', 'score', 'strand', 'frame',
                                    'attribute'])
gtf_annotation_UTR_only = gtf_annotation[
                                      (gtf_annotation.feature == 'UTR')
                                      ].copy()

# filter encori_original with pybedtools.
encori_original[['chromosome',
       'narrowStart', 'narrowEnd', 'name']].to_csv(
           'a.bed', sep='\t',
           header=False, index=False)

gtf_annotation_UTR_only[['chromosome',
       'start', 'end']].to_csv('b.bed', sep='\t', header=False, index=False)

a = pybedtools.BedTool('a.bed')
b = pybedtools.BedTool('b.bed')

overlaps = pybedtools.BedTool.intersect(a, b, wa=True, u=True).to_dataframe()

ago2_utr = pd.merge(
    encori_original, overlaps, on='name').drop(
    ['chrom', 'start', 'end', 'name'], axis=1)

print('samples mapping on UTR are:', ago2_utr.shape)

samples mapping on UTR are: (425370, 22)


## Filter for unique binding site

This cell removes binding sites that are targeted by multiple miRNAs.

expected output:

ambiguous binding sites are: 222480

certain binding sites are: 202890

certain Encori database samples are: (202890, 22)


In [6]:
ambiguous_bs = list()
unique = 0
for name, group in ago2_utr.groupby([
            'chromosome', 'narrowStart', 'narrowEnd']):
    if group.shape[0] != 1: # more than one miRNA targeted this region
        ambiguous_bs += group.index.to_list()
    else:
        unique += 1
print('ambiguous binding sites are:', len(ambiguous_bs))
print('certain binding sites are:', unique)

ago2_utr_certain = ago2_utr.drop(
    ambiguous_bs).reset_index(drop=True)
print('certain Encori database samples are:', ago2_utr_certain.shape)

ambiguous binding sites are: 222480
certain binding sites are: 202890
certain Encori database samples are: (202890, 22)


## Assign unique name to each sample



In [7]:
ago2_utr_certain['name'] = [ mirna + "_" + str(index) 
    for mirna, index in zip(ago2_utr_certain['miRNAid'].to_list(),
                            ago2_utr_certain.index.to_list())]

## Generate binding site coordinates and sequences for miRNA Neg. test set

The cell expands the original narrowStart and narrowEnd binding site coordinates

to a random window of 50nt.

expected output:

average Encori database binding site length is: 50.0

min. Encori database binding site length is: 50

max Encori database binding site length is: 50

Encori with positive binding sites sequence shape is: (202890, 28)

In [8]:
def midpoint(row):
    """
    function extract the genomic coordiantes of binding site's midpoint.

    paramenters:
    row=dataframe row

    returns:
    midpoint (int.)
    """
    start = row.narrowStart
    end = row.narrowEnd
    mid = int(round((end - start) / 2, 0))
    midpoint = start + mid
    return midpoint

In [9]:
ago2_utr_certain['midpoint'] = ago2_utr_certain.apply(
    midpoint, axis=1)
ago2_utr_certain['random_int'] = np.random.randint(
    15, 30, ago2_utr_certain.shape[0])
ago2_utr_certain['random_start'] = (
    ago2_utr_certain['narrowStart'] - 
    ago2_utr_certain['random_int'])
ago2_utr_certain['random_end'] = (
    ago2_utr_certain['random_start'] + WINDOW_SIZE)

print('average Encori database binding site length is:', (
    ago2_utr_certain['random_end'] -
    ago2_utr_certain['random_start']).mean() )
print('min. Encori database binding site length is:', (
    ago2_utr_certain['random_end'] -
    ago2_utr_certain['random_start']).min() )
print('max Encori database binding site length is:', (
    ago2_utr_certain['random_end'] -
    ago2_utr_certain['random_start']).max() )

average Encori database binding site length is: 50.0
min. Encori database binding site length is: 50
max Encori database binding site length is: 50


In [10]:
positive_bindings_col = [
                 'chromosome', 'random_start',
                 'random_end', 'name',
                 'pancancerNum', 'strand']
positive_bindings_bed = 'positive_bindings.bed'

ago2_utr_certain[positive_bindings_col].to_csv(
    positive_bindings_bed, sep='\t', index=False, header=False)

bedfile = pybedtools.BedTool(positive_bindings_bed)
seq_filename = bedfile.sequence(
    fi='GRCh37.primary_assembly.genome.fa',
    s=True, tab=True, name=True).seqfn

seq_df = pd.read_csv(
    seq_filename, sep='\t', header=None, names=['name', 'pos_sequence'])

ago2_utr_certain = pd.concat(
    [ago2_utr_certain, seq_df['pos_sequence']], axis=1)

print('Encori with positive binding sites sequence shape is:',
      ago2_utr_certain.shape)

Encori with positive binding sites sequence shape is: (202890, 28)


In [16]:
ago2_utr_certain[ago2_utr_certain.chromosome == 'chr1'].shape

(21628, 28)

## Plug-in miRNA family sequence

In [19]:
targetscan_family_df = pd.read_csv('miR_Family_Info.txt', sep='\t')
targetscan_family_df.columns = [ 'miRFamily', 'Seed+m8', 'SpeciesID',
               'MiRBase ID', 'MatureSequence',
               'Family Conservation?', 'miRNAid']

# assign miRNA family to Encori df
ago2_with_family = pd.merge(
    ago2_utr_certain,
    targetscan_family_df, on=['miRNAid'])

print(ago2_with_family[ago2_with_family.chromosome == 'chr1'].shape)

(21385, 34)


In [21]:
updated_families = list()
for family, group in ago2_with_family.groupby('miRFamily'):
    filter_seq = sorted(
        group.MatureSequence.unique().tolist(),
        key=lambda x: len(x), reverse=True)[0]
    if len(filter_seq) < 20: # fill with N
        filter_seq = filter_seq + ('N' * 20)
    # trim seq at len 20
    filter_seq = filter_seq[ : 20].replace('U', 'T')
    x = group.copy()
    x['familyseqRepresentative'] = filter_seq
    updated_families.append(x)

encori_ago2_precessed = pd.concat(updated_families)

print(encori_ago2_precessed[encori_ago2_precessed.chromosome == 'chr1'].shape)

(21385, 35)


# Generate Train set

The cell generates the positive sample class dataset for the training of the

model. The negative class is generated on the fly during the training process.

This cell extracts all samples expect those mapping on chromosome 1, which are

kept for the test set.

expected output:

train set size is: (179148, 35)

In [49]:
encori_train_positive_samples = encori_ago2_precessed[
                encori_ago2_precessed.chromosome != 'chr1'
                ].reset_index(drop=True)

print('train set size is:', encori_train_positive_samples.shape)

encori_train_positive_samples.head(10)

train set size is: (179148, 35)


Unnamed: 0,miRNAid,miRNAname,geneID,geneName,geneType,chromosome,narrowStart,narrowEnd,broadStart,broadEnd,strand,clipExpNum,degraExpNum,RBP,PITA,RNA22,miRmap,microT,miRanda,PicTar,TargetScan,pancancerNum,name,midpoint,random_int,random_start,random_end,pos_sequence,miRFamily,Seed+m8,SpeciesID,MiRBase ID,MatureSequence,Family Conservation?,familyseqRepresentative
0,MIMAT0000063,hsa-let-7b-5p,ENSG00000172731,LRRC20,protein_coding,chr10,72059919,72059924,72059905,72059939,-,3,0,AGO2,1,1,0,0,1,0,1,12,MIMAT0000063_0,72059921,27,72059892,72059942,GCTTGGTCCTCAGGTATCTACCTCCCACCTTCTCCTCATCTGTGGA...,let-7-5p/98-5p,GAGGUAG,9606,hsa-let-7b-5p,UGAGGUAGUAGGUUGUGUGGUU,2,TGAGGTAGTAGGTTGTGTGG
1,MIMAT0000063,hsa-let-7b-5p,ENSG00000175161,CADM2,protein_coding,chr3,85008253,85008271,85008253,85008271,+,2,0,AGO2,0,1,0,0,0,0,0,3,MIMAT0000063_1,85008262,22,85008231,85008281,TGCTACCGCCACTAGCGCTGCTTCCACTGCTTCTACCTCCCCTCCC...,let-7-5p/98-5p,GAGGUAG,9606,hsa-let-7b-5p,UGAGGUAGUAGGUUGUGUGGUU,2,TGAGGTAGTAGGTTGTGTGG
2,MIMAT0000063,hsa-let-7b-5p,ENSG00000165102,HGSNAT,protein_coding,chr8,43056584,43056606,43056584,43056606,+,1,0,AGO2,0,1,0,0,0,0,0,1,MIMAT0000063_2,43056595,21,43056563,43056613,AGCGGGGCTTCTCCTGCCTCCATCACATCACAGAAGTACCTCCTGC...,let-7-5p/98-5p,GAGGUAG,9606,hsa-let-7b-5p,UGAGGUAGUAGGUUGUGUGGUU,2,TGAGGTAGTAGGTTGTGTGG
3,MIMAT0000064,hsa-let-7c-5p,ENSG00000175697,GPR156,protein_coding,chr3,119885669,119885674,119885668,119885675,-,2,0,AGO2,0,0,1,0,0,1,1,4,MIMAT0000064_4,119885671,18,119885651,119885701,AAAGAGAGGCATTGCCTGTCCTGTTACTACCTCGCAGCCTTACTAT...,let-7-5p/98-5p,GAGGUAG,9606,hsa-let-7c-5p,UGAGGUAGUAGGUUGUAUGGUU,2,TGAGGTAGTAGGTTGTGTGG
4,MIMAT0000064,hsa-let-7c-5p,ENSG00000178177,LCORL,protein_coding,chr4,17846943,17846948,17846942,17846963,-,1,0,AGO2,1,0,1,0,1,1,1,10,MIMAT0000064_5,17846945,24,17846919,17846969,GTAAAACAGGTGGATACCAACTACCTCGAAGCAAATTGGTGAAAGA...,let-7-5p/98-5p,GAGGUAG,9606,hsa-let-7c-5p,UGAGGUAGUAGGUUGUAUGGUU,2,TGAGGTAGTAGGTTGTGTGG
5,MIMAT0000065,hsa-let-7d-5p,ENSG00000139687,RB1,protein_coding,chr13,49054893,49054898,49054893,49054899,+,2,0,AGO2,1,0,0,0,0,1,1,10,MIMAT0000065_7,49054895,19,49054874,49054924,CTACTGAAACAGATTTCATACCTCAGAATGTAAAAGAACTTACTGA...,let-7-5p/98-5p,GAGGUAG,9606,hsa-let-7d-5p,AGAGGUAGUAGGUUGCAUAGUU,2,TGAGGTAGTAGGTTGTGTGG
6,MIMAT0000065,hsa-let-7d-5p,ENSG00000144554,FANCD2,protein_coding,chr3,10082444,10082450,10082432,10082451,+,1,0,AGO2,0,0,1,0,1,0,0,5,MIMAT0000065_8,10082447,19,10082425,10082475,GTATAGTTGTCTGCAAAGCTACCTCCAAAACATCCAGTGATTTCTT...,let-7-5p/98-5p,GAGGUAG,9606,hsa-let-7d-5p,AGAGGUAGUAGGUUGCAUAGUU,2,TGAGGTAGTAGGTTGTGTGG
7,MIMAT0000065,hsa-let-7d-5p,ENSG00000182606,TRAK1,protein_coding,chr3,42245375,42245380,42245358,42245381,+,2,0,AGO2,0,0,1,0,1,0,0,9,MIMAT0000065_9,42245377,24,42245351,42245401,ATTAAAAAGAATCCAATTATGTTTACCTCAAAAGAACCTGTTTTTG...,let-7-5p/98-5p,GAGGUAG,9606,hsa-let-7d-5p,AGAGGUAGUAGGUUGCAUAGUU,2,TGAGGTAGTAGGTTGTGTGG
8,MIMAT0000065,hsa-let-7d-5p,ENSG00000119446,RBM18,protein_coding,chr9,125002095,125002114,125002095,125002114,-,1,0,AGO2,0,0,0,0,1,0,0,3,MIMAT0000065_10,125002105,18,125002077,125002127,AACTCTCTTTACCCTTTATGCCTGCCTACCTCTGTTGTTAGAGATG...,let-7-5p/98-5p,GAGGUAG,9606,hsa-let-7d-5p,AGAGGUAGUAGGUUGCAUAGUU,2,TGAGGTAGTAGGTTGTGTGG
9,MIMAT0000066,hsa-let-7e-5p,ENSG00000150051,MKX,protein_coding,chr10,27961900,27961905,27961899,27961916,-,2,0,AGO2,1,0,0,0,1,0,0,3,MIMAT0000066_20,27961902,21,27961879,27961929,ATTTATGTTGTAGAGAAATAGAATTACCTCTATTCTTTGTTTTGCC...,let-7-5p/98-5p,GAGGUAG,9606,hsa-let-7e-5p,UGAGGUAGGAGGUUGUAUAGUU,2,TGAGGTAGTAGGTTGTGTGG


## Generate coordinates for Scan Test Set

> Indented block



The cell extends the binding site coordinates of 10 windows of 50nt from the 5' 

and 3' ends.

expected outputs:

encori ago2 on chromosome 1 shape is: (21385, 37)

Encori with scan binding sites sequence shape is: (21385, 38)

In [39]:
encori_chrom1 = encori_ago2_precessed[
                encori_ago2_precessed.chromosome == 'chr1'
                ].reset_index(drop=True)

encori_chrom1['scan_start'] = (
    encori_chrom1['random_start'] - (WINDOW_SIZE * TIMES))
encori_chrom1['scan_end'] = (
    encori_chrom1['random_end'] + (WINDOW_SIZE * TIMES))

print('encori ago2 on chromosome 1 shape is:', encori_chrom1.shape)

encori ago2 on chromosome 1 shape is: (21385, 37)


In [43]:
scan_col = [
            'chromosome', 'scan_start', 'scan_end',
            'name', 'pancancerNum', 'strand']

scan_name = 'scan.bed'
encori_chrom1[scan_col].to_csv(
    scan_name, sep='\t', index=False, header=False)

bedfile = pybedtools.BedTool(scan_name)
seq_filename = bedfile.sequence(
    fi='GRCh37.primary_assembly.genome.fa',
    s=True, tab=True, name=True).seqfn

seq_df = pd.read_csv(
    seq_filename, sep='\t', header=None, names=['name', 'scan_sequence'])

encori_chrom1_scan = pd.concat(
    [encori_chrom1, seq_df['scan_sequence']], axis=1)

print('Encori with scan binding sites sequence shape is:', 
      encori_chrom1_scan.shape)


Encori with scan binding sites sequence shape is: (21385, 38)


## Create Scan and miRNA Neg Test sets

The function `break_seq` breaks the extracted scanning region into windows of 

50nt size. It returns the miRNA sequence, real binding site sequence, and 

the 20 segmented surrounding windows. 

expected output:

positive df samples are: (21385, 3)

scan negative df samples are: (427700, 3)

scan test df samples are: (449085, 3)

miRNA neg samples are: (427700, 3)

mirna neg test df samples are: (449085, 3)

In [44]:
def break_seq(row):
    """
    breaks scan sequence into windows of 50nt length.

    parameters:

    row=dataframe row

    returns:
    mirnaseq, pos_sequence, segments
    """
    segments = list()
    mirnaseq = row.familyseqRepresentative
    region = row.scan_sequence
    pos_sequence = row.pos_sequence
    for window, index in enumerate(range(0, len(region), 50)):
        if window == 10: # this is the real binding window
            continue
        segment = region[index : index + 50]
        segments.append( (segment) )
    return (mirnaseq, pos_sequence, segments)


breaks = encori_chrom1_scan.apply(break_seq, axis=1)

In [45]:
positive_samples_list = list()
scan_neg_list = list()

# create positive sample list and negative segments for scan test set
for sample in breaks:
    mirnaseq, pos_sequence, segments = sample
    positive_samples_list.append([mirnaseq, pos_sequence, 'positive'])
    for neg_segment in segments:
        scan_neg_list.append([mirnaseq, neg_segment, 'negative'])

# generate dataframe of positive bindings
positive_samples_df = pd.DataFrame(
    positive_samples_list,
    columns = ['mirna_binding_sequence', 'binding_sequence', 'label'])
# generate dataframe of negative bindings
scan_neg_df = pd.DataFrame(
    scan_neg_list,
    columns = ['mirna_binding_sequence', 'binding_sequence', 'label'])
# generate dataframe of miRNA negative
mirna_neg_set = negative_class_generator(
        positive_samples_df, neg_ratio=NEG_RATIO,
        random_state=RANDOM_STATE)
# concatene dataframes to create scan test set and miRNA negative test set
scan_test_df = pd.concat([positive_samples_df, scan_neg_df])
mirna_neg_test_df = pd.concat( [positive_samples_df, mirna_neg_set])


print('positive df samples are:', positive_samples_df.shape)
print('scan negative df samples are:', scan_neg_df.shape)
print('scan test df samples are:', scan_test_df.shape)
print('miRNA neg samples are:', mirna_neg_set.shape)
print('mirna neg test df samples are:', mirna_neg_test_df.shape)

positive df samples are: (21385, 3)
scan negative df samples are: (427700, 3)
scan test df samples are: (449085, 3)
miRNA neg samples are: (427700, 3)
mirna neg test df samples are: (449085, 3)


## Write output dataset

In [50]:
scan_test_df.to_csv('scan_test_set.tsv', header=True, index=False, sep='\t')
mirna_neg_set.to_csv('mirna_neg_set.tsv', header=True, index=False, sep='\t')
encori_train_positive_samples.to_csv(
    'train_set_positives.tsv', header=True, index=False, sep='\t')

In [51]:
! tar -czvf encori_datasets.tar.gz scan_test_set.tsv mirna_neg_set.tsv train_set_positives.tsv

scan_test_set.tsv
mirna_neg_set.tsv
train_set_positives.tsv
