<a href="https://colab.research.google.com/github/ML-Bioinfo-CEITEC/mirna_binding/blob/master/notebook/generate_families.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generate microRNA families DB

Download microRNA families from [targetscan](wwww.targetscan.org) and sanified it for project `microRNA_binding`.

The "sanified" dataset contains unique microRNA for each family, and it considers only microRNA of length >= 20nt.

Nucleotide "U" is replaced with "T".


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# set seed for reproducibility
np.random.seed(1789)
# retrive only microRNA of len
min_microrna_len = 20
# targetscan species id field
human_id = 9606
# save info
save_to = './'
save_as = 'tarbased.sanified'

  import pandas.util.testing as tm


### Download microRNA families dataset
We used microRNA families dataset from [targetscan](www.targetscan.org).

In [2]:
mirna_families_db_url = 'http://www.targetscan.org/vert_72/vert_72_data_download/miR_Family_Info.txt.zip'
# load df from url
families_db = pd.read_csv(mirna_families_db_url, sep='\t')
# retrive only human (9606) microRNA
families_humans_db = families_db[families_db['Species ID'] == human_id].copy().reset_index(drop=True)

print('original targetscan dataframe shape is:', families_db.shape)
print('only human entries dataset shape is:', families_humans_db.shape)


original targetscan dataframe shape is: (9994, 7)
only human entries dataset shape is: (2606, 7)


### pre-processing
1. Remove microRNA with `NA` values in field `MiRBase Accession`.

In [3]:
print(families_humans_db.isna().sum())

print('original dataset shape is:', families_humans_db.shape)
families_humans_db = families_humans_db.dropna()
print('original dataset shape without "NA" values is:', families_humans_db.shape)




miR family               0
Seed+m8                  0
Species ID               0
MiRBase ID               0
Mature sequence          0
Family Conservation?     0
MiRBase Accession       26
dtype: int64
original dataset shape is: (2606, 7)
original dataset shape without "NA" values is: (2580, 7)


2. Filters miRNA sequences < 20 nt length, and replace nucleotide "U" to "T".


In [9]:
families_humans_db['sequence_len'] = families_humans_db.apply(lambda x: len(x['Mature sequence']), axis=1)
families_humans_db_filtered = families_humans_db[families_humans_db['sequence_len'] >= min_microrna_len].copy()
families_humans_db_filtered['Mature sequence'] = families_humans_db_filtered['Mature sequence'].str.replace('U', 'T')

print('original dataset shape is:', families_humans_db.shape)
print('processed dataset shape is:', families_humans_db_filtered.shape)

original dataset shape is: (2580, 8)
processed dataset shape is: (2339, 8)


### Retrive a random RNA for each family
This cell loop over the `families_humans_db` and randomly select a microRNA, which will represent the family.

MicroRNAs of each family are listed under the new column field `members`.

In [19]:
# generate array of family size distribution.
family_size = list()
family_repres = list()
for index, (family_id, group) in enumerate(families_humans_db_filtered.groupby('miR family')):
    family_size.append(group.shape[0])
    x = group.copy()
    x['members'] = ','.join(np.unique(group['MiRBase ID']).tolist())
    family_repres.append(x.sample(n=1))

array = np.array(family_size, dtype='int')
families_humans_unique = pd.concat(family_repres)

print('family size min,max,std values:', array.min(), array.max(), array.mean(), sep=':')
print('family size original db shape is', families_humans_db_filtered.shape, sep=':')
print('family size unique db shape is', families_humans_unique.shape, sep=':')

family size min,max,std values::1:22:1.2441489361702127
family size original db shape is:(2339, 8)
family size unique db shape is:(1880, 9)


### save filtered microRNA dataset as FASTA file

In [21]:
with open(save_to + save_as + '.fasta', 'w') as outfasta:
    for index, row in families_humans_unique.iterrows():
        outfasta.write(f'>{row["miR family"]}\n{row["Mature sequence"]}\n')


### save filtered microRNA dataset as tsv table

In [24]:
families_humans_unique.to_csv(save_to + save_as + '.tsv', sep='\t', index=False)

families_humans_unique.head(10)

Unnamed: 0,miR family,Seed+m8,Species ID,MiRBase ID,Mature sequence,Family Conservation?,MiRBase Accession,sequence_len,members
2,let-7-5p/98-5p,GAGGUAG,9606,hsa-let-7c-5p,TGAGGTAGTAGGTTGTATGGTT,2,MIMAT0000064,22,"hsa-let-7a-5p,hsa-let-7b-5p,hsa-let-7c-5p,hsa-..."
11,let-7a-2-3p/let-7g-3p,UGUACAG,9606,hsa-let-7a-2-3p,CTGTACAGCCTCCTAGCTTTCC,-1,MIMAT0010195,22,"hsa-let-7a-2-3p,hsa-let-7g-3p"
13,let-7a-3p/let-7b-3p/let-7f-1-3p/98-3p,UAUACAA,9606,hsa-let-7a-3p,CTATACAATCTACTGTCTTTC,-1,MIMAT0004481,21,"hsa-let-7a-3p,hsa-let-7b-3p,hsa-let-7f-1-3p,hs..."
17,let-7c-3p,UGUACAA,9606,hsa-let-7c-3p,CTGTACAACCTTCTAGCTTTCC,-1,MIMAT0026472,22,hsa-let-7c-3p
18,let-7d-3p,UAUACGA,9606,hsa-let-7d-3p,CTATACGACCTGCTGCCTTTCT,-1,MIMAT0004484,22,hsa-let-7d-3p
19,let-7e-3p,UAUACGG,9606,hsa-let-7e-3p,CTATACGGCCTCCTAGCTTTCC,-1,MIMAT0004485,22,hsa-let-7e-3p
20,let-7f-2-3p/1185-3p,UAUACAG,9606,hsa-let-7f-2-3p,CTATACAGTCTACTGTCTTTCC,-1,MIMAT0004487,22,"hsa-let-7f-2-3p,hsa-miR-1185-1-3p,hsa-miR-1185..."
23,let-7i-3p,UGCGCAA,9606,hsa-let-7i-3p,CTGCGCAAGCTACTGCCTTGCT,-1,MIMAT0004585,22,hsa-let-7i-3p
265,miR-1-3p/206,GGAAUGU,9606,hsa-miR-206,TGGAATGTAAGGAAGTGTGTGG,2,MIMAT0000462,22,"hsa-miR-1-3p,hsa-miR-206,hsa-miR-613"
338,miR-1-5p,CAUACUU,9606,hsa-miR-1-5p,ACATACTTCTTTATATGCCCAT,-1,MIMAT0031892,22,hsa-miR-1-5p
