<a href="https://colab.research.google.com/github/ML-Bioinfo-CEITEC/mirna_binding/blob/master/notebook/generate_families.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generate microRNA families DB

Download microRNA families from [targetscan](wwww.targetscan.org) and sanified it for project `microRNA_binding`.

The "sanified" dataset contains unique microRNA for each family, and it considers only microRNA of length >= 20nt.

Nucleotide "U" is replaced with "T".


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# set seed for reproducibility
np.random.seed(1789)
# retrive only microRNA of len
min_microrna_len = 20
# targetscan species id field
human_id = 9606
# save info
save_to = './'
save_as = 'tarbased.sanified'

### Download microRNA families dataset
We used microRNA families dataset from [targetscan](www.targetscan.org).

In [35]:
mirna_families_db_url = 'http://www.targetscan.org/vert_72/vert_72_data_download/miR_Family_Info.txt.zip'
# load df from url
families_db = pd.read_csv(mirna_families_db_url, sep='\t')
# retrive only human (9606) microRNA
families_humans_db = families_db[families_db['Species ID'] == human_id].copy().reset_index(drop=True)

print('original targetscan dataframe shape is:', families_db.shape)
print('only human entries dataset shape is:', families_humans_db.shape)


original targetscan dataframe shape is: (9994, 7)
only human entries dataset shape is: (2606, 7)


### pre-processing
1. Remove microRNA with `NA` values in field `MiRBase Accession`.

In [40]:
print(families_humans_db.isna().sum())

print('original dataset shape is:', families_humans_db.shape)
families_humans_db = families_humans_db.dropna()
print('original dataset shape without "NA" values is:', families_humans_db.shape)




miR family              0
Seed+m8                 0
Species ID              0
MiRBase ID              0
Mature sequence         0
Family Conservation?    0
MiRBase Accession       0
dtype: int64
original dataset shape is: (2580, 7)
original dataset shape without "NA" values is: (2580, 7)


2. Filters miRNA sequences < 20 nt length, and replace nucleotide "U" to "T".


In [41]:
families_humans_db['sequence_len'] = families_humans_db.apply(lambda x: len(x['Mature sequence']), axis=1)
families_humans_db_filtered = families_humans_db[families_humans_db['sequence_len'] >= min_microrna_len].copy()
families_humans_db_filtered['Mature sequence'] = families_humans_db_filtered['Mature sequence'].str.replace('U', 'T')

print('original dataset shape is:', families_humans_db.shape)
print('processed dataset shape is:', families_humans_db_filtered.shape)

original dataset shape is: (2580, 8)
processed dataset shape is: (2339, 8)


### Retrive a random RNA for each family
This cell loop over the `families_humans_db` and randomly select a microRNA, which will represent the family.

MicroRNAs of each family are listed under the new column field `members`.

In [43]:
# generate array of family size distribution.
family_size = list()
family_repres = list()
for index, (family_id, group) in enumerate(families_humans_db_filtered.groupby('miR family')):
    family_size.append(group.shape[0])
    group['members'] = np.unique(group['MiRBase ID']).tolist()
    family_repres.append(group.sample(n=1))

array = np.array(family_size, dtype='int')
families_humans_unique = pd.concat(family_repres)

print('family size min,max,std values:', array.min(), array.max(), array.mean(), sep=':')
print('family size original db shape is', families_humans_db_filtered.shape, sep=':')
print('family size unique db shape is', families_humans_unique.shape, sep=':')

family size min,max,std values::1:22:1.2441489361702127
family size original db shape is:(2339, 8)
family size unique db shape is:(1880, 9)


### save filtered microRNA dataset as FASTA file

In [44]:
with open(save_to + save_as + '.fasta', 'w') as outfasta:
    for index, row in families_humans_unique.iterrows():
        outfasta.write(f'>{row["miR family"]}\n{row["Mature sequence"]}\n')


### save filtered microRNA dataset as tsv table

In [45]:
families_humans_unique.to_csv(save_to + save_as + '.tsv', sep='\t', index=False)