# RetNet filtering

In [1]:
import os
import glob
import pandas as pd

In [2]:
retnet_url = 'https://sph.uth.edu/retnet/disease.htm'
filtered_dir = 'filtered/*.txt'
output_dir = 'retnet_filtered'

We start by reading in the data from HTML. Since on the web site the data is split into multiple tables, we have to concatenate the result.

In [3]:
retnet_dfs = pd.read_html(retnet_url, header=1)
retnet_df = pd.concat(retnet_dfs, ignore_index=True)
retnet_df.head()

Unnamed: 0,Symbols;OMIM Numbers,Location,Diseases;Protein,How Identified;Comments,References
0,SAMD11; 616765,1p36.33,recessive retinitis pigmentosa; protein: steri...,"homozygosity mapping, whole-exome sequencing; ...",Corton 16
1,"NPHP4, SLSN4; 606966, 606996, 607215",1p36.31,recessive Senior-Loken syndrome; recessive nep...,"linkage mapping, candidate gene; Senior-Loken ...",Mollet 02; Otto 02; Schuermann 02
2,"ESPN, DFNB36; 606351, 609006",1p36.31,recessive Usher syndrome; protein: espin prote...,"homozygosity mapping, whole-exome sequencing; ...",Ahmed 18; Donaudy 06; Naz 04
3,"NMNAT1, LCA9, PNAT1; 204000, 608553, 608700",1p36.22,recessive Leber congenital amaurosis; protein:...,"linkage mapping, whole-exome sequencing; a hom...",Chiang 12; Falk 12; Keen 03; Koenekoop 12; Per...
4,"MFN2, CMT6, CMT2A2, MARF; 608507, 609260, 601152",1p36.22,dominant optic atrophy with neuropathy and myo...,candidate gene; dominant mutation in a large T...,Rouzier 12


We're interested in getting names of all genes - and they happen to be the first entry in the `Symbols;OMIM Numbers` column. To extract only the first word, we use a regular expression.

In [4]:
retnet_df['gene'] = retnet_df['Symbols;OMIM Numbers'].str.extract('(^\w+)')
retnet_df.head()

Unnamed: 0,Symbols;OMIM Numbers,Location,Diseases;Protein,How Identified;Comments,References,gene
0,SAMD11; 616765,1p36.33,recessive retinitis pigmentosa; protein: steri...,"homozygosity mapping, whole-exome sequencing; ...",Corton 16,SAMD11
1,"NPHP4, SLSN4; 606966, 606996, 607215",1p36.31,recessive Senior-Loken syndrome; recessive nep...,"linkage mapping, candidate gene; Senior-Loken ...",Mollet 02; Otto 02; Schuermann 02,NPHP4
2,"ESPN, DFNB36; 606351, 609006",1p36.31,recessive Usher syndrome; protein: espin prote...,"homozygosity mapping, whole-exome sequencing; ...",Ahmed 18; Donaudy 06; Naz 04,ESPN
3,"NMNAT1, LCA9, PNAT1; 204000, 608553, 608700",1p36.22,recessive Leber congenital amaurosis; protein:...,"linkage mapping, whole-exome sequencing; a hom...",Chiang 12; Falk 12; Keen 03; Koenekoop 12; Per...,NMNAT1
4,"MFN2, CMT6, CMT2A2, MARF; 608507, 609260, 601152",1p36.22,dominant optic atrophy with neuropathy and myo...,candidate gene; dominant mutation in a large T...,Rouzier 12,MFN2


These genes will be now used to further filter results obtained in [filtering notebook](01_filtering.ipynb).

In [5]:
filtered_path = glob.glob(filtered_dir)
print('Files considered for further filtering:')
filtered_path

Files considered for further filtering:


['filtered/376_filtered.txt',
 'filtered/484_filtered.txt',
 'filtered/226_filtered.txt',
 'filtered/352_filtered.txt',
 'filtered/339_filtered.txt',
 'filtered/429_filtered.txt',
 'filtered/312_filtered.txt',
 'filtered/367_filtered.txt']

We use inner join to create a new data frame that contains only those records that are matched by `gene` in both tables. The result is saved to a new folder.

In [6]:
os.makedirs(output_dir, exist_ok=True)

for path in filtered_path:
    filtered_df = pd.read_csv(path, sep='\t')
    filename = os.path.basename(path)
    filename_without_extension = os.path.splitext(filename)[0]
    new_filename = filename_without_extension + '_retnet.csv'
    full_output_path = os.path.join(output_dir, new_filename)
    df = pd.merge(left=filtered_df, right=retnet_df, how='inner', on='gene')
    df.to_csv(full_output_path)