### Complex Pheno/Geno workflow

This is the Complex Pheno/Geno workflow to be implemented as a web-based app using FLASK

In [2]:
from Bio import Entrez
from Settings import EMAIL # requires the presence of file Settings.py in current directory 

In [3]:
Entrez.email = EMAIL
# At Settings.py file: EMAIL = <username>@<something>.<something>
# This is the email you have used to register in NCBI

##### 1. Enter rsID
User is asked to enter the rsID that they know is associated with the complex disease of interest.
They are given the option to proceed with the analysis or to use PhenVar in order to check that the specific rsID is returning expected results (wordcloud and D3 graph) 

In [5]:
#get PMIDs from single rsID 

ID =input("Please enter an rsID number (do not include the letters 'rs'): ")

Please enter an rsID number (do not include the letters 'rs'): 800292


In [6]:
print('Do you want to proceed with analysis?')
print('Answering "no" will redirect to the PhenVar results for the given rsID. ')
print('Answering "yes" will take you to the next step where you will be required to enter VCF files. ')
conditional = input('Your answer: ')

Do you want to proceed with analysis?
Answering "no" will redirect to the PhenVar results for the given rsID. 
Answering "yes" will take you to the next step where you will be required to enter VCF files. 
Your answer: no


In [125]:
conditional = conditional.lower()
if conditional == 'no':
    url = 'https://phenvar.colorado.edu/results/?rsids='+ID+'&visualization=png-wordcloud&visualization=js-graph&normalization_type=default'
    print(url)
    
    import webbrowser
    webbrowser.open_new(url)
    #also works in Terminal as: python -m webbrowser -t "http://www.python.org" 
    
elif conditional != 'yes':
    print('Please go back and enter either "yes" or "no".')

https://phenvar.colorado.edu/results/?rsids=800292&visualization=png-wordcloud&visualization=js-graph&normalization_type=default


In [216]:
data =  Entrez.read(Entrez.elink(dbfrom='snp', db = 'pubmed', linkname='snp_pubmed_cited', id=ID))
pmids =  [id_dict['Id'] for id_dict in data[0]['LinkSetDb'][0]["Link"]]
print('There were {} PMIDs retrieved that are related to rs{}'.format(len(pmids), ID))

There were 128 PMIDs retrieved that are related to rs800292


In [220]:
rsids = []
for pmid in pmids:
    try:
        data = Entrez.read(Entrez.elink(dbfrom='pubmed',db='snp',linkname='pubmed_snp_cited',id=pmid))
        rsids.extend([id_dict['Id'] for id_dict in data[0]['LinkSetDb'][0]['Link']])
    except:
        print("There has been some problem with your connection. Please try again.")
        import sys
        sys.exit() #terminates all scripts running. To terminate only script that contains this try/except use quit()

In [385]:
#remove duplicates and turn dict to a list of all the rsIDs, regardless of how many publications support them
rsidSet = [id for id in set(rsids)]

print('A total of {} unique rsIDs were retrieved from {} PMIDs using rs{} as a search term\n'.format(len(rsidSet), len(pmids),ID))

A total of 757 unique rsIDs were retrieved from 128 PMIDs using rs800292 as a search term



##### 2. Filter retrieved rsIDs by number of supporting publications

In [398]:
#create a dictionary where the key is the number of publications supporting each rsID and values are the actual rsIDs
from collections import Counter
from Functions import rev_dict

In [399]:
rsDict = rev_dict(Counter(rsids)) 
 
#The following creates a list with the distinct numbers of supporting information in the results, ordered
numPmids = [i for i in set(rsDict.keys())]
numPmids.sort()

In [400]:
temp = rsiDict.copy()
#[item for sublist in temp.values() for item in sublist]
for i in numPmids:
    temp.pop(i,0)
    print('{} rsIDs are cited by more than {} PMIDs'.format(len([item for sublist in temp.values() for item in sublist]), i))

201 rsIDs are cited by more than 1 PMIDs
99 rsIDs are cited by more than 2 PMIDs
59 rsIDs are cited by more than 3 PMIDs
49 rsIDs are cited by more than 4 PMIDs
39 rsIDs are cited by more than 5 PMIDs
36 rsIDs are cited by more than 6 PMIDs
28 rsIDs are cited by more than 7 PMIDs
24 rsIDs are cited by more than 8 PMIDs
19 rsIDs are cited by more than 9 PMIDs
17 rsIDs are cited by more than 10 PMIDs
16 rsIDs are cited by more than 11 PMIDs
14 rsIDs are cited by more than 12 PMIDs
13 rsIDs are cited by more than 13 PMIDs
12 rsIDs are cited by more than 15 PMIDs
11 rsIDs are cited by more than 16 PMIDs
9 rsIDs are cited by more than 18 PMIDs
8 rsIDs are cited by more than 20 PMIDs
7 rsIDs are cited by more than 23 PMIDs
6 rsIDs are cited by more than 24 PMIDs
5 rsIDs are cited by more than 26 PMIDs
4 rsIDs are cited by more than 32 PMIDs
3 rsIDs are cited by more than 37 PMIDs
2 rsIDs are cited by more than 64 PMIDs
1 rsIDs are cited by more than 84 PMIDs
0 rsIDs are cited by more than 12

In [426]:
condition_filter = False
while condition_filter == False:
    print("What is the minimum acceptable number of publications supporting each rsID you want to consider?")
    print("Possible values are between 1 and {}.".format(max(numPmids[0:len(numPmids)-1])))
    no = int(input(''))
    if no <= max(numPmids[0:len(numPmids)-1]):
        selected_rsids = []
        for i in numPmids[::-1]:
            if i >= no:
                selected_rsids.extend(rsDict[i])
        selected_rsids = set(selected_rsids)
        condition_filter = True
    else:
        print("You have not entered a value between 1 and {}. Please retry.\n".format(max(numPmids[0:len(numPmids)-1])))
print('\nYou have selected to analyse {} different rsIDs that are co-cited together with rs{} (included).'.format(len(selected_rsids), ID))

What is the minimum acceptable number of publications supporting each rsID you want to consider?
Possible values are between 1 and 84.
2

You have selected to analyse 201 different rsIDs that are co-cited together with rs800292 (included).


In [362]:
# Export the identified rsIDs in a format (space delimited) that can be read by plink. 
#It will be used to extract the loci of interest from the VCFs
#Need to assign number for each user (no mix-ups) and arrange to delete at the end of the pipeline

with open('temp_rsids.txt', 'a+') as f:
    for item in selected_rsids:
        f.write("rs{} ".format(item))

##### 2. Upload VCFs and create patient - rsID matrix

I am currently working with the VCF for chr22 (smallest autosomal)from [1000 genomes](ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/) (downloaded on 10/13/2017)


In [1]:
from os import system, listdir, chdir, getcwd
import re

In [510]:
#In Flask path will be the directory from which to upload. 
#A temp directory with the same name will be created in the server

In [2]:
path = '/Users/nikiathanasiadou/NIH/'
files = listdir(path + 'Inputs')
vcfs = [i for i in filter(lambda x: x.endswith('.vcf.gz'), files)]
prefix = [re.sub('.vcf.gz', '', vcf) for vcf in vcfs]
print(prefix)

['Chr1', 'Chr22']


In [32]:
system('plink --vcf /Users/nikiathanasiadou/NIH/Inputs/Chr22.vcf.gz --recode --extract /Users/nikiathanasiadou/NIH/Outputs/rsids.txt --out Chr22')

0

In [34]:
system('plink --file Chr22 --recodeAD --out final')

0

In [37]:
import pandas as pd

In [66]:
table = pd.read_table('final.raw', sep=' ')
table.drop(table.columns[[1, 2,3,4,5,6]], axis=1, inplace=True)
table.drop(table.columns[[ i for i, word in enumerate(list(table)) if word.endswith('HET') ]], axis=1, inplace=True)
table.head()
#also drop columns that are identical (ie no variation)

Unnamed: 0,FID,rs9624909_T,rs9613094_G,rs9608466_A,rs59371099_A,rs740223_A,rs5749088_T,rs3205187_C,rs713685_T,rs6518799_A,rs743751_G,rs61741884_T,rs855791_A,rs8135665_T
0,HG00096,1,1,0,2,0,0,2,0,0,0,1,1,2
1,HG00097,0,0,0,0,1,1,0,0,0,0,0,0,1
2,HG00099,1,0,1,0,0,0,0,0,0,0,1,1,0
3,HG00100,2,0,0,0,2,2,2,1,0,0,1,2,0
4,HG00101,0,0,0,0,0,0,1,0,0,0,0,1,0
