### Complex Pheno/Geno workflow

This is the Complex Pheno/Geno workflow to be implemented as a web-based app using FLASK

In [2]:
from Bio import Entrez
from Settings import EMAIL # requires the presence of file Settings.py in current directory 

In [3]:
Entrez.email = EMAIL
# At Settings.py file: EMAIL = <username>@<something>.<something>
# This is the email you hav eused to register in NCBI

##### 1. Enter rsID
User is asked to enter the rsID that they know is associated with the complex disease of interest.
They are given the option to proceed with the analysis or to use PhenVar in order to check that the specific rsID is returning expected results (wordcloud and D3 graph) 

In [5]:
#get PMIDs from single rsID 

ID =input("Please enter an rsID number (do not include the letters 'rs'): ")

Please enter an rsID number (do not include the letters 'rs'): 800292


In [6]:
print('Do you want to proceed with analysis?')
print('Answering "no" will redirect to the PhenVar results for the given rsID. ')
print('Answering "yes" will take you to the next step where you will be required to enter VCF files. ')
conditional = input('Your answer: ')

Do you want to proceed with analysis?
Answering "no" will redirect to the PhenVar results for the given rsID. 
Answering "yes" will take you to the next step where you will be required to enter VCF files. 
Your answer: no


In [7]:
conditional = conditional.lower()
if conditional == 'no':
    url = 'https://phenvar.colorado.edu/results/?rsids='+ID+'&visualization=png-wordcloud&visualization=js-graph&normalization_type=default'
    print(url)
    
    import webbrowser
    webbrowser.open_new(url)
    #also works in Terminal as: python -m webbrowser -t "http://www.python.org" 
    
elif conditional != 'yes':
    print('Please go back and enter either "yes" or "no".')

https://phenvar.colorado.edu/results/?rsids=800292&visualization=png-wordcloud&visualization=js-graph&normalization_type=default


In [8]:
data =  Entrez.read(Entrez.elink(dbfrom='snp', db = 'pubmed', linkname='snp_pubmed_cited', id=ID))
pmids =  [id_dict['Id'] for id_dict in data[0]['LinkSetDb'][0]["Link"]]
print('There were {} PMIDs retrieved that are related to rs{}'.format(len(pmids), ID))

There were 122 PMIDs retrieved that are related to rs800292


In [9]:
rsids = []
for pmid in pmids:
    data = Entrez.read(Entrez.elink(dbfrom='pubmed',db='snp',linkname='pubmed_snp_cited',id=pmid))
    rsids.extend([id_dict['Id'] for id_dict in data[0]['LinkSetDb'][0]['Link']])
rsids = [id for id in set(rsids)] #remove duplicates and turn dict to a list
print('{} unique rsIDs were retrieved from {} PMIDs using rs{} as a search term'.format(len(rsids), len(pmids),ID))
# Also perhaps use multiple literature evidence supporting each rsID to make the list more specific (strictness criterion)

885 unique rsIDs were retrieved from 122 PMIDs using rs800292 as a search term


In [15]:
# Export the identified rsIDs in a format (space delimited) that can be read by plink. 
#It will be used to extract the loci of interest from the VCFs
#Need to assign number for each user (no mix-ups) and arrange to delete at the end of the pipeline

with open('temp_rsids.txt', 'a+') as f:
    for item in rsids:
        f.write("rs{} ".format(item))

##### 2. Upload VCFs and create patient - rsID matrix

I am currently working with the VCF for chr21 and chr22 (smallest autosomal) and chrY (doesn't have any rsIDs? Will it throw an error?) from [1000 genomes](ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/) (downloaded on 10/13/2017)


In [26]:
from os import system, listdir
import re

In [38]:
#In Flask path will be the directory from which to upload. 
#A temp directory with the same name will be created in the server

path = input("Please enter the directory path where the vcf.gz files are located: ")

Please enter the directory path where the vcf.gz files are located: VCFs


In [84]:
# Create lists of all the uploaded VCFs and all the file prefixes (to be carried as file names during plink processing)

files = listdir(path)
if len(re.findall('.vcf.gz', str(files)))!=len(files): # checks that all files are .vcf.gz
    print("The files you uploaded are not in the <name>.vcf.gz format. Please check your uploaded folder and try again.")
else:
    prefix = [re.sub('.vcf.gz', '', file) for file in files]
#chrs 25 & 26 is chr22 duplicated. 

In [78]:
#The following function requires installation of plink 1.9 from https://www.cog-genomics.org/plink2 at the PATH directory

def ExtractRsID(Path, Prefix):
    print("Processing {}".format(Prefix))
    system('plink --vcf {}/{}.vcf.gz --recode12 --tab --extract temp_rsids.txt --out {}'.format(Path, Prefix,Prefix))

In [88]:
# Useful example: http://wltrimbl.github.io/2014-06-10-spelman/intermediate/python/04-multiprocessing.html
import multiprocessing
import progressbar

bar = progressbar.ProgressBar()
pool = multiprocessing.Pool(len(files)) # run as many processes as vcfs. The default is the number of CPU cores.
results = [pool.apply_async( ExtractRsID, (path, p) ) for p in prefix]
for result in bar(results):
    result.get()
#chr25 and chr26 is chr22 duplicated. The set of 3 chr (21,22,Y) took 1min6sec, set of 4 (= no. of cores) 1min18sec

## OK so this is not THAT fast (in my computer) perhaps using the faster cores at NCBI will help
## Implement some sort of timer in the web interface of the app
## http://www.java2s.com/Tutorial/Java/0240__Swing/Timerbasedanimation.htm

                                                                               N/A% (0 of 5) |                          | Elapsed Time: 0:00:00 ETA:  --:--:--

Processing chr22
Processing chr21
Processing chrY
Processing chr25
Processing chr26


100% (5 of 5) |###########################| Elapsed Time: 0:01:51 Time: 0:01:51


In [113]:
#check for missing ped files

#files I have

#expected files

#make temp txt with ped files to merge

#here merge all files
#system('plink --file output.ped --recodeAD --out twostep')


<callable_iterator object at 0x10832df98>
