#### Prerequisites:
- Pandas
- glob
- subprocess

#### Introduction
This script guides you through the process of firstly, extending the initial Risk SNPs list (seed snps, PD or AD.) with Haploreg (Linkage disequilibrium) and GTex (eQTLs), and secondly, map those snps to their corresponding genes. 

In [5]:
import glob
import pandas as pd
import subprocess

1. Query Haploreg (https://pubs.broadinstitute.org/mammals/haploreg/haploreg.php) v4.1 with the initial list of seed snps provides in the folder for both PD and AD separately. They are called "PD_SNPS.txt" for Parkinson's Disease or "AD_SNPS.txt" for Alzheimers's Disease.

    - options: default, set output mode to Text, LD Threshold 0.8 
    
###### Important: Haploreg is starting to get slow when snps count exceeds ~ 80. So we first divide the the seed list into equal snps of a modarate size. Then we use it as input separately for Haploreg

In [27]:
# the path to your input file = newline separated list of seed snps
# in this example we are using PD snps. Change the path according to your study (AD or PD)
seed_snps="PD_SNPS.txt"

In [6]:
#Import the initial snps list and divide into smaller subsets. 
#Those subsets are getting saved in the current working directory, called "snps_subset_xx".
snps_per_file = 80
smallfile=None
snps_list=[] #list of seed snps for gtex later
with open(seed_snps) as snps: # thi is the newline separated file of seed snps for PD. CHange this file according to your inpt of choice above.
    for lineno, line in enumerate(snps):
        snps_list.append(line.rstrip())
        if lineno % snps_per_file == 0:
            small_filename = 'snps_subset_{}.txt'.format(lineno + snps_per_file)
            smallfile = open(small_filename,"w")
        smallfile.write(line)
    if smallfile:
        smallfile.close()

 #### Now query Haploreg manually on their website with the smaller files separately and put them into the current working directory
 #### Follow these steps
 1. go to (https://pubs.broadinstitute.org/mammals/haploreg/haploreg.php
 2. build query
 3. build query with each subset, output format: Text, LD Threshold: 0.8
 4. Save results as "haploreg_x" with x = 1,2,3,4 etc. in the current working directory. Important to name them in that convention

In [2]:
#list of all files we created
files=glob.glob("haploreg*")

#read all files dymanically and merge them into a big dataframe
haploreg_df=pd.read_csv(files[0],delimiter = '\t')
for file in files[1:]:
    current_df=pd.read_csv(file,delimiter = '\t')
    haploreg_df = haploreg_df.append(current_df) #the final big dataframe contaiing all snps

In [3]:
##keep column 6 and 25 (index 5 and 24). "SNP ID" and "Gene name"
haploreg_df=haploreg_df[['rsID','GENCODE_name']]

In [4]:
#remove NaN valued rows and shady values 
haploreg_df_unique=haploreg_df.dropna(inplace=False)
print("NaN values removed")

NaN values removed


 #### Query Haploreg is done.
 ### 2.
 #### Now we we query gtex portal (http://gtexportal.org/home/index.html) to extend our mapping with eQTLs mapping. Since both Haploreg and gtex portal use genome build hg38, there's no further genome coordinate mapping needed.
 
 #### We access it through the API and fetch only Brain Tissue eQTLs

In [7]:
with open('snps_gene_eqtls.csv', 'a') as outfile: # save in csv in current working directory
    for snp in snps_list:
        #fetch brain eQTLs only
        command="curl -X GET --header \'Accept: text/html\' \'https://gtexportal.org/rest/v1/association/singleTissueEqtl?format=tsv&snpId="+snp+"&tissueSiteDetailId=Brain_Amygdala%2CBrain_Anterior_cingulate_cortex_BA24%2CBrain_Caudate_basal_ganglia%2CBrain_Cerebellar_Hemisphere%2CBrain_Cerebellum%2CBrain_Cortex%2CBrain_Frontal_Cortex_BA9%2CBrain_Hippocampus%2CBrain_Hypothalamus%2CBrain_Nucleus_accumbens_basal_ganglia%2CBrain_Putamen_basal_ganglia%2CBrain_Spinal_cord_cervical_c-1%2CBrain_Substantia_nigra&datasetId=gtex_v8\'"
        subprocess.call(command,shell=True,stdout=outfile) #that might take a few minutes

In [11]:
#open the output
eQTL_query=pd.read_csv("snps_gene_eqtls.csv",sep="\t")
#keep only snpID and gene symbol column
eQTL_query=eQTL_query[["snpId","geneSymbolUpper"]]
print (eQTL_query)

           snpId  geneSymbolUpper
0      rs6088792             GDF5
1      rs6088792          RPL36P4
2      rs6088792          RPL36P4
3      rs6088792             GDF5
4      rs6088792          RPL36P4
...          ...              ...
15687  rs9841498        MCCC1-AS1
15688      snpId  geneSymbolUpper
15689      snpId  geneSymbolUpper
15690      snpId  geneSymbolUpper
15691      snpId  geneSymbolUpper

[15692 rows x 2 columns]


In [13]:
#drop duplicates
eQTL_query=eQTL_query.drop_duplicates(keep="first",inplace=False)
print ("Size without duplicates:",len(eQTL_query))

Size without duplicates: 2434


In [15]:
eQTL_query.head(10) #row number 6 contains column headers

Unnamed: 0,snpId,geneSymbolUpper
0,rs6088792,GDF5
1,rs6088792,RPL36P4
5,rs6088792,UQCC1
10,rs6088792,MAP1LC3A
13,rs6088792,MYH7B
15,snpId,geneSymbolUpper
18,rs2274432,TSEN15
28,rs6088813,MMP24-AS1
29,rs6088813,GDF5
30,rs6088813,UQCC1


In [18]:
#we drop 6th row because it contains column headers. 
#Since we removed all duplicates, this row is the only instance of column headers within the data.
eQTL_query=eQTL_query.drop(eQTL_query.index[5]) #drop it
print("final size",len(eQTL_query))
print (eQTL_query.head(10)) # done

final size 2431
        snpId geneSymbolUpper
0   rs6088792            GDF5
1   rs6088792         RPL36P4
5   rs6088792           UQCC1
10  rs6088792        MAP1LC3A
13  rs6088792           MYH7B
29  rs6088813            GDF5
30  rs6088813           UQCC1
31  rs6088813         RPL36P4
34  rs6088813        MAP1LC3A
36  rs6088813         TRPC4AP


### 3. 
#### Now combine Haploreg and GTex results.
####  Important: only consider eQTL SNPs, which are in Haploreg results

In [19]:
#Drop snps from the GTex reults that are NOT in Haploreg results
haploreg_ids=haploreg_df_unique.rsID.tolist() #haploreg snps ids in list
eQTL_query_haplofilter=eQTL_query[eQTL_query["snpId"].isin(haploreg_ids)] #filter Gtex reults using the haploreg ID list

In [20]:
#renaming eQTL_query_haplofilter column names to make it consistent with the Haploreg dataframe
eQTL_query_haplofilter.columns=["rsID","GENCODE_name"]

In [22]:
# finally join both Haploreg_final and eQTL_query_haplofilter 
frames = [haploreg_df_unique,eQTL_query_haplofilter]
snps_gene_haploreg_gtex = pd.concat(frames, keys=['haplo', 'gtex'])
#
snps_gene_haploreg_gtex #we are almost done... .

Unnamed: 0,Unnamed: 1,rsID,GENCODE_name
haplo,0,rs1533317,U6
haplo,1,rs71566446,RP11-715G15.1
haplo,2,rs13220141,RP11-715G15.1
haplo,3,rs13215778,RP11-715G15.1
haplo,4,rs71566448,RP11-715G15.1
...,...,...,...
gtex,15599,rs878888,MAPT
gtex,15600,rs878888,MAPT-IT1
gtex,15610,rs878888,NSF
gtex,15613,rs894278,MMRN1


In [25]:
#remove duplicates from joined dataframe
snps_gene_haploreg_gtex_unique=snps_gene_haploreg_gtex.drop_duplicates(keep="first",inplace=False)
print("Size of the final Dataframe:",len(snps_gene_haploreg_gtex_unique)) #done

Size of the final Dataframe: 8911


In [None]:
# write the final dataframe to your current working directory
snps_gene_haploreg_gtex.to_csv("snps_gene_haploreg_gtex.csv",index=False)