# ALZML

The goal of this project is to use publically available datasets construct a classifier to determine the features that confer Alzheimer's Disease. To do so we are combining SNP data with metadata to generate a feature rich dataset. 

First we will identify ROIs from SNP data that are known to be involved in AD. From there, adjacent features will be extracted to serve as a feature profile of the ROI. Once all ROIs have been profiled we will then run a heirarchical clustering algorithm to determine distinguishing features of these ROIs.

From these features we are hoping to construct an HMM that will be able to identify some of the underrepresented variants that may confer AD. 

Data are sourced from:
- gnomad
- IGAP
- TBI Study
- Zou eGWAS
- Mayo eGWAS
- NG00061
- NG00039

IGAP data will be used as the determinant for a ROI. A +/- 5kb region from the location of the SNP will serve as an anchor to gather local features using bedtools in order to retain positional information. 

### Only run bash block if files need to be parsed properly (shouldn't happen)

In [1]:
%%bash
#head -n 10 ng00061/WES_release3AtlasOnly_vep80_most_severe_consequence_per_gene.txt
# These have different numbers of columns of course. Need to fix this maybe? Will only use the coordinates from this
# for now, this is the primary key to link all the other datasets.
#head -n 10 ng00061/WES_release3AtlasOnly_rolling_flat_annotation.txt
#cut -f1-5 WGS_v1_rolling_flat_annotation.txt > WGS_v1_rolling_flat_annotation.pos_only.txt

#head -n 10 ../IGAP_summary_statistics/IGAP_stage_1_2_combined.txt

# I will join all of the chromosomes into one file.
#cat /home/twaddlac/Hackthan_AD/ng00039/pvalue_only/metaanalysis/pvalueonly_METAANALYSIS1_chr*.TBL | perl -pe 's/^(\d+)-(\d+)/$1\t$2/g' > pvalue_only.tsv
#head -n 10 ../pvalue_only.tsv

#I'm not sure what the difference is between these but I am assuming we get the inverse of controls here.
# We will use the controls for a confusion matrix
#ls /data/Hackthan_AD/Mayo_eGWAS/
#head /data/Hackthan_AD/Mayo_eGWAS/Hap300_CER_All.txt

# Only going to use the coordinates of the annotation files since they have mismatching columns.
##### Only run this once #####
#cut -f1-5 /data/Hackthan_AD/NG00061/WGS_v1_rolling_flat_annotation.txt > /data/Hackthan_AD/NG00061/WGS_v1_rolling_flat_annotation.pos_only.txt

## only need to run once
#cat <(grep -m1 '^Marker' /home/twaddlac/Hackthan_AD/ng00039/pvalue_only/pvalue/pvalueonly_METAANALYSIS1_chr10.TBL) <(cat /home/twaddlac/Hackthan_AD/ng00039/pvalue_only/pvalue/*TBL | perl -pe 's/(\d+)-(\d+)/$1\t$2/g'| grep -v '^Marker') > /home/twaddlac/Hackthan_AD/ng00039/pvalue_only/pvalue/pvalue.tsv

## IGAP Data

In [3]:
import pandas as pd
igap1 = pd.read_csv('/data/Hackthan_AD/IGAP_summary_statistics/IGAP_stage_1.txt', sep='\t')
igap12 = pd.read_csv('/data/Hackthan_AD/IGAP_summary_statistics/IGAP_stage_1_2_combined.txt', sep='\t')
igap1.rename(columns={
    'Chromosome':'chr',
    'Position':'pos'
}, inplace=True)
igap12.rename(columns={
    'Chromosome':'chr',
    'Position':'pos'
}, inplace=True)

## NG00061

In [4]:
anno = pd.read_csv('/data/Hackthan_AD/NG00061/WGS_v1_rolling_flat_annotation.pos_only.txt', sep='\t', header=0)

In [5]:
conseq = pd.read_csv('/data/Hackthan_AD/NG00061/WGS_v1_vep80_most_severe_consequence_per_gene.txt', sep='\t', header=0)

## NG00039

In [6]:
pvalue = pd.read_csv('/home/twaddlac/Hackthan_AD/ng00039/pvalue_only/pvalue/pvalue.tsv', sep='\t', header=0, index_col=False)
# pvalue.columns = ['chr','pos','allele1','allele2','pvalue']

## Mayo_eGWAS

In [7]:
hapCerAd = pd.read_csv('/home/twaddlac/Hackthan_AD/Mayo_eGWAS/Hap300_CER_AD.txt', sep='\t', header=0)
hapTxAd = pd.read_csv('/home/twaddlac/Hackthan_AD/Mayo_eGWAS/Hap300_TX_AD.txt', sep='\t', header=0)
hapmapCerAd = pd.read_csv('/home/twaddlac/Hackthan_AD/Mayo_eGWAS/HapMap2_CER_AD.txt', sep='\t', header=0)
hapmapTxAd = pd.read_csv('/home/twaddlac/Hackthan_AD/Mayo_eGWAS/HapMap2_TX_AD.txt', sep='\t', header=0)

hapCerAd.rename(columns={'CHR':'chr','BP':'pos'}, inplace=True)
hapTxAd.rename(columns={'CHR':'chr','BP':'pos'}, inplace=True)
hapmapCerAd.rename(columns={'CHR':'chr','BP':'pos'}, inplace=True)
hapmapTxAd.rename(columns={'CHR':'chr','BP':'pos'}, inplace=True)

## TBI Study Expression Data
There's a lot more data for this dataset but we can import it later.

In [7]:
ge = pd.read_csv('/home/twaddlac/Hackthan_AD/TBI_study/gene_expression_matrix_2016-03-03/fpkm_table_normalized.csv', sep=',', header=0)

# Joining Datasets
The file linking everything together will be the annotation data from NG00061 as it should hold the complete set of SNPs. I will use coordinates from this table to join.

In [8]:
headers = ['chr','pos']
temp = anno.merge(pvalue, how='left', left_on=headers, right_on=headers)

In [9]:
temp

Unnamed: 0,chr,pos,alt_allele,seq_meta_var_id,epacts_var_id,Allele1,Allele2,P-value
0,10,60494,G,"10:60494A,G",10:60494_A/G,a,g,0.32420
1,10,60523,G,"10:60523T,G",10:60523_T/G,t,g,0.37210
2,10,61331,G,"10:61331A,G",10:61331_A/G,a,g,0.75160
3,10,61334,A,"10:61334G,A",10:61334_G/A,,,
4,10,61646,G,"10:61646A,G",10:61646_A/G,,,
5,10,61654,A,"10:61654G,A",10:61654_G/A,,,
6,10,61766,A,"10:61766G,A",10:61766_G/A,,,
7,10,62450,A,"10:62450G,A",10:62450_G/A,,,
8,10,66295,T,"10:66295C,T",10:66295_C/T,,,
9,10,66326,G,"10:66326A,G",10:66326_A/G,a,g,0.58810


In [None]:
# Breaks kernel. Don't run in notebook.
from functools import reduce
dataframes = [
    anno,
    pvalue,
    conseq,
    igap1,
    igap12,
    hapCerAd,
    hapTxAd,
    hapmapCerAd,
    hapmapTxAd
]
df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['chr','pos'],
                                                how='outer'), dataframes).fillna('void')