The goal of this notebook is to join all of the different datasets into one giant matrix which can then be fed into the autoencoder. To start, let's look at each dataset that we have.

### Only run bash block if files need to be parsed properly (shouldn't happen)

In [8]:
%%bash
#head -n 10 ng00061/WES_release3AtlasOnly_vep80_most_severe_consequence_per_gene.txt
# These have different numbers of columns of course. Need to fix this maybe? Will only use the coordinates from this
# for now, this is the primary key to link all the other datasets.
#head -n 10 ng00061/WES_release3AtlasOnly_rolling_flat_annotation.txt
#cut -f1-5 WGS_v1_rolling_flat_annotation.txt > WGS_v1_rolling_flat_annotation.pos_only.txt

#head -n 10 ../IGAP_summary_statistics/IGAP_stage_1_2_combined.txt

# I will join all of the chromosomes into one file.
#cat /home/twaddlac/Hackthan_AD/ng00039/pvalue_only/metaanalysis/pvalueonly_METAANALYSIS1_chr*.TBL | perl -pe 's/^(\d+)-(\d+)/$1\t$2/g' > pvalue_only.tsv
#head -n 10 ../pvalue_only.tsv

#I'm not sure what the difference is between these but I am assuming we get the inverse of controls here.
# We will use the controls for a confusion matrix
#ls /data/Hackthan_AD/Mayo_eGWAS/
#head /data/Hackthan_AD/Mayo_eGWAS/Hap300_CER_All.txt

# Only going to use the coordinates of the annotation files since they have mismatching columns.
##### Only run this once #####
#cut -f1-5 /data/Hackthan_AD/NG00061/WGS_v1_rolling_flat_annotation.txt > /data/Hackthan_AD/NG00061/WGS_v1_rolling_flat_annotation.pos_only.txt

## only need to run once
#cat <(grep -m1 '^Marker' /home/twaddlac/Hackthan_AD/ng00039/pvalue_only/pvalue/pvalueonly_METAANALYSIS1_chr10.TBL) <(cat /home/twaddlac/Hackthan_AD/ng00039/pvalue_only/pvalue/*TBL | perl -pe 's/(\d+)-(\d+)/$1\t$2/g'| grep -v '^Marker') > /home/twaddlac/Hackthan_AD/ng00039/pvalue_only/pvalue/pvalue.tsv

## IGAP Data

In [9]:
import pandas as pd
igap1 = pd.read_csv('/data/Hackthan_AD/IGAP_summary_statistics/IGAP_stage_1.txt', sep='\t')
igap12 = pd.read_csv('/data/Hackthan_AD/IGAP_summary_statistics/IGAP_stage_1_2_combined.txt', sep='\t')
igap1.rename(columns={
    'Chromosome':'chr',
    'Position':'pos'
}, inplace=True)
igap12.rename(columns={
    'Chromosome':'chr',
    'Position':'pos'
}, inplace=True)

## NG00061

In [10]:
anno = pd.read_csv('/data/Hackthan_AD/NG00061/WGS_v1_rolling_flat_annotation.pos_only.txt', sep='\t', header=0)

In [11]:
conseq = pd.read_csv('/data/Hackthan_AD/NG00061/WGS_v1_vep80_most_severe_consequence_per_gene.txt', sep='\t', header=0)

## NG00039

In [None]:
pvalue = pd.read_csv('/home/twaddlac/Hackthan_AD/ng00039/pvalue_only/pvalue/pvalue.tsv', sep='\t', header=0, index_col=False)
# pvalue.columns = ['chr','pos','allele1','allele2','pvalue']

## Mayo_eGWAS

In [None]:
hapCerAd = pd.read_csv('/home/twaddlac/Hackthan_AD/Mayo_eGWAS/Hap300_CER_AD.txt', sep='\t', header=0)
hapTxAd = pd.read_csv('/home/twaddlac/Hackthan_AD/Mayo_eGWAS/Hap300_TX_AD.txt', sep='\t', header=0)
hapmapCerAd = pd.read_csv('/home/twaddlac/Hackthan_AD/Mayo_eGWAS/HapMap2_CER_AD.txt', sep='\t', header=0)
hapmapTxAd = pd.read_csv('/home/twaddlac/Hackthan_AD/Mayo_eGWAS/HapMap2_TX_AD.txt', sep='\t', header=0)

hapCerAd.rename(columns={'CHR':'chr','BP':'pos'}, inplace=True)
hapTxAd.rename(columns={'CHR':'chr','BP':'pos'}, inplace=True)
hapmapCerAd.rename(columns={'CHR':'chr','BP':'pos'}, inplace=True)
hapmapTxAd.rename(columns={'CHR':'chr','BP':'pos'}, inplace=True)

## TBI Study Expression Data
There's a lot more data for this dataset but we can import it later.

In [None]:
ge = pd.read_csv('/home/twaddlac/Hackthan_AD/TBI_study/gene_expression_matrix_2016-03-03/fpkm_table_normalized.csv', sep=',', header=0)

# Joining Datasets
The file linking everything together will be the annotation data from NG00061 as it should hold the complete set of SNPs. I will use coordinates from this table to join.

In [None]:
headers = ['chr','pos']
temp = anno.merge(pvalue, how='left', left_on=headers, right_on=headers)

In [None]:
temp.head()

Unnamed: 0,chr,pos,alt_allele,seq_meta_var_id,epacts_var_id,Allele1,Allele2,P-value
0,10,60494,G,"10:60494A,G",10:60494_A/G,a,g,0.3242
1,10,60523,G,"10:60523T,G",10:60523_T/G,t,g,0.3721
2,10,61331,G,"10:61331A,G",10:61331_A/G,a,g,0.7516
3,10,61334,A,"10:61334G,A",10:61334_G/A,,,
4,10,61646,G,"10:61646A,G",10:61646_A/G,,,


In [None]:
from functools import reduce
dataframes = [
    anno,
    pvalue,
    conseq,
    igap1,
    igap12,
    hapCerAd,
    hapTxAd,
    hapmapCerAd,
    hapmapTxAd
]
df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['chr','pos'],
                                            how='outer'), dataframes).fillna('void')