# Pre-process eQTL meta-analysis data

This script reads in the eQTL meta-analysis output at [syn16984815](https://www.synapse.org/#!Synapse:syn16984815) and condenses it for ingest into Agora. For each gene (Ensembl ID), there are multiple SNPs with different p-values / FDRs. Here we take the smallest FDR for each gene, mark that gene as significant if the smallest FDR is <= 0.05, and discard all other duplicate gene rows. 

The output is a data frame with one row per Ensembl ID, with two columns: one for the Ensembl ID and one for whether the gene was significant for at least one SNP. The output file is uploaded to [syn12514912](https://www.synapse.org/#!Synapse:syn12514912).

In [1]:
from agoradatatools.etl import utils
import pandas as pd
syn = utils._login_to_synapse()

Welcome, Jaclyn Beck!



This file will take awhile to download, as it's >17 GB. 

In [2]:
eqtl_meta = syn.get("syn16984815")

This file is extremely large and has a lot of columns we don't need, so we only have `read_csv` return the two columns we are interested in. 

In [3]:
eqtl_meta = pd.read_csv(eqtl_meta.path, usecols=["gene", "FDR"])
eqtl_meta.head()

Unnamed: 0,gene,FDR
0,ENSG00000227232,0.9628
1,ENSG00000227232,0.756904
2,ENSG00000227232,0.958278
3,ENSG00000227232,0.961634
4,ENSG00000227232,0.892694


Take the minimum FDR for each gene, and determine significance. 

In [4]:
eqtl_meta = eqtl_meta.groupby("gene")["FDR"].agg("min").reset_index()
eqtl_meta["has_eqtl"] = eqtl_meta["FDR"] <= 0.05
eqtl_meta["has_eqtl"].value_counts()

True     18395
False      997
Name: has_eqtl, dtype: int64

Rename "gene" column and get rid of "FDR" column.

In [5]:
eqtl_meta = eqtl_meta.rename(columns = {"gene": "ensembl_gene_id"}).drop(columns="FDR")
eqtl_meta

Unnamed: 0,ensembl_gene_id,has_eqtl
0,ENSG00000000419,True
1,ENSG00000000457,True
2,ENSG00000000460,True
3,ENSG00000000938,True
4,ENSG00000000971,True
...,...,...
19387,ENSG00000282936,True
19388,ENSG00000283041,True
19389,ENSG00000283050,True
19390,ENSG00000283078,True


Make sure there are no NA values in the data frame.

In [6]:
print(any(eqtl_meta["ensembl_gene_id"].isna()))
print(any(eqtl_meta["has_eqtl"].isna()))

False
False


Write to a file.

In [7]:
eqtl_meta.to_csv("../output/eqtl_meta_analysis.csv", index=False)