# Process GWAS data
This notebook creates a dataset of genes identified by GWAS as being significantly associated with risk or protection in AD. The gene list is curated by the ADSP Gene Verification Committee (https://adsp.niagads.org/index.php/gvc-top-hits-list/). 

The list of genes is downloaded as an Excel file, and this notebook ingests the Excel file, queries Biomart for the Ensembl IDs of these genes, and writes the result to a csv file for use in Agora.

In [1]:
import pandas as pd  # Requires install of package "openpyxl" for read_excel
import preprocessing_utils

The Excel file contains 2 sheets:

    table1 = Table 1: List of AD Loci with Genetic Evidence Compiled by ADSP Gene Verification Committee
    table2 = Table 2: AD risk/protective causal genes
    
We want the genes from both tables. 

In [2]:
gwas = pd.read_excel(
    "../../input/gwas_gvc_compiled_list.xlsx", sheet_name=[0, 1], skiprows=1
)
print(gwas[0].shape)
print(gwas[1].shape)

(76, 5)
(20, 4)


Concatenate the tables into one data frame.

In [3]:
gwas[0] = gwas[0].rename(columns={"Reported Gene/ Closest gene": "Gene"})
gwas_df = pd.concat(gwas, axis=0)
print(gwas_df.shape)
gwas_df.head()

(96, 6)


Unnamed: 0,Unnamed: 1,Number,Chr,Location (hg38),SNV,Gene,Source
0,0,1,1.0,109345810,rs141749679,SORT1,
0,1,2,1.0,207577223,rs679515,CR1,
0,2,3,2.0,9558882,rs72777026,ADAM17,
0,3,4,2.0,37304796,rs17020490,PRKD3,
0,4,5,2.0,105749599,rs143080277,NCK2,


## Get Ensembl IDs
Query Ensembl for a list of Ensembl IDs that match the gene symbols in this table. There is no python library that allows searching on external_gene_name when querying BioMart. So this code manually makes the request via GET. See http://uswest.ensembl.org/info/data/biomart/biomart_restful.html

In [4]:
attributes = ["ensembl_gene_id", "external_gene_name", "chromosome_name"]
filters = {"external_gene_name": set(gwas_df["Gene"])}

result = preprocessing_utils.manual_query_biomart(
    attributes=attributes, filters=filters
)

result = result.rename(
    columns={
        "Gene stable ID": "ensembl_gene_id",
        "Gene name": "hgnc_symbol",
        "Chromosome/scaffold name": "chromosome_name",
    }
)
result

Unnamed: 0,ensembl_gene_id,hgnc_symbol,chromosome_name
0,ENSG00000281614,INPP5D,HG2232_PATCH
1,ENSG00000154734,ADAMTS1,21
2,ENSG00000284816,EPHA1,HG708_PATCH
3,ENSG00000138613,APH1B,15
4,ENSG00000285132,CTSB,HG76_PATCH
...,...,...,...
96,ENSG00000066336,SPI1,11
97,ENSG00000091536,MYO15A,17
98,ENSG00000151694,ADAM17,2
99,ENSG00000203710,CR1,1


Remove human alternative sequence genes and patches from the list. These can be identified from the `chromosome_name`: valid Ensembl IDs will have either a numerical chromosome value (1-23) or X, Y, or MT. All other chromosome names identify alternative sequences or patches. 

In [5]:
result = preprocessing_utils.filter_hasgs(
    df=result, chromosome_name_column="chromosome_name"
)
result = result[["ensembl_gene_id", "hgnc_symbol"]]
len(result)

86

Check: The output should contain every gene in the GWAS input. 

In [6]:
print(len(set(gwas_df["Gene"])))
print(len(list(set(gwas_df["Gene"]) & set(result["hgnc_symbol"]))))
print(all(elem in set(result["hgnc_symbol"]) for elem in set(gwas_df["Gene"])))

86
86
True


Write to file. Note: Some gene symbols map to multiple Ensembl IDs -- and that's okay. 

In [7]:
result.to_csv(
    "../../output/igap_genetic_association_genes_2023.csv", index=False, header=True
)

File is then uploaded to Synapse at [syn12514826](https://www.synapse.org/#!Synapse:syn12514826).