<a href="https://polly.elucidata.io/manage/workspaces?action=open_polly_notebook&source=github&path=path_place_holder&kernel=elucidata/Genomics-hail&machine=gp" target="_parent"><img src="https://elucidatainc.github.io/PublicAssets/open_polly.svg" alt="Open in Polly"/></a>


# Exploring Pann UKBioBank GWAS data and Gnomad Data on Polly

## 1- What information is there in Pann-UKbiobank GWAS data?
### Summary Statistics of GWAS studies conducted across the data collected under UKbiobank project.

UK Biobank is a collection of a half million individuals with paired genetic and phenotype information that has been valuable in studies of genetic etiology for common diseases and traits.

Phenotypes Studied include:
- physical attributes (e.g. height, BMI, bone density) 
- blood panel traits (e.g. white blood cell count, cholesterol, blood glucose) 
- common diseases (e.g. diabetes, cardiovascular disease, psychiatric disorders)
- electronic health record data (e.g. diagnosis codes entered by clinicians)
- prescription data (e.g. prescribed to take statins) 
- health surveys (e.g. dietary intake, activity levels, general health satisfaction) 
- social surveys (e.g. educational attainment, occupation), and many other measures.

To summarize, phenotypes included both data pulled from electronic medical records as well as participants' survey responses to questionnaires given online or at the clinic.

The Pann-UkBioBank GWAS data consists of multi-ancestry analysis of 7,221 phenotypes, across 6 continental ancestry groups.

- **EUR** = European ancestry
- **CSA** = Central/South Asian ancestry
- **AFR** = African ancestry
- **EAS** = East Asian ancestry
- **MID** = Middle Eastern ancestry
- **AMR** = Admixed American ancestry

OverView of Data:
- **We have 7221 phenotypes studies**
- **For each phenotype we have ~28 Million variants studied**

Reference: [link](https://pan.ukbb.broadinstitute.org/)


## Let's take a case study of Alzheimers disease

- **Alzheimers disease** is one of the most common Neurodegenerative disaease



## UKbiobank GWAS data phenotype codes

In UKbiobank disease phenotypes are defined using ICD codes (version 10)

Based on International Clasification of Diseases (ICD) diagnosis codes Chapter VI Diseases of Nervous system | G30 Alzheimers disease

We are specifically looking at phenotypes with the following ICD10 codes:

-    **G30** Alzheimers disease
-    Alzheimer disease is the most frequent cause of dementia in Western societies
-    Prevalence worldwide is estimated to be as high as 24 million


In [9]:
!sudo pip3 install polly-python --upgrade --quiet

You should consider upgrading via the '/usr/local/bin/python3.10 -m pip install --upgrade pip' command.[0m


Setting up polly-python 

In [10]:
import os
from polly.omixatlas import OmixAtlas
import pandas as pd
import numpy as np
from json import dumps

In [11]:
pd.set_option('display.max_columns',None)

In [12]:
omixatlas = OmixAtlas(os.environ['POLLY_REFRESH_TOKEN'])

## UKBiobank

In [72]:
tables = omixatlas.query_metadata("SHOW TABLES IN ukbiobank")
tables

Query execution succeeded (time taken: 0.86 seconds, data scanned: 0.000 MB)
Fetched 2 rows


Unnamed: 0,table_name
0,ukbiobank.datasets
1,ukbiobank.variant_data


#### Looking for fields available in Ukbiobank OmixAtlas

In [13]:
schema = omixatlas.get_schema('ukbiobank',['dataset'],'all')
schema.dataset[['Field Name','Field Description']]

Unnamed: 0,Field Name,Field Description
0,category_l3,Category L3
1,category_l4,Category L4
2,src_uri,Unique URI derived from data file's S3 location
3,sample_id,Unique ID assocaited with every dataset
4,modifier,Modified
...,...,...
59,n_cases_afr,no. of cases in Afr ancestry
60,lambda_gc_afr,Lambda GC in Afr
61,data_type,The type of biomolecular data captured (eg - E...
62,category,category of the phenotype


#### Quering for Alzheimer's disease, ALS and Cerebellar Ataxia datasets using  ICD 10 phenocodes

In [20]:
query = """SELECT dataset_id, phenocode, gene, pheno_sex, population, ukbb__description, description, category
            FROM ukbiobank.datasets WHERE phenocode LIKE 'G30' OR phenocode LIKE 'G32' OR phenocode LIKE 'G12'"""

results = omixatlas.query_metadata(query=query,query_api_version='v2')
results

Query execution succeeded (time taken: 1.92 seconds, data scanned: 0.513 MB)
Fetched 109 rows


Unnamed: 0,dataset_id,phenocode,gene,pheno_sex,population,ukbb__description,description,category
0,icd10_G12_both_sexes_ACAN,G12,ACAN,both_sexes,[EUR],G12 Spinal muscular atrophy and related syndromes,Genome-Wide Association Study (GWAS) results o...,Chapter VI Diseases of the nervous system | G1...
1,icd10_G12_both_sexes_ADAMTS17,G12,ADAMTS17,both_sexes,[EUR],G12 Spinal muscular atrophy and related syndromes,Genome-Wide Association Study (GWAS) results o...,Chapter VI Diseases of the nervous system | G1...
2,icd10_G12_both_sexes_ANGPTL7,G12,ANGPTL7,both_sexes,[EUR],G12 Spinal muscular atrophy and related syndromes,Genome-Wide Association Study (GWAS) results o...,Chapter VI Diseases of the nervous system | G1...
3,icd10_G12_both_sexes_ANKRD12,G12,ANKRD12,both_sexes,[EUR],G12 Spinal muscular atrophy and related syndromes,Genome-Wide Association Study (GWAS) results o...,Chapter VI Diseases of the nervous system | G1...
4,icd10_G12_both_sexes_APC,G12,APC,both_sexes,[EUR],G12 Spinal muscular atrophy and related syndromes,Genome-Wide Association Study (GWAS) results o...,Chapter VI Diseases of the nervous system | G1...
...,...,...,...,...,...,...,...,...
104,icd10_G30_both_sexes_THBS4,G30,THBS4,both_sexes,[EUR],G30 Alzheimer's disease,Genome-Wide Association Study (GWAS) results o...,Chapter VI Diseases of the nervous system | G3...
105,icd10_G30_both_sexes_TNFRSF13B,G30,TNFRSF13B,both_sexes,[EUR],G30 Alzheimer's disease,Genome-Wide Association Study (GWAS) results o...,Chapter VI Diseases of the nervous system | G3...
106,icd10_G30_both_sexes_TTN,G30,TTN,both_sexes,[EUR],G30 Alzheimer's disease,Genome-Wide Association Study (GWAS) results o...,Chapter VI Diseases of the nervous system | G3...
107,icd10_G30_both_sexes_WNT1,G30,WNT1,both_sexes,[EUR],G30 Alzheimer's disease,Genome-Wide Association Study (GWAS) results o...,Chapter VI Diseases of the nervous system | G3...


#### Querying UKbiobank variant data table using Polly python

In [84]:
query = """SELECT rsid,ref,alt,chromosome,position,pval_EUR,src_dataset_id 
            FROM ukbiobank.variant_data WHERE src_dataset_id = 'icd10_G30_both_sexes_APP' 
            """

mt = omixatlas.query_metadata(query)

mt

Query execution succeeded (time taken: 1.27 seconds, data scanned: 0.038 MB)
Fetched 3005 rows


Unnamed: 0,rsid,ref,alt,chromosome,position,pval_EUR,src_dataset_id
0,rs115342875,G,[A],21,27436647,-0.21900,icd10_G30_both_sexes_APP
1,rs111287103,G,[T],21,27436767,,icd10_G30_both_sexes_APP
2,rs114803281,A,[G],21,27436791,-0.28000,icd10_G30_both_sexes_APP
3,rs540366301,G,[C],21,27436848,-0.15920,icd10_G30_both_sexes_APP
4,21:27436901_TAG_T,TAG,[T],21,27436901,-0.09831,icd10_G30_both_sexes_APP
...,...,...,...,...,...,...,...
3000,rs3737415,T,[G],21,27348460,-0.70910,icd10_G30_both_sexes_APP
3001,rs45550035,C,[CT],21,27348592,-0.10480,icd10_G30_both_sexes_APP
3002,rs3737416,C,[T],21,27348604,-0.18430,icd10_G30_both_sexes_APP
3003,21:27348618_CATTTG_C,CATTTG,[C],21,27348618,-2.90100,icd10_G30_both_sexes_APP


#### Filtering for significant variants i.e. ln p-val < -5

In [9]:
sig_vars =  mt[mt['pval_EUR'] < -5].reset_index(drop=True)
sig_vars

Unnamed: 0,rsid,ref,alt,chromosome,position,pval_EUR,src_dataset_id
0,rs113407022,C,[T],21,27397458,-6.72,icd10_G30_both_sexes_APP
1,rs577136121,T,[C],21,27411298,-6.078,icd10_G30_both_sexes_APP
2,21:27412896_AAC_A,AAC,[A],21,27412896,-7.283,icd10_G30_both_sexes_APP
3,rs17001455,G,[A],21,27254298,-8.132,icd10_G30_both_sexes_APP
4,rs2829974,A,[T],21,27260486,-7.74,icd10_G30_both_sexes_APP
5,rs17001473,T,[C],21,27265073,-7.058,icd10_G30_both_sexes_APP
6,rs147618536,G,[T],21,27267613,-8.365,icd10_G30_both_sexes_APP
7,rs12482330,T,[C],21,27299185,-5.89,icd10_G30_both_sexes_APP
8,21:27320893_GA_G,GA,[G],21,27320893,-5.024,icd10_G30_both_sexes_APP
9,rs2829997,G,[A],21,27326859,-5.113,icd10_G30_both_sexes_APP


## Using Hail to analyse VCF files

In [45]:
polly files copy --workspace-id 9412 -s polly://files/icd10-G30-both_sexes.vcf.bgz -d ./icd10-G30-both_sexes.vcf.bgz -y

In [49]:
import hail as hl
hl.init()

In [50]:
from hail.plot import show
from pprint import pprint
hl.plot.output_notebook()

In [51]:
hl.import_vcf('./icd10-G30-both_sexes.vcf.bgz',force_bgz=True,reference_genome='GRCh37').write('file.vcf.bgz.mt', overwrite=True)

2022-07-08 11:56:49 Hail: INFO: Coerced sorted dataset==>         (10 + 2) / 12]
2022-07-08 11:58:18 Hail: INFO: wrote matrix table with 11416285 rows and 0 columns in 12 partitions to file.vcf.bgz.mt


In [55]:
mt = hl.read_matrix_table('./file.vcf.bgz.mt')

Browsing through the data

In [18]:
mt.info.show(n_rows=100)

Unnamed: 0_level_0,Unnamed: 1_level_0,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info
locus,alleles,pass_gnomad_genomes,n_passing_populations,high_quality,nearest_genes,info,ac_AFR,af_AFR,an_AFR,gnomad_genomes_ac_AFR,gnomad_genomes_af_AFR,gnomad_genomes_an_AFR,ac_AMR,af_AMR,an_AMR,gnomad_genomes_ac_AMR,gnomad_genomes_af_AMR,gnomad_genomes_an_AMR,ac_CSA,af_CSA,an_CSA,ac_EAS,af_EAS,an_EAS,gnomad_genomes_ac_EAS,gnomad_genomes_af_EAS,gnomad_genomes_an_EAS,ac_EUR,af_EUR,an_EUR,gnomad_genomes_ac_EUR,gnomad_genomes_af_EUR,gnomad_genomes_an_EUR,ac_MID,af_MID,an_MID,af_cases_EUR,af_controls_EUR,beta_EUR,se_EUR,pval_EUR,low_confidence_EUR
locus<GRCh37>,array<str>,str,int32,str,str,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,str
1:11063,"[""T"",""G""]",,0,"""false""",,0.816,285.0,0.0154,18400.0,,,,3.71,0.00162,2300.0,,,,9.58,0.000433,22100.0,2.88,0.000496,5800.0,,,,47.8,5.21e-05,918000.0,,,,7.96,0.00239,3330.0,3.78e-05,4.8e-05,-3.62,14.8,0.807,"""true"""
1:13259,"[""G"",""A""]","""true""",4,"""true""",,0.81,18.6,0.00101,18400.0,1.0,0.000117,8580.0,0.592,0.000258,2300.0,0.0,0.0,802.0,0.518,2.34e-05,22100.0,0.165,2.84e-05,5800.0,0.0,0.0,1530.0,247.0,0.000269,918000.0,0.0,0.0,8100.0,0.267,8.01e-05,3330.0,0.000616,0.000277,1.12,1.5,0.455,"""true"""
1:17641,"[""G"",""A""]",,0,"""false""",,0.851,2.25,0.000122,18400.0,,,,0.184,8.03e-05,2300.0,,,,1.35,6.12e-05,22100.0,0.149,2.57e-05,5800.0,,,,739.0,0.000806,918000.0,,,,0.502,0.000151,3330.0,0.00124,0.00083,0.718,0.963,0.456,"""true"""
1:30741,"[""C"",""A""]",,0,"""false""",,0.81,41.8,0.00227,18400.0,,,,0.961,0.000418,2300.0,,,,1.23,5.55e-05,22100.0,0.0,0.0,5800.0,,,,15.3,1.67e-05,918000.0,,,,0.235,7.07e-05,3330.0,,,,,,
1:51427,"[""T"",""G""]","""false""",4,"""false""",,0.821,27.4,0.00149,18400.0,18.0,0.00253,7100.0,0.114,4.95e-05,2300.0,0.0,0.0,606.0,0.157,7.09e-06,22100.0,0.0431,7.44e-06,5800.0,0.0,0.0,1520.0,0.659,7.18e-07,918000.0,0.0,0.0,5440.0,0.0431,1.3e-05,3330.0,,,,,,
1:57222,"[""T"",""C""]",,0,"""false""",,0.827,3.01,0.000163,18400.0,,,,0.467,0.000203,2300.0,,,,5.48,0.000248,22100.0,1.02,0.000176,5800.0,,,,574.0,0.000625,918000.0,,,,0.49,0.000147,3330.0,0.000743,0.000658,0.286,1.11,0.796,"""true"""
1:58396,"[""T"",""C""]","""true""",4,"""true""",,0.914,58.1,0.00315,18400.0,30.0,0.00371,8090.0,2.82,0.00123,2300.0,0.0,0.0,738.0,3.55,0.00016,22100.0,0.604,0.000104,5800.0,0.0,0.0,1540.0,226.0,0.000247,918000.0,0.0,0.0,7760.0,6.86,0.00206,3330.0,0.0,0.000241,-1.02,1.67,0.543,"""true"""
1:62745,"[""C"",""G""]",,0,"""false""",,0.995,28.9,0.00157,18400.0,,,,0.996,0.000434,2300.0,,,,0.0,0.0,22100.0,0.0,0.0,5800.0,,,,0.996,1.09e-06,918000.0,,,,0.0,0.0,3330.0,,,,,,
1:63668,"[""G"",""A""]","""false""",4,"""false""",,0.833,0.0,0.0,18400.0,2.0,0.000248,8070.0,0.0,0.0,2300.0,0.0,0.0,714.0,33.2,0.0015,22100.0,0.0,0.0,5800.0,1.0,0.000652,1530.0,25.2,2.75e-05,918000.0,0.0,0.0,7230.0,0.984,0.000296,3330.0,0.0,2.78e-05,-1.33,4.84,0.783,"""true"""
1:64658,"[""A"",""T""]","""false""",4,"""false""",,0.881,57.8,0.00313,18400.0,1.0,0.00012,8300.0,0.0157,6.83e-06,2300.0,0.0,0.0,768.0,1.0,4.52e-05,22100.0,0.0,0.0,5800.0,0.0,0.0,1550.0,1.82,1.98e-06,918000.0,0.0,0.0,7870.0,1.16,0.000348,3330.0,,,,,,


#### Plotting a Manhattan plot of all the variants 

In [25]:
hover_fields = dict([('rsid', mt.rsid),
                    ('allele', mt.alleles),('nearest_gene',mt.info.nearest_genes)]) 

p1 = hl.plot.manhattan(mt.info.pval_EUR,locus=mt.locus,hover_fields=hover_fields,significance_line=1e-5)

show(p1)



In [78]:
p = hl.plot.qq(mt.info.pval_EUR)
show(p)

2022-07-08 12:43:24 Hail: INFO: Ordering unsorted dataset with network shuffle2]

Interpreting results for GWAS:
\begin{equation}
{Y}_i = {\alpha}_0 + {\beta}_1 X_i + se_i
\end{equation}

- beta = Effect size
- se = standard error
- pval = significance 

### Filtering for significant variants i.e. pval < 1e-5

In [26]:
mt_filter = mt.filter_rows(mt.info.pval_EUR < 1e-5)

In [28]:
print("Number of significant variants = {}".format(mt_filter.count()[0]))

mt_filter.info.show(10)




Number of significant variants = 386




Unnamed: 0_level_0,Unnamed: 1_level_0,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info,info
locus,alleles,pass_gnomad_genomes,n_passing_populations,high_quality,nearest_genes,info,ac_AFR,af_AFR,an_AFR,gnomad_genomes_ac_AFR,gnomad_genomes_af_AFR,gnomad_genomes_an_AFR,ac_AMR,af_AMR,an_AMR,gnomad_genomes_ac_AMR,gnomad_genomes_af_AMR,gnomad_genomes_an_AMR,ac_CSA,af_CSA,an_CSA,ac_EAS,af_EAS,an_EAS,gnomad_genomes_ac_EAS,gnomad_genomes_af_EAS,gnomad_genomes_an_EAS,ac_EUR,af_EUR,an_EUR,gnomad_genomes_ac_EUR,gnomad_genomes_af_EUR,gnomad_genomes_an_EUR,ac_MID,af_MID,an_MID,af_cases_EUR,af_controls_EUR,beta_EUR,se_EUR,pval_EUR,low_confidence_EUR
locus<GRCh37>,array<str>,str,int32,str,str,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,str
1:18121538,"[""T"",""C""]","""true""",4,"""true""","""ACTL8""",0.841,5.37,0.000291,18400.0,2.0,0.00023,8710.0,4.01,0.00175,2300.0,2.0,0.00236,848.0,14.1,0.000639,22100.0,0.533,9.2e-05,5800.0,0.0,0.0,1560.0,1500.0,0.00163,918000.0,16.0,0.00186,8590.0,5.27,0.00158,3330.0,0.00681,0.00161,4.02,0.901,8.05e-06,"""false"""
1:110145296,"[""CT"",""C""]",,0,"""false""","""GNAT2""",0.852,6460.0,0.35,18400.0,,,,923.0,0.402,2300.0,,,,8320.0,0.376,22100.0,2150.0,0.371,5800.0,,,,381000.0,0.415,918000.0,,,,1350.0,0.405,3330.0,0.366,0.417,-0.243,0.0543,7.81e-06,"""false"""
1:112135958,"[""T"",""C""]","""true""",4,"""true""","""RAP1A""",0.978,8080.0,0.438,18400.0,3880.0,0.447,8680.0,682.0,0.297,2300.0,256.0,0.302,848.0,6930.0,0.314,22100.0,2000.0,0.345,5800.0,471.0,0.303,1550.0,278000.0,0.303,918000.0,2500.0,0.291,8580.0,915.0,0.275,3330.0,0.356,0.304,0.252,0.0545,3.58e-06,"""false"""
1:112145503,"[""T"",""C""]","""true""",4,"""true""","""RAP1A""",0.993,1330.0,0.072,18400.0,710.0,0.0815,8710.0,431.0,0.188,2300.0,170.0,0.2,848.0,5200.0,0.235,22100.0,1370.0,0.236,5800.0,336.0,0.217,1550.0,217000.0,0.236,918000.0,1940.0,0.226,8580.0,509.0,0.153,3330.0,0.283,0.237,0.26,0.0586,9.07e-06,"""false"""
1:112147956,"[""C"",""A""]","""true""",4,"""true""","""RAP1A""",0.994,1310.0,0.0708,18400.0,698.0,0.0802,8710.0,424.0,0.185,2300.0,167.0,0.197,848.0,4900.0,0.221,22100.0,1360.0,0.234,5800.0,329.0,0.212,1550.0,214000.0,0.234,918000.0,1910.0,0.222,8580.0,481.0,0.145,3330.0,0.281,0.235,0.265,0.0588,6.59e-06,"""false"""
1:112150109,"[""C"",""T""]","""true""",4,"""true""","""RAP1A""",0.991,1300.0,0.0703,18400.0,703.0,0.0808,8700.0,431.0,0.188,2300.0,167.0,0.2,834.0,5240.0,0.237,22100.0,1370.0,0.237,5800.0,335.0,0.215,1560.0,217000.0,0.237,918000.0,1890.0,0.224,8420.0,503.0,0.151,3330.0,0.284,0.238,0.263,0.0587,7.43e-06,"""false"""
1:112155679,"[""C"",""T""]","""true""",4,"""true""","""RAP1A""",0.995,1310.0,0.071,18400.0,705.0,0.0812,8680.0,425.0,0.185,2300.0,168.0,0.198,848.0,4910.0,0.222,22100.0,1360.0,0.234,5800.0,327.0,0.21,1560.0,216000.0,0.235,918000.0,1930.0,0.225,8580.0,484.0,0.145,3330.0,0.284,0.236,0.268,0.0587,5.03e-06,"""false"""
1:112156965,"[""C"",""A""]","""true""",4,"""true""","""RAP1A""",0.995,1310.0,0.071,18400.0,703.0,0.0807,8710.0,425.0,0.185,2300.0,167.0,0.197,848.0,4910.0,0.222,22100.0,1360.0,0.234,5800.0,326.0,0.21,1550.0,216000.0,0.235,918000.0,1930.0,0.225,8590.0,484.0,0.145,3330.0,0.284,0.236,0.267,0.0587,5.34e-06,"""false"""
1:112158504,"[""A"",""G""]","""true""",4,"""true""","""RAP1A""",0.995,1310.0,0.071,18400.0,701.0,0.0806,8700.0,425.0,0.185,2300.0,165.0,0.196,844.0,4920.0,0.222,22100.0,1360.0,0.234,5800.0,322.0,0.208,1550.0,216000.0,0.235,918000.0,1920.0,0.224,8580.0,484.0,0.145,3330.0,0.284,0.236,0.268,0.0587,5.03e-06,"""false"""
1:112161489,"[""T"",""G""]","""true""",4,"""true""","""RAP1A""",0.996,1310.0,0.0708,18400.0,706.0,0.081,8710.0,425.0,0.185,2300.0,168.0,0.198,848.0,4930.0,0.223,22100.0,1360.0,0.234,5800.0,330.0,0.212,1560.0,216000.0,0.235,918000.0,1930.0,0.224,8590.0,488.0,0.146,3330.0,0.284,0.236,0.267,0.0587,5.18e-06,"""false"""


In [29]:
hover_fields = dict([('rsid', mt_filter.rsid),
                    ('allele', mt_filter.alleles),('nearest_gene',mt_filter.info.nearest_genes)]) 

p3 = hl.plot.manhattan(mt_filter.info.pval_EUR,hover_fields=hover_fields,significance_line=1e-5)

show(p3)

