
# Genome-wide association study (GWAS)

This notebook demonstrates conducting a genome-wide association study using the public 1000 Genomes dataset stored in BigQuery.  

Related Links:
* [BigQuery](https://cloud.google.com/bigquery/)
* BigQuery [SQL reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax)
* [1,000 Genomes summary](https://www.internationalgenome.org/1000-genomes-summary/)
* [Verily Workbench documentation](https://support.workbench.verily.com/docs/)


In this experiment, we'll be identifying variant positions within chromosome 12 that differ significantly between the case and control groups. The case group for the purposes of this notebook will be individuals from the "EAS" (East Asian) super population.  Variant data from the 1000 genomes dataset is publicly accessible within BigQuery. 

This notebook is intended to be run using [Verily Workbench](https://workbench.verily.com/).   
It's most straighforward to run it from a Workbench [workspace's](https://support.workbench.verily.com/docs/technical_reference/workspaces/) [JupyterLab cloud environment](xxx).

However, you can also run the notebook locally. To do so, you will need to have **first [installed and authorized the Workbench CLI](https://support.workbench.verily.com/docs/technical_reference/cli/cli_install_and_run/), and [created a workspace](https://support.workbench.verily.com/docs/technical_reference/workspaces/workspace_operations/)**.

**TODO**: notebook approximate costs (including BQ query costs)

## Setup and Configuration

First, do some imports.

In [None]:
import pandas as pd
from google.cloud import bigquery

import numpy as np
import seaborn as sns

# In JupyterLab, enable IPython to display matplotlib graphs.
import matplotlib.pyplot as plt
%matplotlib inline
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

In [None]:
'''
Define a utility function to resolve the BQ dataset name from its reference name in the workspace.
It uses the Workbench CLI.
'''
def get_bq_dataset_from_reference(resource_name):
    BQ_CMD_OUTPUT = !wb resolve --name={resource_name}
    BQ_DATASET = BQ_CMD_OUTPUT[0]
    return BQ_DATASET

### create a BigQuery dataset for our explorations

Next, we'll create a workspace-managed (["controlled"](https://support.workbench.verily.com/docs/technical_reference/data_resources/#referenced-vs-workspace-controlled-data-resources)) BigQuery dataset to use for this example. 

This dataset needs to be in the `US` region.  That is because the `bigquery-public-data.human_genome_variants` dataset, which we'll use, is in the `US` region. To save the results of queries over that dataset, the workspace dataset must be in the same region.

In [None]:
bq_dataset_name = 'GWAS_experiments'

Run the command to create the dataset.  You'll see an ignorable error if you run this more than once.

**Note**: if you're not running this notebook in the context of a Verily Workbench workspace, ensure first that the `wb` utility is set to the workspace in which you want to create the dataset. You can check this by running `wb workspace describe`. If need be, you can run `wb workspace set --id=<your_workspace_id>`to set the workspace.  
(If you're running the notebook from a workspace cloud environment, `wb` will be set already to use that workspace).

In [None]:
!wb resource create bq-dataset --location=US --id $bq_dataset_name

Now, get the full BQ dataset name from the reference name. It should have the form `<project_id>.<dataset_name>`.

In [None]:
gwas_experiments_dataset = get_bq_dataset_from_reference(bq_dataset_name)
print(gwas_experiments_dataset)

Next, create a BQ client object. We'll use this for our queries.

In [None]:
job_query_config = bigquery.QueryJobConfig(default_dataset=gwas_experiments_dataset)
client = bigquery.Client(default_query_job_config=job_query_config)

Now we're set up to run queries and create new tables in the workspace dataset we created.

## Classifying per-call variant positions into variant/non-variant groups

We can tally the reference/alternate allele accounts for *individual* variant positions within chromosome 12. The field `call.genotype` is an integer ranging from `[-1, num_alternate_bases]`. 
* A value of negative one indicates that the genotype for the call is ambiguous (i.e., a no-call).
* A value of zero indicates that the genotype for the call is the same as the reference (i.e., non-variant). 
* A value of one would indicate that the genotype for the call is the 1st value in the list of alternate bases (likewise for values >1).

In [None]:
variants_table = 'bigquery-public-data.human_genome_variants.1000_genomes_phase_3_variants_20150220'

In the query below, we're setting a destination table (the value of `var12_table`) in the workspace dataset. (For most of the queries in this notebook, the result sets are way too large to hold in-memory on the notebook server).  

In [None]:
query = f"""
SELECT reference_name, start_position, reference_bases,
alternate_bases[SAFE_OFFSET(0)].alt AS alt_bases, end_position, VT, call_info
FROM `{variants_table}`
CROSS JOIN UNNEST(call) AS call_info
   WHERE
      reference_name = '12'
"""

In [None]:
var12_table = f'{gwas_experiments_dataset}.var12'

Now we'll run the actual query, which will create a new 'var12' table in the workspace dataset with the results. In this case, we don't need a handle to the returned `RowIterator`, but we'll show how to use that in subsequent queries.

If the given destination table already exists, the query will fail.  If you'd like to override this and overwrite the existing table with the new query results, uncomment the line that sets `bigquery.WriteDisposition.WRITE_TRUNCATE`. 

You can similarly add this `WRITE_TRUNCATE` config for any of the queries below.

In [None]:
# Start the query, passing in the destination table config.
job_query_config = bigquery.QueryJobConfig(destination=var12_table)
#job_query_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE  # uncomment to overwrite the existing table

query_job = client.query(query, job_config=job_query_config)  # Make an API request.
query_job.result()

Let's verify that our allele counts match our expectations before moving on. For any given row, the alternate + reference counts should sum to 2 for this experiment.

In [None]:
query1 = f"""
SELECT reference_name, start_position, end_position, VT[SAFE_OFFSET(0)] as vt, reference_bases, alt_bases, call_info.genotype,
(select sum(CAST(num = 0 as int64)) from t.call_info.genotype num) ref_count,
(select sum(CAST(num = 1 as int64)) from t.call_info.genotype num) alt_count
FROM `{var12_table}`  t
LIMIT 1000
"""

In [None]:
alleles_df = client.query(query1).result().to_dataframe()
alleles_df

### Assigning case and control groups

Now we can join our allele counts with metadata available in the sample info table. We'll use this sample metadata to split the set of genomes into case and control groups based upon the super population group.

In [None]:
sample_info_table= 'bigquery-public-data.human_genome_variants.1000_genomes_sample_info'

In [None]:
query2 = f"""
WITH alleles AS (
  SELECT reference_name, start_position, end_position, VT[SAFE_OFFSET(0)] as vt, reference_bases, alt_bases,
  call_info.genotype, call_info.name as cn,
(select sum(CAST(num = 0 as int64)) from t.call_info.genotype num) ref_count,
(select sum(CAST(num = 1 as int64)) from t.call_info.genotype num) alt_count
FROM `{var12_table}`  t
)
SELECT
  super_population,
  ('EAS' = super_population) AS is_case,
  cn,
  reference_name,
  start_position,
  reference_bases,
  alt_bases,
  end_position,
  vt,
  ref_count,
  alt_count,
FROM alleles
JOIN `{sample_info_table}` AS samples
  ON alleles.cn = samples.sample
"""

In [None]:
exp_groups_table = f'{gwas_experiments_dataset}.exp_groups'

In [None]:
# Start the query, passing in the destination table config.
job_query_config = bigquery.QueryJobConfig(destination=exp_groups_table)

query_job = client.query(query2, job_config=job_query_config)
query_job.result()

The variants table contains a few different types of variant: structural variants ("SV"), indels ("INDEL") and SNPs ("SNP").

In [None]:
query3 = f"""
SELECT
  vt,
  COUNT(*)
FROM `{exp_groups_table}`
GROUP BY vt
"""

This is a small result, so we can save it to a dataframe.

In [None]:
query_job = client.query(query3)
variant_types = query_job.result().to_dataframe()
variant_types

For the purposes of this experiment, let's limit the variants to only SNPs. To keep things simple, we'll create a new dedicated table for just the SNP variants.

In [None]:
query4 = f"""
SELECT *
FROM `{exp_groups_table}`
where vt = 'SNP'
"""

In [None]:
snps_table = f'{gwas_experiments_dataset}.snps'

In [None]:
# Start the query, passing in the extra configuration.
job_query_config = bigquery.QueryJobConfig(destination=snps_table)

query_job = client.query(query4, job_config=job_query_config)
query_job.result()

### Tallying reference/alternate allele counts for case/control groups

Now that we've assigned each call set to either the case or the control group, we can tally up the counts of reference and alternate alleles within each of our assigned case/control groups, for each variant position, like so:

In [None]:
query5 = f"""
SELECT
    reference_name,
    start_position,
    end_position,
    reference_bases,
    alt_bases,
    vt,
    SUM(ref_count + alt_count) AS allele_count,
    SUM(ref_count) AS ref_count,
    SUM(alt_count) AS alt_count,
    SUM(IF(TRUE = is_case, 	SAFE_CAST((ref_count + alt_count) AS INT64), 0)) AS case_count,
    SUM(IF(FALSE = is_case, SAFE_CAST((ref_count + alt_count) AS INT64), 0)) AS control_count,
    SUM(IF(TRUE = is_case, ref_count, 0)) AS case_ref_count,
    SUM(IF(TRUE = is_case, alt_count, 0)) AS case_alt_count,
    SUM(IF(FALSE = is_case, ref_count, 0)) AS control_ref_count,
    SUM(IF(FALSE = is_case, alt_count, 0)) AS control_alt_count,

FROM `{snps_table}`
GROUP BY
    reference_name,
    start_position,
    end_position,
    reference_bases,
    alt_bases,
    vt
"""

In [None]:
grouped_counts_table = f'{gwas_experiments_dataset}.grouped_counts'

For this query, we'll grab a handle to the returned query result (a `RowIterator`).

In [None]:
# Start the query, passing in the extra configuration.
job_query_config = bigquery.QueryJobConfig(destination=grouped_counts_table)
#job_query_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE  # to overwrite the existing table

query_job = client.query(query5, job_config=job_query_config)  # Make an API request.
grouped_counts_res = query_job.result()

We can use the query result iterator to write a subset of results to a pandas dataframe.

In [None]:
for df in grouped_counts_res.to_dataframe_iterable():
    grouped_counts_df = df
    break
grouped_counts_df

Again, validate that the results are sensical for the group level counts (still per variant position).

## Quantify the statistical significance at each variant positions

We can quantify the statistical significance of each variant position using the Chi-squared test. Furthermore, we can restrict our result set to *only* statistically significant variant positions for this experiment by ranking each position by its statistical signficance (decreasing) and thresholding the results for significance at `p <= 5e-8` (chi-squared score >= 29.7).  
(Chi-squared critical value for df=1, p-value=5*10^-8 is 29.71679)

We now run this query over **all** the variants within chromosome 12.

In [None]:
query6 = f"""
WITH sres AS (
SELECT
  reference_name, start_position, end_position, reference_bases, alt_bases, vt,
  case_count, control_count, allele_count, ref_count, alt_count,
  case_ref_count, case_alt_count, control_ref_count, control_alt_count,
  # http://homes.cs.washington.edu/~suinlee/genome560/lecture7.pdf
  # https://en.wikipedia.org/wiki/Yates%27s_correction_for_continuity
  ROUND(
    POW(ABS(case_ref_count - (ref_count/allele_count)*case_count) - 0.5,
      2)/((ref_count/allele_count)*case_count) +
    POW(ABS(control_ref_count - (ref_count/allele_count)*control_count) - 0.5,
      2)/((ref_count/allele_count)*control_count) +
    POW(ABS(case_alt_count - (alt_count/allele_count)*case_count) - 0.5,
      2)/((alt_count/allele_count)*case_count) +
    POW(ABS(control_alt_count - (alt_count/allele_count)*control_count) - 0.5,
      2)/((alt_count/allele_count)*control_count),
    3) AS chi_squared_score
FROM `{grouped_counts_table}`
WHERE
  # For chi-squared, expected counts must be at least 5 for each group
  (ref_count/allele_count)*case_count >= 5.0
  AND (ref_count/allele_count)*control_count >= 5.0
  AND (alt_count/allele_count)*case_count >= 5.0
  AND (alt_count/allele_count)*control_count >= 5.0
)
SELECT * from sres WHERE chi_squared_score >= 29.71679
ORDER BY
  chi_squared_score DESC,
  allele_count DESC
"""

In [None]:
stats_results_table = f'{gwas_experiments_dataset}.stats_results'

In [None]:
# Start the query, passing in the extra configuration.
job_query_config = bigquery.QueryJobConfig(destination=stats_results_table)
#job_query_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE  # to overwrite the existing table

query_job = client.query(query6, job_config=job_query_config)  # Make an API request.
stats_res = query_job.result()

Let's look at the first few most significant variants:

In [None]:
for df in stats_res.to_dataframe_iterable():
    stats_df = df
    break
stats_df

Scroll to the right in the above results to see that the positions deemed significant do in fact have significantly different case/control counts for the alternate/reference bases.

### Computing Chi-squared statistics in BigQuery vs Python

Let's compare these BigQuery-computed Chi-squared scores to ones calculated via Python's statistical packages, for the first row in the dataframe.

In [None]:
import numpy as np
from scipy.stats import chi2_contingency

chi2, p, dof, expected = chi2_contingency(np.array([
    [281, 727], # case
    [3794, 206]  # control
]))

print('Python Chi-sq score = %.3f' % chi2)

We can see for the computation in Python that the value matches that computed by BigQuery.

## Analyzing the GWAS results

First, how many statistically significant variant positions did we find?

In [None]:
query7 = f"""
SELECT COUNT(*) AS num_significant_snps
FROM `{stats_results_table}`
"""

In [None]:
query_job = client.query(query7)
df = query_job.result().to_dataframe()
df

Let's pull in the top 1000 SNP positions to local memory. Since we only need a subset of the columns, we can project our data first to remove unneeded columns.

In [None]:
query8 = f"""
SELECT * FROM (
  SELECT
    reference_name,
    start_position,
    reference_bases,
    alt_bases,
    chi_squared_score
  FROM `{stats_results_table}`
  LIMIT 1000
)
ORDER BY start_position asc
"""

In [None]:
query_job = client.query(query8)  # Make an API request.
sig_snps_dataset = query_job.result().to_dataframe()
sig_snps_dataset

Let's visualize the distribution of significant SNPs along the length of the chromosome. The y-value of the charts indicates the Chi-squared score: larger values are more significant.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

#g = sns.distplot(sig_snps['start'], rug=False, hist=False, kde_kws=dict(bw=0.1))
fig, ax = plt.subplots()
ax.scatter(sig_snps_dataset['start_position'], sig_snps_dataset['chi_squared_score'], alpha=0.3, c='red')
ax.set_ylabel('Chi-squared score')
ax.set_xlabel('SNP position (bp)')

Let's zoom in on one region that contains a large number of very significant SNPs:

In [None]:
fig, ax = plt.subplots()
ax.scatter(sig_snps_dataset['start_position'], sig_snps_dataset['chi_squared_score'], alpha=0.5, c='red')
ax.set_xlim([10.7e7, 12.2e7])
# ax.set_xlim([3.3e7, 3.5e7])
ax.set_ylabel('Chi-squared score')
ax.set_xlabel('SNP position (bp)')

# Summary

This notebook illustrated how to conduct a GWAS experiment using variant data stored within the Google Genomics BigQuery tables, retrieve a local copy of the top results and visualize the data with Python libraries.

## Cleanup

You can delete the dataset you created (`GWAS_experiments`) if you are done with the tables that you created.

You can do this via the `wb` command-line utility like this:  
`wb resource delete --id GWAS_experiments`

You can also delete the dataset via the [Verily Workbench UI](https://support.workbench.verily.com/docs/technical_reference/data_resources/data_resources_operations/)

---
Copyright 2024 Verily Life Sciences LLC

Use of this source code is governed by a BSD-style  
license that can be found in the LICENSE file or at  
https://developers.google.com/open-source/licenses/bsd