# GWAS in the Cloud

## Overview 
We retrofitted the NIH CFDE tutorial from [here](https://training.nih-cfde.org/en/latest/Bioinformatic-Analyses/GWAS-in-the-cloud/background/) to a notebook so that you could run it on Vertex AI. We have greatly simplified the instructions, so if you need or want more details, look at the full tutorial to find out more.
Most of this notebook is bash, but expects that you are using a Python kernel, until step 3, plotting, you will need to switch your kernel to R.

## Learning Objectives
+ Learn how to run a GWAS analysis in Google Cloud

## Prerequisites
+ You only need access to a Vertex AI environment to run this notebook

## Get Started

### Install packages and set up environment

#### Download the data
use %%bash to denote a bash block. You can also use '!' to denote a single bash command within a Python notebook

In [None]:
%%bash
mkdir GWAS
curl -LO https://de.cyverse.org/dl/d/E0A502CC-F806-4857-9C3A-BAEAA0CCC694/pruned_coatColor_maf_geno.vcf.gz
curl -LO https://de.cyverse.org/dl/d/3B5C1853-C092-488C-8C2F-CE6E8526E96B/coatColor.pheno

In [None]:
%%bash
mv *.gz GWAS
mv *.pheno GWAS
ls GWAS

#### Install dependencies
Here we install mamba, which is faster than conda, but it can be tricky to add to path in a Sagemaker notebook so we just call the whole path. You could also skip this install and just use conda since that is preinstalled in the kernel.

In [None]:
! curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
! bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge

In [None]:
#add to your path
import os
os.environ["PATH"] += os.pathsep + os.environ["HOME"]+"/mambaforge/bin"

In [None]:
! mamba install -y -c bioconda plink vcftools

## Begin the Analysis

### Make map and ped files from the vcf file to feed into plink

In [None]:
cd GWAS

In [None]:
! vcftools --gzvcf pruned_coatColor_maf_geno.vcf.gz --plink --out coatColor

### Create a list of minor alleles

For more info on these terms, look at step 2 [here](https://training.nih-cfde.org/en/latest/Bioinformatic-Analyses/GWAS-in-the-cloud/analyze/).

In [None]:
#unzip vcf
! vcftools --gzvcf pruned_coatColor_maf_geno.vcf.gz --recode --out pruned_coatColor_maf_geno

In [None]:
#create list of minor alleles
! cat pruned_coatColor_maf_geno.recode.vcf | awk 'BEGIN{FS="\t";OFS="\t";}/#/{next;}{{if($3==".")$3=$1":"$2;}print $3,$5;}'  > minor_alleles

In [None]:
! head minor_alleles

### Run quality controls

In [None]:
#calculate missingness per locus
! plink --file coatColor --make-pheno coatColor.pheno "yellow" --missing --out miss_stat --noweb --dog --reference-allele minor_alleles --allow-no-sex --adjust

In [None]:
#take a look at lmiss, which is the per locus rates of missingness
! head miss_stat.lmiss

In [None]:
#peek at imiss which is the individual rates of missingness
! head miss_stat.imiss

### Convert to plink binary format

In [None]:
! plink --file coatColor --allow-no-sex --dog --make-bed --noweb --out coatColor.binary

### Run a simple association step (the GWAS part!)

In [None]:
! plink --bfile coatColor.binary --make-pheno coatColor.pheno "yellow" --assoc --reference-allele minor_alleles --allow-no-sex --adjust --dog --noweb --out coatColor

### Identify statistical cutoffs
This code finds the equivalent of 0.05 and 0.01 p value in the negative-log-transformed p values file. We will use these cutoffs to draw horizontal lines in the Manhattan plot for visualization of haplotypes that cross the 0.05 and 0.01 statistical threshold (i.e. have a statistically significant association with yellow coat color)

In [None]:
%%bash
unad_cutoff_sug=$(tail -n+2 coatColor.assoc.adjusted | awk '$10>=0.05' | head -n1 | awk '{print $3}')
unad_cutoff_conf=$(tail -n+2 coatColor.assoc.adjusted | awk '$10>=0.01' | head -n1 | awk '{print $3}')

## Plotting
In this tutorial, plotting is done in R, so at this point you can change your kernel to R in the top right. Wait for it to say 'idle' in the bottom left, then continue. You could also plot using Python native packages and maintain the Python notebook kernel.

### Install qqman

In [None]:
install.packages('qqman', contriburl=contrib.url('http://cran.r-project.org/'))

### Run the plotting function

In [None]:
#make sure you are still CD in GWAS, when you change kernel it may reset to home
setwd('GWAS')

In [None]:
require(qqman)

In [None]:
data=read.table("coatColor.assoc", header=TRUE)

In [None]:
data=data[!is.na(data$P),]

In [None]:
manhattan(data, p = "P", col = c("blue4", "orange3"),
          suggestiveline = 12,
          genomewideline = 15,
          chrlabs = c(1:38, "X"), annotateTop=TRUE, cex = 1.2)

## Conclusions

In our graph, haplotypes in four parts of the genome (chromosome 2, 5, 28 and X) are found to be associated with an increased occurrence of the yellow coat color phenotype.

The top associated mutation is a nonsense SNP in the gene MC1R known to control pigment production. The MC1R allele encoding yellow coat color contains a single base change (from C to T) at the 916th nucleotide.

## Clean Up
You just need to stop this instance and optionally delete the instance and storage bucket