# 1000 Genomes Single Chromosome PCA Example: Reading variant data into an R sparse matrix

Adapted from:  
http://bwlewis.github.io/1000_genomes_examples/PCA.html  
https://github.com/bwlewis/1000_genomes_examples  

## Introduction

This example walks through the computation of principal components (PCA) of genomic variant data across one chromosome from 2,504 people from the [1000 genomes project](https://www.internationalgenome.org/1000-genomes-summary/). The example projects all of the variant data for one chromosome into a three-dimensional subspace, and then plots the result. 

The example uses:

- a very simple C parsing program to efficiently read variant data into an R sparse matrix,
- the `irlba` package to efficiently compute principal components,
- the `threejs` package to visualize the result.

This example is intended to be run in a Verily Workbench notebook cloud environment ('Jupyterlab Vertex AI Workbench instance'), using the R environment image.  You can take the defaults when creating the notebook environment, though some compute-heavy aspects of the analysis will run more quickly with additional cores.

## Setup and configuration

Create Workbench [referenced resources](https://support.workbench.verily.com/docs/technical_reference/data_resources/) pointing to 1000 genomes data, if the resources have not already been created.
The GCS URIs point to folders from the 1000 genomes public dataset.  It doesn't hurt to run these commands more than once.

In [None]:
system("wb resource resolve --name vcf-20150220 || wb resource add-ref gcs-object --path gs://genomics-public-data/1000-genomes-phase-3/vcf-20150220 --name vcf-20150220",
       intern = TRUE)
system("wb resource resolve --name 1000genomes_ftp || wb resource add-ref gcs-object --path gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp --name 1000genomes_ftp",
       intern = TRUE)

Next, create a [controlled resource](https://support.workbench.verily.com/docs/technical_reference/data_resources/#referenced-vs-workspace-controlled-data-resources) bucket, which we'll use later to store some analysis results. It doesn't hurt to run this cell if the resource already exists.

In [None]:
system("wb resource resolve --name workspace_files || wb resource create gcs-bucket --name=workspace_files --cloning=COPY_NOTHING --description='Bucket for data and reports'", intern = TRUE)

Mount the new resources so that you can access the contents as if they are part of the file system.  
Once the resources are defined for a workspace, they will be automounted for any new cloud environments that you create.  Because this cloud environment already exists, we'll run the command now so that we can access these new resources.

In [None]:
system("wb resource mount")

After you've run this command, you should be able to see these new resources listed under `~/workspace`:

In [None]:
system("ls -l /home/jupyter/workspace", intern = TRUE)

### Install some packages

Install some necessary R packages. You only need to run the following two installation commands once per notebook environment.

In [None]:
# Fast and memory efficient methods for truncated singular value decomposition and
# principal components analysis of large sparse and dense matrices.
install.packages("irlba")

In [None]:
# Create interactive 3D scatter plots, network plots, and globes using the 'three.js' visualization library
install.packages("threejs")

Do some package imports:

In [None]:
library(Matrix)
library(irlba)
library(tidyverse)
library(threejs)

Download and compile a small program to parse VCF files. We could use R alone to read and parse the VCF file, it would just take a while longer. **You only need to run this cell once per notebook environment** (though it's harmless to run again).

In [None]:
system("wget https://raw.githubusercontent.com/bwlewis/1000_genomes_examples/master/parse.c")
system("cc -O2 parse.c")

## Analysis

All the remaining steps in this example run from R.  
Let’s read the variant data for chromosome 20 into an R sparse matrix. Note that we only care about the variant number and sample (person) number in this exercise and ignore everything else.

Note also that we're using the filepath of a file automounted from the referenced resource we set up earlier.

In [None]:
p <- pipe("cat /home/jupyter/workspace/vcf-20150220/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf  | sed /^#/d  | cut  -f '10-' | ./a.out | cut -f '1-2'")

The next step will take a few minutes to run.

In [None]:
x <- read.table(p, colClasses = c("integer", "integer"), fill = TRUE, row.names = NULL)

In [None]:
# Convert to a sparse matrix of people (rows) x variant (columns)
chr20 <- sparseMatrix(i = x[,2], j = x[,1], x = 1.0)

In [None]:
# Inspect the dimensions of this matrix
print(dim(chr20))
# [1]    2504 1812841

We’ve loaded a sparse matrix with 2,504 rows (people) by 1,812,841 columns (variants). The next step computes the first three principal component vectors using the `irlba` package and plots a 3d scatterplot using the `threejs` package. 

In [None]:
cm <- colMeans(chr20)
p <- irlba(chr20, nv = 3, nu = 3, tol = 0.1, center = cm)

In [None]:
scatterplot3js(p$u)

### Using ancillary "superpopulation" data

The data exhibit obvious groups, and those groups correspond to ethnicities. That can be illustrated by loading ancillary data from the 1000 genomes project that identifies the “superpopulation” of each sample.

Read just the header of the chromosome file to obtain the sample identifiers.  Again, we're using a file automounted from the 1000 Genomes folder we added as a referenced resource:

In [None]:
# Read just the header of the chromosome file to obtain the sample identifiers
ids <- readLines(pipe("cat /home/jupyter/workspace/vcf-20150220/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf  | sed -n /^#CHROM/p | tr '\t' '\n' | tail -n +10"))

Download and parse the superpopulation data for each sample:

In [None]:
# Download and parse the superpopulation data for each sample, order by ids
ped <- read.table(url("ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped"),sep = "\t",header = TRUE,row.names = 2)[ids,6,drop = FALSE]

Read the subpopulation and superpopulation codes.  We're again using the path to one of the files automounted from a referenced resource we defined earlier.

In [None]:
# Read the subpopulation and superpopulation codes

pop <- read.table("/home/jupyter/workspace/1000genomes_ftp/20131219.populations.tsv",sep = "\t",header = TRUE)
pop <- pop[1:26,]
superPopulation <- pop[,3]
names(superPopulation) <- pop[,2]
superPopulation <- factor(superPopulation)

In [None]:
# Map sample sub-populations to super-populations
ped$Superpopulation <- superPopulation[as.character(ped$Population)]

# Plot with colors corresponding to super-populations
N <- length(levels(superPopulation))
scatterplot3js(p$u, col = rainbow(N)[ped$Superpopulation], size = 0.5)

## Save your work

Earlier in the notebook, we created a workspace GCS bucket named `workspace_files`, and mounted it to the notebook server's file system.  
You can directly write to such buckets as if they are part of the local file system, which makes it easy to persist analysis results, data, and notebooks, and to share data across the workspace's cloud environments.

The example below shows how you can save a dataframe to a `.tsv` file in that bucket.

In [None]:
system("mkdir -p /home/jupyter/workspace/workspace_files/1kgenomes_analysis")
write_tsv(ped, '/home/jupyter/workspace/workspace_files/1kgenomes_analysis/ped.tsv')

If you like, you can also save your notebook(s) in progress to your workspace bucket. This is useful if you've made some modifications:

In [None]:
system("mkdir -p /home/jupyter/workspace/workspace_files/1kgenomes_analysis/notebooks")
system("cp ./R_1k_genomes.ipynb /home/jupyter/workspace/workspace_files/1kgenomes_analysis/notebooks", intern = TRUE)

While we don't show it here, note that if you've set up [GitHub integration](https://support.workbench.verily.com/docs/technical_reference/cloud_environments/git_repos_ssh_keys/) for your workspace, you can also use source control to checkpoint your work.

---
Copyright 2023 Verily Life Sciences LLC

Use of this source code is governed by a BSD-style  
license that can be found in the LICENSE file or at  
https://developers.google.com/open-source/licenses/bsd