# Parse Principle Components

This notebook accesses the results of the genomic PCA and genomic ancestry prediction for use in our analysis.
This notebook can be run with a standard VM.

## Get principle component data from CDR bucket

More information on these files can be found here: https://support.researchallofus.org/hc/en-us/articles/4616869437204-Controlled-CDR-Directory

In [None]:
system("gsutil -u $GOOGLE_PROJECT cp gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/aux/ancestry/ancestry_preds.tsv ./")

In [None]:
library(tidyverse)

## Read the localized file

Now that the file is on our vm, we can load it into R.

In [None]:
ancestry_tab = read_tsv("ancestry_preds.tsv", col_types="ic-c-")

In [None]:
#Uncomment to view the raw file format
#head(ancestry_tab)

## Transform the data

Because the PCs are represented in a format like [pc1,pc2,...] we need to transform them into independant columns.

In [None]:
pcs = ancestry_tab %>% separate(pca_features,sep="[,[\\]]",into=c(NA,paste0("pc",1:5))) %>% rename(person_id=research_id)

Here are the results. Note the scrambled format for presentation

In [None]:
pcs %>% transmute(across(-person_id, \(x) sample(x))) %>% head()
#head(pcs)

## Filter related individuals

One additional step we can deal with now is to filter out those participants who were flagged for relatedness. It's possible to make decisions about which individual in a family group to keep, but we will just use the AoU created list for simplicity.

In [None]:
system("gsutil -u $GOOGLE_PROJECT cp gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/aux/relatedness/relatedness_flagged_samples.tsv ./")

In [None]:
to_drop = read_tsv("relatedness_flagged_samples.tsv")

In [None]:
pcs = pcs %>% filter(!(person_id %in% to_drop$sample_id))

## Save the file and store it in our bucket

In [None]:
write_csv(pcs ,"pcs.csv")

In [None]:
system("gsutil cp pcs.csv $WORKSPACE_BUCKET/")