# Tutorial: Principal Coordinate Analysis using rTASSEL

## Enter your title here  

**Objective**: Describe the objective of this analysis   
**Data**: Describe your data       
**User and contact**: your name, your contact     

### Table of contents
* [Notes](#Notes) 
* [Load libraries](#Load-libraries)
* [Parameters and functions](#Parameters-and-functions)
* [Data](#Data)
* [Analysis](#Analysis)
    * [PCA with genotype data](#PCA-with-genotype-data)
    * [Add metadata to PCA](#Add-metadata-to-PCA)
* [References and additional resources](#References-and-additional-resources)

## Notes

This tutorial assumes: 
1. You already know how to load your data (via a flate file or BrAPI database) into rTASSEL and have inspected your data:
- See 01_rTASSEL_Load_Data.ipynb for a tutorial on how to load flat tiles into rTASSEL
- See brapi_template.ipynb on how to load files via BrAPI databases
2. You filtered your genotype data
- See 02_rTASSEL_filter_geno.ipynb for a tutorial on how to filter genotype data in rTASSEL
3. You have a csv file for metadata with a "Taxa" field that matches the taxa in your genotype file

In [None]:
getwd()

In [None]:
Sys.Date()

## Load libraries

In [None]:
library(data.table) #Efficient I/O handling for delimited data
library(ggplot2) #Plotting and visualization
library(magrittr) #Implement `%>%` function for functional programming
library(dplyr) #Manipulate data 
library(rTASSEL) #R interface to TASSEL

## Parameters and functions

In [None]:
### PLEASE EDIT WITH THE PATHS TO YOUR DATA ###

# Path to genotype data
myGenoPath <- "/path/to/genotype/data"

# Path to metadata 
myMetadataPath <- "/path/to/metadata"

## Data

In [None]:
tasGeno <- rTASSEL::readGenotypeTableFromPath(
    path = myGenoPath
)
tasGeno

## Analysis

### PCA with genotype data 

In [None]:
pcaGeno <- tasGeno %>% rTASSEL::pca()

In [None]:
str(pcaGeno)

In [None]:
## Inspect `pcaGeno` object ----
pcaGeno %>% class() %>% print()

In [None]:
pcaGeno %>% names() %>% print()

In [None]:
pcaGeno$Eigenvalues_Datum %>% head()

In [None]:
## Plot total variance for first 10 PCs ----
nPCs <- 10 # edit this value to visualize a different number of PCs
pcaGeno$Eigenvalues_Datum %>% 
    as.data.frame() %>% 
    head(n = nPCs) %>% 
    ggplot2::ggplot() + 
    aes(x = PC, y = proportion_of_total, group = 1) + 
    geom_line(color = "red") + 
    geom_point(size = 3) +
    xlab("PC") +
    ylab("Proportion of total variance")

In [None]:
pcaGeno$PC_Datum %>% head()

In [None]:
pcaGeno$PC_Datum %>%
    as.data.frame() %>%
    ggplot() +
    aes(x = PC1, y = PC2) +
    geom_point()

### Add metadata to PCA

In [None]:
taxaMetadata <- read.csv(file = myMetadataPath)

In [None]:
head(taxaMetadata)

In [None]:
pcaDatum <- pcaGeno$PC_Datum

In [None]:
head(pcaDatum)

In [None]:
# use left join to add subpopulation information to pcaDatum table
pcaDatum %>%
    left_join(taxaMetadata, by = "Taxa") %>% # uses dplyr
    head()

In [None]:
# revisulalize PCA coloring by subpopulation
pcaDatum %>%
    left_join(taxaMetadata, by = "Taxa") %>% 
    ggplot() +
    aes(PC1, PC2, color = Subpopulation) +
    geom_point(size = 2)

## References and additional resources

To cite rTASSEL, please use the following citation:

Monier et al., (2022). rTASSEL: An R interface to TASSEL for analyzing genomic diversity. Journal of Open Source Software, 7(76), 4530, https://doi.org/10.21105/joss.04530

You can find more information about rTASSEL here:

https://maize-genetics.github.io/rTASSEL/index.html

and an rTASSEL tutorial in binder here: 

https://mybinder.org/v2/gh/btmonier/rTASSEL_sandbox/HEAD?labpath=getting_started.ipynb