# Tutorial: Principal Component Analysis (PCA) using rTASSEL

## Enter your notebook title here

**Objective**: Enter your objective here  
**Data**: Describe your data set here  
**User and contact**: Enter your name and contact here

### Table of contents
* [Notes](#Notes) 
* [Libraries](#Libraries)
* [Parameters and functions](#Parameters-and-functions)
* [Data](#Data)
    + [Load metadata into R](#Load-metadata-into-R)
    + [Genotype data](#Genotype-data)
        + [Retrieve BrAPI data and filter](#Retrieve-BrAPI-data-and-filter)
        + [Inspect genotype data in R](#Inspect-genotype-data-in-R)
        + [Load genotype data into rTASSEL](#Load-genotype-data-into-rTASSEL)
* [Analysis](#Analysis)
    + [Filter genotype data in rTASSEL](#Filter-genotype-data-in-rTASSEL)
    + [PCA with genotype data](#PCA-with-genotype-data)
    + [Add metadata to scatterplot](#Add-metadata-to-scatterplot)
* [References and additional resources](#References-and-additional-resources)

## Notes

This tutorial assumes: 
1. You already know how to load your data via a BrAPI database into rTASSEL and will inspect your data:
    - See 01_rTASSEL_load_data.brapi.ipynb on how to load files via BrAPI databases.
2. You will filter your genotype data as appropriate for your data set and analysis:
    - See 02_rTASSEL_GenotypeFiltering.brapi.ipynb for a tutorial on how to filter genotype data when retrieving data via BrAPI and using rTASSEL.
3. You have a csv file for metadata with a "Taxa" field that matches the taxa in your genotype file. (Alternatively, you can use data from a phenotype table imported via BrAPI.) 

More on the `pca()` function can be found [here](https://rtassel.maizegenetics.net/reference/pca.html), `plotScree()` [here](https://rtassel.maizegenetics.net/reference/plotScree.html) and `plotPCA()` [here](https://rtassel.maizegenetics.net/reference/plotPCA.html).

In [None]:
getwd()

In [None]:
Sys.Date()

## Libraries

In [None]:
library(rTASSEL)
library(QBMS)

## Parameters and functions

**Please edit the paths to your own data:**

In [None]:
# Path to taxa metadata
myMetadataPath <- "/path/to/metadata"

Create a function for setting the dimensions of a plot:

In [None]:
fig <- function(width, heigth) {
    options(
        repr.plot.width  = width, 
        repr.plot.height = heigth
    )
}

## Data

### Load metadata into R

In [None]:
taxaMetadata <- read.csv(file = myMetadataPath)
taxaMetadata |> head()

### Genotype data

#### Retrieve BrAPI data and filter

**You will need to log into Gigwa using the BrAPI helper.**

In [None]:
geno_provider$gigwa_list_dbs()

**Please edit the code to set your database (db):**

In [None]:
geno_provider$gigwa_set_db("myDataBase")

In [None]:
geno_provider$gigwa_list_projects()

**Please edit the code to set your project:**

In [None]:
geno_provider$gigwa_set_project("myProject")

**Edit the below code to use appropriate filters for your data set and analysis, additional filtering can be done after retrieving the data and loading into rTASSEL.** 

In [None]:
genoDataFromGigwa <- geno_provider$gigwa_get_variants(
    max_missing = 0.2,
    min_maf = 0.05)

#### Inspect genotype data in R

In [None]:
genoDataFromGigwa |> head()
genoDataFromGigwa |> dim()
genoDataFromGigwa |> names()

#### Load genotype data into rTASSEL 

In [None]:
tasGeno <- genoDataFromGigwa |> rTASSEL::readGenotypeTableFromGigwa()

In [None]:
tasGeno

## Analysis

### Filter genotype data in rTASSEL

Perform additional filtering steps in rTASSEL for your data set and analysis:  
- See 02_rTASSEL_GenotypeFiltering.brapi.ipynb for more details about filtering.

In [None]:
# Example only
#tasGeno |>
#    filterGenotypeTableTaxa(
#        minNotMissing = .5
#    ) |>
#    filterGenotypeTableSites(
#        siteMinAlleleFreq = 0.05,
#        maxHeterozygous = 0.5
#    )

### PCA with genotype data 

Run principle component analysis on your genotype data using the `pca()` function in rTASSEL:

In [None]:
pcaGeno <- tasGeno |> rTASSEL::pca()

In [None]:
pcaGeno

In [None]:
pcaGeno |> reportNames()

In [None]:
pcaGeno |> tableReport("Eigenvalues_Datum") |> head()

Set plot dimensions with the figure function created at the start of the notebook:

In [None]:
fig(10,10)

Create a scree plot using the eigenvalues generated in your PCA with the `plotScree()` function:

In [None]:
pcaGeno |> plotScree()

In [None]:
pcaGeno |> tableReport("PC_Datum") |> head()

Create a scatter plot with your chosen principal components using `plotPCA()`:

In [None]:
pcaGeno |> plotPCA(
    x = 1,
    y = 2
)

### Add metadata to scatterplot

In [None]:
taxaMetadata |> head()

In [None]:
pcaGeno |> plotPCA(
    x = 1,
    y = 2,
    metadata = taxaMetadata,
    mCol = "Subpopulation")

## References and additional resources

**To cite rTASSEL, please use the following citation:**

Monier et al., (2022). rTASSEL: An R interface to TASSEL for analyzing genomic diversity. Journal of Open Source Software, 7(76), 4530, https://doi.org/10.21105/joss.04530.

You can find more information about rTASSEL [here](https://rtassel.maizegenetics.net)

and an rTASSEL tutorial in binder [here](https://mybinder.org/v2/gh/btmonier/rTASSEL_sandbox/HEAD?labpath=getting_started.ipynb).

**Please also cite QBMS using the following citation:**  

Al-Shamaa K (2023). QBMS: Query the Breeding Management System(s). R package version 0.9.1, https://icarda-git.github.io/QBMS/.