# Tutorial: Sequence diversity metrics using rTASSEL

## Enter your notebook title here

**Objective**: Enter your objective here  
**Data**: Describe your data set here  
**User and contact**: Enter your name and contact here

### Table of contents
* [Notes](#Notes)
* [Libraries](#Libraries)
* [Parameters and functions](#Parameters-and-functions)
* [Data](#Data)
    * [Genotype data](#Genotype-data)
        * [Inspect genotype data in R](#Inspect-genotype-data-in-R)
        * [Load genotype data into rTASSEL](#Load-genotype-data-into-rTASSEL)
* [Analysis](#Analysis)
    + [Sequence diversity](#Sequence-diversity)
* [References and additional resources](#References-and-additional-resources)

## Notes

This tutorial assumes: 
1. You already know how to load your data via a BrAPI database into rTASSEL and will inspect your data:
    - See 01_rTASSEL_load_data.brapi.ipynb on how to load files via BrAPI databases.
2. You will filter your genotype data as appropriate for your data set and analysis:
    - See 02_rTASSEL_GenotypeFiltering.brapi.ipynb for a tutorial on how to filter genotype data when retrieving data via BrAPI and using rTASSEL.
    
Additional documentation on the `seqDiversity()` function in rTASSEL can be found [here](https://rtassel.maizegenetics.net/reference/seqDiversity.html).

In [None]:
getwd()

In [None]:
Sys.Date()

## Libraries

In [None]:
library(QBMS)
library(rTASSEL)
library(ggplot2)
library(dplyr)

## Parameters and functions

**Please edit the path to your own data:**

In [None]:
# Path to genotype data
myGenoPath <- "/path/to/genotype/data"

Create a function for setting the dimensions of a plot:

In [None]:
fig <- function(width, heigth) {
    options(
        repr.plot.width  = width, 
        repr.plot.height = heigth
    )
}

## Data

### Genotype data

In [None]:
myGenoTable <- data.table::fread(myGenoPath)

#### Inspect genotype data in R

In [None]:
myGenoTable |> head()
myGenoTable |> dim()
myGenoTable |> names()

#### Load genotype data into rTASSEL 

In [None]:
tasGeno <- rTASSEL::readGenotypeTableFromPath(
    path = myGenoPath
)
tasGeno

In [None]:
tasGeno

**Perform filtering steps in rTASSEL for your data set and analysis:**  
    - See 02_rTASSEL_GenotypeFiltering.ipynb for more details about filtering.

## Analysis

### Sequence diversity 

In rTASSEL `seqDiversity()` provides: 

- segregating sites
- average pairwise divergence (𝜋) 
- estimated mutation rate (𝜃 or 4𝑁𝜇)  
- Tajima's D 

By default `seqDiversity` will calculate diversity metrics across the entire set of sites in the genotype data, providing a single set of diversity metrics for all markers:  

Diversity metrics are returned in `$Diversity`.  
Polymorphic distribution is returned in `$PolyDist`.

In [None]:
tasGeno |> seqDiversity()

Options can be provided to change how the diversity metrics are calculated:

- using `startSite` and `endSite` to restrict the analysis or,

- by creating a `sliding window analysis` as in the below example:

In [None]:
seqRestrict <- tasGeno |>
    seqDiversity(
        slidingWindowAnalysis = TRUE,
        stepSize = 50,
        windowSize = 100
    )

In [None]:
seqRestrict$Diversity |> head()

In [None]:
## Visualization

Visualize sequence diversity across the genome using `ggplot()`

First, set the plot dimensions with the figure function created at the start of the notebook:

In [None]:
fig(12,5)

In this example, Tajima's D is plotted for chromosome 1 as calculated in the sliding sindow analysis created above:

In [None]:
seqRestrict$Diversity |>
    filter(Chromosome == "1") |>
    ggplot() +
    aes(x = StartChrPosition, y = TajimaD) +
    geom_line()

## References and additional resources

To cite rTASSEL, please use the following citation:

Monier et al., (2022). rTASSEL: An R interface to TASSEL for analyzing genomic diversity. Journal of Open Source Software, 7(76), 4530, https://doi.org/10.21105/joss.04530.

You can find more information about rTASSEL [here](https://rtassel.maizegenetics.net)

and an rTASSEL tutorial in binder [here](https://mybinder.org/v2/gh/btmonier/rTASSEL_sandbox/HEAD?labpath=getting_started.ipynb).