# Tutorial: Filter genotype data with rTASSEL 

## Enter your notebook title here

**Objective**: Filter genotype data with rTASSEL  
**Data**: Describe your data set here  
**User and contact**: Enter your name and contact here

### Table of contents
* [Notes](#Notes) 
* [Libraries](#Libraries)
* [Parameters and functions](#Parameters-and-functions)
* [Data](#Data)
* [Analysis](#Analysis)
    * [Filter by variant site](#Filter-by-variant-site)
        * [Examples](#Examples)
    * [Filter by taxa](#Filter-by-taxa)
        * [Examples](#Examples)
    * [Filter by variant site and taxa](#Filter-by-variant-site-and-taxa)
* [References and additional resources](#References-and-additional-resources)

## Notes

This tutorial assumes: 
1. You already know how to load your data via a flat file into rTASSEL and have inspected your data:
- See 01_rTASSEL_load_data.ipynb for a tutorial on how to load flat tiles into rTASSEL.

2. It is up to you to determine what filters and values to apply to your data, this notebook describes the methods for applying filters only. 
- See [this paper](https://www.frontiersin.org/articles/10.3389/fgene.2020.00447/full) for a discussion on genotyping and data quality control. 

Additional documentation on filtering genotype data with rTASSEL can be found [here](http://rtassel.maizegenetics.net/articles/genotype_filtration.html).

In [None]:
getwd()

In [None]:
Sys.Date()

## Libraries

In [None]:
library(rTASSEL) #R interface to TASSEL

## Parameters and functions

**Please edit with the path to your data**

In [None]:
# Path to genotype data
myGenoPath <- "/path/to/genotype/data"

## Data

Inspect genotype data in R

In [None]:
myGenoTable <- data.table::fread(myGenoPath)
myGenoTable |> head()

Load genotype data into rTASSEL

In [None]:
tasGeno <- rTASSEL::readGenotypeTableFromPath(
    path = myGenoPath
)
tasGeno

## Analysis 

### Filter by variant site

Filtering by variant site uses the method: `filterGenotypeTableSites()`  

Variant sites can be filtered by:  

* Genotype information
    + `siteMinCount`
    + `siteMinAlleleFreq`
    + `siteMaxAlleleFreq`
    + `minHeterozygous`
    + `maxHeterozygous`
* Indexed variant sites
    + `startSite`
    + `endSite`
* Physical marker positions
    + `startChr`
    + `endChr`
    + `startPos`
    + `endPos`
* R objects
    + `gRangesObj`
* External files
    + `bedFile`
    + `chrPosFile`

#### Examples

Filter by minor allele frequency (MAF):

In [None]:
tasGeno |> filterGenotypeTableSites(siteMinAlleleFreq = 0.05)

Filter by max heterozygosity:

In [None]:
tasGeno |> filterGenotypeTableSites(maxHeterozygous = 0.5)

Filter by MAF and max heterozygosity:

In [None]:
tasGeno |> filterGenotypeTableSites(siteMinAlleleFreq = 0.05,
        maxHeterozygous = 0.5)

Filter using physical chromosome positions:

In [None]:
tasGeno |>
    filterGenotypeTableSites(
        siteRangeFilterType = "position",
        startChr = 9,
        endChr = 10,
        startPos = 250,
        endPos = 700
    )

Filter with a bedfile of positions (edit code with your bedfile name):

In [None]:
tasGeno |>
    filterGenotypeTableSites(
        siteRangeFilterType = "none",
        bedFile = "my_ranges.bed"
    )

Create a new TASSEL object for your analysis to use in downstream analyses, for example:

In [None]:
myFiltered_tasGeno <- tasGeno |>
    filterGenotypeTableSites(
        siteMinAlleleFreq = 0.05,
        maxHeterozygous = 0.5
    )

### Filter by taxa 

Filtering by taxa uses the method: `filterGenotypeTableTaxa()`  

Variant sites can be filtered by:  

* Genotype information
  + `minNotMissing`
  + `minHeterozygous`
  + `maxHeterozygous`
* R objects
  + `taxa`

#### Examples

Filter taxa by frequency of variants called (not missing)
In this example taxa with <80% variant sites (minNotMissing = 0.8) called are removed:

In [None]:
tasGeno |>
    filterGenotypeTableTaxa(
        minNotMissing = 0.8
    )

Filter taxa by max heterozygosity:

In [None]:
tasGeno |>
    filterGenotypeTableTaxa(
        maxHeterozygous = 0.1
    )

Filter taxa by a list of taxa IDs:

In [None]:
myTaxa <- "TaxaA, TaxaB, TaxaC"

tasGeno |>
    filterGenotypeTableTaxa(
        taxa = myTaxa
    )

### Filter by variant site and taxa

In [None]:
tasGeno |>
    filterGenotypeTableTaxa(
        minNotMissing = .5
    ) |>
    filterGenotypeTableSites(
        siteMinAlleleFreq = 0.05,
        maxHeterozygous = 0.5
    )

## References and additional resources

To cite rTASSEL, please use the following citation:

Monier et al., (2022). rTASSEL: An R interface to TASSEL for analyzing genomic diversity. Journal of Open Source Software, 7(76), 4530, https://doi.org/10.21105/joss.04530.

You can find more information about rTASSEL [here](https://rtassel.maizegenetics.net)

and an rTASSEL tutorial in binder [here](https://mybinder.org/v2/gh/btmonier/rTASSEL_sandbox/HEAD?labpath=getting_started.ipynb).