# Tutorial: Filter genotype data with rTASSEL 

## Enter your notebook title here

**Objective**: Filter genotype data with rTASSEL  
**Data**: Describe your data set here  
**User and contact**: Enter your name and contact here

### Table of contents
* [Notes](#Notes) 
* [Libraries](#Libraries)
* [Parameters and functions](#Parameters-and-functions)
* [Data](#Data)
* [Analysis](#Analysis)
    * [Filter by variant site](#Filter-by-variant-site)
        * [Examples](#Examples)
    * [Filter by taxa](#Filter-by-taxa)
        * [Examples](#Examples)
* [References and additional resources](#References-and-additional-resources)

## Notes

This tutorial assumes: 
1. You already know how to load your data (via a flate file or BrAPI database) into rTASSEL and have inspected your data:
- See 01_rTASSEL_Load_Data.ipynb for a tutorial on how to load flat tiles into rTASSEL
- See brapi_template.ipynb on how to load files via BrAPI databases

See: http://rtassel.maizegenetics.net/articles/genotype_filtration.html for additional documentation on filtering genotype data with rTASSEL

2. It is up to you to determine what filters and values to apply to your data, this notebook describes the methods for applying filters only. 
- See this paper for a discussion on genotyping and data quality control: https://www.frontiersin.org/articles/10.3389/fgene.2020.00447/full 

In [None]:
getwd()

In [None]:
Sys.Date()

## Libraries

In [None]:
library(rTASSEL) #R interface to TASSEL

## Parameters and functions

In [None]:
### PLEASE EDIT WITH THE PATHS TO YOUR DATA ###

# Path to genotype data
myGenoPath <- "/shared/commons/data/workshop_senegal/demo_data_genotype_01.vcf"

## Data

In [None]:
# Inspect genotype data in R
myGenoTable <- data.table::fread(myGenoPath)
myGenoTable |> head()

In [None]:
# Load genotype data into rTASSEL
tasGeno <- rTASSEL::readGenotypeTableFromPath(
    path = myGenoPath
)
tasGeno

## Analysis 

### Filter by variant site

**Filtering by variant site uses the method:** `filterGenotypeTableSites()`  

**Variant sites can be filtered by:**  

* Genotype information
    + `siteMinCount`
    + `siteMinAlleleFreq`
    + `siteMaxAlleleFreq`
    + `minHeterozygous`
    + `maxHeterozygous`
* Indexed variant sites
    + `startSite`
    + `endSite`
* Physical marker positions
    + `startChr`
    + `endChr`
    + `startPos`
    + `endPos`
* R objects
    + `gRangesObj`
* External files
    + `bedFile`
    + `chrPosFile`

#### Examples

In [None]:
# Minor allele frequency (MAF)
tasGeno |> filterGenotypeTableSites(siteMinAlleleFreq = 0.05)

In [None]:
# Max heterozygosity
tasGeno |> filterGenotypeTableSites(maxHeterozygous = 0.5)

In [None]:
# MAF and max heterozygosity
tasGeno |> filterGenotypeTableSites(siteMinAlleleFreq = 0.05, 
                                    maxHeterozygous = 0.5)

In [None]:
# Create a new TASSEL object for your analysis to use in downstream analyses
myFiltered_tasGeno <- tasGeno |> filterGenotypeTableSites(siteMinAlleleFreq = 0.05, 
                                    maxHeterozygous = 0.5)

In [None]:
# Filter using physical chromosome positions
tasGeno |> 
    filterGenotypeTableSites(
        siteRangeFilterType = "position",
        startChr = 1,
        endChr = 2,
        startPos = 250,
        endPos = 700
    )

In [None]:
# Filter with a bedfile of positions
tasGeno |> 
    filterGenotypeTableSites(
        siteRangeFilterType = "none",
        bedFile = "my_ranges.bed"
    )

## Filter by taxa 

**Filtering by taxa uses the method:** `filterGenotypeTableTaxa()`  

**Variant sites can be filtered by:**  

* Genotype information
  + `minNotMissing`
  + `minHeterozygous`
  + `maxHeterozygous`
* R objects
  + `taxa`

#### Examples

In [None]:
# Filter taxa by frequency of variants called (not missing)
tasGeno |> 
    filterGenotypeTableTaxa(
        minNotMissing = .8 # remove taxa with <80% variant sites called
    )

In [None]:
tasGeno |> 
    filterGenotypeTableTaxa(
        minHeterozygous = 0.0,
        maxHeterozygous = 0.0
    )

In [None]:
# Filter taxa by a list of taxa
myTaxa <- "TaxaA, TaxaB, TaxaC"

tasGeno |> 
    filterGenotypeTableTaxa(
        taxa = myTaxa
    )

## References and additional resources

To cite rTASSEL, please use the following citation:

Monier et al., (2022). rTASSEL: An R interface to TASSEL for analyzing genomic diversity. Journal of Open Source Software, 7(76), 4530, https://doi.org/10.21105/joss.04530

You can find more information about rTASSEL here:

https://maize-genetics.github.io/rTASSEL/index.html

and an rTASSEL tutorial in binder here: 

https://mybinder.org/v2/gh/btmonier/rTASSEL_sandbox/HEAD?labpath=getting_started.ipynb