# Tutorial: Filter genotype data with QBMS and rTASSEL 

## Enter your notebook title here

**Objective**: Filter genotype data with rTASSEL  
**Data**: Describe your data set here  
**User and contact**: Enter your name and contact here

### Table of contents
* [Notes](#Notes) 
* [Libraries](#Libraries)
* [Data](#Data)
    * [Filter with QBMS](#Filter-with-QBMS)
    * [Inspect genotype data in R](#Inspect-genotype-data-in-R)
    * [Load genotype data into rTASSEL](#Load-genotype-data-into-rTASSEL)
    * [Filter with rTASSEL](#Filter-with-rTASSEL)
        * [Filter by variant site](#Filter-by-variant-site)
            * [Examples](#Examples)
        * [Filter by taxa](#Filter-by-taxa)
            * [Examples](#Examples)
    * [Filter by variant site and taxa](#Filter-by-variant-site-and-taxa)
* [References and additional resources](#References-and-additional-resources)

## Notes

This tutorial assumes: 
1. You already know how to load your data via a BrAPI database into rTASSEL and have inspected your data:
- See 01_rTASSEL_load_data.brapi.ipynb on how to load files via BrAPI databases

2. It is up to you to determine what filters and values to apply to your data, this notebook describes the methods for applying filters only. 
- See [this paper](https://www.frontiersin.org/articles/10.3389/fgene.2020.00447/full) for a discussion on genotyping and data quality control. 

Additional documentation on filtering genotype data with rTASSEL can be found [here](http://rtassel.maizegenetics.net/articles/genotype_filtration.html).

In [None]:
getwd()

In [None]:
Sys.Date()

## Libraries

In [None]:
library(QBMS) #Retrieve data from BrAPI databases
library(rTASSEL) #R interface to TASSEL

## Data

**You will need to log into Gigwa using the BrAPI helper.**

In [None]:
geno_provider$gigwa_list_dbs()

**Please edit the code to set your database (db):**

In [None]:
geno_provider$gigwa_set_db("myDataBase")

In [None]:
geno_provider$gigwa_list_projects()

**Please edit the code to set your project:**

In [None]:
geno_provider$gigwa_set_project("myProject")

In [None]:
samples <- gigwa_get_samples()
samples |> head()

### Filter with QBMS
**Filtering with QBMS allows filtering prior to retrieval of data and can save time.** This is done with gigwa_get_variants() using the optional arguments:

* `max_missing` maximum missing ratio by sample [0 and 1], defaults to 1
* `min_maf` minimum Minor Allele Frequency [0 and 1], defaults to 0
* `samples` a list of a samples subset, defaults to NULL  

For example:

In [None]:
genoDataFromGigwa <- geno_provider$gigwa_get_variants(
    max_missing = 0.2,
    min_maf = 0.05,
    samples = c("33-16", "4722", "A214N")
)

### Inspect genotype data in R

In [None]:
genoDataFromGigwa |> head()
genoDataFromGigwa |> dim()
genoDataFromGigwa |> names()

### Load genotype data into rTASSEL

In [None]:
tasGeno <- genoDataFromGigwa |> rTASSEL::readGenotypeTableFromGigwa()

In [None]:
tasGeno

### Filter with rTASSEL 

### Filter by variant site

Filtering by variant site uses the method: `filterGenotypeTableSites()`  

Variant sites can be filtered by:  

* Genotype information
    + `siteMinCount`
    + `siteMinAlleleFreq` (redundant with `min_maf` from QBMS)
    + `siteMaxAlleleFreq`
    + `minHeterozygous`
    + `maxHeterozygous`
* Indexed variant sites
    + `startSite`
    + `endSite`
* Physical marker positions
    + `startChr`
    + `endChr`
    + `startPos`
    + `endPos`
* R objects
    + `gRangesObj`
* External files
    + `bedFile`
    + `chrPosFile`

#### Examples

Filter by max heterozygosity by site:

In [None]:
tasGeno |> filterGenotypeTableSites(maxHeterozygous = 0.2)

Filter using physical chromosome positions:

In [None]:
tasGeno |>
    filterGenotypeTableSites(
        siteRangeFilterType = "position",
        startChr = 9,
        endChr = 10,
        startPos = 250,
        endPos = 700
    )

Filter with a bedfile of positions:

In [None]:
tasGeno |> 
    filterGenotypeTableSites(
        siteRangeFilterType = "none",
        bedFile = "my_ranges.bed"
    )

### Filter by taxa 

Filtering by taxa uses the method: `filterGenotypeTableTaxa()`  

Variant sites can be filtered by:  

* Genotype information
  + `minNotMissing` (redundant although inverse to `max_missing` in QBMS)
  + `minHeterozygous`
  + `maxHeterozygous`
* R objects
  + `taxa` (redundant with `samples` in QBMS)

#### Examples

Filter taxa by frequency of variants called (not missing).
In this example taxa with <80% variant sites (minNotMissing = 0.8) called are removed:

In [None]:
tasGeno |> 
    filterGenotypeTableTaxa(
        minNotMissing = 0.8
    )

Filter taxa by heterozygosity:

In [None]:
tasGeno |> 
    filterGenotypeTableTaxa(
        maxHeterozygous = 0.2
    )

### Filter by variant site and taxa

In [None]:
tasGeno |>
    filterGenotypeTableTaxa(
        minNotMissing = 0.8
    ) |>
    filterGenotypeTableSites(
        maxHeterozygous = 0.2
    )

## References and additional resources

**To cite rTASSEL, please use the following citation:**

Monier et al., (2022). rTASSEL: An R interface to TASSEL for analyzing genomic diversity. Journal of Open Source Software, 7(76), 4530, https://doi.org/10.21105/joss.04530.

You can find more information about rTASSEL [here](https://rtassel.maizegenetics.net)

and an rTASSEL tutorial in binder [here](https://mybinder.org/v2/gh/btmonier/rTASSEL_sandbox/HEAD?labpath=getting_started.ipynb).

**Please also cite QBMS using the following citation:**

Al-Shamaa K (2023). QBMS: Query the Breeding Management System(s). R package version 0.9.1, https://icarda-git.github.io/QBMS/.