# Module: Quality Control of Sequencing Datasets

High-quality sequencing data is the foundation of reliable bioinformatics analyses. Before diving into downstream analyses, it's important to assess the quality of your sequencing data. By identifying and addressing potential issues early on, you can minimize noise in your data and produce results that are likely to be more accurate.

In this notebook, you'll explore a suite of bioinformatics tools widely used to evaluate sequencing data quality, identify and remove low-quality regions in the reads, and remove reads that are irrelevant to the analyses.

Created by: _Microbial Oceanography Laboratory (MOLab)_

---
## How to Use This Notebook

1. Make sure tools are installed already (see below if not yet).
2. Activate environment. Replace environment name accordingly.
```bash
conda activate quality-control-env
```
2. Open jupyter notebook with the command below and select the notebook.
```bash
jupyter notebook
```
3. To run the cells in this notebook, press Shift+Enter.

---
## Tools Used
1. **FastQC**
2. **Trimmomatic**
3. **MultiQC**
4. **Minimap2**
5. **Samtools**

To install these tools, find the `quality-control.yaml` file located in the same folder as this notebook (in repository). Then run the command below in the terminal:

> `conda env create -f quality-control.yaml`

---
## Starting Files 

1. Paired-end FASTQ sequences.

---
## Expected Outputs

1. Clean paired-end FASTQ sequences.

---
## Table of Contents
 * [**FASTQ**](#FASTQ)
 * [**Inspect Raw Data**](#Inspect-Raw-Data)
     * [FastQC](#FastQC)
     * [MultiQC](#MultiQC)
 * [**Trimming and Filtering**](#Trimming-and-Filtering)
     * [Trimmomatic](#Trimmomatic)
     * [Inspect Quality Again](#Inspect-Quality-Again)
 * [**(Optional) Removing Host Sequences**](#(Optional)-Removing-Host-Sequences)
     * [Map Reads](#Map-Reads)
     * [Get Non-host Reads](#Get-Non-host-Reads)

---
# <font color = 'gray'>FASTQ</font>

FASTQ is a type of file that contains information about the sequence of the reads and the accuracy of the base calls. A typical FASTQ file is formatted as shown below.

---

![FASTQ](https://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10142-023-01259-x/MediaObjects/10142_2023_1259_Fig1_HTML.png)

---

Sequence IDs start with the _@_ symbol. The sequence and quality score strings should be of the same length. The characters in the quality score information are encoded such that each character correspond to a specific ASCII character (and subsequently a score). More recent Illumina-derived sequences are encoded using the [Phred+33 encoding](https://support.illumina.com/help/BaseSpace_Sequence_Hub_OLH_009008_2/Source/Informatics/BS/QualityScoreEncoding_swBS.htm). At position $i$, the quality score $Q_i$ matches with the $i$-th position in the sequence ($S_i$). This means $Q_i$ describes (in)accuracy of the base call of $S_i$. The quantitative relationship of Q-score and accuracy is described below.

![Qscore and Accuracy](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSOQbRyUYgCvuojUWJWKuLDQG3zz2Q0-5ZejQ&s)

---
# <font color = 'gray'>Inspect Raw Data</font>

### FastQC

Before anything else, you must first assess the quality of your raw data. This will help you evaluate whether the data's quality is ideal for the type of analysis you will be conducting. For instance, if your goal is to perform SNP or phylogenetic analyses where the accuracy of each base call is crucial, then ideally, your dataset must have a high quality (typically $Q\geq30$).

To generate quality metrics and visuals of your data, you can use the `fastqc` tool. The command below will output quality reports (HTML file) per FASTQ file stored inside the `raw-reads` folder. You can also add the `-t` option to increase the number of threads and make the execution faster.

In [None]:
!fastqc \
    -o fastqc/ \
    raw-reads/*

A few things to look at in the report are:

1. **Basic Statistics** - Check the library depth and read lengths.
1. **Per Base Sequence Quality** - This module provides a picture on how accurate the base calls are at different positions in your library.
2. **Adapter Content** - This module shows if your library contains adapter sequences. These are typically trimmed before proceeding to downstream steps.

More detailed descriptions of `fastqc` modules are available here: [FastQC Documentation](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/).

<div class="alert alert-block alert-info">
<b>Note:</b> 
    
The interpretations of other <code>fastqc</code> modules/reports are context-dependent. For instance, in a metabarcoding dataset, you would expect that the <i>Overrepresented Sequence</i> report will yield a warning or error. This should not be of any concern since it is the nature of metabarcoding datasets to have multiple copies of the same amplicons/sequences.
     
As another example, if you are working with WGS from an isolate, and you found multiple peaks in the <i>Per Sequence GC Content</i> report, this could be indicative of contamination. However, if your data is, say, shotgun metagenomics, then the presence of multiple peaks should not be concerning.
</div>

### MultiQC

Previously, `fastqc` was utilized to generate quality reports per FASTQ file. You can aggregate these reports into a single report using `multiqc`. This is a great tool if you are working with numerous libraries, making it easier to get an overview of the quality profile of your samples.

The command will summarize the `fastqc` reports stored inside the `./fastqc` folder, and the aggregated report will be placed inside the `./multiqc` folder.

In [None]:
!multiqc \
    -outdir ./multiqc \
    ./fastqc 

<div class="alert alert-block alert-info">
<b>Tip:</b> 
    
Samples processed in different batches often exhibit variations in quality profiles. It’s recommended to inspect individual libraries or review <code>fastqc</code> reports to ensure that each sample batch is handled appropriately based on its specific data quality.
</div>

---
# <font color = 'gray'>Trimming and Filtering</font>

### Trimmomatic

Next, low-quality regions in the reads will be trimmed off using `trimmomatic`. The strictness of trimming parameters will depend on the quality of your data and your objectives.

The list below describes the arguments used for the command:

| Parameter | Description |
| :-: | :- |
| `PE` | Paired-end mode |
| `raw-reads/1_F.fastq.gz` | Path to forward read FASTQ file |
| `raw-reads/1_R.fastq.gz` | Path to reverse read FASTQ file |
| `trimmomatic/ofp.fastq` | **O**utput **F**orward **P**aired. Trimmed forward reads that still have pairs |
| `trimmomatic/ofu.fastq` | **O**utput **F**orward **U**npaired. Trimmed forward reads that no longer have pairs (i.e. orphan forward reads) |
| `trimmomatic/orp.fastq` | **O**utput **R**everse **P**aired. Trimmed reverse reads that still have pairs |
| `trimmomatic/oru.fastq` | **O**utput **R**orward **U**npaired. Trimmed reverse reads that no longer have pairs (i.e. orphan reverse reads) |
| `ILLUMINACLIP` | Trim adapters from the reads. A FASTA file containing adapter sequences should be provided. Other sub-parameters are described in the manual. More adapter sequences can be found [here](https://github.com/timflutre/trimmomatic/tree/master/adapters) |
| `LEADING` | Starting from the 5' position, if a base drops below the specified quality, the sequence to its left are trimmed. |
| `TRAILING` | Starting from the 3' position, if a base drops below the specified quality, the sequence to its right are trimmed. |
| `SLIDINGWINDOW:n:m` | A window of (*n*) consecutive bases is taken. If average of the _n_ bases fall below the threshold _m_, the sequence is trimmed to the left. The window then moves to the next _n_ consecutive bases and continues the algorithm. |
| `MINLEN` | Trimmed reads that fall below the specified are discarded. |

Besides these, the `MAXINFO` parameter is also great to use since it finds a balance between quality and information (i.e. sequence length).

You can find the description of other `trimmomatic` parameters here: [Trimmomatic Documentation](http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf)

In [None]:
!trimmomatic PE \
    raw-reads/1_F.fastq.gz \
    raw-reads/1_R.fastq.gz \
    trimmomatic/ofp.fastq \
    trimmomatic/ofu.fastq \
    trimmomatic/orp.fastq \
    trimmomatic/oru.fastq \
    ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:8:true \
    LEADING:20 \
    TRAILING:20 \
    SLIDINGWINDOW:5:25 \
    MINLEN:75

<div class="alert alert-block alert-info">
<b>Tip:</b> 
    
Check the standard output of <code>trimmomatic</code> to see how many paired reads are left after filtering. If you think you're losing too many reads on your data and if your objective permits, try to be a bit more lenient with your trimming and filtering parameters (e.g. lower <code>MINLEN</code>).
</div>

<div class="alert alert-block alert-warning">
<b>Warning:</b> 
    
The command above only processes the <code>1_F.fastq.gz</code> and <code>1_R.fastq.gz</code> read pairs. If you have multiple datasets, you have to rerun the code block for those libraries. Make sure to replace the filenames of the outputs as well. Alternatively, you can also automate this by using a for loop in bash.
</div>

### Inspect Quality Again

After trimming and filtering the reads, for sanity check, inspect the quality of the sequencing libraries again as described in the [Inspect Raw Data](##Inspect-Raw-Data) section.

---
# <font color = 'gray'>(Optional) Removing Host Sequences</font>

If your data originates from a host organism and your research focuses, for example, on the associated community only, removing host-derived sequences is essential to ensure that the libraries represent only the relevant information.

### Map Reads

First, you need to map the clean reads to a genome assembly of your host organism. The [NCBI Genomes database](https://www.ncbi.nlm.nih.gov/home/genomes/) is a useful resource for accessing genome assemblies of various organisms.

`minimap2` will be utilized to perform the mapping. We also assume that the name of the assembly file is `Host_Genome.fna`. 

In [None]:
!minimap2 \
    --split-prefix=tmp$$ \
    -a \
    -xsr Host_Genome.fna \
    ofp.fastq orp.fastq | samtools view -bh | samtools sort -o output.bam

Then the alignment file (`output.bam`) is indexed.

In [None]:
!samtools index output.bam

### Get Non-host Reads

After mapping, extract the reads that did not map to the host genome. This subset of reads should now be largely free of host sequences and better represent the community of interest.

In [None]:
samtools fastq -F 3584 -f 77 output.bam  | gzip -c > not_host/1_F_no_host.fastq.gz
samtools fastq -F 3584 -f 141 output.bam | gzip -c > not_host/1_R_no_host.fastq.gz