# Quality Control of High-throughput sequencing data

The [FastQ](https://en.wikipedia.org/wiki/FASTQ_format) sequencing format (Fasta with **quality**) is the most common format for high-throughput sequencing. The *quality* refers to the file format containing information about the [Phred score](https://en.wikipedia.org/wiki/Phred_quality_score). A quick look at an example line from a file reveals some basic information you should be aware of:

## FastQ Format

A single entry (read) in the file looks like this: 

In [None]:
@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
NNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNN
+SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!

Every line has some significance:

|Line|Information|
|----|-----------|
|1|Always begins with ‘@’ and then information about the read|
|2|The actual DNA sequence|
|3|Always begins with a ‘+’ and sometimes the same info in line 1|
|4|Has a string of characters which represent the quality scores; must have same number of characters as line 2|

Phred scores are encoded in the 4th line along a number line (from a score of 0-40) represented by symbols:

In [None]:
Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI
                  |         |         |         |         |
Quality score:    0........10........20........30........40     

## Examining some sample data

Run the cell below to see the FastQ files provided in this tutorial

In [None]:
!ls -F $HOME/tutorial-data/qc_examples/

We see two FastQ files `SRR3191542_1.fastq.gz` and `SRR3191542_2.fastq.gz`. The `_1` and `_2` indicates that these are a pair of reads from a paired-end sequencing run; the left and right read respectively. There is also a directory `processed/` which contains 'backup' results that have been pre-proccessed for you

## Sequencing QC with FASTQC

[fastqc](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) is a simple and popular sequencing tool. To run it (it is already installed on this computer) we use the command `fastqc` followed by the name of the file we want to analyze:

**Note: In this Jupyter Notebook, commands we want to run on the Linux shell must begin with a `!`**

In [None]:
!fastqc $HOME/tutorial-data/qc_examples/SRR3191542_1.fastq.gz

In the cell below, run the command on the `SRR3191542_1.fastq.gz` file. Remember, to start your command with `!` and use the path `$HOME/tutorial-data/qc_examples/`

We now have results for each of our files including an HTML report and a zip file containing additional information. 

In [None]:
!ls $HOME/tutorial-data/qc_examples

We can preview the FastQC HTML report as shown by running the next cell

In [None]:
%%HTML
<iframe width="100%" height="550" src="/user/tutorial-user/view/tutorial-data/qc_examples/SRR3191542_1_fastqc.html"></iframe>

## Sequencing QC with Fastp

[Fastp](https://github.com/OpenGene/fastp) is more recently developed and popular QC tools. It has the advantage of not just reporting on quality, but simultaneously making trimming and filtering adjustments to improve the quality. Have a look over the [Fastp manual](https://github.com/OpenGene/fastp/blob/master/README.md); what do the options in the following command do? Run the command and then examine the results:

In [None]:
!fastp -QLyAG -i $HOME/tutorial-data/qc_examples/SRR3191542_1.fastq.gz -o $HOME/tutorial-data/qc_examples/fastp_SRR3191542_1.fastq.gz

Let's see the reports generated:

In [None]:
!ls $HOME/tutorial-data/qc_examples/

Again we can preview the results

In [None]:
%%HTML
<iframe width="100%" height="600" src="$HOME/tutorial-data/qc_examples/fastp.html"></iframe>

You can also view the HTML results by going back to the Jupyter notebook home screen. Click the `..` to go up one level on the system directory and your results will be in `tutorial-data/qc_examples`

## Running Fastp on the tutorial sample data

The sample data for this tutorial is [described in this paper](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0175744). And located on this computer here: 

In [None]:
!ls -R /home/tutorial-user/tutorial-data/kallisto-bulk/data/fastq_files/

To analyze these data using fastp we can use a for loop to loop through the files, appropriately ingesting either a single read or pair of reads, and creating the appropriate output. We will not need to execute the loop, but an example is given below:

For paired-end files (*hint* - remove the # in front of the fastp command for this to actually work). 

In [None]:
!cd /home/tutorial-user/tutorial-data/kallisto-bulk/data/fastq_files/paired-end/
!for r1infile in /home/tutorial-user/tutorial-data/kallisto-bulk/data/fastq_files/paired-end/*_1.fastq.gz;\
 do\
 r2infile=$(echo $r1infile|cut -f1 -d _)_2.fastq.gz;\
 r1outfile=fastp_$(echo $r1infile|cut -f1 -d _)_1.fastq.gz;\
 r2outfile=fastp_$(echo $r1infile|cut -f1 -d _)_2.fastq.gz;\
 reportname=$(echo $r1infile|cut -f1 -d _).fastp-report.html;\
 echo "processing $r1infile and $r2infile";\
 #fastp -h $reportname -i $r1infile -o $r1outfile -I $r2infile -O $r2outfile;\
 done; 

For paired-end files (*hint* - remove the # in front of the fastp command for this to actually work). 

In [None]:
!cd /home/tutorial-user/tutorial-data/kallisto-bulk/data/fastq_files/single-end/
!for infile in /home/tutorial-user/tutorial-data/kallisto-bulk/data/fastq_files/single-end/*_1.fastq.gz;\
 do\
 outfile="fastp_${infile}"; reportname="fastp_${infile}.report";\
 echo "processing $infile";\
 #fastp "-h '$reportname' -i $infile -o $outfile";\
 done