# Data triming and filtering with fastp

[fastp](https://github.com/OpenGene/fastp) is "A tool designed to provide fast all-in-one preprocessing for FastQ files. This tool is developed in C++ with multithreading supported to afford high performance."

In otherwords, we can take the FASTQ files (our sequencing data) and do various things to improve the quality of that data such as

- Remove reads that have an overal low quality (e.g., low [Phred score](https://en.wikipedia.org/wiki/Phred_quality_score))
- Trim portions of a read (e.g., where the first few/last few bases are of low quality)
- Remove reads that are shorter than some desired length (e.g., drop short reads and keep long reads)
- Remove adapters (e.g., in sequencing, some artificial DNA "tags" may be added to the DNA to be sequenced)
- And more...

## Installing fastp

As always, we will install the software; again we will use conda. 

**Important**: Make sure you execute each numbered step 

1. We will search for the tool we want to install

We will use the `conda search` command and the channel (`-c`) flag to search [bioconda](https://bioconda.github.io/)

In [None]:
conda search fastp -c bioconda

2. Create a conda enviornment

Conda uses something called "enviornments" which are essentially isolated configurations on our computer where we can included all the needed compatible tools and exlude other tools which are unnesessary or would have conflicts with our desired tool. We will use the `-y` option to install without prompting the user for input, the `--name` option to name the enviornment for the tool. We will enforce versioning (`tool==version`) so that we know what version of a tool was used to do an analysis should we wish to repeat the analysis. 

**Tip**: Use the latest version where possible, but if you get an error with dependancies, using a lower version may help. Some tools may never be installed successfully using conda, but we will face those when we have too. 


In [None]:
conda create -y --name fastp fastp==0.20.0 -c bioconda

3. We will use the `conda init` command so that conda can be configured for this shell

In [None]:
conda init

4. **DON'T SKIP**: We need to restart the computer's [kernal](https://en.wikipedia.org/wiki/Kernel_(operating_system)). Go to the **Kernal** menu and choose **Restart Kernal**

5. Finally, we can activate the conda enviornment (created with the name used for the environment). When you run the next cell it should return the name of the environment.  

In [None]:
conda activate fastp

## Running fastp

First, we will be run fastp on a small subset of the sequence reads. In this exercise, the reads are in the `concat_fastq` folder. We will run on a sample that has 100 fastq files from the polyrhiza experiment (`100_spolyrhiza_reads.fastq.gz`). 

**Tip**: When using commands or searching for files, the tab key will help you autocomplete (and help ensure the files and commands you think you have are actually accessible).



### fastp options

When we call the `fastp` program here are some options we are likely to use (see the [fastp](https://github.com/OpenGene/fastp) documentation for all options):

**Note**: We are working with "single-end" data so we will only cover these options

- `--in1 <file_name.fastq.gz>`: Name of the input file (this is the file to be trimmed)
- `--out1 <file_name.fastq.gz>`: Name of the output FASTQ file that should be created (i.e. whatever read triming and filtering; you get to name the output file, choose something descriptive like "trimmed"
- `--length_required  <some number>`: Remove reads less than this length (bp)
- `--average_qual <some number>`: Remove reads when their average quality (Phred score) is below some number. Don't be too strict with Nanopore reads since we expect a lower Phred score on average. Reads with an average less than 9 have been automaticaly excluded during the initial sequencing. 
- `--thread <some number>`: How many CPUs to use (more will be faster, but you can only use with as many CPUs as you have launched this application with). 
- `--report_title "some title in quotes`: A descriptive title for your summary report
- `--html <filename.html>`: The name of the file to save your HTML report in

### Example 1 - running fastp on `100_spolyrhiza_reads.fastq.gz`

1. Review and run the following `fastp` analysis. Ensure you have a sense of what this command should do

**Tip**: Adding a backslash after each command does not change the commands, it just makes it easier to read. These symbols are not interprited by the computer. 

In [None]:
fastp --in1 concat_fastq/100_spolyrhiza_reads.fastq.gz\
      --out1 concat_fastq/100_spolyrhiza_reads_filtered.fastq.gz\
      --length_required 1000\
      --report_title "100 Reads Sample Reads>= 1000bp"\
      --html 100_spolyrhiza_reads_filtered_example.html

2. In your Jupyter file browser, you should now have an HTML file entitled `100_spolyrhiza_reads_filtered_example.html`. Double-click to open and examine the file. 
**Tip**: This file contains graphcs, but you must click the **Trust HTML** button that will be in the upper-left of the tab that opens

### Results

Clearly, filtering only by length the number of total reads changes (from 100 to only 55 that passed this filter). In Nanopore sequencing, length is not really related to sequencing quality. So the other statistics will change (i.e. the number of bases with a Phred score of at least 20), but these changes are minor/not significant. 

## Challenge - Use FastP to create your input sequences for assembly

Now, it's up to you to take the entire sequencing dataset (located in `concat/spolyrhiza_reads.fastq.gz`) and make choices about how much you want to clean and filter this data. The better data you present to the assembly software, the better the genome assembly you will get. However, there are tradeoffs. 

Think of this as if you were making an apple pie. You go to by apples, but some have brown spots and some are spoiled. No one wants a pie made of spoiled apples. However, if you insist only on 100% perfect apples, there may not be enough for your pie. 

In your genome assembly, having longer, high quality reads will help. However you need a minimum of 30-40X (coverage)[https://en.wikipedia.org/wiki/Coverage_(genetics)] for most assemblers to work. Our genome is appoximately 140-50 million bases, meaning we want somewhere around 6 billion bases.


### What to do

1. Try some fastp filter/trimming tools on the `spolyrhiza_reads.fastq.gz` file located at `concat/spolyrhiza_reads.fastq.gz`

**Your command should**
- start with `fastp`
- specify the `--in1` and `--out1` parameters (give your output file a unique and descriptive name)
- Use the `--thread` and set to the number of CPUs used when you launched this application on CyVerse
- specify the `--report_title` and `--html` parameters (again use unique and descriptive names/titles)

2. Take note of the number of reads you start and end with - don't be too strict
3. You can combine as several of the triming and flitering tools in the [fastp](https://github.com/OpenGene/fastp) documentation
4. You should create at least two outputs of reads for triming and filtering - perhaps one that is very conservative (higher read lengths, base qualities) and one not so conservative. 

In the end, the only way you will know which of your choices are better is to run the genome assembler and see what results you get. Different assemblers will be tolerant of different lengths and qualities. Some assemblers may actually do better with large numbers of lower quality reads (because they use the sheer number of reads to correct for errors).

In [None]:
# Add as many cells as you need to run your commands. Make sure you use unique file names and reports each time to keep track. Be sure to save your work to your folder on CyVerse


## Document your work

This is the first notebook where you will be manipulating a file. We need to keep good track on what changes were made/how a file was produced so that we can fully document our work. In this exercise, it will be critical to know how you trimmed and filtered reads. Knowing your settings will allow us to compare results across everyone who does this experiment. You will also be able to go back and reproduce your work if needed. 

**Make sure your student folder has your HTML report (which has a copy of your command/settings at the bottom of the report) and your trimmed/filtered FASTQ files.**