#Exploring, trimming and processing of Illumina (fastq) data 



In order to follow the below exercise you'll have to download Illumina data as described [here](https://github.com/HullUni-bioinformatics/metabarcode-course-2016/blob/master/data/raw_Illumina_data/Download_Illumina_data_from_SRA.ipynb).

The below cells assume that the relative path to your reads is `../raw_Illumina_data/reads`. Amend if necessary.

__TASK__


__How to display the first 12 lines of a gzipped file?__

The pipe symbol `|` is an extremely powerful function to pass the output from one command directly to the next command. This is often referred to as 'piping'.



In [None]:
!gunzip -c ../raw_Illumina_data/reads/Windermere_01-CytB_2.fastq.gz | head -n 12

In [None]:
!zcat -c ../raw_Illumina_data/reads/Windermere_01-CytB_2.fastq.gz | head -n 12

__TASK__

__Determine the number of sequences in a fastq file__

Display only every 4th line of the file and count the lines.

In [None]:
!zcat ../raw_Illumina_data/reads/Windermere_01-CytB_2.fastq.gz | sed -n '1~4p' | wc -l

As we did before for the `fasta` file, we could search for lines with a specific pattern in the `fastq` file that we are sure will only occur once per sequence and count.

As you know by now, per definition, each sequence header in fastq format starts with an '@' character. Couldn't we simply look for that?



In [None]:
!zcat ../raw_Illumina_data/reads/Windermere_01-CytB_2.fastq.gz | grep "@" | wc -l

Let's be more specific and look only for lines that start with an '@' character.

In [None]:
!zcat ../raw_Illumina_data/reads/Windermere_01-CytB_2.fastq.gz | grep "^@" | wc -l

Still not quite.. __Why could that be?__

Another useful pattern to search for is a line that contains only '+', i.e. the 3rd line in the standard fastq format.

In [None]:
!zcat ../raw_Illumina_data/reads/Windermere_01-CytB_2.fastq.gz | grep "^+$" | wc -l

##Quality trimming

__TASK__

Qality trim to a phred score of Q > 30, discarding sequences of length shorter than 250 bp.

Two steps:
 - decompress (gunzip)
 - perform trimming


In [None]:
%%bash
gunzip ../raw_Illumina_data/reads/Windermere_01-CytB_2.fastq.gz
fastq_quality_trimmer -Q 33 -t 30 -l 250 -v -i ../raw_Illumina_data/reads/Windermere_01-CytB_2.fastq -o test1.fastq

Alternatively you could use the pipe function also here:

In [None]:
!zcat ../raw_Illumina_data/reads/Windermere_01-CytB_2.fastq.gz | \
fastq_quality_trimmer -Q 33 -t 30 -l 250 -v -o test1.fastq

__TASK__

Filter all reads with a quality Q < 30 in more than 40% of their bases.

In [None]:
!zcat ../raw_Illumina_data/reads/Windermere_01-CytB_2.fastq.gz | \
fastq_quality_filter -Q 33 -q 30 -p 60 -v -o test2.fastq


__TASK__

Combine both of the above via pipe.

In [None]:
!zcat ../raw_Illumina_data/reads/Windermere_01-CytB_2.fastq.gz | \
fastq_quality_filter -Q 33 -q 30 -p 60 -v | \
fastq_quality_trimmer -Q 33 -t 30 -l 250 -v -o test3.fastq

__Other tools__

[Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic) as an example for another efficient software for trimming Illumina data.

To run the below command successfully you will have to specify the correct input files. Can you determine the desired trimming strategy?

In [None]:
%%bash

java -jar /usr/bin/trimmomatic-0.32.jar PE -phred33 -trimlog trimmomatic.log \
../raw_Illumina_data/reads/Windermere_01-CytB_1.fastq.gz ../raw_Illumina_data/reads/Windermere_01-CytB_2.fastq.gz \
trimmomatic_paired_1.fastq trimmomatic_orphan_1.fastq \
trimmomatic_paired_2.fastq trimmomatic_orphan_2.fastq \
LEADING:30 TRAILING:30 SLIDINGWINDOW:5:20 MINLEN:200


[flash](http://ccb.jhu.edu/software/FLASH/) for merging of overlapping read pairs.

In [None]:
!flash trimmomatic_paired_1.fastq trimmomatic_paired_2.fastq -o flash

Explore read length distribution after merging with flash.

In [None]:
!cat flash.hist

[vsearch](https://github.com/torognes/vsearch) for read clustering. 

A real world metabarcoding use case for `vsearch` could be:

Cluster the merged reads at 99% similarity, retaining one sequence per cluster (aka a 'centroid') and write a table summarizing the fate of each sequence in the dataset.


[vsearch](https://github.com/torognes/vsearch) is an extremely versatile tool, worth knowing your way around with.


Note that to perform clustering you'll first have to convert your fastq seqeunces into fasta format.

you could do this using a tool from FASTX-toolkit, e.g.:

In [None]:
!fastq_to_fasta -h

In [None]:
!fastq_to_fasta -Q 33 -v -n -i flash.extendedFrags.fastq -o flash.extendedFrags.fasta

Or use some simple python code and [Biopython](http://biopython.org/wiki/Biopython) functions.

In [None]:
from Bio import SeqIO

Seqs = SeqIO.parse(open('flash.extendedFrags.fastq','r'), 'fastq')

output_handle = open("flash.extendedFrags.fasta", "w")

count = SeqIO.write(Seqs, output_handle, "fasta")

output_handle.close()

print("Converted %i reads from 'fasta' to 'fastq' format" % count)

Now, let's cluster at 99 % similarity.

In [None]:
%%bash

vsearch --cluster_fast flash.extendedFrags.fasta \
--id 0.99 --centroids flash.centroids.99.fasta --uc flash.centroids.99.uc

Now, let's cluster at 95 % similarity.

In [None]:
%%bash

vsearch --cluster_fast flash.extendedFrags.fasta \
--id 0.95 --centroids flash.centroids.95.fasta --uc flash.centroids.95.uc

How many clusters did you retain? Compare 99% vs. 95% clustering. Which similarity setting do you think would be appropriate? Something to discuss..

__Well Done!__