#Exploring, trimming and processing of Illumina (fastq) data 




The exercise makes use of a number of programs that are all pre-installed in the metaBEAT docker image.

__TASK__


__How to display the first 12 lines of a gzipped file?__





In [None]:
!gunzip example.fastq.gz

In [19]:
!head -n 12 example.fastq

@1_1101_20378_1073_1
GTACAGGTTGAACTGTTTACCCTCCCTATCCTCCAACCTCTCCCACAACGGAGCATCAGTAGATTTAGCCATCTTCTCCCTACATTTAGCAGGAGTATCA
+
EGGGFGFEFGGGGGGGGGGGGGGGGGGGE@EEFGGGGG@FE<FFGGDGGEGCCFF@FGFFGGD<FFGGGGGGGGGFFGFFFFDFEEEEC@,6C==9E?EF
@1_1101_19728_1076_1
GTACTGGATGAACAGTTTATCCCCCCCTATCCTCAACCTCTCTCACAACGGAGCATCAGTAGATTTAGCCATCTTCTCCCTACATTTAGCAGGAGTATCA
+
GGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGCFFGG8EGGGF@FFFCFGGGFFGGDGFGEFGGGFGGGGGFGGGGGGGC
@1_1101_11882_1079_1
GAACTGGTTGAACAGTATATCCTCCGCTCTCTAGAGCGATTGCCCACACAGGGGCTTCGGTTGATCTCGCCATCTTCTCTCTTCACCTTGCAGGTGTAAG
+
GGGGGGGFGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGG@FGGGCGGFGGGGGGGGEGGGGGGGGGGGGGGCEGGGGGGGGFGGGGGGC


Explore the above output and try to understand the basic characteristics of the fastq format. How many lines are there per sequence? How is `fastq` different from `fasta` format?

The pipe symbol `|` is an extremely powerful function to pass the output from one command directly to the next command. This is often referred to as 'piping'. You have used it already in the command line exercise on the first day. You can also use it to peek into compressed files, without physically decompressing the data to disk first.

In [None]:
!gunzip -c example.fastq.gz | head -n 12

In [None]:
!zcat example.fastq.gz | head -n 12

__TASK__

__Determine the number of sequences in a fastq file__

Display only every 4th line of the file and count the lines (assuming your file is still compressed).

In [None]:
!zcat example.fastq.gz | sed -n '1~4p' | wc -l

As we did before for the `fasta` file, we could search for lines with a specific pattern in the `fastq` file that we are sure will only occur once per sequence and count.

As you know by now, per definition, each sequence header in fastq format starts with an '@' character. Couldn't we simply look for that?



In [None]:
!zcat example.fastq.gz | grep "@" | wc -l

Let's be more specific and look only for lines that start with an '@' character.

In [None]:
!zcat example.fastq.gz  | grep "^@" | wc -l

Still not quite.. __Why could that be?__

Another useful pattern to search for is a line that contains only '+', i.e. the 3rd line in the standard fastq format.

In [None]:
!zcat example.fastq.gz | grep "^+$" | wc -l

##Quality trimming

__TASK__

Qality trim to a phred score of Q > 30, discarding sequences of length shorter than 250 bp.



Two steps:
 - decompress (gunzip)
 - perform trimming


In [None]:
%%bash
gunzip AHA_ASH_2.fastq.gz
fastq_quality_trimmer -Q 33 -t 30 -l 250 -v -i AHA_ASH_2.fastq -o AHA_ASH_2-trimmed.fastq

Alternatively you could use the pipe function also here:

In [None]:
!zcat AHA_ASH_2.fastq.gz | \
fastq_quality_trimmer -Q 33 -t 30 -l 250 -v -o AHA_ASH_2-trimmed.fastq

__TASK__

Filter all reads with a quality Q < 30 in more than 40% of their bases.

In [None]:
!fastq_quality_filter -i AHA_ASH_2.fastq -Q 33 -q 30 -p 60 -v -o AHA_ASH_2-filtered.fastq


__TASK__

Combine both of the above via pipe. __Note__ that the `\` symbols at the end of some lines below are just to line wrap the command in order to make it more easily readable. It is still only one command.

In [None]:
!zcat AHA_ASH_1.fastq.gz  | \
fastq_quality_filter -Q 33 -q 30 -p 60 -v | \
fastq_quality_trimmer -Q 33 -t 30 -l 250 -v -o AHA_ASH_1-trim-filter.fastq

##Reduce redundancy by read clustering

[vsearch](https://github.com/torognes/vsearch) for read clustering. 

A real world metabarcoding use case for `vsearch` could be:

Cluster the merged reads at 97% similarity, retaining one sequence per cluster (aka a 'centroid') and write a table summarizing the fate of each sequence in the dataset.


[vsearch](https://github.com/torognes/vsearch) is an extremely versatile tool, worth knowing your way around with.


Note that to perform clustering you'll first have to convert your fastq seqeunces into fasta format.

you could do this using a tool from FASTX-toolkit, e.g.:

In [None]:
!fastq_to_fasta -h

In [None]:
%%bash
gunzip AHA_ASH.extendedFrags.fastq.gz
fastq_to_fasta -Q 33 -v -n -i AHA_ASH.extendedFrags.fastq -o AHA_ASH.extendedFrags.fasta

Or use some simple python code and [Biopython](http://biopython.org/wiki/Biopython) functions.

In [None]:
from Bio import SeqIO

Seqs = SeqIO.parse(open('AHA_ASH.extendedFrags.fastq','r'), 'fastq')

output_handle = open("AHA_ASH.extendedFrags.fasta", "w")

count = SeqIO.write(Seqs, output_handle, "fasta")

output_handle.close()

print("Converted %i reads from 'fasta' to 'fastq' format" % count)

Now, let's cluster at 97 % similarity.

In [None]:
%%bash

vsearch --cluster_fast AHA_ASH.extendedFrags.fasta --id 0.97 \
--centroids AHA_ASH.extended.clustered-0.97.fasta \
--uc AHA_ASH.extended.clustered-0.97.uc

Now, let's cluster at 95 % similarity.

In [None]:
%%bash

vsearch --cluster_fast AHA_ASH.extendedFrags.fasta --id 0.95 \
--centroids AHA_ASH.extended.clustered-0.95.fasta \
--uc AHA_ASH.extended.clustered-0.95.uc

How many clusters did you retain? Compare 97% vs. 95% clustering. Which similarity setting do you think would be appropriate? Something to discuss..

__Well Done!__

##BLAST 

BLAST search against the full nucleotide (nt) database downloaded from Genbank.

In [None]:
!blastn -db ../reference-dbs/BLAST/nt/nt -query AHA_ASH.extended.clustered-0.97.fasta \
-num_alignments 50 -out AHA_ASH.extended.clustered-0.97.blastn.nt.out