**Overview**

Preprocessing is an important first step in all NGS data analysis. We will try to pre-process several types of NGS data. The output of each of the datasets will be used in the alignment exercise tomorrow.

 - P. aeruginosa single end Illumina reads
 - Human Illumina paired end reads
 - E. coli - 454 and Ion Torrent data
 - P. aeruginosa single end Illumina reads

# Introduction

Let's try some reads from a study of Pseudomonas aeruginosa, an opportunistic pathogen that can live in the environment and may infect humans. Inside humans they may live in the lungs and form biofilms. The reads are from a strain that has infected ~40 patients over the last 30 years in Denmark and here it adapted to the human-host lung environment. In the alignment exercise we are going to compare it to the PAO1 reference genome which was isolated in 1955 in Australian from a wound (ie. probably it just came from the environment) and see which genes it has lost during the adaptation (assuming that the PAO1 genome is a relatively close ancestor). But first let us check the quality of the data. 

Make a subdirectory in your own folder and make a symbolic link to the data. <br>
The first file we will work with is called<font color=green> Paeruginosa.fastq.gz</font>. <br>
And it is located at : <font color=green> /exercises/alignment/paeruginosa/Paeruginosa.fastq.gz </font>

Look at the reads. They are gzipped to store space so we need to look at them using gzip which is a program to open gzipped files.

_First step is to prepare the folder._

In [1]:
! mkdir -p exercises/alignment/paeruginosa
%cd exercises/alignment/paeruginosa
! pwd

/home/jupyter-admin/ngs/exercises/alignment/paeruginosa
/home/jupyter-admin/ngs/exercises/alignment/paeruginosa


In [2]:
! zcat /exercises/alignment/paeruginosa/Paeruginosa.fastq.gz | head -n 12

@HWUSI-EAS656_0037_FC:3:1:16637:1035#NNNNNN/1
CATATTTTGTGGCTCATCCCAAGGGAGAGGTTTTTCTATACTCAGGAGAAGTTACTCACGATAAAGAGAA
+
41?8FFF@@DAGGGEDF@FGECGGGBG@GE.EEBGBDADBBEEBEEC>ACE>CD?EEC?CAB>EB:<C##
@HWUSI-EAS656_0037_FC:3:1:4655:1043#NNNNNN/1
CATGGTGTTGGCCAGCAGCACATTCCTGCCCATGTAGAACTCGCGCAGCCCGGAACTGGACGGCAGGGGC
+
&&77?2>8:GDGDGEDGBDDCDDGEDGBDD>DDCAA>A?5>>>5>>C>>39:;83889157;78?:?>?#
@HWUSI-EAS656_0037_FC:3:1:11313:1042#NNNNNN/1
GTAGGGGTGGTAGAGCGCCTTGCGGCCGACCTGCCGGGCAAGGGAGCGGGTGATGTCGTAGACGATGCCG
+
)&1?EEE?@D<BGGGDBDDDDCDDGDGBCA@A>@DD@A<B>CB?@AA@AB:<<*/048)99;;AA@AA@6

gzip: stdout: Broken pipe


**Q1. How long are the reads?
  Q2. How many reads are there? 
  Q3. What is the average depth? The genome size is 6,588,339 nt.**
   

We need to find out which qualty encoding the reads use in order to trim the reads. Instead of looking at the data let's use this little program: 

In [3]:
! zcat /exercises/alignment/paeruginosa/Paeruginosa.fastq.gz | python /exercises/alignment/paeruginosa/fastx_detect_fq.py

Traceback (most recent call last):
  File "/exercises/alignment/paeruginosa/fastx_detect_fq.py", line 5, in <module>
    from Bio.SeqIO.QualityIO import FastqGeneralIterator
ModuleNotFoundError: No module named 'Bio'

gzip: stdout: Broken pipe


**Q4. What are the qualities encoded in? This information is needed when we want to trim the reads and perform alignment.**
**FastQC**

Let's take a look of the reads and see if they need trimming of adaptors or qualities - we will run fastqc on the reads and view the report using Firefox. 


In [4]:
! fastqc /exercises/alignment/paeruginosa/Paeruginosa.fastq.gz --outdir="."

/bin/sh: 1: fastqc: not found


This program outputs several statistics in the report. Look at each of the figures and see if it reports something that we should look at (red/white cross). Look at the "Per base sequence quality". 

In [6]:
import IPython.display as disp
disp.IFrame(src="./exercises/alignment/paeruginosa/Paeruginosa_fastqc.html" ,width=1400, height=700)


**Q5. 
a) Briefly, what does the "Per base sequence quality" plot show? 
b) Is that what you expect, and why?**

Now, look at "Overrepresented sequences". Here you see that there is a sequence often found in the reads that 
matches the Illumina Paired End PCR Primer 2. It is especially important to remove these when are doing a de novo assembly as these will overlap between the reads and make wrong assemblies. For alignment they are also troublesome as the aligners often perform global alignment (i.e. it wants to align all of the read to the genome). 
Copy the primer sequence and save it to use with the cutadapts software.

**Cut adaptors/primers**

Let's cut the adapter from the reads. Let's only keep reads that are larger than 30 nucleotides, and there should be at least 3 bases matching before cutting the adaptor. When it is done, some statistics will be printed. 

In [6]:
! cutadapt -b "CTAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTGAAAAAAA" -m 30 -o Paeruginosa.fastq.cut.gz -O 3 /exercises/alignment/paeruginosa/Paeruginosa.fastq.gz

This is cutadapt 1.15 with Python 3.6.9
Command line parameters: -b CTAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTGAAAAAAA -m 30 -o Paeruginosa.fastq.cut.gz -O 3 /exercises/alignment/paeruginosa/Paeruginosa.fastq.gz
Running on 1 core
Trimming 1 adapter with at most 10.0% errors in single-end mode ...
Finished in 40.27 s (36 us/read; 1.65 M reads/minute).

=== Summary ===

Total reads processed:               1,106,297
Reads with adapters:                    21,068 (1.9%)
Reads that were too short:               6,636 (0.6%)
Reads written (passing filters):     1,099,661 (99.4%)

Total basepairs processed:    77,440,790 bp
Total written (filtered):     76,924,907 bp (99.3%)

=== Adapter 1 ===

Sequence: CTAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTGAAAAAAA; Type: variable 5'/3'; Length: 70; Trimmed: 21068 times.
13770 times, it overlapped the 5' end of a read
7298 times, it overlapped the 3' end or was within the read

No. of allowed errors:
0-9 bp: 0; 1

**Q6. Look at the top of the output. Does the number of times that a read was cut fit with the number of times it was found in the FastQC report?**

Also try trimming low quality bases from the ends. Afterwards, take a look at the reads. You should see the varying length now: 

In [8]:
! cutadapt -b "CTAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTGAAAAAAA"  -m 30 -o Paeruginosa.fastq.cut.trim.gz -O 3 -q 20 --quality-base=33 /exercises/alignment/paeruginosa/Paeruginosa.fastq.gz
! gzip -dc Paeruginosa.fastq.cut.trim.gz | head -n10
#You can also re-run fastqc on the trimmed files and view them to see the difference. 

This is cutadapt 1.15 with Python 3.6.9
Command line parameters: -b CTAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTGAAAAAAA -m 30 -o Paeruginosa.fastq.cut.trim.gz -O 3 -q 20 --quality-base=33 /exercises/alignment/paeruginosa/Paeruginosa.fastq.gz
Running on 1 core
Trimming 1 adapter with at most 10.0% errors in single-end mode ...
Finished in 37.98 s (34 us/read; 1.75 M reads/minute).

=== Summary ===

Total reads processed:               1,106,297
Reads with adapters:                    22,181 (2.0%)
Reads that were too short:              98,506 (8.9%)
Reads written (passing filters):     1,007,791 (91.1%)

Total basepairs processed:    77,440,790 bp
Quality-trimmed:              12,534,718 bp (16.2%)
Total written (filtered):     62,833,041 bp (81.1%)

=== Adapter 1 ===

Sequence: CTAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTGAAAAAAA; Type: variable 5'/3'; Length: 70; Trimmed: 22181 times.
15110 times, it overlapped the 5' end of a read
7071 times, it 

In [10]:
! fastqc /exercises/alignment/paeruginosa/Paeruginosa.fastq.cut.trim.gz --outdir="."

Started analysis of Paeruginosa.fastq.cut.trim.gz
Approx 5% complete for Paeruginosa.fastq.cut.trim.gz
Approx 10% complete for Paeruginosa.fastq.cut.trim.gz
Approx 15% complete for Paeruginosa.fastq.cut.trim.gz
Approx 20% complete for Paeruginosa.fastq.cut.trim.gz
Approx 25% complete for Paeruginosa.fastq.cut.trim.gz
Approx 30% complete for Paeruginosa.fastq.cut.trim.gz
Approx 35% complete for Paeruginosa.fastq.cut.trim.gz
Approx 40% complete for Paeruginosa.fastq.cut.trim.gz
Approx 45% complete for Paeruginosa.fastq.cut.trim.gz
Approx 50% complete for Paeruginosa.fastq.cut.trim.gz
Approx 55% complete for Paeruginosa.fastq.cut.trim.gz
Approx 60% complete for Paeruginosa.fastq.cut.trim.gz
Approx 65% complete for Paeruginosa.fastq.cut.trim.gz
Approx 70% complete for Paeruginosa.fastq.cut.trim.gz
Approx 75% complete for Paeruginosa.fastq.cut.trim.gz
Approx 80% complete for Paeruginosa.fastq.cut.trim.gz
Approx 85% complete for Paeruginosa.fastq.cut.trim.gz
Approx 90% complete for Paerugino

In [None]:
import IPython.display as disp
disp.IFrame(src="./exercises/alignment/paeruginosa/Paeruginosa.fastq.cut.trim.html" ,width=1400, height=700)

**Q7. What was the effect on the qualities, and what was the effect on the per base sequence content? Can you suggest an additional trimming to increase the quality?** <br>

## Human Illumina Paired end reads
Let us look at some paired end Illumina reads, these reads are from a ~40X wgs of an Asian individual. To save time we are not going to use all reads - the reads are filtered so that we only have reads that map chromosome 21. 

The data we will be using is :

/exercises/alignment/human/HG00418_A_*.fastq.gz

Let's look at the data using fastqc. To begin with we are only going to use the "A" reads (A_1 and A_2). The _1 file contains the first of all pairs (forward reads) and the _2 contains the second of the pairs (reverse reads). Try to type in the commands needed to run fastqc. 

It looks like the reads can use some trimming. We don't find any adaptor sequences so let's skip that and only trim the reads based on quality. for this we are going to use __prinseq__. 


**RESTART THE KERNEL BEFORE YOU MOVE ON**

In [1]:
! mkdir -p exercises/alignment/human
%cd exercises/alignment/human
! pwd

/home/jupyter-panos/exercises/alignment/human
/home/jupyter-panos/exercises/alignment/human


In [2]:
! fastqc /exercises/alignment/human/HG00418_A_1.fastq.gz --outdir="."
! fastqc /exercises/alignment/human/HG00418_A_2.fastq.gz --outdir="."

Started analysis of HG00418_A_1.fastq.gz
Approx 5% complete for HG00418_A_1.fastq.gz
Approx 10% complete for HG00418_A_1.fastq.gz
Approx 15% complete for HG00418_A_1.fastq.gz
Approx 20% complete for HG00418_A_1.fastq.gz
Approx 25% complete for HG00418_A_1.fastq.gz
Approx 30% complete for HG00418_A_1.fastq.gz
Approx 35% complete for HG00418_A_1.fastq.gz
Approx 40% complete for HG00418_A_1.fastq.gz
Approx 45% complete for HG00418_A_1.fastq.gz
Approx 50% complete for HG00418_A_1.fastq.gz
Approx 55% complete for HG00418_A_1.fastq.gz
Approx 60% complete for HG00418_A_1.fastq.gz
Approx 65% complete for HG00418_A_1.fastq.gz
Approx 70% complete for HG00418_A_1.fastq.gz
Approx 75% complete for HG00418_A_1.fastq.gz
Approx 80% complete for HG00418_A_1.fastq.gz
Approx 85% complete for HG00418_A_1.fastq.gz
Approx 90% complete for HG00418_A_1.fastq.gz
Approx 95% complete for HG00418_A_1.fastq.gz
Analysis complete for HG00418_A_1.fastq.gz
Started analysis of HG00418_A_2.fastq.gz
Approx 5% complete fo

In [9]:
import IPython.display as disp
disp.IFrame(src="./exercises/alignment/human/HG00418_A_1_fastqc.html" ,width=1400, height=700)

In [6]:
import IPython.display as disp
disp.IFrame(src="./exercises/alignment/human/HG00418_A_2_fastqc.html" ,width=1400, height=700)

Prinseq cannot read/write compressed files so we need to pipe the data in and out using gzip. We trim quality from the right (3') to minimum 20. We also want the read to have average quality of 20, and after trimming the read should be at least 35bp for us to keep it. We are writing the commands over two lines, paste in both lines. Each command takes ~2 mins to complete.

In [8]:
! gzip -dc /exercises/alignment/human/HG00418_A_1.fastq.gz | perl /exercises/alignment/scripts/prinseq-lite.pl -fastq stdin -out_bad null -out_good stdout \
-trim_qual_right 20 -min_qual_mean 20 -min_len 35 -log prinseq.HG00418_A_1.log | gzip > HG00418_A_1.trim.gz

Input and filter stats:
	Input sequences: 1,059,560
	Input bases: 105,956,000
	Input mean length: 100.00
	Good sequences: 1,053,458 (99.42%)
	Good bases: 101,320,041
	Good mean length: 96.18
	Bad sequences: 6,102 (0.58%)
	Bad bases: 610,200
	Bad mean length: 100.00
	Sequences filtered by specified parameters:
	trim_qual_right: 141
	min_len: 5957
	min_qual_mean: 4


In [7]:
! gzip -dc /exercises/alignment/human/HG00418_A_2.fastq.gz | perl /exercises/alignment/scripts/prinseq-lite.pl -fastq stdin -out_bad null -out_good stdout \
-trim_qual_right 20 -min_qual_mean 20 -min_len 35 -log prinseq.HG00418_A_2.log | gzip > HG00418_A_2.trim.gz

Input and filter stats:
	Input sequences: 1,059,560
	Input bases: 105,956,000
	Input mean length: 100.00
	Good sequences: 1,048,379 (98.94%)
	Good bases: 100,627,714
	Good mean length: 95.98
	Bad sequences: 11,181 (1.06%)
	Bad bases: 1,118,100
	Bad mean length: 100.00
	Sequences filtered by specified parameters:
	trim_qual_right: 1593
	min_len: 9546
	min_qual_mean: 42


When we have trimmed our paired end reads we have removed reads from each file. This means that the ordering is not in sync between the two files, eg. if we removed read no. 5 in file1 but not in file2, then read6 in file1 will be paired with read5 in file2. We need to fix this using pairfq before we can use the reads. 

We use "-f" as input forward reads (first pair), "-r" input reverse reads (second pair) and we output "common" and "unique" files using the -fp, -rp, -fs and -rs options. It is the "common" files that we are going to use.

In [9]:
! gunzip /home/jupyter-panos/exercises/alignment/human/HG00418_A_1.trim.gz
! gunzip /home/jupyter-panos/exercises/alignment/human/HG00418_A_2.trim.gz

In [10]:
! perl /exercises/alignment/scripts/pairfq makepairs -f /home/jupyter-panos/exercises/alignment/human/HG00418_A_1.trim -r /home/jupyter-panos/exercises/alignment/human/HG00418_A_2.trim -fp HG00418_A_1.trim.fastq-common.out.gz \
-rp HG00418_A_2.trim.fastq-common.out.gz -fs HG00418_A_1.trim.fastq-unique.out.gz -rs HG00418_A_2.trim.fastq-unique.out.gz --compress gzip

**Q8. 
a) How many paired reads are left? Please report absolute count and percentage of the untrimmed files. 
b) Were most removed reads from the forward or reverse file?**




__E. coli - 454 and Ion Torrent data__

The 454 and Ion Torrent data are from the German E. coli outbreak (cucumber/sprouts) during the summer 2011. The files are here: 

/exercises/alignment/ecoli/GOS1.fastq.gz 
/exercises/alignment/ecoli/iontorrent.fastq.gz 

Take a look at the fastq.gz files, and then run fastqc and look at the reports. 

Look at the different quality profiles. What differences can you see between the data types?

We can also look at some graph data that we can visualize what it looks like for the 454 data: 
*/ecoli/GOS1.gd.html 

**Q9. Looking at the html report,  How long are the 454 reads?** Let's trim the 454 alignments. 

In [11]:
! fastqc /exercises/alignment/ecoli/GOS1.fastq.gz  --outdir="."

Started analysis of GOS1.fastq.gz
Approx 5% complete for GOS1.fastq.gz
Approx 10% complete for GOS1.fastq.gz
Approx 15% complete for GOS1.fastq.gz
Approx 20% complete for GOS1.fastq.gz
Approx 25% complete for GOS1.fastq.gz
Approx 30% complete for GOS1.fastq.gz
Approx 35% complete for GOS1.fastq.gz
Approx 40% complete for GOS1.fastq.gz
Approx 45% complete for GOS1.fastq.gz
Approx 50% complete for GOS1.fastq.gz
Approx 55% complete for GOS1.fastq.gz
Approx 60% complete for GOS1.fastq.gz
Approx 65% complete for GOS1.fastq.gz
Approx 70% complete for GOS1.fastq.gz
Approx 75% complete for GOS1.fastq.gz
Approx 80% complete for GOS1.fastq.gz
Approx 85% complete for GOS1.fastq.gz
Approx 90% complete for GOS1.fastq.gz
Approx 95% complete for GOS1.fastq.gz
Approx 100% complete for GOS1.fastq.gz
Analysis complete for GOS1.fastq.gz


In [15]:
! fastqc /exercises/alignment/ecoli/iontorrent.fq.gz  --outdir="."

Started analysis of iontorrent.fq.gz
Approx 5% complete for iontorrent.fq.gz
Approx 10% complete for iontorrent.fq.gz
Approx 15% complete for iontorrent.fq.gz
Approx 20% complete for iontorrent.fq.gz
Approx 25% complete for iontorrent.fq.gz
Approx 30% complete for iontorrent.fq.gz
Approx 35% complete for iontorrent.fq.gz
Approx 40% complete for iontorrent.fq.gz
Approx 45% complete for iontorrent.fq.gz
Approx 50% complete for iontorrent.fq.gz
Approx 55% complete for iontorrent.fq.gz
Approx 60% complete for iontorrent.fq.gz
Approx 65% complete for iontorrent.fq.gz
Approx 70% complete for iontorrent.fq.gz
Approx 75% complete for iontorrent.fq.gz
Approx 80% complete for iontorrent.fq.gz
Approx 85% complete for iontorrent.fq.gz
Approx 90% complete for iontorrent.fq.gz
Approx 95% complete for iontorrent.fq.gz
Approx 100% complete for iontorrent.fq.gz
Analysis complete for iontorrent.fq.gz


In [13]:
import IPython.display as disp
disp.IFrame(src="./exercises/alignment/human/GOS1_fastqc.html" ,width=1400, height=700)

In [16]:
import IPython.display as disp
disp.IFrame(src="./exercises/alignment/human/iontorrent_fastqc.html" ,width=1400, height=700)


In [24]:
import IPython.display as disp
disp.IFrame(src="/exercises/alignment/ecoli/GOS1.gd.html" ,width=1400, height=700)


Let's say we don't want reads shorter than 150bp and chop of reads larger than 500bp, also let's remove reads with avg. quality lower than 20.

In [23]:
! gzip -dc /exercises/alignment/ecoli/GOS1.fastq.gz | perl /exercises/alignment/scripts/prinseq-lite.pl -fastq stdin -graph_data GOS1.gd -out_good stdout \
-out_bad null -min_len 150 -min_qual_mean 20 -trim_to_len 500 | gzip -c > GOS1.trim.fastq.gz



Input and filter stats:
	Input sequences: 175,000
	Input bases: 63,292,714
	Input mean length: 361.67
	Good sequences: 148,282 (84.73%)
	Good bases: 61,096,439
	Good mean length: 412.03
	Bad sequences: 26,718 (15.27%)
	Bad bases: 2,081,746
	Bad mean length: 77.92
	Sequences filtered by specified parameters:
	min_len: 26641
	min_qual_mean: 77
Filehandle STDIN reopened as FH only for output at /exercises/alignment/scripts/prinseq-lite.pl line 2278.
