## 2. Reads Cleaning

*Tools*

**Trimmomatic**
* Conda: https://bioconda.github.io/recipes/trimmomatic/README.html#package-trimmomatic
* Manual: http://www.usadellab.org/cms/?page=trimmomatic
* Other resource: https://academic.oup.com/bioinformatics/article/30/15/2114/2390096

**cutadapt**
* Conda: https://bioconda.github.io/recipes/cutadapt/README.html#package-cutadapt
* Manual: https://cutadapt.readthedocs.io/en/stable/

In [1]:
! ls ./P1/

[34mArchive[m[m               reads_1.fastq.gz      trimmo_adapters.fasta


Steps to be performed: 
* Remove the adapters (ILLUMINACLIP:trimmo_adapters.fasta:2:30:10) 
 * *Cut adapter and other illumina-specific sequences from the read*
 * ILLUMINACLIP **:** fastaWithAdaptersEtc **:** seed mismatches **:** palindrome clip threshold **:** simple clip threshold
   * **fastaWithAdaptersEtc**: specifies the path to a fasta file containing all the adapters, PCR sequences etc. The naming of the various sequences within this file determines how they are used.
   * **seedMismatches**: specifies the maximum mismatch count which will still allow a full match to be performed
   * **palindromeClipThreshold**: specifies how accurate the match between the two 'adapter ligated' reads must be for PE palindrome read alignment
   * **simpleClipThreshold**: specifies how accurate the match between any adapter etc. sequence must be against a read
* Quality trimming (SLIDINGWINDOW:4:15)
 * *Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold*
 * SLIDINGWINDOW **:** windowSize **:** requiredQuality
   * **windowSize**: specifies the number of bases to average across
   * **requiredQuality**: specifies the average quality required
* Remove remaining shorter than 40 nucletide sequences (MINLEN:40)
 * *Drop the read if it is below a specified length*
 * MINLEN **:** length
   * **length**: Specifies the minimum length of reads to be kept
   
*Trimming occurs in the order which the steps are specified on the command line. It is recommended in most cases that adapter clipping, if required, is done as early as possible*

In [2]:
# Checking reads quality before trimming
! fastqc ./P1/reads_1.fastq.gz

Started analysis of reads_1.fastq.gz
Approx 10% complete for reads_1.fastq.gz
Approx 20% complete for reads_1.fastq.gz
Approx 30% complete for reads_1.fastq.gz
Approx 40% complete for reads_1.fastq.gz
Approx 50% complete for reads_1.fastq.gz
Approx 60% complete for reads_1.fastq.gz
Approx 70% complete for reads_1.fastq.gz
Approx 80% complete for reads_1.fastq.gz
Approx 90% complete for reads_1.fastq.gz
Approx 100% complete for reads_1.fastq.gz
Analysis complete for reads_1.fastq.gz


In [6]:
# Trimming
! Trimmomatic SE  ./P1/reads_1.fastq.gz ./P1/reads_1.clean.fastq.gz ILLUMINACLIP:./P1/trimmo_adapters.fasta:2:30:10 SLIDINGWINDOW:4:15 MINLEN:40
                            

TrimmomaticSE: Started with arguments:
 ./P1/reads_1.fastq.gz ./P1/reads_1.clean.fastq.gz ILLUMINACLIP:./P1/trimmo_adapters.fasta:2:30:10 SLIDINGWINDOW:4:15 MINLEN:40
Automatically using 4 threads
Using Long Clipping Sequence: 'AAGCAGTGGTATCAACGCAGAGTACATGGG'
Using Long Clipping Sequence: 'AAGCAGTGGTATCAACGCAGAGTACTTTTT'
ILLUMINACLIP: Using 0 prefix pairs, 2 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Reads: 10000 Surviving: 6075 (60.75%) Dropped: 3925 (39.25%)
TrimmomaticSE: Completed successfully


In [7]:
# Checking reads quality after trimming
! fastqc ./P1/reads_1.clean.fastq.gz

Started analysis of reads_1.clean.fastq.gz
Approx 15% complete for reads_1.clean.fastq.gz
Approx 30% complete for reads_1.clean.fastq.gz
Approx 45% complete for reads_1.clean.fastq.gz
Approx 65% complete for reads_1.clean.fastq.gz
Approx 80% complete for reads_1.clean.fastq.gz
Approx 95% complete for reads_1.clean.fastq.gz
Analysis complete for reads_1.clean.fastq.gz


In [8]:
! ls ./P2/

[34mArchivo[m[m               reads_1.fastq.gz      trimmo_adapters.fasta


Steps to be performed
* Remove the adapters
* Trim the bad quality edges 
* Discard the reads that have end up being too small (less than 40 nucleotides)

In [9]:
# Remove the adapters
! cutadapt -a file:./P1/trimmo_adapters.fasta -o ./P1/reads_2_adapt.fastq.gz ./P1/reads_1.fastq.gz


This is cutadapt 1.18 with Python 3.6.10
Command line parameters: -a file:./P1/trimmo_adapters.fasta -o ./P1/reads_2_adapt.fastq.gz ./P1/reads_1.fastq.gz
Processing reads on 1 core in single-end mode ...
Finished in 0.64 s (64 us/read; 0.93 M reads/minute).

=== Summary ===

Total reads processed:                  10,000
Reads with adapters:                     3,869 (38.7%)
Reads written (passing filters):        10,000 (100.0%)

Total basepairs processed:     1,010,000 bp
Total written (filtered):        635,276 bp (62.9%)

=== Adapter a1 ===

Sequence: AAGCAGTGGTATCAACGCAGAGTACATGGG; Type: regular 3'; Length: 30; Trimmed: 1914 times.

No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30 bp: 3

Bases preceding removed adapters:
  A: 1.8%
  C: 2.3%
  G: 2.7%
  T: 1.8%
  none/other: 91.4%

Overview of removed sequences
length	count	expect	max.err	error counts
3	126	156.2	0	126
4	29	39.1	0	29
5	5	9.8	0	5
6	1	2.4	0	1
7	1	0.6	0	1
9	1	0.0	0	1
100	1	0.0	3	0 1
101	1750	0.0	3	1130 3

In [11]:
# Quality trimming
! cutadapt -q 10 -o ./P1/reads_3.qual.fastq.gz ./P1/reads_2_adapt.fastq.gz 

This is cutadapt 1.18 with Python 3.6.10
Command line parameters: -q 10 -o ./P1/reads_3.qual.fastq.gz ./P1/reads_2_adapt.fastq.gz
Processing reads on 1 core in single-end mode ...
Finished in 0.23 s (23 us/read; 2.62 M reads/minute).

=== Summary ===

Total reads processed:                  10,000
Reads with adapters:                         0 (0.0%)
Reads written (passing filters):        10,000 (100.0%)

Total basepairs processed:       635,276 bp
Quality-trimmed:                  12,402 bp (2.0%)
Total written (filtered):        622,874 bp (98.0%)



In [13]:
# Remove short reads
! cutadapt -m 40 -o ./P1/reads_4.len.fastq.gz ./P1/reads_3.qual.fastq.gz

This is cutadapt 1.18 with Python 3.6.10
Command line parameters: -m 40 -o ./P1/reads_4.len.fastq.gz ./P1/reads_3.qual.fastq.gz
Processing reads on 1 core in single-end mode ...
Finished in 0.30 s (30 us/read; 1.97 M reads/minute).

=== Summary ===

Total reads processed:                  10,000
Reads with adapters:                         0 (0.0%)
Reads that were too short:               3,741 (37.4%)
Reads written (passing filters):         6,259 (62.6%)

Total basepairs processed:       622,874 bp
Total written (filtered):        621,732 bp (99.8%)

