<hr style="height:0px; visibility:hidden;" />

<h1><center>5. Amplicon processing</center></h1>


<div class="alert alert-block alert-success">
Now we're ready to start getting into the actual processing! Note that this is R kernel we are working in now, and while we will break down most of the code we see, don't feel like you need to digest and completely understand all of the R code right away.
</div>

---

<center>This is notebook 5 of 6 of <a href="00-overview.ipynb">GL4U's Amplicon Bootcamp</a>. It is expected that the previous notebooks have been completed already.</center>

---

[**Previous:** 4. Setup and QC](04-setup-QC.ipynb)
<br>

<div style="text-align: right"><a href="06-amplicon-analysis.ipynb"><b>Next:</b> 6. Amplicon analysis</a></div>

---
---

## Setting up our environment

### Loading libraries

In [None]:
library(dada2)
library(tidyverse)

### Setting our location and some general variables

In [None]:
setwd("~/GL4U-amplicon-tutorial/")

In [None]:
list.files()

In [None]:
raw_reads_dir <- "raw-reads"
trimmed_and_filtered_reads_dir <- "trimmed-and-filtered-reads"
fastqc_outputs_dir <- "fastqc-outputs"
final_outputs_dir <- "final-outputs"

In [None]:
# reading in our sample info table
sample_info_tab <- read.table(file = "sample-info.tsv", header = TRUE, sep = "\t", row.names = 1)

<div class="alert alert-block alert-info">
<b>Code Breakdown</b>
<br>

- `read.table()`      - the primary function we're using
    - `file = `       - where we specify the input file we want to read
    - `header = `     - where we state if the first row should be treated as a header (TRUE/FALSE)
    - `sep = `        - where we specify the delimiter that separates values ("\t" is for tab)
    - `row.names = `  - where we can tell it if any columns should be treated as row names, this says to use the first column (would set to `NULL` if we wanted to explicitly not use any column as row names)

</div>

In [None]:
sample_info_tab

In [None]:
sample_names <- row.names(sample_info_tab)
sample_names

### Creating some variables to help with processing

In [None]:
# making an object that holds all forward read starting files
forward_raw_files <- list.files(path = raw_reads_dir, pattern = "*R1_raw.fastq.gz", full.names = TRUE)

<div class="alert alert-block alert-info">
<b>Code Breakdown</b>
<br>

- `list.files()`      - the primary function we're using
    - `path = `       - the location to look for files
    - `pattern = `    - the pattern to match, here we are using the `*` wildcard like before, to say anything that ends with "R1_raw.fastq.gz"
    - `full.names = ` - here allows us to specify if we want to retain the directory structure leading to the files, which we do want in this case (TRUE/FALSE)

</div>

In [None]:
# looking at them
forward_raw_files

In [None]:
# now doing the same for the reverse reads
reverse_raw_files <- list.files(raw_reads_dir, pattern = "*R2_raw.fastq.gz", full.names = TRUE)
reverse_raw_files

In [None]:
# making an object holding what will be the output trimmed/filtered forward files
forward_filtered_files <- paste0(trimmed_and_filtered_reads_dir, "/", sample_names, "_R1_filtered.fastq.gz")

<div class="alert alert-block alert-info">
<b>Code Breakdown</b>
<br>

- `paste0()`      - the primary function we're using
    - `... `      - all arguments we give as just positional arugments like the above will just be stuck together

</div>

In [None]:
forward_filtered_files

In [None]:
# now doing the same for the reverse reads
reverse_filtered_files <- paste0(trimmed_and_filtered_reads_dir, "/", sample_names, "_R2_filtered.fastq.gz")
reverse_filtered_files

---

## Checking on settings to remove primers

It is imperative that we properly remove the primers otherwise we will end up with non-biological sequences introduced due to the amibugous bases in the primers that were used. We can try trimming the primers with dada2's quality trimming/filtering program. But before we run it on everything, we're going to closely look at and test things on one sample – looking at the sequences before and after so we can visibly check the primers are indeed being removed.

These are the primers for this dataset, and the IUPAC degenerate-base codes.

```
f primer: GTGYCAGCMGCCGCGGTAA
r primer: GGACTACNVGGGTWTCTAAT

Y = C/T  
M = A/C  
N = A/T/G/C  
V = A/C/G  
W = A/T  
```

### Ensuring we can spot the primers

In [None]:
# establishing a connection with the F10 forward read file
fwd_test_file <- paste0(raw_reads_dir, "/F10_R1_raw.fastq.gz")

In [None]:
fwd_test_file

In [None]:
incon <- gzcon(file(fwd_test_file, open = "rb"))

In [None]:
# this reads in the first 8 lines, with each set of 4 lines holding one fastq entry
fwd_lines <- readLines(incon, 8)

In [None]:
# here is how we can just get the sequences for the first 2 entries
fwd_lines[c(2,6)]

They each start exactly with the forward primer sequence right up front, which isn't always the case (the asterisks are over the degenerate bases):

```
                           *    *
forward primer:         GTGYCAGCMGCCGCGGTAA
forward read 1 start:   GTGCCAGCAGCCGCGGTAA
forward read 2 start:   GTGCCAGCCGCCGCGGTAA
```

Let's look at a couple reverse reads:

In [None]:
# establishing a connect with the F10 forward read file
rev_test_file <- paste0(raw_reads_dir, "/F10_R2_raw.fastq.gz")

In [None]:
rev_test_file

In [None]:
# establishing a connection
incon <- gzcon(file(rev_test_file, open = "rb"))

# storing the first 8 lines in a variable
rev_lines <- readLines(incon, 8)

In [None]:
# and looking at the first 2 sequences
rev_lines[c(2,6)]

They each start exactly with the reverse primer sequence right up front:

```
                                **    *
reverse primer:          GGACTACNVGGGTWTCTAAT
reverse read 1 start:    GGACTACTAGGGTTTCTAAT
reverse read 2 start:    GGACTACCCGGGTTTCTAAT
```

### Doing a test trimming where we specify to cut these off
The forward primer is 19 bases, the reverse is 20. We can pass these values to the `trimLeft` argument of dada2's `filterAndTrim()` function:

In [None]:
filterAndTrim(fwd = fwd_test_file, 
              rev = rev_test_file, 
              filt = "test-F10_R1_filtered.fastq.gz",
              filt.rev = "test-F10_R2_filtered.fastq.gz", 
              trimLeft = c(19, 20))

<div class="alert alert-block alert-info">
<b>Code Breakdown</b>
<br>

- `filterAndTrim()` - primary function
    - `fwd = `      - where we provide the object holding all the forward read input file(s)
    - `rev = `      - where we provide the object holding all the reverse read input file(s)
    - `filt = `     - where we provide the object holding what will be the output forward read file(s)
    - `filt.rev = ` - where we provide the object holding what will be the output reverse read file(s)
    - `trimLeft = ` - how many bases we want to have trimmed off the left side of the reads (providing them as a vector like done above with two numbers means the first will be used for the forward reads and the second for the reverse reads)

</div>


In [None]:
list.files()

### Ensuring those settings successfully removed the primers
Now we are going to peek at the output trimmed files to make sure we cut off the primers, doing the same things we did above to read in part of the file and then just look at the first 2 sequences of the forward and reverse reads:

In [None]:
# establishing a connection and storing the first 8 lines into a file in one line now
fwd_filt_lines <- gzcon(file("test-F10_R1_filtered.fastq.gz", open = "rb")) %>% readLines(8)
    # reminder that this is the same as writing things nested this way
# fwd_filt_lines <- readLines(gzcon(file("F10_R1_filtered.fastq.gz", open = "rb")), 8)

In [None]:
fwd_filt_lines[c(2,6)]

These previously started:

```
                            *    *
forward primer:          GTGYCAGCMGCCGCGGTAA
original fwd read 1:     GTGCCAGCAGCCGCGGTAA   TACGGAGGAT
original fwd read 2:     GTGCCAGCCGCCGCGGTAA   TACGTAGGGG
```

They each now begin right after the forward primer 👍

In [None]:
rev_filt_lines <- gzcon(file("test-F10_R2_filtered.fastq.gz", open = "rb")) %>% readLines(8)

In [None]:
rev_filt_lines[c(2,6)]

These previously started:

```
                                **    *
reverse primer:          GGACTACNVGGGTWTCTAAT
original rev read 1:     GGACTACTAGGGTTTCTAAT  CCTGTTTGAT
original rev read 2:     GGACTACCCGGGTTTCTAAT  CCTTTTTGCT
```

They each now begin right after the reverse primer 👍

So with that confirmation (or looking at some more samples if wanted), we can be fairly confident in using that `trimLeft` argument for all our samples to remove the primers (since these were all prepared and sequenced together the same way).

Now just removing those test output files so we know for sure we run everything the same way when we do all of them:

In [None]:
file.remove("test-F10_R1_filtered.fastq.gz", "test-F10_R2_filtered.fastq.gz")

In [None]:
list.files()

---

## Processing with dada2

### Quality trimming/filtering (including removing primers)

In [None]:
filtered_out <- filterAndTrim(fwd = forward_raw_files, 
                              rev = reverse_raw_files, 
                              filt = forward_filtered_files, 
                              filt.rev = reverse_filtered_files, 
                              trimLeft = c(19, 20), 
                              maxEE = c(1,1),
                              multithread = 6)

<div class="alert alert-block alert-info">
<b>Code Breakdown</b>
<br>

- `filterAndTrim()` - primary function
    - `fwd = `      - where we provide the object holding all the forward read input files
    - `rev = `      - where we provide the object holding all the reverse read input files
    - `filt = `     - where we provide the object holding what will be the output forward read files
    - `filt.rev = ` - where we provide the object holding what will be the output reverse read files
    - `trimLeft = ` - how many bases we want to have trimmed off the left side of the reads (providing them as a vector like this with two numbers means the first will used for the forward reads and the second for the reverse reads)
    - `maxEE = `    - maximum "expected error" to allow for the forward and reverse reads (similar to above; you can read more about "expected error" [here](https://www.drive5.com/usearch/manual/exp_errs.html) and in its original publication [here](https://academic.oup.com/bioinformatics/article/31/21/3476/194979))

</div>


And we can check our files are where we expect:

In [None]:
list.files()

In [None]:
list.files(trimmed_and_filtered_reads_dir)

**Now let's switch back to the [Setup and QC notebook](04-setup-QC.ipynb#Quality-assessment-of-filtered-reads) to run fastqc and multiqc on these files.**

### Generate error model of data

Now we are going to generate error models by learning the specific error-signatures of our dataset. Each sequencing run, even when all goes well, will have its own subtle variations to its error profile. dada2 tries to learn and incorporate this information when it later tries to infer the true, starting biological sequences. Here we are running the function that does this on both the forward and reverse reads.

In [None]:
err_forward_reads <- learnErrors(fls = forward_filtered_files, multithread = 6)
err_reverse_reads <- learnErrors(fls = reverse_filtered_files, multithread = 6)

<div class="alert alert-block alert-info">
<b>Code Breakdown</b>
<br>

- `learnErrors()` - primary function
    - `fls = `          - where we provide the object holding all the input read files
    - `multithread = `  - where we can specify how many jobs to run in parallel

</div>


### Inferring sequences

Here’s where dada2 gets to do what it was born to do, that is to do its best to infer true biological sequences. It does this by incorporating the error models it generated above, quality information for the reads, and abundances of each unique sequence, and then figuring out if each sequence is more likely to be of biological origin or more likely to have been introduced by a sequencing error. You can read more about the details of this in the [dada2 paper](https://www.nature.com/articles/nmeth.3869#methods) of course or looking through their [site](https://benjjneb.github.io/dada2/index.html).

This step can be run on individual samples, which is the least computationally intensive manner, or on all samples together, which increases the function’s ability to resolve low-abundance ASVs. Imagine Sample A has 10,000 copies of sequence Z, and Sample B has 1 copy of sequence Z. Sequence Z would likely be filtered out of Sample B even though it was a “true” singleton among perhaps thousands of spurious singletons we needed to remove. Because running all samples together on large datasets can become impractical computationally, the developers also added a way to try to combine the best of both worlds that they refer to as pseudo-pooling, which is explained very nicely [on this page](https://benjjneb.github.io/dada2/pseudo.html#Pseudo-pooling). We will be using that method here:

In [None]:
forward_seqs <- dada(derep = forward_filtered_files, err = err_forward_reads, pool = "pseudo", multithread = 6)
reverse_seqs <- dada(derep = reverse_filtered_files, err = err_reverse_reads, pool = "pseudo", multithread = 6)

<div class="alert alert-block alert-info">
<b>Code Breakdown</b>
<br>

- `dada()` - primary function
    - `derep = ` - where we provide the object holding all the input read files
    - `err = `   - where we provide the object created by the `learnErrors()` function we ran above
    - `pool = `  - where we tell it the method to use (if any) to try to pool information across samples, as explained on [this page](https://benjjneb.github.io/dada2/pseudo.html#Pseudo-pooling)
    - `multithread = `  - where we can specify how many jobs to run in parallel

</div>


### Merging forward and reverse reads

Now dada2 merges the forward and reverse ASVs to reconstruct our full target amplicons, requiring the overlapping region to be identical between the two reads.

In [None]:
merged_amplicons <- mergePairs(dadaF = forward_seqs, derepF = forward_filtered_files, 
                               dadaR = reverse_seqs, derepR = reverse_filtered_files)

<div class="alert alert-block alert-info">
<b>Code Breakdown</b>
<br>

- `mergePairs()` - primary function
    - `dadaF = ` - where we provide the forward read object from the dada() function we ran above
    - `derepF = ` - where we provide the object holding all the input forward read files
    - `dadaR = ` - where we provide the reverse read object from the dada() function we ran above
    - `derepR = ` - where we provide the object holding all the input reverse read files

</div>


### Generating sequence table with counts per sample

Now we can generate a count table with the `makeSequenceTable()` function. This is one of the main outputs from processing an amplicon dataset. It is also often referred to as a biome table, or an OTU matrix.

In [None]:
seqtab <- makeSequenceTable(merged_amplicons)

This isn't very friendly to look at yet, because it uses the full sequences as column names, but we'll make a more traditional one where we change that in a few steps.

### Removing putative chimeras

Chimeras are technical artifacts made during PCR where different sequences merge together to form a new sequence, and this problem is extremely common during the generation of amplicon data. dada2 identifies likely chimeras by aligning each sequence with those that were recovered in greater abundance and then seeing if there are any lower-abundance sequences that can be made exactly by mixing left and right portions of two of the more-abundant ones. If so, these likely chimeric sequences are removed.

In [None]:
seqtab.nochim <- removeBimeraDenovo(unqs = seqtab, multithread = 6)

<div class="alert alert-block alert-info">
<b>Code Breakdown</b>
<br>

- `removeBimeraDenovo()` - primary function
    - `unqs = ` - where we provide the object we created with the `makeSequenceTable()` function above
    - `multithread = `  - where we can specify how many jobs to run in parallel

</div>


We can see how many unique sequences we had prior to chimera removal by looking at the number of columns in the object we made above with the `makeSequenceTable()` function:

In [None]:
ncol(seqtab)

And how many we had after removing likely chimeras:

In [None]:
ncol(seqtab.nochim)

In [None]:
ncol(seqtab.nochim) / ncol(seqtab) * 100

That says we dropped quite a bit in terms of number of unique sequences, and we're only retaining ~17% of the total unique sequences recovered. But this is not the same as the number of actual fragments sequenced, because many of them are seen more than once. Here's a way we can look at that value:

In [None]:
sum(seqtab.nochim) / sum(seqtab) * 100

Which tells us we retained ~96% of the initial sequences. This is a very common scenario with amplicon data, having many chimeric unique sequences recovered, but only making up a small portion of the total data sequenced.

### Generating an overview of counts throughout processing

It can be helpful to have a count of how many reads we had at each step along the way in one table. This can aid in finding any potentially problematic steps. The developers’ [DADA2 tutorial](https://benjjneb.github.io/dada2/tutorial.html) provides an example of a nice, quick way to pull out how many reads were dropped at various points of the pipeline. Here’s a slightly modified version adding in a final column of percent of reads retained from the start:

In [None]:
# making a helper function
getN <- function(x) sum(getUniques(x))

summary_tab <- data.frame(row.names = sample_names,
                          starting_read_pairs = filtered_out[, 1],
                          filtered_read_pairs = filtered_out[, 2],
                          fwd_ASVs = sapply(forward_seqs, getN),
                          rev_ASVs = sapply(reverse_seqs, getN),
                          merged_ASVs = sapply(merged_amplicons, getN),
                          non_chimeras = rowSums(seqtab.nochim),
                          final_perc_reads_retained = round(rowSums(seqtab.nochim) / filtered_out[, 1] * 100, 1)
                         )

<div class="alert alert-block alert-info">

This is a very busy code block and not that straightforward for where we are at. So we aren't going to break every component down on this one, but in the future, running each part piece-by-piece and looking at what each is doing would be good practice if wanting to understand it better.

</div>


And here is what our summary table looks like:

In [None]:
summary_tab

Showing we retained about 80% of our starting reads, with most being dropped at the initial filtering step.

### Assigning taxonomy
To assign taxonomy, we are going to use the [DECIPHER package](https://bioconductor.org/packages/release/bioc/html/DECIPHER.html). There are some DECIPHER-formatted databases available [here](http://www2.decipher.codes/Classification/TrainingSets/), which is where the one that we use below comes from.

In [None]:
# loading library used for taxonomy assignment
library(DECIPHER)

In [None]:
# creating the type of object needed
dna <- DNAStringSet(getSequences(seqtab.nochim))
    # this is pulling the sequences out of our seqtab.nochim object with the getSequences() function, 
    # and passing them to the DNAStringSet() function

In [None]:
# downloading reference
download.file(url = "http://www2.decipher.codes/Classification/TrainingSets/SILVA_SSU_r138_2019.RData", destfile = "SILVA_SSU_r138_2019.RData")

In [None]:
# loading reference into R objects
load("SILVA_SSU_r138_2019.RData")

In [None]:
# took about 60 seconds with subset dataset on local with 4 cpus
# classifying sequences
tax_info <- IdTaxa(test = dna, trainingSet = trainingSet, strand = "both", processors = 6)

<div class="alert alert-block alert-info">
<b>Code Breakdown</b>
<br>

- `IdTaxa()` - primary function
    - `test = ` - where we provide the dna object holding our sequences we want to classify
    - `trainingSet = ` - where we provide the object holding the reference information (it was loaded as "trainingSet" by the above `load()` function)
    - `strand = ` - specifying to check both forward and reverse strands
    - `processors = ` - where we can specify how many jobs to run in parallel

</div>


And we can peek at this object holding our classifications:

In [None]:
tax_info

In [None]:
# and removing the reference file as we don't need it anymore
unlink("SILVA_SSU_r138_2019.RData")

### Generating and writing standard outputs

The typical standard outputs form amplicon processing are: 1) a fasta file of our unique ASVs; 2) a count table showing how many times each unique ASV was detected in each sample; and 3) a taxonomy table linking our ASV IDs to their assigned taxonomy. Here is one way we can generate those files from our dada2 objects in R.

<div class="alert alert-block alert-info">

This code can get a little busy too, and it's a little beyond our current scope to dig into it all. So like above, we aren't going to break every component down, but running each part piece-by-piece and looking at what it's doing would be good practice in the future if wanting to understand it better.
</div>


**1. Making and writing out a fasta file of our recovered ASV sequences**

In [None]:
# giving sequences more manageable names
asv_seqs <- colnames(seqtab.nochim)
asv_headers <- vector(dim(seqtab.nochim)[2], mode = "character")

for (i in 1:dim(seqtab.nochim)[2]) {
    asv_headers[i] <- paste(">ASV", i, sep = "_")
}

# making then writing out a fasta of final ASV sequences
asv_fasta <- c(rbind(asv_headers, asv_seqs))
write(asv_fasta, paste0(final_outputs_dir, "/ASVs.fasta"))

In [None]:
# we can look at the fasta object we made with the head() function
head(asv_fasta)

**2. Making and writing out a count table of how many times each ASV was detected in each sample**

In [None]:
# making and writing out a count table
asv_starting_tab <- t(seqtab.nochim)
colnames(asv_starting_tab) <- sample_names

asv_ids <- sub(">", "", asv_headers)

count_tab <- data.frame("ASV_ID" = asv_ids, asv_starting_tab, check.names = FALSE, row.names = NULL)

write.table(count_tab, paste0(final_outputs_dir, "/ASV_counts.tsv"), sep = "\t", quote = FALSE, row.names = FALSE)

In [None]:
# we can peek at this table with the head() function
head(count_tab)

**3. Making and writing out a table of taxonomy**

In [None]:
# making and writing out a table of taxonomy, with any unclassified as "NA"
ranks <- c("domain", "phylum", "class", "order", "family", "genus", "species")

starting_tax_tab <- t(sapply(tax_info, function(x) {
    m <- match(ranks, x$rank)
    taxa <- x$taxon[m]
    taxa[startsWith(taxa, "unclassified_")] <- NA
    taxa
}))

colnames(starting_tax_tab) <- ranks
tax_tab <- data.frame("ASV_ID" = asv_ids, starting_tax_tab, row.names = NULL)

write.table(tax_tab, paste0(final_outputs_dir, "/ASV_taxonomy.tsv"), sep = "\t", quote = FALSE, row.names = FALSE)

In [None]:
# we can also peek at this table
head(tax_tab)

---

And that's it for baseline processing now that we have our standard goods. **Next we'll move onto the [analysis notebook](06-amplicon-analysis.ipynb).**


---
---

[**Previous:** 4. Setup and QC](04-setup-QC.ipynb)
<br>

<div style="text-align: right"><a href="06-amplicon-analysis.ipynb"><b>Next:</b> 6. Amplicon analysis</a></div>
