In [1]:
library(dada2)
library(tidyverse)

Loading required package: Rcpp

“package ‘Rcpp’ was built under R version 4.1.3”
“package ‘tidyverse’ was built under R version 4.1.3”
── [1mAttaching packages[22m ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.1     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.1.8     [32m✔[39m [34mdplyr  [39m 1.1.0
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.4     [32m✔[39m [34mforcats[39m 1.0.0
“package ‘ggplot2’ was built under R version 4.1.3”
“package ‘tibble’ was built under R version 4.1.3”
“package ‘tidyr’ was built under R version 4.1.3”
“package ‘readr’ was built under R version 4.1.3”
“package ‘purrr’ was built under R version 4.1.3”
“package ‘dplyr’ was built under R version 4.1.3”
“package ‘stringr’ was built under R version 4.1.3”
“pack

## Checking on settings to remove primers

Here we can try trimming the primers with dada2's quality trimming/filtering program. But we want to make sure we are removing the primers, so we are going to run one sample and look at the sequences before and after. These are the primers for this dataset, and the IUPAC degenerate-base codes.

```
f primer: GTGYCAGCMGCCGCGGTAA
r primer: GGACTACNVGGGTWTCTAAT

Y = C/T  
M = A/C  
N = A/T/G/C  
V = A/C/G  
W = A/T  
```

In [2]:
setwd("~/Documents/temp/GLDS-249/testing")

### Ensuring we can spot the primers

In [11]:
incon <- gzcon(file("F10_R1_raw.fastq.gz", open = "rb"))

In [4]:
# this reads in the first 8 lines, with each set of 4 lines holding one fastq entry
stuff <- readLines(incon, 8)

In [6]:
# here is how we can just get the sequences for the first 2 entries
stuff[c(2,6)]

They each start exactly with the forward primer sequence

```
forward primer:         GTGYCAGCMGCCGCGGTAA
forward read 1 start:   GTGCCAGCAGCCGCGGTAA
forward read 2 start:   GTGCCAGCCGCCGCGGTAA
```

In [14]:
stuff_rev <- readLines(gzcon(file("F10_R2_raw.fastq.gz", open = "rb")), 8)

In [15]:
stuff_rev[c(2,6)]

They each start exactly with the reverse primer sequence:

```
reverse primer:          GGACTACNVGGGTWTCTAAT
reverse read 1 start:    GGACTACTAGGGTTTCTAAT
reverse read 2 start:    GGACTACCCGGGTTTCTAAT
```

### Doing a test trimming where we specify to cut these off
The forward primer is 19 bases, the reverse is 20. We can pass these values to the `trimLeft` argument of dada2's `filterAndTrim()` function:

In [16]:
filterAndTrim(fwd = "F10_R1_raw.fastq.gz", 
              rev = "F10_R2_raw.fastq.gz", 
              filt = "F10_R1_filtered.fastq.gz",
              filt.rev = "F10_R2_filtered.fastq.gz", 
              trimLeft = c(19, 20))

### Ensuring those settings successfully removed the primers
Now we are going to peek at the output trimmed files to make sure we cut off the primers, doing the same things we did above to read in part of the file and then just look at the first 2 sequences of the forward and reverse reads:

In [18]:
filt_stuff <- readLines(gzcon(file("F10_R1_filtered.fastq.gz", open = "rb")), 8)

In [19]:
filt_stuff[c(2,6)]

These previously started:

```
forward primer:          GTGYCAGCMGCCGCGGTAA
original fwd read 1:     GTGCCAGCAGCCGCGGTAA   TACGGAGGAT
original fwd read 2:     GTGCCAGCCGCCGCGGTAA   TACGTAGGGG
```

They each now begin right after the forward primer 👍

In [21]:
filt_stuff_rev <- readLines(gzcon(file("F10_R2_filtered.fastq.gz", open = "rb")), 8)

In [22]:
filt_stuff_rev[c(2,6)]

These previously started:

```
reverse primer:          GGACTACNVGGGTWTCTAAT
original rev read 1:     GGACTACTAGGGTTTCTAAT  CCTGTTTGAT
original rev read 2:     GGACTACCCGGGTTTCTAAT  CCTTTTTGCT
```

They each now begin right after the reverse primer 👍

So with that confirmation, I'm confident in using that `trimLeft` argument for all our samples to remove the primers (since these were all prepared and sequenced together the same way).

## Setting up some variables

In [34]:
# making an object that holds all forward read starting files
forward_raw_files <- list.files(pattern = "*R1_raw.fastq.gz")

# making an object that holds all reverse read starting file
reverse_raw_files <- list.files(pattern = "*R2_raw.fastq.gz")

In [35]:
forward_raw_files

In [36]:
reverse_raw_files

In [37]:
# getting an object just holding unique sample names
sample_names <- gsub(x = forward_files, pattern = "_.*", replacement = "")

In [38]:
sample_names

In [41]:
# making an object holding what will be the output trimmed/filtered forward files
forward_filtered_files <- paste0(sample_names, "_R1_filtered.fastq.gz")

# making an object holding what will be the output trimmed/filtered reverse files
reverse_filtered_files <- paste0(sample_names, "_R2_filtered.fastq.gz")

In [42]:
forward_filtered_files

In [44]:
reverse_filtered_files

## Quality trimming/filtering (including removing primers)

In [45]:
filtered_out <- filterAndTrim(fwd = forward_raw_files, 
                              rev = reverse_raw_files, 
                              filt = forward_filtered_files, 
                              filt.rev = reverse_filtered_files, 
                              trimLeft = c(19, 20), 
                              maxEE = c(2,2))

**Switch back to amplicon-QC.ipynb in order to run fastqc/multiqc and look at these**

## Generate error model of data

In [46]:
err_forward_reads <- learnErrors(forward_filtered_files, multithread = 4)
err_reverse_reads <- learnErrors(reverse_filtered_files, multithread = 4)

96596166 total bases in 721228 reads from 10 samples will be used for learning the error rates.
95882808 total bases in 721228 reads from 10 samples will be used for learning the error rates.


### Inferring sequences

In [47]:
forward_seqs <- dada(forward_filtered_files, err = err_forward_reads, pool = "pseudo", multithread = 4)
reverse_seqs <- dada(reverse_filtered_files, err = err_reverse_reads, pool = "pseudo", multithread = 4)

Sample 1 - 72133 reads in 15758 unique sequences.
Sample 2 - 76179 reads in 15353 unique sequences.
Sample 3 - 76458 reads in 15540 unique sequences.
Sample 4 - 72963 reads in 14768 unique sequences.
Sample 5 - 67127 reads in 14067 unique sequences.
Sample 6 - 64008 reads in 12385 unique sequences.
Sample 7 - 75799 reads in 16263 unique sequences.
Sample 8 - 78005 reads in 16752 unique sequences.
Sample 9 - 76973 reads in 14572 unique sequences.
Sample 10 - 61583 reads in 12296 unique sequences.

   selfConsist step 2Sample 1 - 72133 reads in 16623 unique sequences.
Sample 2 - 76179 reads in 15967 unique sequences.
Sample 3 - 76458 reads in 16094 unique sequences.
Sample 4 - 72963 reads in 15731 unique sequences.
Sample 5 - 67127 reads in 14718 unique sequences.
Sample 6 - 64008 reads in 12618 unique sequences.
Sample 7 - 75799 reads in 16688 unique sequences.
Sample 8 - 78005 reads in 17051 unique sequences.
Sample 9 - 76973 reads in 15213 unique sequences.
Sample 10 - 61583 reads in 

### Merging forward and reverse reads

In [50]:
merged_amplicons <- mergePairs(dadaF = forward_seqs, derepF = forward_filtered_files, 
                               dadaR = reverse_seqs, derepR = reverse_filtered_files)

### Generating sequence table with counts per sample

In [67]:
seqtab <- makeSequenceTable(merged_amplicons)

### Removing putative chimeras

In [70]:
seqtab.nochim <- removeBimeraDenovo(seqtab, multithread = 4)

In [78]:
dim(seqtab)

In [79]:
dim(seqtab.nochim)

In [71]:
sum(seqtab.nochim) / sum(seqtab) * 100

We retained 94% of the initial sequences.

### Assigning taxonomy

In [62]:
# loading library used for taxonomy assignment
library(DECIPHER)

Loading required package: Biostrings

Loading required package: BiocGenerics

Loading required package: parallel


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB


The following objects are masked from ‘package:dplyr’:

    combine, intersect, setdiff, union


The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, whi

In [72]:
# creating the type of object needed
dna <- DNAStringSet(getSequences(seqtab.nochim))

In [None]:
# downloading reference
download.file(url = "http://www2.decipher.codes/Classification/TrainingSets/SILVA_SSU_r138_2019.RData", destfile = "SILVA_SSU_r138_2019.RData")

In [64]:
# loading reference
load("SILVA_SSU_r138_2019.RData")

In [65]:
# classifying sequences
tax_info <- IdTaxa(dna, trainingSet = trainingSet, strand = "both", processors = 4)


Time difference of 60.31 secs



In [66]:
tax_info

  A test set of class 'Taxa' with length 106
      confidence taxon
  [1]       100% Root; Bacteria; Bacteroidota; Bacteroidia; Bacteroidales; Ta...
  [2]        94% Root; Bacteria; Firmicutes; Bacilli; Erysipelotrichales; Ery...
  [3]        68% Root; Bacteria; Firmicutes; Clostridia; Oscillospirales; Rum...
  [4]        99% Root; Bacteria; Firmicutes; Clostridia; Lachnospirales; Lach...
  [5]        63% Root; Bacteria; Firmicutes; Clostridia; Lachnospirales; Lach...
  ...        ... ...
[102]        97% Root; Bacteria; Proteobacteria; Alphaproteobacteria; Rickett...
[103]       100% Root; Bacteria; Fusobacteriota; Fusobacteriia; Fusobacterial...
[104]       100% Root; Bacteria; Proteobacteria; Gammaproteobacteria; Pseudom...
[105]        63% Root; Bacteria; Firmicutes; Bacilli; Paenibacillales; Paenib...
[106]        77% Root; Bacteria; Proteobacteria; Alphaproteobacteria; Rhizobi...

### Generating and writing standard outputs

In [73]:
# giving sequences more manageable names
asv_seqs <- colnames(seqtab.nochim)
asv_headers <- vector(dim(seqtab.nochim)[2], mode = "character")

for (i in 1:dim(seqtab.nochim)[2]) {
    asv_headers[i] <- paste(">ASV", i, sep = "_")
}

In [74]:
# making then writing out a fasta of final ASV sequences
asv_fasta <- c(rbind(asv_headers, asv_seqs))
write(asv_fasta, "ASVs.fasta")

In [75]:
# making and writing out a count table
asv_tab <- t(seqtab.nochim)
row.names(asv_tab) <- sub(">", "", asv_headers)

write.table(asv_tab, "ASV_counts.tsv", sep = "\t", quote = F, col.names = NA)

In [76]:
# making and writing out a table of taxonomy, with any unclassified as "NA"
ranks <- c("domain", "phylum", "class", "order", "family", "genus", "species")

tax_tab <- t(sapply(tax_info, function(x) {
    m <- match(ranks, x$rank)
    taxa <- x$taxon[m]
    taxa[startsWith(taxa, "unclassified_")] <- NA
    taxa
}))

colnames(tax_tab) <- ranks
rownames(tax_tab) <- gsub(pattern = ">", replacement = "", x = asv_headers)

write.table(tax_tab, "ASV_taxonomy.tsv", sep = "\t", quote = F, col.names = NA)

In [77]:
tax_tab

Unnamed: 0,domain,phylum,class,order,family,genus,species
ASV_1,Bacteria,Bacteroidota,Bacteroidia,Bacteroidales,Tannerellaceae,Parabacteroides,
ASV_2,Bacteria,Firmicutes,Bacilli,Erysipelotrichales,Erysipelatoclostridiaceae,Erysipelatoclostridium,
ASV_3,Bacteria,Firmicutes,Clostridia,Oscillospirales,Ruminococcaceae,Ruminococcus,
ASV_4,Bacteria,Firmicutes,Clostridia,Lachnospirales,Lachnospiraceae,,
ASV_5,Bacteria,Firmicutes,Clostridia,Lachnospirales,Lachnospiraceae,,
ASV_6,Bacteria,Firmicutes,Clostridia,Lachnospirales,Lachnospiraceae,,
ASV_7,Bacteria,Firmicutes,Clostridia,Lachnospirales,Lachnospiraceae,,
ASV_8,Bacteria,Firmicutes,Clostridia,Lachnospirales,Lachnospiraceae,,
ASV_9,Bacteria,Firmicutes,Clostridia,Lachnospirales,Lachnospiraceae,,
ASV_10,Bacteria,Firmicutes,Bacilli,Lactobacillales,Lactobacillaceae,Lactobacillus,
