/
manuscript_supplemental.Rmd
559 lines (460 loc) · 29.1 KB
/
manuscript_supplemental.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
---
title: "Supplemental Material - _metagenomeFeatures_: An R package for working with 16S rRNA reference databases and marker-gene survey feature data."
author: "Nathan D. Olson and Nidhi Shah"
date: "`r Sys.Date()`"
output: bookdown::pdf_document2
bibliography: manuscript_supplemental.bib
---
# Background
16S rRNA amplicon sequencing is commonly used for microbial community characterization, including differential abundance and diversity analysis.
A limitation to 16S rRNA amplicon sequencing is a lack of taxonomic resolution, where organisms are only identifiable to the genus or family level.
We define taxonomic resolution as the ability to differentiate between groups within a taxonomic level, for example, differentiating between species within a genus.
We are interested in determining whether the 16S rRNA region of interest contains sufficient information for species-level taxonomic assignment.
Taxonomic resolution varies by clade and amplicon regions.
However, the extent to which taxonomic resolution varies is not well characterized.
There are multiple 16S rRNA databases each using their own file format and whose maintainers use different curation approaches resulting in different taxonomic and sequence compositions.
One can perform taxonomic or other 16S rRNA sequence analysis on an individual database but with `metagenomeFeatures` and `MgDb` annotation packages the same analysis can easily be performed on multiple databases, increasing the power of the analysis.
Here we demonstrate how `metagenomeFeatures` and the `MgDb` annotation packages can be used to characterize taxonomic resolution for the _Paenibacillus_ genus and V12 and V4 amplicon regions.
Originally, _Paenibacillus_ was classified under the _Bacillus_ genus, a novel genus was formed based on the 16S rRNA gene similarity in the 1990s.
_Paenibacillus_ spp. are facultative anaerobic bacteria present in a variety of environments including the soil, water, and can act as opportunistic pathogens in humans [@ouyang2008].
It has been shown that _Paenibacillus_ spp. will play an important role in sustainable agricultural industries [@grady2016].
Thus, an appropriate speciation of this genus is of an interest to the community.
We used V12 and V4 region as they represent two commonly used amplicons for 16S rRNA marker-gene surveys.
We will use the Greengenes 13.5 database, accessed using the `greengenes13.5MgDb`, Silva 128 database, accessed using the `silva128.1MgDb`, and RDP 11.5 database, accessed using the `ribosomaldatabaseproject11.5MgDb`, as annotation packages for our analysis of the _Paenibacillus_ genus.
First, we show the distribution of sequences belonging to _Paenibacillus_ in these three databases, and then a way to combine different subsets of database sequences to build a custom database object.
This helps investigators to pool sequences from different databases, to create an accurate and most representative database for their sample.
This is especially helpful when a certain clade of interest is not very well represented in the databases.
We then evaluate 16S rRNA amplicon sequencing taxonomic resolution for _Paenibacillus_ species for V12 and V4 regions.
In many analyses, the Greengenes 13.5 database is used for demonstration purposes but the other `MgDb` annotation packages such as RDP 11.5, SILVA 128, can also be used.
While the databases provide a consistent interface to the different 16S rRNA databases, the databases use different approaches to formatting the taxonomic names.
As a result, the majority of the code used to produce this document is tidying the data.
This code is not shown in the document but is available in the source Rmarkdown file which is provided as supplemental material and available in the metagenomeFeatures GitHub repository.[^1]
[^1]: https://github.com/HCBravoLab/metagenomeFeatures/blob/master/inst/manuscript/manuscript_supplemental.Rmd
# Required Packages
In addition to `metagenomeFeatures`, `greengenes13.5MgDb`, `ribosomaldatabaseproject11.5MgDb`, and `silva128.1MgDb`, the `DECIPHER`, `tidyverse`, and `ggpubr` packages are also used in the following analysis.
Our analysis uses the `DECIPHER` package to extract the amplicon regions, perform multiple sequence alignment, and generate a pairwise sequence distance matrix [@Wright2015-me].
The `tidyverse` and `ggpubr` packages will be used to reformat the taxonomic and distance matrix data and generate summary figures [@tidyverse @ggpubr].
See [Session Information Section](#sessionInfo) for package version information.
```{r message = FALSE, echo = TRUE}
library(tidyverse)
library(ggpubr)
library(UpSetR)
library(DECIPHER)
library(metagenomeFeatures)
library(greengenes13.5MgDb)
library(silva128.1MgDb)
library(ribosomaldatabaseproject11.5MgDb)
```
# 16S rRNA Database Comparison
We developed 16S rRNA MgDb annotation packages for three 16S rRNA databases, Greengenes, SILVA, and RDP.
RNAcentral, a meta-database for non-coding RNA sequences provides cross database unique identifiers for sequences in the databases.
The number of sequences in the databases and fraction of sequences with RNAcentralIDs varies for the three databases (Table \@ref(tab:dbSize)).
Using the RNAcentral IDs we can evaluate the overlap between the three databases (Fig. \@ref(fig:dbVenn)).
```{r dbSize, echo = FALSE}
get_rnacentral_entries <- function(mgdb){
if("rnacentral_ids" %in% taxa_columns(mgdb)){
rna_filt <- mgDb_taxa(mgdb) %>%
filter(!is.na(rnacentral_ids))
} else {
rna_filt <- mgDb_taxa(mgdb) %>%
filter(!is.na(RNAcentralID))
}
rna_filt
}
get_size_and_rnacentral <- function(mgdb){
db_size <- mgDb_taxa(mgdb) %>%
tally() %>% collect() %>% as.numeric()
db_wrna_id <- get_rnacentral_entries(mgdb) %>%
tally() %>% collect() %>% as.numeric()
## return size and size with rna central ids
data_frame(Size = db_size,
n_entries_rnaid = db_wrna_id)
}
db_list <- list(Greengenes = gg13.5MgDb,
SILVA = slv128.1MgDb,
RDP = rdp11.5MgDb)
db_info_df <- db_list %>%
map_dfr(get_size_and_rnacentral, .id = "Database")
db_info_df %>%
mutate(`Frac Ids` = (n_entries_rnaid)/Size) %>%
dplyr::select(-n_entries_rnaid) %>%
knitr::kable(digits = 2, format.args = list(big.mark = ","), booktabs = TRUE,
caption = "Number of sequences (size) and fraction of sequences with RNAcentral ids (frac ids) for the three 16S rRNA databases with MgDb annotation packages.")
```
```{r echo = FALSE, warning = FALSE}
get_rnacentral_ids <- function(mgdb){
if("rnacentral_ids" %in% taxa_columns(mgdb)){
rna_filt <- mgDb_taxa(mgdb) %>%
filter(!is.na(rnacentral_ids))
} else {
rna_filt <- mgDb_taxa(mgdb) %>%
filter(!is.na(RNAcentralID))
}
rna_filt %>%
select(contains("central")) %>%
collect() %>%
distinct() %>%
.[[1,]]
}
## This is really slow ...
rna_central_id_list <- db_list %>%
map(get_rnacentral_ids)
```
```{r dbVenn, echo = FALSE, fig.cap = "Overlap in sequence composition between the three 16S rRNA databases with MgDb annotation packages for sequences with RNAcentral IDs."}
upset(fromList(rna_central_id_list), order.by = "freq")
```
# _Paenibacillus_ Taxonomic Resolution
For the taxonomic resolution analysis, we will:
1. characterize the taxonomic composition of the _Paenibacillus_ genus in the three databases,
2. calculate pairwise distances between sequences for 30 species with the most sequences,
3. evaluate pairwise distances within and between species.
## Taxonomic Characterization
First obtain the taxonomic information for _Paenibacillus_ genus from the three databases using the `mgDb_select()` function.
```{r subsetDb, echo = TRUE}
paeni_16S_greengenes <- metagenomeFeatures::mgDb_select(gg13.5MgDb,
type = c("taxa"),
keys = "Paenibacillus",
keytype = "Genus")
paeni_16S_silva <- metagenomeFeatures::mgDb_select(slv128.1MgDb,
type = c("taxa"),
keys = "Paenibacillus",
keytype = "Genus")
paeni_16S_rdp <- metagenomeFeatures::mgDb_select(rdp11.5MgDb,
type = c("taxa"),
keys = "Paenibacillus",
keytype = "Genus")
```
```{r getCounts, echo = FALSE, warning = FALSE, message = FALSE}
## Per genus count data - greengenes ###########################################
tidy_gg_taxa <- paeni_16S_greengenes %>%
## cleaning up species names
mutate(Species = if_else(Species == "s__", "Unassigned", Species),
Species = str_replace(Species, "s__",""))
taxa_df_greengenes <- tidy_gg_taxa %>%
group_by(Species) %>%
summarise(Count = n()) %>%
ungroup() %>%
mutate(Species = fct_reorder(Species, Count))
## Count info for text
total_otus_greengenes <- sum(taxa_df_greengenes$Count)
unassigned_idx_greengenes <- taxa_df_greengenes$Species == "Unassigned"
no_species_assignment_greengenes <- taxa_df_greengenes$Count[unassigned_idx_greengenes]
## Per genus count data - SILVA ################################################
tidy_slv_taxa <- paeni_16S_silva %>% rowwise() %>%
mutate(new = str_split(Species, boundary("word")),Species = paste(new[1],new[2])) %>%
mutate(new = NULL) %>%
mutate(Species = if_else(!str_detect(Species, "Paenibacillus"), "Unassigned", Species)) %>%
mutate(Species = if_else(str_detect(Species, paste(c("Paenibacillus sp", "uncultured", "unidentified"), collapse = '|')), "Unassigned", Species)) %>%
mutate(Species = str_remove(Species, "Paenibacillus "))
taxa_df_silva <- tidy_slv_taxa %>%
group_by(Species) %>%
summarise(Count = n()) %>%
ungroup() %>%
mutate(Species = fct_reorder(Species, Count))
## Count info for text
total_otus_silva <- sum(taxa_df_silva$Count)
unassigned_idx_silva <- taxa_df_silva$Species == "Unassigned"
no_Species_assignment_silva <- taxa_df_silva$Count[unassigned_idx_silva]
## Per genus count data - RDP ##################################################
tidy_rdp_taxa <- paeni_16S_rdp %>% rowwise() %>%
dplyr::rename(Species = species, Genus = genus) %>%
mutate(new = str_split(Species, boundary("word")), Species = paste(new[1],new[2])) %>%
mutate(new = NULL) %>%
mutate(Species = if_else(!str_detect(Species, "Paenibacillus"), "Unassigned", Species)) %>%
mutate(Species = if_else(str_detect(Species, paste(c("Paenibacillus sp", "uncultured", "unidentified"), collapse = '|')), "Unassigned", Species)) %>%
mutate(Species = str_remove(Species, "Paenibacillus "))
taxa_df_rdp <- tidy_rdp_taxa %>%
group_by(Species) %>%
summarise(Count = n()) %>%
ungroup() %>%
mutate(Species = fct_reorder(Species, Count))
## Count info for text
total_otus_rdp <- sum(taxa_df_rdp$Count)
unassigned_idx_rdp <- taxa_df_rdp$Species == "Unassigned"
no_Species_assignment_rdp <- taxa_df_rdp$Count[unassigned_idx_rdp]
```
For _Paenibacillus_ genus the databases vary in the number of sequences for each species with some databases not having any sequences for some species (Fig. \@ref(fig:speciesCount)).
With a relatively small overlap between the databases, multiple database analysis provides additional power (Fig. \@ref(fig:dbVenn)).
The three database maintainers differ in their curation approaches, taxonomic name formatting, and update frequency.
There are benefits to using all three databases in a cross database analysis.
The Greengenes database taxonomic name is more rigidly formatted, only including the major taxonomic levels, e.g. does not include intermediate level such as suborder, and only goes to the species level.
As a result, species name parsing is unlikely to negatively impact the results.
SILVA and RDP are consistently updated while Greengenes has not been updated since 2013, and therefore it does not contain sequences that were not available in 2013. As a result, the SILVA and RDP databases are significantly more comprehensive than Greengenes.
Both, the SILVA and RDP taxonomic name formatting include intermediate taxonomic levels allowing for more fine grained taxonomic analysis.
Additionally, the SILVA database is the only one of the three to include eukaryotic 16S sequences as well as prokaryotic sequences.
For our analysis of the _Paenibacillus_, the and issue with all databases is that the sequences only classified to the genus level ("Unassigned" at species level) is the most abundant group, Greeengenes - `r no_species_assignment_greengenes`, SILVA - `r no_Species_assignment_silva`, and RDP - `r no_Species_assignment_rdp` (Fig. \@ref(fig:speciesCount)).
<!-- Maybe table or upset plot for Paenibacillus - Number of sequences and species for each database -->
```{r speciesCount, echo = FALSE, fig.cap = "Number of sequences in the genus _Paenibacillus_ assigned to the 10 most represented species as well as sequences not assigned at the species level for the Greengenes, SILVA, and RDP databases.", message = FALSE, warning = FALSE}
combined_taxa_df <- bind_rows(list(Greengenes = taxa_df_greengenes,
SILVA = taxa_df_silva,
RDP = taxa_df_rdp),
.id = "Database") %>%
group_by(Species) %>%
mutate(max_count = max(Count, na.rm = TRUE)) %>%
ungroup() %>%
top_n(30, max_count) %>%
mutate(Species = fct_reorder(Species, max_count))
combined_taxa_df %>%
group_by(Species) %>%
mutate(ymin = min(Count),
ymax = max(Count)) %>%
ggplot() +
geom_linerange(aes(x = Species, ymin = ymin + 1, ymax = ymax + 1)) +
geom_point(aes(x = Species, y = Count + 1, color = Database)) +
labs(y = "Number of sequences") +
scale_y_log10() +
coord_flip() +
theme_bw() +
theme(axis.text.y = element_text(face = "italic"),
legend.position = "bottom")
```
## Taxonomic Resolution
Next, we evaluate the 16S rRNA amplicon sequencing taxonomic resolution for _Paenibacillus_ Species by comparing within and between Species amplicon pairwise distance for the V12 and V4 regions.
To differentiate between Species, the pairwise distances for within-Species amplicon sequences must be less than the between Species distances.
Additionally, the difference in amplicon sequence pairwise distances between and within Species must be greater than the sequencing error rate to detect the difference.
### Generating Distance Matrix for 16S Amplicons
Here, we will demonstrate the process for extracting the target amplicon region and calculating pairwise distances using the Greengenes 13.5 database for simplicity.
See the source Rmarkdown file for code used to calculate pairwise distances for other databases and the V4 region as well as tidying the data for downstream analysis.
The V12 and V4 regions were extracted from the database sequences using pattern matching.
Some sequences do not contain both forward and reverse primers as not all the 16S rRNA sequences are full length.
Only sequences with both forward and reverse primers are included in the analysis.
The following PCR primers were used for our _in-silico_ PCR:
|Region |Direction |Primer |
|:------|:-------- |:----------------------------------|
|V12 |Forward | 27F - AGAGTTTGATCATGGCTCAG |
| |Reverse | 336R - CACTGCTGCSYCCCGTAGGAGTCT |
|V4 |Forward | 515F - GTGCCAGCMGCCGCGGTAA |
| |Reverse | 806R - GGACTACHVGGGTWTCTAAT |
```{r trimSeqs, echo = TRUE}
forward_primer <- "AGAGTTTGATCATGGCTCAG"
## reverse complementing reverse primer
reverse_primer <- DNAString("CACTGCTGCSYCCCGTAGGAGTCT") %>%
reverseComplement() %>%
as.character()
## Selecting Paenibacillus taxa and seq
paeni_16S_greengenes <- metagenomeFeatures::mgDb_select(gg13.5MgDb,
type = c("taxa", "seq"),
keys = "Paenibacillus",
keytype = "Genus")
## Finding sequences with forward primer
forward_match <- Biostrings::vmatchPattern(forward_primer,
subject = paeni_16S_greengenes$seq,
max.mismatch = 2) %>%
as.list() %>% map_dfr(as.data.frame,.id = "seq_id")
## Finding sequences with reverse primer
reverse_match <- Biostrings::vmatchPattern(reverse_primer,
subject = paeni_16S_greengenes$seq,
max.mismatch = 2,
fixed = FALSE) %>%
as.list() %>% map_dfr(as.data.frame,.id = "seq_id")
## sequences with both forward and reverse primers
seqs_to_use_ids <- intersect(forward_match$seq_id, reverse_match$seq_id)
seqs_to_use <- names(paeni_16S_greengenes$seq) %in% seqs_to_use_ids
## Trimming sequences with both primers
paeni_V12 <- TrimDNA(paeni_16S_greengenes$seq[seqs_to_use],
leftPatterns = forward_primer,
rightPatterns = reverse_primer,
type = "both")
## Excluding seqs with length 0
paeni_V12_seqs <- paeni_V12[[2]][width(paeni_V12[[2]]) != 0]
```
For more accurate distance estimates, a multiple sequence alignment is used to calculate pairwise distances.
We will use the `AlignSeqs` function in the `DECIPHER` package to generate the multiple sequence alignment.
```{r align, echo = TRUE}
v12_align <- AlignSeqs(paeni_V12[[2]], verbose = FALSE)
```
The resulting alignment can be viewed using the `BrowseSeqs` function in the `DECIPHER` package.
```{r eval = FALSE, echo = TRUE}
BrowseSeqs(v12_align)
```
Next a pairwise distance matrix is generated using the `DistanceMatrix` function in the `DECIPHER` package for taxonomic resolution analysis and converting distance matrix to a data frame for analysis.
```{r v12Anno, echo = TRUE}
v12_dist <- DistanceMatrix(v12_align,
correction = "none",
verbose = FALSE,
includeTerminalGaps = FALSE)
```
### Multiple Database Comparison
Now that we have our pairwise distance matrix, we can evaluate the taxonomic resolution of the genus _Paenibacillus_ by comparing the within and between species distances.
To reduce the complexity of our analysis, we are only going to look at the pairwise distances between sequences assigned to the 10 species with the most sequences (Fig. \@ref(fig:speciesCount)).
With more sequences we will have better estimates for within species pairwise distances but is likely to result in the exclusion of closely related species.
We first compare the distribution of within and between species pairwise distances for the genus.
Then look at the pairwise distances within species _P. amylolyticus_ compared distances between _P. amylolyticus_ and the other nine most represented _Paenibacillus_ species.
```{r distDFfuns, results= "hide", echo = FALSE, message = FALSE}
get_amp <- function(paeni_16S, forward_primer, reverse_primer){
## Finding sequeces with forward primer
forward_match <- Biostrings::vmatchPattern(forward_primer,
subject = paeni_16S,
max.mismatch = 2) %>%
as.list() %>% map_dfr(as.data.frame,.id = "seq_id")
## Finding sequences with reverse primer
reverse_match <- Biostrings::vmatchPattern(reverse_primer,
subject = paeni_16S,
max.mismatch = 2,
fixed = FALSE) %>%
as.list() %>% map_dfr(as.data.frame,.id = "seq_id")
## sequences with both forward and reverse primers
seqs_to_use_ids <- intersect(forward_match$seq_id, reverse_match$seq_id)
seqs_to_use <- names(paeni_16S) %in% seqs_to_use_ids
## Trimming sequences with both primers
paeni_amp <- TrimDNA(paeni_16S[seqs_to_use],
leftPatterns = forward_primer,
rightPatterns = reverse_primer,
type = "both")
## Excluding seqs with lenght 0
paeni_amp <- paeni_amp[[2]][width(paeni_amp[[2]]) != 0]
## Returning target seqs
paeni_amp
}
get_dist <- function(amp_seqs){
aligned_seqs <- AlignSeqs(amp_seqs, verbose = FALSE)
amp_dist <- DistanceMatrix(aligned_seqs,
correction = "none",
verbose = FALSE,
includeTerminalGaps = FALSE)
Key_ord <- data.frame(Keys = rownames(amp_dist),
Keys_ord = 1:nrow(amp_dist))
Key2_ord <- data.frame(Keys2 = rownames(amp_dist),
Keys2_ord = 1:nrow(amp_dist))
## Creating a data frame for exploratory analysis
amp_dist_df <- amp_dist %>%
as.data.frame() %>%
rownames_to_column(var = "Keys") %>%
gather("Keys2","distance", -Keys) %>%
left_join(Key_ord) %>%
left_join(Key2_ord) %>%
filter(Keys_ord < Keys2_ord) %>%
select(-Keys_ord, -Keys2_ord)
}
get_tidy_dist_df <- function(mgdb, tax_df, forward_primer, reverse_primer){
## Get seqs and taxa
paeni_16S <- metagenomeFeatures::mgDb_select(mgdb,
type = "seq",
keys = tax_df$Keys,
keytype = "Keys")
## Subset seqs
amp_seqs <- get_amp(paeni_16S, forward_primer, reverse_primer)
## Get distance DF
amp_dist_df <- get_dist(amp_seqs)
## get distance df
amp_dist_anno_df <- amp_dist_df %>%
left_join(tax_df) %>%
left_join(tax_df,by = c("Keys2" = "Keys")) %>%
dplyr::rename(Keys_Species = Species.x,
Keys2_Species = Species.y) %>%
mutate(group_comp = if_else(Keys_Species == Keys2_Species,
"within","between"))
## Return tidy dataframe
amp_dist_anno_df
}
```
```{r getDistDF, echo = FALSE, message = FALSE, results = "hide", warning = FALSE}
### GG
gg_for_seq_comp <- tidy_gg_taxa %>%
filter(!is.na(rnacentral_ids),
Species != "Unassigned",
Species %in% combined_taxa_df$Species)
### SILVA
slv_for_seq_comp <- tidy_slv_taxa %>%
filter(!is.na(RNAcentralID),
Species != "Unassigned",
Species %in% combined_taxa_df$Species)
### RDP
rdp_for_seq_comp <- tidy_rdp_taxa %>%
filter(!is.na(RNAcentralID),
Species != "Unassigned",
Species %in% combined_taxa_df$Species)
## V12 Amplicon
forward_primer <- "AGAGTTTGATCATGGCTCAG"
## reverse complementing reverse primer
reverse_primer <- DNAString("CACTGCTGCSYCCCGTAGGAGTCT") %>%
reverseComplement() %>%
as.character()
taxa_list <- list(Greengenes = gg_for_seq_comp,
SILVA = slv_for_seq_comp,
RDP = rdp_for_seq_comp)
multidb_dist_V12_df <- map2_dfr(db_list, taxa_list,
get_tidy_dist_df,
forward_primer,
reverse_primer,
.id = "Database")
## V4 Amplicon
forward_primer <- "GTGCCAGCMGCCGCGGTAA"
## reverse complementing reverse primer
reverse_primer <- "ATTAGAWACCCBDGTAGTCC"
multidb_dist_V4_df <- map2_dfr(db_list, taxa_list,
get_tidy_dist_df,
forward_primer,
reverse_primer,
.id = "Database")
## Combine into a single dataframe
multidb_dist_df <- bind_rows(
list(V12 = multidb_dist_V12_df,
V4 = multidb_dist_V4_df),
.id = "amplicon")
## Excluding outlier sequence "329842" - mean pairwise distance to all other
## sequences is 0.2
multidb_dist_df <- filter(multidb_dist_df, Keys != "329842",Keys2 != "329842")
```
#### Genus Level Comparison
Pairwise distance is smaller within than between species, indicating that the V12 and V4 regions are potentially suitable for classifying members of the _Paenibacillus_ genus to the Species level (Fig. \@ref(fig:pairDist)).
The median and interquartile pairwise distance range for within and between species comparisons in consistent across the three databases further supporting our observation.
A large number of outliers, pairwise distances past the boxplot whiskers, were observed for all three databases.
Outliers below the between species boxplots and above the within species are problematic for species level classification.
The sequences and taxonomic assignment of these outliers should be evaluated to ensure they are not errors in the database, a result of incorrect database parsing, or are examples where phylogeny and taxonomy are inconsistent.
Overall, the V12 region had greater pairwise distances than V4 for both within and between Species.
It is important also to consider that we only included sequences with species assignments and the species with the most sequences in the database.
Species-level information for the unclassified sequences and inclusion of less well-represented species might yield results that are inconsistent with our analysis.
Additionally, our analysis does not identify the pairwise sequence distance required to classify a sequence as a novel _Paenibacillus_ Species.
```{r pairDist, fig.cap = "Distribution of within and between Species pairwise distances for the V12 and V4 16S rRNA region. Sequences not classified to the Species level were excluded from the analysis.", echo = FALSE}
multidb_dist_df %>%
ggplot(aes(x = Database, y = distance, fill = group_comp)) +
geom_boxplot() +
facet_grid(.~amplicon) +
labs(x = "Species Comparison", fill = "Species Comparison") +
theme_bw() +
theme(legend.position = "bottom")
```
#### Species level comparison
While the overall pairwise distance is greater between than within species for the _Paenibacillus_ genus, it is important to understand how within and between species pairwise distances compare for individual Species.
The within species pairwise for _P. amylolyticus_ distances are similar to between species distances for _P. pabuli_ and _P. illnoisensis_ (Fig. \@ref(fig:speciesComp)).
When including outliers the within pairwise distances for _P. amylolyticus_ is in the range of between pairwise distances for other _P. spp._.
Similar to the genus level comparison investigation of the outliers sequences and strains is required to evaluate the taxonomic resolution for _Paenibacillus_.
```{r speciesComp, fig.cap = "Pairwise distances within the _Paenibacillus amylolyticus_ species and between _P. amylolyticus_ and the 9 other most represented species.", echo = FALSE}
multidb_dist_df %>%
mutate(Species_Comp = paste0(Keys_Species, Keys2_Species)) %>%
filter(str_detect(Species_Comp, "amylolyticus")) %>%
mutate(Species_Comp = str_remove(Species_Comp, "amylolyticus")) %>%
mutate(Species_Comp = fct_reorder(Species_Comp, distance),
Species_Comp = fct_relevel(Species_Comp, "amylolyticus", after = 0)) %>%
ggplot() +
geom_boxplot(aes(x = Species_Comp,
y = distance,
fill = Database)) +
facet_wrap(~amplicon, ncol = 1) +
theme_bw() +
theme(axis.text.x = element_text(angle = -45, hjust = 0),
legend.position = "bottom") +
labs(x = "Species compared to amylolyticus", y = "Pairwise Distance")
```
# Conclusion
The `metagenomeFeatures` package and associated 16S rRNA database packages `greengenes13.5MgDb`, `silva128.1MgDb`, `ribosomaldatabaseproject11.5MgDb`, provide a consistent interface for working with the different databases.
Furthermore, through the use of the RNAcentralIDs we are able to evaluate database overlap and perform cross database analysis.
We demonstrate how the _metagenomeFeatures_ package in conjunction with one of the associated 16S rRNA database packages, and other R packages, can be used to evaluate whether species-level taxonomic classification is possible for a specific amplicon region.
The approach used here can easily be extended to evaluate taxonomic groups (changing filtering parameters), or amplicon regions (changing primer sequences).
********************************************************************************
# Session Information {#sessionInfo}
## System Information
```{r}
sessioninfo::platform_info()
```
## Package Versions
```{r}
sessioninfo::package_info() %>%
filter(attached == TRUE) %>%
select(package, loadedversion, source) %>%
knitr::kable(booktabs = TRUE)
```
# References
<!-- __Potential reference__ https://www.biorxiv.org/content/early/2018/03/25/288654?%3Fcollection= -->
<!-- Sequencing of the 16S ribosomal RNA (rRNA) gene and the fungal Internal Transcribed Spacer (ITS) region is widely used to survey microbial communities. Specialized ribosomal sequence databases have been developed to support this approach including Greengenes, SILVA and RDP. Most taxonomy annotations in these databases are predictions from sequence rather than authoritative assignments based on studies of type strains or isolates. Here, I investigate the error rates of taxonomy annotations in these databases. I found 253,485 sequences with conflicting annotations in SILVA v128 and Greengenes v13.5 at ranks up to phylum (9,644 conflicts), indicating that the annotation error rate in these databases is ~15%. I found that 34% of non-singleton genera have overlapping subtrees in the Greengenes tree from 2001 according to the RDP taxonomy, most of which are probably due to branching order errors in the Greengenes tree, which is therefore an unreliable guide to phylogeny. Using a blinded test, I estimated that the annotation error rate of the RDP database is ~10%. -->