inst/manuscript/manuscript_supplemental.Rmd

---
title: "Supplemental Material - _metagenomeFeatures_: An R package for working with 16S rRNA reference databases and marker-gene survey feature data."
author: "Nathan D. Olson and Nidhi Shah"
date: "`r Sys.Date()`"
output: bookdown::pdf_document2
bibliography: manuscript_supplemental.bib
---

# Background
16S rRNA amplicon sequencing is commonly used for microbial community characterization, including differential abundance and diversity analysis.
A limitation to 16S rRNA amplicon sequencing is a lack of taxonomic resolution, where organisms are only identifiable to the genus or family level.
We define taxonomic resolution as the ability to differentiate between groups within a taxonomic level, for example, differentiating between species within a genus.
We are interested in determining whether the 16S rRNA region of interest contains sufficient information for species-level taxonomic assignment.
Taxonomic resolution varies by clade and amplicon regions.
However, the extent to which taxonomic resolution varies is not well characterized.

There are multiple 16S rRNA databases each using their own file format and whose maintainers use different curation approaches resulting in different taxonomic and sequence compositions. 
One can perform taxonomic or other 16S rRNA sequence analysis on an individual database but with `metagenomeFeatures` and `MgDb` annotation packages the same analysis can easily be performed on multiple databases, increasing the power of the analysis.
Here we demonstrate how `metagenomeFeatures` and the `MgDb` annotation packages can be used to characterize taxonomic resolution for the _Paenibacillus_ genus and V12 and V4 amplicon regions.
Originally, _Paenibacillus_ was classified under the _Bacillus_ genus, a novel genus was formed based on the 16S rRNA gene similarity in the 1990s.
_Paenibacillus_ spp. are facultative anaerobic bacteria present in a variety of environments including the soil, water, and can act as opportunistic pathogens in humans [@ouyang2008].
It has been shown that _Paenibacillus_ spp. will play an important role in sustainable agricultural industries [@grady2016].
Thus, an appropriate speciation of this genus is of an interest to the community.

We used V12 and V4 region as they represent two commonly used amplicons for 16S rRNA marker-gene surveys.
We will use the Greengenes 13.5 database, accessed using the `greengenes13.5MgDb`, Silva 128 database, accessed using the `silva128.1MgDb`, and RDP 11.5 database, accessed using the `ribosomaldatabaseproject11.5MgDb`, as annotation packages for our analysis of the _Paenibacillus_ genus.
First, we show the distribution of sequences belonging to _Paenibacillus_ in these three databases, and then a way to combine different subsets of database sequences to build a custom database object.
This helps investigators to pool sequences from different databases, to create an accurate and most representative database for their sample.
This is especially helpful when a certain clade of interest is not very well represented in the databases.
We then evaluate 16S rRNA amplicon sequencing taxonomic resolution for _Paenibacillus_ species for V12 and V4 regions.
In many analyses, the Greengenes 13.5 database is used for demonstration purposes but the other `MgDb` annotation packages such as RDP 11.5, SILVA 128, can also be used.

While the databases provide a consistent interface to the different 16S rRNA databases, the databases use different approaches to formatting the taxonomic names.
As a result, the majority of the code used to produce this document is tidying the data.
This code is not shown in the document but is available in the source Rmarkdown file which is provided as supplemental material and available in the metagenomeFeatures GitHub repository.[^1] 

[^1]: https://github.com/HCBravoLab/metagenomeFeatures/blob/master/inst/manuscript/manuscript_supplemental.Rmd

# Required Packages
In addition to `metagenomeFeatures`, `greengenes13.5MgDb`, `ribosomaldatabaseproject11.5MgDb`, and `silva128.1MgDb`, the `DECIPHER`, `tidyverse`, and `ggpubr` packages are also used in the following analysis.
Our analysis uses the `DECIPHER` package to extract the amplicon regions, perform multiple sequence alignment, and generate a pairwise sequence distance matrix [@Wright2015-me].
The `tidyverse` and `ggpubr` packages will be used to reformat the taxonomic and distance matrix data and generate summary figures [@tidyverse @ggpubr]. 
See [Session Information Section](#sessionInfo) for package version information. 

```{r message = FALSE, echo = TRUE}
library(tidyverse)
library(ggpubr)
library(UpSetR)
library(DECIPHER)
library(metagenomeFeatures)
library(greengenes13.5MgDb)
library(silva128.1MgDb)
library(ribosomaldatabaseproject11.5MgDb)
```

# 16S rRNA Database Comparison
We developed 16S rRNA MgDb annotation packages for three 16S rRNA databases, Greengenes, SILVA, and RDP. 
RNAcentral, a meta-database for non-coding RNA sequences provides cross database unique identifiers for sequences in the databases. 
The number of sequences in the databases and fraction of sequences with RNAcentralIDs varies for the three databases (Table \@ref(tab:dbSize)). 
Using the RNAcentral IDs we can evaluate the overlap between the three databases (Fig. \@ref(fig:dbVenn)). 

```{r dbSize, echo = FALSE}
get_rnacentral_entries <- function(mgdb){
   if("rnacentral_ids" %in% taxa_columns(mgdb)){
    rna_filt <- mgDb_taxa(mgdb) %>%
      filter(!is.na(rnacentral_ids))
  } else {
    rna_filt <- mgDb_taxa(mgdb) %>%
      filter(!is.na(RNAcentralID))
  }
  rna_filt
}

get_size_and_rnacentral <- function(mgdb){
  db_size <- mgDb_taxa(mgdb) %>%
    tally() %>% collect() %>% as.numeric()

  db_wrna_id <- get_rnacentral_entries(mgdb) %>%
    tally() %>% collect() %>% as.numeric()

  ## return size and size with rna central ids
  data_frame(Size = db_size,
             n_entries_rnaid = db_wrna_id)
}

db_list <- list(Greengenes = gg13.5MgDb,
                SILVA = slv128.1MgDb,
                RDP = rdp11.5MgDb)

db_info_df <- db_list %>%
  map_dfr(get_size_and_rnacentral, .id = "Database")
db_info_df %>% 
  mutate(`Frac Ids` = (n_entries_rnaid)/Size) %>% 
  dplyr::select(-n_entries_rnaid) %>% 
  knitr::kable(digits = 2, format.args = list(big.mark = ","), booktabs = TRUE,
               caption = "Number of sequences (size) and fraction of sequences with RNAcentral ids (frac ids) for the three 16S rRNA databases with MgDb annotation packages.")
```

```{r echo = FALSE, warning = FALSE}
get_rnacentral_ids <- function(mgdb){
   if("rnacentral_ids" %in% taxa_columns(mgdb)){
    rna_filt <- mgDb_taxa(mgdb) %>%
      filter(!is.na(rnacentral_ids))
  } else {
    rna_filt <- mgDb_taxa(mgdb) %>%
      filter(!is.na(RNAcentralID))
  }
  rna_filt %>%
    select(contains("central")) %>%
    collect() %>%
    distinct() %>%
    .[[1,]]
}

## This is really slow ...
rna_central_id_list <- db_list %>%
  map(get_rnacentral_ids)
```


```{r dbVenn, echo = FALSE, fig.cap = "Overlap in sequence composition between the three 16S rRNA databases with  MgDb annotation packages for sequences with RNAcentral IDs."}
upset(fromList(rna_central_id_list), order.by = "freq")
```

# _Paenibacillus_ Taxonomic Resolution
For the taxonomic resolution analysis, we will:
1. characterize the taxonomic composition of the _Paenibacillus_ genus in the three databases,
2. calculate pairwise distances between sequences for 30 species with the most sequences,
3. evaluate pairwise distances within and between species.


## Taxonomic Characterization
First obtain the taxonomic information for _Paenibacillus_ genus from the three databases using the `mgDb_select()` function.

```{r subsetDb, echo = TRUE}
paeni_16S_greengenes <- metagenomeFeatures::mgDb_select(gg13.5MgDb,
                          type = c("taxa"),
                          keys = "Paenibacillus",
                          keytype = "Genus")
paeni_16S_silva <- metagenomeFeatures::mgDb_select(slv128.1MgDb,
                          type = c("taxa"),
                          keys = "Paenibacillus",
                          keytype = "Genus")
paeni_16S_rdp <- metagenomeFeatures::mgDb_select(rdp11.5MgDb,
                          type = c("taxa"),
                          keys = "Paenibacillus",
                          keytype = "Genus")
```

```{r getCounts, echo = FALSE, warning = FALSE, message = FALSE}
## Per genus count data - greengenes ###########################################
tidy_gg_taxa <- paeni_16S_greengenes %>%
    ## cleaning up species names
    mutate(Species = if_else(Species == "s__", "Unassigned", Species),
           Species = str_replace(Species, "s__",""))

taxa_df_greengenes <- tidy_gg_taxa %>%
    group_by(Species) %>%
    summarise(Count = n()) %>%
    ungroup() %>%
    mutate(Species = fct_reorder(Species, Count))

## Count info for text
total_otus_greengenes <- sum(taxa_df_greengenes$Count)
unassigned_idx_greengenes <- taxa_df_greengenes$Species == "Unassigned"
no_species_assignment_greengenes <- taxa_df_greengenes$Count[unassigned_idx_greengenes]

## Per genus count data - SILVA ################################################
tidy_slv_taxa <- paeni_16S_silva %>% rowwise() %>%
  mutate(new = str_split(Species, boundary("word")),Species = paste(new[1],new[2])) %>%
  mutate(new = NULL) %>%
  mutate(Species = if_else(!str_detect(Species, "Paenibacillus"), "Unassigned", Species)) %>%
  mutate(Species = if_else(str_detect(Species, paste(c("Paenibacillus sp", "uncultured", "unidentified"), collapse = '|')), "Unassigned", Species)) %>%
  mutate(Species = str_remove(Species, "Paenibacillus "))

taxa_df_silva <- tidy_slv_taxa %>%
  group_by(Species) %>%
  summarise(Count = n()) %>%
  ungroup() %>%
  mutate(Species = fct_reorder(Species, Count))

## Count info for text
total_otus_silva <- sum(taxa_df_silva$Count)
unassigned_idx_silva <- taxa_df_silva$Species == "Unassigned"
no_Species_assignment_silva <- taxa_df_silva$Count[unassigned_idx_silva]

## Per genus count data - RDP ##################################################
tidy_rdp_taxa <- paeni_16S_rdp %>% rowwise() %>%
  dplyr::rename(Species = species, Genus = genus) %>%
  mutate(new = str_split(Species, boundary("word")), Species = paste(new[1],new[2])) %>%
  mutate(new = NULL) %>%
  mutate(Species = if_else(!str_detect(Species, "Paenibacillus"), "Unassigned", Species)) %>%
  mutate(Species = if_else(str_detect(Species, paste(c("Paenibacillus sp", "uncultured", "unidentified"), collapse = '|')), "Unassigned", Species)) %>%
  mutate(Species = str_remove(Species, "Paenibacillus "))

taxa_df_rdp <- tidy_rdp_taxa %>%
  group_by(Species) %>%
  summarise(Count = n()) %>%
  ungroup() %>%
  mutate(Species = fct_reorder(Species, Count))

## Count info for text
total_otus_rdp <- sum(taxa_df_rdp$Count)
unassigned_idx_rdp <- taxa_df_rdp$Species == "Unassigned"
no_Species_assignment_rdp <- taxa_df_rdp$Count[unassigned_idx_rdp]
```

 
For _Paenibacillus_ genus the databases vary in the number of sequences for each species with some databases not having any sequences for some species (Fig. \@ref(fig:speciesCount)). 
With a relatively small overlap between the databases, multiple database analysis provides additional power  (Fig. \@ref(fig:dbVenn)). 
The three database maintainers differ in their curation approaches, taxonomic name formatting, and update frequency. 
There are benefits to using all three databases in a cross database analysis. 
The Greengenes database taxonomic name is more rigidly formatted, only including the major taxonomic levels, e.g. does not include intermediate level such as suborder, and only goes to the species level. 
As a result, species name parsing is unlikely to negatively impact the results. 
SILVA and RDP are consistently updated while Greengenes has not been updated since 2013, and therefore it does not contain sequences that were not available in 2013. As a result, the SILVA and RDP databases are significantly more comprehensive than Greengenes.
Both, the SILVA and RDP taxonomic name formatting include intermediate taxonomic levels allowing for more fine grained taxonomic analysis. 
Additionally, the SILVA database is the only one of the three to include eukaryotic 16S sequences as well as prokaryotic sequences.
For our analysis of the  _Paenibacillus_, the and issue with all databases is that the sequences only classified to the genus level ("Unassigned" at species level) is the most abundant group, Greeengenes - `r no_species_assignment_greengenes`, SILVA - `r no_Species_assignment_silva`, and RDP - `r no_Species_assignment_rdp` (Fig. \@ref(fig:speciesCount)).  

<!--  Maybe table or upset plot for Paenibacillus - Number of sequences and species for each database   -->


```{r speciesCount, echo = FALSE, fig.cap = "Number of sequences in the genus _Paenibacillus_ assigned to the 10 most represented species as well as sequences not assigned at the species level for the Greengenes, SILVA, and RDP databases.", message = FALSE, warning = FALSE}
combined_taxa_df <- bind_rows(list(Greengenes = taxa_df_greengenes,
                                   SILVA = taxa_df_silva,
                                   RDP = taxa_df_rdp),
                              .id = "Database") %>%
  group_by(Species) %>%
  mutate(max_count = max(Count, na.rm = TRUE)) %>%
  ungroup() %>%
  top_n(30, max_count) %>%
  mutate(Species = fct_reorder(Species, max_count))

combined_taxa_df %>%
  group_by(Species) %>%
  mutate(ymin = min(Count),
         ymax = max(Count)) %>%
ggplot() +
    geom_linerange(aes(x = Species, ymin = ymin + 1, ymax = ymax + 1)) +
    geom_point(aes(x = Species, y = Count + 1, color = Database)) +
    labs(y = "Number of sequences") +
  scale_y_log10() +
    coord_flip() +
    theme_bw() +
    theme(axis.text.y = element_text(face = "italic"),
          legend.position = "bottom")
```


## Taxonomic Resolution
Next, we evaluate the 16S rRNA amplicon sequencing taxonomic resolution for _Paenibacillus_ Species by comparing within and between Species amplicon pairwise distance for the V12 and V4 regions.
To differentiate between Species, the pairwise distances for within-Species amplicon sequences must be less than the between Species distances.
Additionally, the difference in amplicon sequence pairwise distances between and within Species must be greater than the sequencing error rate to detect the difference.

### Generating Distance Matrix for 16S Amplicons
Here, we will demonstrate the process for extracting the target amplicon region and calculating pairwise distances using the Greengenes 13.5 database for simplicity.
See the source Rmarkdown file for code used to calculate pairwise distances for other databases and the V4 region as well as tidying the data for downstream analysis.

The V12 and V4 regions were extracted from the database sequences using pattern matching.
Some sequences do not contain both forward and reverse primers as not all the  16S rRNA sequences are full length.
Only sequences with both forward and reverse primers are included in the analysis.

The following PCR primers were used for our _in-silico_ PCR:

|Region |Direction |Primer                             |
|:------|:-------- |:----------------------------------|
|V12    |Forward   | 27F - AGAGTTTGATCATGGCTCAG        |
|       |Reverse   | 336R - CACTGCTGCSYCCCGTAGGAGTCT   |
|V4     |Forward   | 515F - GTGCCAGCMGCCGCGGTAA        |
|       |Reverse   | 806R - GGACTACHVGGGTWTCTAAT       |

```{r trimSeqs, echo = TRUE}
forward_primer <- "AGAGTTTGATCATGGCTCAG"
## reverse complementing reverse primer
reverse_primer <- DNAString("CACTGCTGCSYCCCGTAGGAGTCT") %>%
    reverseComplement() %>%
    as.character()

## Selecting Paenibacillus taxa and seq
paeni_16S_greengenes <- metagenomeFeatures::mgDb_select(gg13.5MgDb,
                          type = c("taxa", "seq"),
                          keys = "Paenibacillus",
                          keytype = "Genus")

## Finding sequences with forward primer
forward_match <- Biostrings::vmatchPattern(forward_primer,
                                           subject = paeni_16S_greengenes$seq,
                                           max.mismatch = 2) %>%
    as.list() %>% map_dfr(as.data.frame,.id = "seq_id")

## Finding sequences with reverse primer
reverse_match <- Biostrings::vmatchPattern(reverse_primer,
                                           subject = paeni_16S_greengenes$seq,
                                           max.mismatch = 2,
                                           fixed = FALSE) %>%
    as.list() %>% map_dfr(as.data.frame,.id = "seq_id")

## sequences with both forward and reverse primers
seqs_to_use_ids <- intersect(forward_match$seq_id, reverse_match$seq_id)
seqs_to_use <- names(paeni_16S_greengenes$seq) %in% seqs_to_use_ids

## Trimming sequences with both primers
paeni_V12 <- TrimDNA(paeni_16S_greengenes$seq[seqs_to_use],
                     leftPatterns = forward_primer,
                     rightPatterns = reverse_primer,
                     type = "both")

## Excluding seqs with length 0
paeni_V12_seqs <- paeni_V12[[2]][width(paeni_V12[[2]]) != 0]
```

For more accurate distance estimates, a multiple sequence alignment is used to calculate pairwise distances.
We will use the `AlignSeqs` function in the `DECIPHER` package to generate the multiple sequence alignment.

```{r align, echo = TRUE}
v12_align <- AlignSeqs(paeni_V12[[2]], verbose = FALSE)
```

The resulting alignment can be viewed using the `BrowseSeqs` function in the `DECIPHER` package.

```{r eval = FALSE, echo = TRUE}
BrowseSeqs(v12_align)
```

Next a pairwise distance matrix is generated using the `DistanceMatrix` function in the `DECIPHER` package for taxonomic resolution analysis and converting distance matrix to a data frame for analysis.

```{r v12Anno, echo = TRUE}
v12_dist <- DistanceMatrix(v12_align,
                           correction = "none",
                           verbose = FALSE,
                           includeTerminalGaps = FALSE)
```

### Multiple Database Comparison
Now that we have our pairwise distance matrix, we can evaluate the taxonomic resolution of the genus _Paenibacillus_ by comparing the within and between species distances.
To reduce the complexity of our analysis, we are only going to look at the pairwise distances between sequences assigned to the 10 species with the most sequences (Fig. \@ref(fig:speciesCount)).
With more sequences we will have better estimates for within species pairwise distances but is likely to result in the exclusion of closely related species.
We first compare the distribution of within and between species pairwise distances for the genus.
Then look at the pairwise distances within species _P. amylolyticus_ compared distances between _P. amylolyticus_ and the other nine most represented _Paenibacillus_ species.


```{r distDFfuns, results= "hide", echo = FALSE, message = FALSE}
get_amp <- function(paeni_16S, forward_primer, reverse_primer){

    ## Finding sequeces with forward primer
    forward_match <- Biostrings::vmatchPattern(forward_primer,
                                               subject = paeni_16S,
                                               max.mismatch = 2) %>%
        as.list() %>% map_dfr(as.data.frame,.id = "seq_id")

    ## Finding sequences with reverse primer
    reverse_match <- Biostrings::vmatchPattern(reverse_primer,
                                               subject = paeni_16S,
                                               max.mismatch = 2,
                                               fixed = FALSE) %>%
        as.list() %>% map_dfr(as.data.frame,.id = "seq_id")

    ## sequences with both forward and reverse primers
    seqs_to_use_ids <- intersect(forward_match$seq_id, reverse_match$seq_id)
    seqs_to_use <- names(paeni_16S) %in% seqs_to_use_ids

    ## Trimming sequences with both primers
    paeni_amp <- TrimDNA(paeni_16S[seqs_to_use],
                         leftPatterns = forward_primer,
                         rightPatterns = reverse_primer,
                         type = "both")

    ## Excluding seqs with lenght 0
    paeni_amp <- paeni_amp[[2]][width(paeni_amp[[2]]) != 0]

    ## Returning target seqs
    paeni_amp
}

get_dist <- function(amp_seqs){
  aligned_seqs <- AlignSeqs(amp_seqs, verbose = FALSE)
  amp_dist <- DistanceMatrix(aligned_seqs,
                            correction = "none",
                            verbose = FALSE,
                            includeTerminalGaps = FALSE)

  Key_ord <- data.frame(Keys = rownames(amp_dist),
                        Keys_ord = 1:nrow(amp_dist))
  Key2_ord <- data.frame(Keys2 = rownames(amp_dist),
                         Keys2_ord = 1:nrow(amp_dist))

  ## Creating a data frame for exploratory analysis
  amp_dist_df <- amp_dist %>%
      as.data.frame() %>%
      rownames_to_column(var = "Keys") %>%
      gather("Keys2","distance", -Keys) %>%
      left_join(Key_ord) %>%
      left_join(Key2_ord) %>%
      filter(Keys_ord < Keys2_ord) %>%
      select(-Keys_ord, -Keys2_ord)
}

get_tidy_dist_df <- function(mgdb, tax_df, forward_primer, reverse_primer){
    ## Get seqs and taxa
    paeni_16S <- metagenomeFeatures::mgDb_select(mgdb,
                              type = "seq",
                              keys = tax_df$Keys,
                              keytype = "Keys")

    ## Subset seqs
    amp_seqs <- get_amp(paeni_16S, forward_primer, reverse_primer)

    ## Get distance DF
    amp_dist_df <- get_dist(amp_seqs)

    ## get distance df
    amp_dist_anno_df <- amp_dist_df %>%
      left_join(tax_df) %>%
      left_join(tax_df,by = c("Keys2" = "Keys")) %>%
      dplyr::rename(Keys_Species = Species.x,
                    Keys2_Species = Species.y) %>%
      mutate(group_comp = if_else(Keys_Species == Keys2_Species,
                                  "within","between"))

    ## Return tidy dataframe
    amp_dist_anno_df
}
```


```{r getDistDF, echo = FALSE, message = FALSE, results = "hide", warning = FALSE}
### GG
gg_for_seq_comp <- tidy_gg_taxa %>% 
  filter(!is.na(rnacentral_ids), 
         Species != "Unassigned", 
         Species %in% combined_taxa_df$Species)

### SILVA
slv_for_seq_comp <- tidy_slv_taxa %>% 
  filter(!is.na(RNAcentralID), 
         Species != "Unassigned", 
         Species %in% combined_taxa_df$Species)

### RDP
rdp_for_seq_comp <- tidy_rdp_taxa %>% 
  filter(!is.na(RNAcentralID), 
         Species != "Unassigned", 
         Species %in% combined_taxa_df$Species)

## V12 Amplicon
forward_primer <- "AGAGTTTGATCATGGCTCAG"
## reverse complementing reverse primer
reverse_primer <- DNAString("CACTGCTGCSYCCCGTAGGAGTCT") %>%
    reverseComplement() %>%
    as.character()

taxa_list <- list(Greengenes = gg_for_seq_comp,
                  SILVA = slv_for_seq_comp,
                  RDP = rdp_for_seq_comp)

multidb_dist_V12_df <- map2_dfr(db_list, taxa_list,
                                get_tidy_dist_df,
                                forward_primer,
                                reverse_primer,
                                .id = "Database")

## V4 Amplicon
forward_primer <- "GTGCCAGCMGCCGCGGTAA"
## reverse complementing reverse primer
reverse_primer <- "ATTAGAWACCCBDGTAGTCC"

multidb_dist_V4_df <- map2_dfr(db_list, taxa_list,
                                get_tidy_dist_df,
                                forward_primer,
                                reverse_primer,
                                .id = "Database")

## Combine into a single dataframe
multidb_dist_df <- bind_rows(
  list(V12 = multidb_dist_V12_df,
       V4  = multidb_dist_V4_df),
  .id = "amplicon")
## Excluding outlier sequence "329842" - mean pairwise distance to all other
## sequences is 0.2
multidb_dist_df <- filter(multidb_dist_df, Keys != "329842",Keys2 != "329842")
```


#### Genus Level Comparison

Pairwise distance is smaller within than between species, indicating that the V12 and V4 regions are potentially suitable for classifying members of the _Paenibacillus_ genus to the Species level (Fig. \@ref(fig:pairDist)). 
The median and interquartile pairwise distance range for within and between species comparisons in consistent across the three databases further supporting our observation. 
A large number of outliers, pairwise distances past the boxplot whiskers, were observed for all three databases. 
Outliers below the between species boxplots and above the within species are problematic for species level classification. 
The sequences and taxonomic assignment of these outliers should be evaluated to ensure they are not errors in the database, a result of incorrect database parsing, or are examples where phylogeny and taxonomy are inconsistent. 


Overall, the V12 region had greater pairwise distances than V4 for both within and between Species.
It is important also to consider that we only included sequences with species assignments and the species with the most sequences in the database.
Species-level information for the unclassified sequences and inclusion of less well-represented species might yield results that are inconsistent with our analysis.
Additionally, our analysis does not identify the pairwise sequence distance required to classify a sequence as a novel _Paenibacillus_ Species.

```{r pairDist, fig.cap = "Distribution of within and between Species pairwise distances for the V12 and V4 16S rRNA region. Sequences not classified to the Species level were excluded from the analysis.", echo = FALSE}
multidb_dist_df  %>%
    ggplot(aes(x = Database, y = distance, fill = group_comp)) +
    geom_boxplot() +
    facet_grid(.~amplicon) +
    labs(x = "Species Comparison", fill = "Species Comparison") +
        theme_bw() +
    theme(legend.position = "bottom")
```


#### Species level comparison
While the overall pairwise distance is greater between than within species for the _Paenibacillus_ genus, it is important to understand how within and between species pairwise distances compare for individual Species.
The within species pairwise for _P. amylolyticus_ distances are similar to between species distances for _P. pabuli_ and _P. illnoisensis_ (Fig. \@ref(fig:speciesComp)). 
When including outliers the within pairwise distances for _P. amylolyticus_ is in the range of between pairwise distances for other _P. spp._.
Similar to the genus level comparison investigation of the outliers sequences and strains is required to evaluate the taxonomic resolution for _Paenibacillus_.

```{r speciesComp, fig.cap = "Pairwise distances within the _Paenibacillus amylolyticus_  species and between  _P. amylolyticus_ and the 9 other most represented species.", echo = FALSE}
multidb_dist_df %>%
  mutate(Species_Comp = paste0(Keys_Species, Keys2_Species)) %>% 
  filter(str_detect(Species_Comp, "amylolyticus")) %>% 
  mutate(Species_Comp = str_remove(Species_Comp, "amylolyticus")) %>% 
  mutate(Species_Comp = fct_reorder(Species_Comp, distance),
         Species_Comp = fct_relevel(Species_Comp, "amylolyticus", after = 0)) %>% 
    ggplot() +
    geom_boxplot(aes(x = Species_Comp,
                    y = distance,
                    fill = Database)) +
    facet_wrap(~amplicon, ncol = 1) +
   theme_bw() + 
  theme(axis.text.x = element_text(angle = -45, hjust = 0), 
        legend.position = "bottom") + 
  labs(x = "Species compared to amylolyticus", y = "Pairwise Distance")
```


# Conclusion
The `metagenomeFeatures` package and associated 16S rRNA database packages `greengenes13.5MgDb`, `silva128.1MgDb`,   `ribosomaldatabaseproject11.5MgDb`, provide a consistent interface for working with the different databases. 
Furthermore, through the use of the RNAcentralIDs we are able to evaluate database overlap and perform cross database analysis. 
We demonstrate how the _metagenomeFeatures_ package in conjunction with one of the associated 16S rRNA database packages, and other R packages, can be used to evaluate whether species-level taxonomic classification is possible for a specific amplicon region.
The approach used here can easily be extended to evaluate taxonomic groups (changing filtering parameters), or amplicon regions (changing primer sequences).

********************************************************************************

# Session Information {#sessionInfo}
## System Information
```{r}
sessioninfo::platform_info()
```


## Package Versions
```{r}
sessioninfo::package_info() %>% 
  filter(attached == TRUE) %>% 
  select(package, loadedversion, source) %>% 
    knitr::kable(booktabs = TRUE)
```

# References
<!-- __Potential reference__ https://www.biorxiv.org/content/early/2018/03/25/288654?%3Fcollection= -->
<!-- Sequencing of the 16S ribosomal RNA (rRNA) gene and the fungal Internal Transcribed Spacer (ITS) region is widely used to survey microbial communities. Specialized ribosomal sequence databases have been developed to support this approach including Greengenes, SILVA and RDP. Most taxonomy annotations in these databases are predictions from sequence rather than authoritative assignments based on studies of type strains or isolates. Here, I investigate the error rates of taxonomy annotations in these databases. I found 253,485 sequences with conflicting annotations in SILVA v128 and Greengenes v13.5 at ranks up to phylum (9,644 conflicts), indicating that the annotation error rate in these databases is ~15%. I found that 34% of non-singleton genera have overlapping subtrees in the Greengenes tree from 2001 according to the RDP taxonomy, most of which are probably due to branching order errors in the Greengenes tree, which is therefore an unreliable guide to phylogeny. Using a blinded test, I estimated that the annotation error rate of the RDP database is ~10%. -->