Skip to content

Answers to questions

Robyn Wright edited this page May 25, 2023 · 9 revisions

These are answers for the 2023 ICG bioinformatics workshop running from May 24th-25th with access to Acenet servers.

Authors: Robyn Wright, Monica Alvaro Fuss and Morgan Langille

The answers shown here are for the Microbiome analysis using QIIME2 with 16S data, the Microbial diversity metrics and data visualization with QIIME2 and the Metagenome sequence data analysis with Kraken2 and Phyloseq.

Microbiome analysis using QIIME2 with 16S data

Question 1: How many samples are there?

We have 12 samples. The raw data folder contains one fastq.gz file for each set of the forward and reverse sequences (labelled R1 and R2, respectively) for each of the samples (BBxxx).

Question 2: Into what group(s) are the samples classified?

The samples are classified into two different groups: either bulk or rhizosphere soil and forest or managed sites.

Question 3: What would happen if you ran this exact command on V4/V5-amplified sequences?

Cutadapt will only trim reads that match the specified primer sequence. Therefore, most reads would be discarded because we are including the --p-discard-untrimmed option.

Question 4: How long are our forward reads? Why are there no reverse reads in our file?

The forward read median length is 405 nucleotides. There are no reverse reads because forward and reverse reads were merged into one sequence during read joining.

Question 5: What would be a good cut-off?

There is no one right answer for this question, but a nice and round trim length of 400 nucleotides is good enough to maintain a proper balance between depth and quality.

Question 6: What is the mean sequencing depth per sample after denoising?

The mean sequencing depth (frequency) across all denoised samples is 10,777 reads.

Question 7: Which sample has the least reads?

Sample BB203 has the lowest sequencing depth (7,397 reads).

Question 8: What is the maximum sequencing depth across all samples?

The final maximum sequencing depth is 11,536 reads.

Microbial diversity metrics and data visualization with QIIME2

Question 1: What is a good rarefaction depth for diversity analysis?

A cut-off of 4,000 reads will be sufficient: the curve plateaus around this depth and we won't exclude any samples.

Question 2: are there any significant differences in alpha diversity between any of our metadata categories?

There are significant differences in richness and phylogenetic diversity between forest and managed environment samples. There are no significant differences between bulk and rhizosphere soil.

Question 3: which metadata category appears to provide more separation in the beta diversity PCoA plots?

This is hard to say just by looking at the Emperor plots provided by QIIME2, but the forest/managed category appears to exhibit more distinct separation. You could verify this by exploring the beta-group-significance command within the diversity plugin.

Question 4: can you identify any patterns between the metadata groups?

Because stacked barcharts are limited in their analytical capabilities, it is hard to discern anything except very obvious patterns.

Question 5: Does ANCOM identify any differentially abundant taxa between any of the metadata groups? If so, which one(s)?

ANCOM does not identify any taxa as differentially significant between bulk and rhizosphere soils, but three ASVs are identified as differentially abundant between forest and managed environments. You can look up the identity of each ASV in the taxonomy.tsv file.

Metagenome sequence data analysis with Kraken2 and Phyloseq

Question 1: How many more reads are classified using the full database than with the mini database?

Using the Mini database, 357,516 reads were classified (7.21%), while using the full database, 1,974,579 reads were classified (39.80%), so ~1.6M more reads are classified using the full database.

Question 2: What do you notice about the number, or proportion, of reads that have been classified?

With a confidence threshold of 0.2 (using the full database), only 383,770 reads have been classified (7.74%), so as the confidence threshold increases, fewer reads are classified.

Question 3: What do you notice now about the number of reads that have been classified? Does this seem like an appropriate confidence threshold to use?

With a confidence threshold of 0.2 (using the mini database), only 3,138 reads have been classified (0.06%), so with this database too, increasing the confidence threshold leads to fewer reads being classified. This seems like too low a proportion of reads that are classified to get an accurate overview of the community, though, so if you see this with your own data, I'd suggest using a lower confidence threshold or a bigger database. The right database/confidence threshold combination to use will depend on your data as well as the research question at hand, and I'd often suggest running a few different confidence thresholds with a subset of your data to see what seems sensible.

Question 4: Here you should see that there are also Eukaryotes in your samples. What do you notice about the distribution of the Eukaryotes in different samples?

In the bulk samples Eukaryotes make up about 10% of the community, while in the rhizosphere samples they make up 35-70% of the samples. They don't seem to vary much between forest and managed samples, although they are particularly high in a single managed rhizosphere sample.

Question 5: Can you modify this phylum level code to work at a lower rank?

An example of this at the family level:

ps.family = tax_glom(physeq_relabun, taxrank="ta6", NArm=FALSE)
rank.sum = tapply(taxa_sums(ps.family), tax_table(ps.family)[, "ta6"], sum, na.rm=TRUE)
top30 = names(sort(rank.sum, TRUE))[1:30]
ps.family = prune_taxa((tax_table(ps.family)[, rnk] %in% top30), ps.family)
plot_bar(ps.family, fill="ta6") + facet_wrap(c(~Description_1, ~Description_3), scales="free_x", nrow=1) + theme(legend.text=element_text(size=5), legend.key.size = unit(0.3, "cm")) + guides(fill=guide_legend(ncol=1)) + scale_fill_manual(values=palette)

Note that here we've changed the phylum and ta3 to family and ta6 (domain=ta1, kingdom=ta2, phylum=ta3, class=ta4, order=ta5, family=ta6, genus=ta7, species=ta8)

Question 6: Why do we not usually have a phylogenetic tree for read-based metagenomic analyses?

In amplicon analyses, we're looking at reads from only a single gene, so we can make trees directly with these genes. In metagenomic analyses, the reads come from different parts of the genome and it would therefore be much more computationally intensive to try to build a tree with these. Often when we assemble the reads into contigs and then use these to build metagenome assembled genomes (MAGs) then we will perform tree-based analyses using this. However, there are some methods that help to generate trees for metagenomic data, e.g. PhyloT - which will generate a taxonomic tree based on a list of NCBI taxonomy ID's - or mapping of the prokaryotic reads to genomes that have already been placed in a phylogenetic tree, such as that provided with each Genome Taxonomy Database release.

Question 7: What patterns do you notice with the alpha diversity between groups? How might you explain these differences in terms of the number and abundance of taxa in the groups?

Richness (observed taxa and Chao1) is higher in rhizosphere than bulk samples, but Simpson's diversity is lower. Richness is slightly lower in managed than forest samples but Simpson's diversity is similar. This suggests that there are more taxa present in rhizosphere than bulk samples (higher richness), but that just a few of these taxa are dominant in the rhizosphere samples (lower diversity).

Question 8: What can you tell about the samples from how they group on the PCoA plots? Does this agree with the results of the PERMANOVA tests?

The samples group more strongly by whether they are from bulk or rhizosphere (Description_1) than whether they are from forest/managed (Description_3). We also see that the bulk/rhizosphere samples are separates on Axis 1, which accounts for a large amount of the overall variation between samples (81%).

You should see an output similar to this for the PERMANOVA test:

adonis2(formula = distance ~ sample_data(ps)$Description_1 * sample_data(ps)$Description_3)
                                                            Df SumOfSqs      R2       F Pr(>F)   
sample_data(ps)$Description_1                                1  0.66799 0.72628 24.8658  0.002 **
sample_data(ps)$Description_3                                1  0.02319 0.02522  0.8633  0.404   
sample_data(ps)$Description_1:sample_data(ps)$Description_3  1  0.01365 0.01484  0.5080  0.567   
Residual                                                     8  0.21491 0.23366                  
Total                                                       11  0.91973 1.00000                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Here we can see the R2 (effect size) and p (Pr(>F)) values for Description_1 (bulk vs rhizosphere), Description_3 (forest vs managed) and the interaction between the two (sample_data(ps)$Description_1:sample_data(ps)$Description_3). The residual shows us how much of the variation between samples is not explained by the variables that we're testing (23% in this case). So the stronger grouping on the PCoA plot by bulk vs rhizosphere is supported by the PERMANOVA test, where Description_1 accounts for 73% of the between sample variation (R2=0.72628) and is the only factor that is significant (p=0.002). Description_3 explains only a small amount of the variation between samples (2.5%) and this effect is not significant (p=0.404) and the interaction between Description_1 and Description_3 is also not significant (p=0.567) and accounts for only a small amount of the variation (1.5%).

Question 9: How do these results compare with those for Bray-Curtis dissimilarity? What does this tell you about the abundances of different taxa?

A lower proportion of the variation is explained by Jaccard distance than Bray-Curtis dissimilarity (Residual R2=0.32533), but the grouping of samples and results of the PERMANOVA test are similar. This suggests that the differences between the sample groups are driven more by presence/absence of taxa than they are by the abundance of those taxa.

Clone this wiki locally