Metagenomics (IMPACTT December 2022) Answers

These are answers for the 2022 IMPACTT bioinformatics workshop running from Dec 6 - 7th.

Authors: Robyn Wright and Morgan Langille

This tutorial has been adapted from the previous 2021 version written for the 2021 Canadian Bioinformatics Workshop by Jacob Nearing and Morgan Langille.

The answers shown here are for the Metagenomic Taxonomic Composition Tutorial and the Metagenomic Assembly Tutorial.

Taxonomic composition

Question 1: Based on the above line counts can you tell how many reads should be in the reverse FASTQs? Remember that a standard convention is to mark the reverse FASTQs with the characters R2

The number of reads should be the same in the reverse FASTQs as it is in the forward FASTQs.

Question 2: Take a look at kneaddata_read_counts.txt: how many reads in sample CSM7KOMH were dropped?

You can look at the number of reads in the "final pair1" or "final pair2" columns for this - they should show 24,007 reads, meaning that 993 reads were dropped.

Question 3: It's reasonable that the processed data was provided by IBDMDB since their goal is to provide standardized data that can be used for many projects. However, why is it important that normally all raw data be uploaded to public repositories?

There are a few different reasons that it's important that all raw data be uploaded to repositories, primarily that someone else should be able to verify your results by repeating exactly what you did, but also the best standards for processing sequencing data are constantly evolving and someone else may want to look at something different in those samples. For example, you may have looked at only the bacterial reads in your samples, but someone else may be interested in the archaeal, fungal or viral reads.

Question 4: How many more reads in each sample are classified using the full RefSeq database (RefSeqCompleteV205) than with the minikraken database (with no confidence threshold)? *Hint: you can use the less command to look at both the .kreport and the .kraken.txt files.

For example, if we look at the CSM7KOMH_0.0_minikraken.kreport and CSM7KOMH_0.0_RefSeqCompleteV205.kreport files, we can see that with the minikraken database, 33,117 reads or 68.97% of reads were classified, while with the RefSeqCompleteV205 database, 49,463 reads or 98.98% of reads were classified.

Question 5: How many more reads in each sample are classified using a confidence threshold of 0.0 compared with a confidence threshold of 0.1 for the minikraken database?

For example, if we look at the same samples but with a confidence threshold of 0.1 (CSM7KOMH_0.1_minikraken.kreport and CSM7KOMH_0.1_RefSeqCompleteV205.kreport), then we find that 26,558 reads or 55.31% of reads were classified with the minikraken database, while 49,409 reads or 98.87% of reads were classified with the RefSeqCompleteV205 database. So we can see that there was a fairly large decrease in the number of reads classified with the minikraken database when we went from a confidence threshold of 0.0 to 0.1, while with the RefSeqCompleteV205 database, this difference was very small. As you can see, we need to take into account both the database and the confidence threshold when choosing which to use for your samples. If you want to read more about this, you can do so in our preprint.

Question 6: Take a look at the raw output for the same sample but run with a confidence threshold of 0.1. Was the second read classified? Why or why not?

In this example, you should see that the second read with a confidence threshold of 0.0 reads: C CB1APANXX170426:1:1101:10001:37312/1 Bacteroides (taxid 816) 101 0:57 816:5 0:4 816:1 And with a confidence threshold of 0.1 reads: U CB1APANXX170426:1:1101:10001:37312/1 unclassified (taxid 0) 101 0:57 816:5 0:4 816:1

If we calculate the confidence as we had before for the taxid 816, which the read is initially classified as, then this gives us (5+1)/(57+5+4+1) = 6/67 = 0.089, so this is below the confidence threshold of 0.1. Because the other k-mers within the read were unclassified, this read becomes unclassified when we set the confidence threshold to 0.1.

Question 7: How would you go about figuring out the total number of species we found within our ten different samples?

You can simply count the number of lines within the Bracken summary files, like so:

wc -l bracken_out_merged/merged_output_minikraken.species.bracken
wc -l bracken_out_merged/merged_output_RefSeqCompleteV205.species.bracken

Remember that these files also have a header row, so we'll need to subtract one from each, giving 415 species in the MiniKraken-classified samples and 2,381 species in the RefSeqCompleteV205-classified samples.

Question 8: How does the number of species classified by each of the confidence thresholds differ, and how does it differ between the minikraken and the RefSeqCompleteV205 databases?

You can use some commands like this to see the number of species classified within each file, in different groups at a time:

wc -l bracken_out/*minikraken*
wc -l bracken_out/*RefSeqCompleteV205*
wc -l bracken_out/*0.0_minikraken*
wc -l bracken_out/*0.1_minikraken*
wc -l bracken_out/*0.0_RefSeqCompleteV205*
wc -l bracken_out/*0.1_RefSeqCompleteV205*

Question 9: Have a look at the figures and see what differences you notice in the abundance profiles between: (A) the confidence thresholds of 0.0 and 0.1; (B) the CD and nonIBD samples; and (C) the MiniKraken and RefSeq Complete V205 databases.

(A) For both the MiniKraken and the RefSeq Complete V205 database, the overall profiles are fairly similar, but in each case the relative abundance of the "Other" category is higher with a confidence threshold of 0.0 than 0.1. This is likely because with no confidence threshold set (0.0), there are lots of taxa identified that are only present in very low abundances (and are likely to be false positives), whereas with the confidence threshold of 0.1 we are not identifying so many false positive taxa.

(B) There are some differences in even the most abundant taxa between the CD and nonIBD samples. The specific differences differ between the databases, but for the RefSeq Complete V205 database, for example, Prevotella copri is only identified in nonIBD samples, while some Bacteroides species (Bacteroides ovatus, Bacteroides fragilis and Bacteroides uniformis) tend to be present only at very low abundances in nonIBD samples, but are very abundant in some CD samples.

(C) For example, Eubacterium rectale is on average the most abundant species in the RefSeq Complete V205 database, but it completely absent from the MiniKraken classifications, even though many of the other most abundant species are the same (e.g., Prevotella copri, Bacteroides ovatus, Faecalibacterium prausnitzii and Phocaeicola vulgatus). When we see something like this, it could be, for example, because this species isn't present in the smaller MiniKraken database, or it could be that it had a different species name in a previous NCBI version.

Assembly

Question 1: How many contigs are there in the megahit_out/final.contigs.fa file?

There are 37 contigs in this file.

Question 2: What proportion of the reads from sample HSM6XRQY were mapped to the contigs?

Here you should have seen that the sample names were printed before the read mapping statistics. This was the last sample to be mapped and therefore the last to be printed out. 1.47% of the reads in sample HSM6XRQY were mapped to the contigs.

Question 3: How many fasta files/bins are there in the fasta_bins folder?

There are three bins in this folder. This was easy to count with such a small number of files, but if we had lots of files in this folder then we could have used ls concoct_output/fasta_bins | wc -l to count the number of files for us.

Question 4: How many contigs are there in the largest fasta file?

There are 35 contigs in the largest fasta file/bin.

Question 5: If you were keeping only the MAGs with >90% completeness and <10% contamination/redundancy, how many MAGs would you be left with? How about if you kept the MAGs with >50% completeness and <10% contamination/redundancy?

Note that if you open the summary-CHECKM-gtdbtk.tsv file in Excel or a similar program, you will be able to sort the MAGs be these columns and see this more easily.

If we only kept those with >90% completeness and <10% redundancy then we would have 20 MAGs, while we would have 41 if we expanded this to include all MAGS WITH >50% completeness.

Question 6: Take a look at the most complete MAGs and the Kraken2/Bracken output. Do the most complete MAGs make sense to you based on the community profiles output by Kraken2/Bracken? Are there any taxa missing? If there are, why do you think this might be?

Typically, we'd expect that the most abundant taxa that are identified in our reads would also be more likely to be assembled as close to complete genomes just because with more reads, we'd expect that by chance we're more likely to have more of the genome present. Generally, this seems to be the case, with MAGs classified as the most abundant taxa from the reads identified (e.g. Phocaeicola vulgatus, Prevotella copri and Bacteroides ovatus). There are some taxa missing from these "best" MAGs, though. For example, Faecalibacterium prausnitzii and Eubacterium rectale are both missing.

In the full MAG output (summary-CHECKM-gtdbtk.tsv), we actually do have several Faecalibacterium MAGs, however, those that are complete have high contamination (e.g. MAG 151), and others are not very complete. If we were doing our own analysis, there are methods for refining the MAGs where they're relatively complete, and we can hopefully reduce the contamination/redundancy in this way. There is one Eubacterium MAG (MAG 136, Eubacterium_G ventriosum), but having a quick look at Eubacterium rectale on the NCBI taxonomy browser suggests that it may also have different names sometimes, and it also could be quite closely related to the Roseburia species, and we do have several of those MAGs. The process of assembling MAGs is definitely not perfect and we need to interpret the resulting genomes with caution.

Question 7: Are there any MAGs with unusually small or large genome sizes? How do you think the completion or contamination/redundancy contributes to this?

Some of the largest genomes are around 5 Mb (MAG 152 Bacteroides fragilis_A and MAG 81 *Bacteroides faecis). One of these is relatively complete (MAG 152, ~94% completion), but the other is only 55% complete (MAG 81) and already has >5% redundancy. This suggests that it may also contain some contamination, as it would be very large if 100% complete and the redundancy could be much higher, too. Some of the MAGs are small, but are almost 100% complete (e.g. MAG 34), while some are small but not very complete (e.g. MAG 21).

Contact

Please feel free to post a question on the Microbiome Helper google group if you have any issues.
General comments or inquires about Microbiome Helper can be sent to morgan.langille@dal.ca.