Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sourmash vs aligners, & comparison with other tools #3079

Open
ctb opened this issue Mar 13, 2024 · 9 comments
Open

sourmash vs aligners, & comparison with other tools #3079

ctb opened this issue Mar 13, 2024 · 9 comments

Comments

@ctb
Copy link
Contributor

ctb commented Mar 13, 2024

(copied from #461 (comment))

Hi Titus @ctb , I don't know whether I could ask the following questions related to this issue, anyway....

We've used your tool to quantify the abundance of a set number of reference genomes against raw reads (query fastq).
In addtion, according to your suggestion here, we used median_abund to determine which genomes are differentially abundant among sample groups. However, we had some doubts on this value.

  1. when we use aligner like bwa or blast, we have to set a reasonable coverage cutoff (say 60% coverage) to decrease false positive when we quantify abundance between genomes and reads. However, when we used your tool, how to make sure those genomes present in samples are not false positive?
  2. we've compared your tool with bwa together to see if their results are comparable. The answer is Yes. why we used your tool to replace our original mapper was that your tool is way faster than bwa-based methods since our reference set was big (~60k genomes).
  3. we also spend some time on comparing biobakery results with sourmash evaluating abundance and we wanted to start to draft our manuscripts on these findings, so we need to obtain more information related to this discussion from you, and thanks in advance.

forgive me for my offensiveness to hijack this disscussion.
Yuzie

Originally posted by @yuzie0314 in #461 (comment)

@ctb
Copy link
Contributor Author

ctb commented Mar 13, 2024

will draft longer answer in a bit, but can I confirm what 'coverage' means here?

  1. when we use aligner like bwa or blast, we have to set a reasonable coverage cutoff (say 60% coverage) to decrease false positive when we quantify abundance between genomes and reads. However, when we used your tool, how to make sure those genomes present in samples are not false positive?

do you mean fraction of genome covered by at least one read ('breadth' in inStrain, or 'detection' in anvi'o), or something else?

@ctb
Copy link
Contributor Author

ctb commented Mar 13, 2024

In addtion, according to your suggestion here, we used median_abund to determine which genomes are differentially abundant among sample groups. However, we had some doubts on this value.

  1. when we use aligner like bwa or blast, we have to set a reasonable coverage cutoff (say 60% coverage) to decrease false positive when we quantify abundance between genomes and reads. However, when we used your tool, how to make sure those genomes present in samples are not false positive?

The best reference to use is Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets, Portik et al., which shows that sourmash has very few false positives at a taxonomic level. This does suggest relatively few false positives at a genomic level as well.

The underlying algorithm that does the actual assignment is implemented as sourmash gather as is described in Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers, Irber et al. (bioRxiv).

But... we don't have anything written as to why it works well. I have unpublished analyses and a general understanding of what's going on, and there's information scattered around the repo (see #2360 specifically). The short answer appears to be that the combinatorial nature of sourmash matching with FracMinHash and containment makes matches as low as 3 hashes (a --threshold-bp of 3kb when using a scaled of 1000) very specific at a species level.

  1. we've compared your tool with bwa together to see if their results are comparable. The answer is Yes. why we used your tool to replace our original mapper was that your tool is way faster than bwa-based methods since our reference set was big (~60k genomes).

We always like to hear that ;)

  1. we also spend some time on comparing biobakery results with sourmash evaluating abundance and we wanted to start to draft our manuscripts on these findings, so we need to obtain more information related to this discussion from you, and thanks in advance.

Sure - I think biobakery uses mapping underneath, right? Happy to chat more here about such things; if we need to move to e-mail for detailed stuff, drop me a note at ctbrown@ucdavis.edu (but I'd be happiest to keep as much public as possible).

In general, sourmash has the goal of rapid and extremely large scale analysis against all available genomes, including private genomes; while other tools focus more on a curated set of high quality genomes, usually those available from NCBI or GTDB.

@yuzie0314
Copy link

will draft longer answer in a bit, but can I confirm what 'coverage' means here?

  1. when we use aligner like bwa or blast, we have to set a reasonable coverage cutoff (say 60% coverage) to decrease false positive when we quantify abundance between genomes and reads. However, when we used your tool, how to make sure those genomes present in samples are not false positive?

do you mean fraction of genome covered by at least one read ('breadth' in inStrain, or 'detection' in anvi'o), or something else?

The coverage I meaned here was specific fraction of genome covered by at least one read (the breadth )

@yuzie0314
Copy link

Dear the author Titus,

The biobakery is using their genome set to generate the results, and we think their so-called biomarkers are derived from their manually curated. On the other hand, we had gtdbtk representative genome and some denovo assembled genomes sketched from sourmash and applied this customed db to generate the abundance table against PE raw reads. Overall, we could see their phylum and genus constitution or even the alpha diversity are comparable. In addition, we also conducted deseq2 or maaslin2 to see the differences between different sample groups. We found some species showed higher abundance within specific group, and those results presented in the biobakery and your methods. We actually not 100% trust the results from biobakery so we wanted to use sourmash to validate or supplement any interesting results to our hypothesis. This project was actually a big cohort study related to smoker and non-smoker in Asian, which was not only including quantification analysis but also the assembled genomes built from the samples. Because this project was complicated, forgive me unable to make it clearer. We are actually your big fans because our pipelines heavily relying your tool, and look forward to any improvements within your updates.

if we have any good news, will let you know and bring out to discuss with you.
YZ

@ctb
Copy link
Contributor Author

ctb commented Mar 15, 2024

will draft longer answer in a bit, but can I confirm what 'coverage' means here?

  1. when we use aligner like bwa or blast, we have to set a reasonable coverage cutoff (say 60% coverage) to decrease false positive when we quantify abundance between genomes and reads. However, when we used your tool, how to make sure those genomes present in samples are not false positive?

do you mean fraction of genome covered by at least one read ('breadth' in inStrain, or 'detection' in anvi'o), or something else?

The coverage I meaned here was specific fraction of genome covered by at least one read (the breadth )

Great! Yes, in that case, our experience is that sourmash is accurate at the species level down to about 3 hashes, or --threshold-bp=3000 at a scaled of 1000.

@ctb
Copy link
Contributor Author

ctb commented Mar 15, 2024

Dear the author Titus,

The biobakery is using their genome set to generate the results, and we think their so-called biomarkers are derived from their manually curated. On the other hand, we had gtdbtk representative genome and some denovo assembled genomes sketched from sourmash and applied this customed db to generate the abundance table against PE raw reads. Overall, we could see their phylum and genus constitution or even the alpha diversity are comparable. In addition, we also conducted deseq2 or maaslin2 to see the differences between different sample groups. We found some species showed higher abundance within specific group, and those results presented in the biobakery and your methods. We actually not 100% trust the results from biobakery so we wanted to use sourmash to validate or supplement any interesting results to our hypothesis. This project was actually a big cohort study related to smoker and non-smoker in Asian, which was not only including quantification analysis but also the assembled genomes built from the samples. Because this project was complicated, forgive me unable to make it clearer. We are actually your big fans because our pipelines heavily relying your tool, and look forward to any improvements within your updates.

if we have any good news, will let you know and bring out to discuss with you. YZ

Excellent, really glad to hear this - please post questions and so on as you have them!

@jorondo1
Copy link

jorondo1 commented Mar 19, 2024

Hi @ctb !

I'd like to slide into this discussion, hopefully not too far away from the subject. @yuzie0314 is comparing (i assume) Metaphlan results with Sourmash, which I've also been doing (along with kraken/bracken using conservative confidence cutoffs, as suggested a great paper I wish had also considered sourmash).

My question is: Would it be accurate to say that Sourmash gather reports sequence abundance, and not taxonomic abundance? I'm referring to the distinction made by this paper, which explains k-mer based methods (again kraken/bracken) reports sequence abundance, which are not genome-length normalised, as opposed to marker-gene methods. Are we getting something similar conceptually with Sourmash, or is there a form of genome length normalisation going on in the abundance estimation?

Looking forward to your reply,
Jonathan

@ctb
Copy link
Contributor Author

ctb commented Mar 19, 2024

My question is: Would it be accurate to say that Sourmash gather reports sequence abundance, and not taxonomic abundance? I'm referring to the distinction made by this paper, which explains k-mer based methods (again kraken/bracken) reports sequence abundance, which are not genome-length normalised, as opposed to marker-gene methods. Are we getting something similar conceptually with Sourmash, or is there a form of genome length normalisation going on in the abundance estimation?

You are correct - no genome-length normalization is going on! (Thanks for the references, too :)

I've been exploring this a little bit with @bryshalm who is looking at sourmash for OTU-style processing, and my hot take is that failing to normalize for genome length is similar to our general disregard for 16S/SSU rRNA copy number - you just need to be aware that it's a thing :).

I also do not think, even in principle, that you can properly normalize for genome length - there are too many unknowns around pangenomic content in metagenomes vs what's in the reference database - but happy to discuss further!

@jorondo1
Copy link

I agree, the important thing is to acknowledge this. Thanks for the quick reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants