-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Meaning of abundance in classification summary #152
Comments
I have the same question actually. Contacted the developers via email and no answer. It would be really helpful to get some formula and this clarified... |
I've got a somewhat related issue, in that I get loads of classifications for which the centrifuge summary indicates there are thousands of reads (and unique reads as well since I restricted it to scoring down to 1 classification) which have an abundance of just 0. Looking at the equation in the paper for calculating abundance I'm not sure why there would ever be anything it says is present at all with an abundance of exactly 0. There doesn't seem to be a pattern to it either, includes eukaryotes and prokaryotes, species level through kingdom level classifications. |
@pxo289 I see the same output as you, where everything gets an abundance of 0 despite a large proportion of all reads being uniquely assigned to one organism. There's a bug somewhere, evidently. Note that there's a separate issue tracking it: #158 @lec49, I'll also attempt answer your original question, based on my best attempting at understanding the paper at https://genome.cshlp.org/content/26/12/1721.full.pdf, with the caveat that, as noted above, I've not actually got Centrifuge to work on my reads yet.
No - the length of reads is irrelevant. It's the count of reads that matters. Rather, the abundance values attempt to convey what proportion of organisms in a sample were from each species. That calculation is informed by both the number of reads assigned to each species and the lengths of those species' reference genomes. The key formula is this one from the end of page 4: This gives the likelihood of a given configuration of abundances, α, given a set C of assignments of reads to species. (Equivalently, then, it's the probability of a set of assignments C of reads to species given a set of abundances.) From the formula we can reverse-engineer Centrifuge's assumptions:
That second assumption is where the normalized by its genomic length idea comes in. Subject to those assumptions, Centrifuge uses the screenshotted formula to find the highest-likelihood configuration of abundances. |
Using Illumina data from mock communities, also often found unexpected abundance estimates, many reads mapped uniquely but very low or zero estimated abundances. Seems that the expectation maximization may be the cause of this. Could the developers please comment on this? Thanks. |
Any developments with relation to this thread? |
Hi,
I have a question about the 'abundance' column on the centrifuge classification summary output.
The manual says: 'it is the proportion of this genome normalized by its genomic length'
Can someone clarify in more simple terms exactly what this means?
My understanding was it was the length of all the reads that map to that genome divided by the total length of the genome. If you have reads that map to the whole genome you will get a score of 1.0? Is this correct or does it mean something else?
Thanks,
Lyndsay
The text was updated successfully, but these errors were encountered: