Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meaning of abundance in classification summary #152

Open
lec49 opened this issue Oct 11, 2018 · 5 comments
Open

Meaning of abundance in classification summary #152

lec49 opened this issue Oct 11, 2018 · 5 comments

Comments

@lec49
Copy link

lec49 commented Oct 11, 2018

Hi,
I have a question about the 'abundance' column on the centrifuge classification summary output.

The manual says: 'it is the proportion of this genome normalized by its genomic length'

Can someone clarify in more simple terms exactly what this means?

My understanding was it was the length of all the reads that map to that genome divided by the total length of the genome. If you have reads that map to the whole genome you will get a score of 1.0? Is this correct or does it mean something else?

Thanks,

Lyndsay

@t-neumann
Copy link

I have the same question actually. Contacted the developers via email and no answer.

It would be really helpful to get some formula and this clarified...

@pxo289
Copy link

pxo289 commented Jan 10, 2019

I've got a somewhat related issue, in that I get loads of classifications for which the centrifuge summary indicates there are thousands of reads (and unique reads as well since I restricted it to scoring down to 1 classification) which have an abundance of just 0. Looking at the equation in the paper for calculating abundance I'm not sure why there would ever be anything it says is present at all with an abundance of exactly 0. There doesn't seem to be a pattern to it either, includes eukaryotes and prokaryotes, species level through kingdom level classifications.

@ExplodingCabbage
Copy link

@pxo289 I see the same output as you, where everything gets an abundance of 0 despite a large proportion of all reads being uniquely assigned to one organism. There's a bug somewhere, evidently. Note that there's a separate issue tracking it: #158

@lec49, I'll also attempt answer your original question, based on my best attempting at understanding the paper at https://genome.cshlp.org/content/26/12/1721.full.pdf, with the caveat that, as noted above, I've not actually got Centrifuge to work on my reads yet.

My understanding was it was the length of all the reads that map to that genome divided by the total length of the genome.

No - the length of reads is irrelevant. It's the count of reads that matters.

Rather, the abundance values attempt to convey what proportion of organisms in a sample were from each species. That calculation is informed by both the number of reads assigned to each species and the lengths of those species' reference genomes. The key formula is this one from the end of page 4:

screenshot of likelihood formula from paper

This gives the likelihood of a given configuration of abundances, α, given a set C of assignments of reads to species. (Equivalently, then, it's the probability of a set of assignments C of reads to species given a set of abundances.) From the formula we can reverse-engineer Centrifuge's assumptions:

  • If a read is assigned to multiple species (let's say x, y and z) then all we can infer is that it came from species x or y or z. Since these possibilities are mutually exclusive, P(read came from x or y or z | α) = P(read came from x | α) + P(read came from y | α) + P(read came from z | α).
  • The probability of a given read coming from a given species, given a particular configuration of abundances, is proportional to the species' genome length. That is, if species A and B are present in equal abundance, but B has twice as long a genome as A, then we'll see twice as many reads from B (since there's twice as much DNA present per organism).

That second assumption is where the normalized by its genomic length idea comes in.

Subject to those assumptions, Centrifuge uses the screenshotted formula to find the highest-likelihood configuration of abundances.

@aistBMRG
Copy link

aistBMRG commented Jun 5, 2019

Using Illumina data from mock communities, also often found unexpected abundance estimates, many reads mapped uniquely but very low or zero estimated abundances. Seems that the expectation maximization may be the cause of this. Could the developers please comment on this?

Thanks.

@harisankarsadasivan
Copy link

Any developments with relation to this thread?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants