Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancies in L1 norm between opal and CAMI #23

Closed
dkoslicki opened this issue Dec 7, 2017 · 7 comments
Closed

Discrepancies in L1 norm between opal and CAMI #23

dkoslicki opened this issue Dec 7, 2017 · 7 comments
Assignees

Comments

@dkoslicki
Copy link
Member

Metrics in CAMI were computed with this code. See lines 154-175 for the computation of L1 norm.

@fernandomeyer
Copy link
Contributor

The results for L1 norm don't match because the indicated code normalizes the abundances by default. For each rank, it sums up all abundances and then divides the abundance of each taxon by that sum.

For example, the CAMI gold standard lc contains two taxa for superkingdom:

10239 superkingdom 10239 Viruses 6.3464
2 superkingdom 2 Bacteria 28.5714

The considered abundances will be:

10239: 0.1817525732
2: 0.8182474268

Do we want this normalization in OPAL? It also affects many other metrics.

@fernandomeyer
Copy link
Contributor

Another difference: the indicated code looks for multiple predictions for the same taxon in the same profile, summimg up the repeated predictions. OPAL only considers one prediction per taxon, which seems logical.

@alicemchardy
Copy link

alicemchardy commented Dec 10, 2017 via email

@fernandomeyer fernandomeyer self-assigned this Dec 18, 2017
@dkoslicki
Copy link
Member Author

@fernandomeyer the issue with summing up multiple predictions was my attempt at error handling. Using just one (or the first) of multiple predictions also makes sense (but is somewhat arbitrary). In general, just taking the first prediction might lead to unexpected results, but it's sort of the user's fault for a malformed *.profile file. So whichever direction you choose to go is fine with me.

@dkoslicki
Copy link
Member Author

With respect to the normalization:
The rational for normalization was that it standardizes (somewhat) the metric values. Without normalizing, the metric is "biased" towards samples that make fewer predictions. For example, if a tool only makes a prediction for 1% of the sample, the metric will be at worst 1.01, whereas a tool that correctly predicts 50% of the abundances exactly correctly, then it's L1 norm will be 1. Normalizing would change this to be 1.99 in the former case (close to the maximal value of 2), and still 1 in the latter.

So in summary, I do think we should allow this as an option (which is the way I coded it originally, if I recall correctly).

@fernandomeyer
Copy link
Contributor

With normalization (now default option in OPAL), OPAL matches the L1 norm of the results in https://github.com/CAMI-challenge/firstchallenge_evaluation/tree/master/profiling/data/submissions_evaluation/56bb3485727d7a24678adf67
However, unifrac values don't match anymore. For the results above, one can conclude that normalized abundances were used to compute L1 norm but not unifrac. Is this the desired behavior?

@fernandomeyer
Copy link
Contributor

Already implemented:
-Abundances will be normalized by default for all metrics, as discussed.
-Multiple predictions for the same taxon will be summed up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants