Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some popular distance metrices #33

Closed
wants to merge 18 commits into from
Closed

Add some popular distance metrices #33

wants to merge 18 commits into from

Conversation

sgibb
Copy link
Member

@sgibb sgibb commented Dec 11, 2019

This PR implements a few distance/similarity measurements that are discussed in #29 and #30.

Would be great if @tobiasko could also have a look at the implementation.

The implemented scores are proposed in:
Stein and Scott 1994
stein

Toprak et al. 2014
toprak

Please note that there is also a .calibrate function for scaling. I thought I would need it for the distance functions to scale them to 0:1 but it seems not to be necessary. But I would definitively need it for the normalise function. I can remove it if you like.

@sgibb sgibb added the enhancement New feature or request label Dec 11, 2019
@sgibb sgibb self-assigned this Dec 11, 2019
Copy link
Member

@jorainer jorainer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and very important contribution! Documentation could however be improved slightly.

R/distance.R Show resolved Hide resolved
R/distance.R Show resolved Hide resolved
R/distance.R Outdated Show resolved Hide resolved
x <- matrix(c(1:5, 1:5), ncol = 2)
y <- matrix(c(1:5, 5:1), ncol = 2)

test_that("ndotproduct", {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way that we could cross-check with external results (i.e. results from another/original implementation of the methods)?

Copy link
Member Author

@sgibb sgibb Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be really great. But I only aware of our implementation in MSnbase and the one in OrgMassSpecR. The latter produce (of course) different results for the dotproduct. Especially the weighting isn't available in any tools I know:
https://github.com/OrgMassSpec/OrgMassSpecR/blob/aa0e4ee927b7f7051efa6400ad6f3fae8539f7c8/OrgMassSpecR/R/SpectrumSimilarity.R#L42

similarity_score <- as.vector((u %*% v) / (sqrt(sum(u^2)) * sqrt(sum(v^2))))

vs our:

MsCoreUtils/R/distance.R

Lines 79 to 84 in b1e1ad0

ndotproduct <- function(x, y, m = 0L, n = 0.5, na.rm = TRUE) {
wx <- .weightxy(x[, 1L], x[, 2L], m, n)
wy <- .weightxy(y[, 1L], y[, 2L], m, n)
sum(wx * wy, na.rm = na.rm)^2L /
(sum(wx^2L, na.rm = na.rm) * sum(wy^2L, na.rm = na.rm))
}

Which (if m = 0 and n = 0.5) boils down to:

sum(sqrt(u) * sqrt(v))^2L / (sum(u) * sum(v))
# vs OrgMassSpecR
sum(u * v) / (sqrt(sum(u^2)) * sqrt(sum(v^2)))

As you see already this "simple" normalised dotproduct is hard to implement correctly.

@lgatto, @tobiasko, @tnaake are you aware of any software/web service were we can double check our results?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the spectral contrast angle (SCA) I found this paper:

https://link.springer.com/content/pdf/10.1016/S1044-0305(01)00327-0.pdf

but they do simulations and list means/medians in the paper. But there is some basic geometric stuff in there that might help!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm...what about the following: Define s (actual position and intensity values do not matter), generate s' that is orthogonal in all dimension. Could be done by a function f(s) -> s'. The geometric definition of SCA would require that SCA of s vs. s` is 90 degree.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tobiasko I am not sure that this is working because we the dotproduct is part of the spectralangle and we use a normalised dotpoduct here. So may we have to implement an unnoramlized dotproduct and spectral angle first?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tobiasko: thanks for your code suggestions but we already have a binning and an outer join.

dotproduct etc. are needed in Chromatograms as well. That's why we put it here. We already have a compareSpectra in Spectra.

This is far more powerful than R/dotproduct.R

Why? And R/dotproduct.R is obsolete with this PR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Because...

> ## copied from
> ## https://github.com/rformassspectrometry/MsCoreUtils/blob/distance/R/distance.R
> 
> .weightxy <- function(x, y, m = 0, n = 0.5) {
+   x ^ m * y ^ n
+ }
> 
> ndotproduct <- function(x, y, m = 0L, n = 0.5, na.rm = TRUE) {
+   wx <- .weightxy(x[, 1L], x[, 2L], m, n)
+   wy <- .weightxy(y[, 1L], y[, 2L], m, n)
+   sum(wx * wy, na.rm = na.rm)^2L /
+     (sum(wx^2L, na.rm = na.rm) * sum(wy^2L, na.rm = na.rm))
+ }
> 
> ## test on case with no shared peaks
> x <- matrix(c(6:10, 1:5), ncol = 2, dimnames = list(c(), c("mz", "intensity")))
> y <- matrix(c(1:5, 1:5), ncol = 2, dimnames = list(c(), c("mz", "intensity")))
> ndotproduct(x, y)
[1] 1
> # :-(

and that should be zero by definition!? Please compare to case 2 above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These is expected. As you mentioned above all distance/similarity measurements require binned spectra. The input in your example is not aligned but ndotproduct assumes it is (and for that an intensity of 1:5 against an intensity 1:5 should return a similarity of 1). So a fair comparison would be:

joinndotproduct <- function(x, y) {
    j <- join(x[, "mz"], y[, "mz"], type = "outer")
    ndotproduct(x[j$x,], y[j$y,])
}

joinndotproduct(x, y)
# [1] 0

That is exactly what Spectra::compareSpectra is doing. compareSpectra is more powerful because it allows the user to select a joining function (outer/inner/left/right/(graph based if the other PR will be accepted)/user provided) and a distance/similarity function (one of these that are part of this PR or any other function the user provides).

We try to build functions for just one task (following the unix philosophy: doing on thing and do it well. Your normDotProduct does at least 2 task: joining and distance measurement).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As already mentioned above my problem with the dotproduct is the following:

The general definition of a normalized dotproduct is:

sum(u * v) / (sqrt(sum(u^2)) * sqrt(sum(v^2))

but in the literature (e.g. Stein and Scott 1994) it is

sum(sqrt(u) * sqrt(v))^2L / (sum(u) * sum(v)))

(for the default weighting with m = 0, and n = 0.5)

This will definitively yield different results.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also providing some background info: we did separate the binning/peak matching and similarity calculation steps because we wanted to give the users the choice to select a specific method for each of these steps (i.e. some people might prefer binning the data while others might want to match peaks based on difference of their m/z). The similarity calculations are performed on the matched (or aligned spectra). It would thus also be possible to use different, base R, similarity calculation functions.

For the reason to have these in MsCoreUtils and not in Spectra: some/many of these functions will be re-used in the Chromatograms package.

@codecov-io
Copy link

codecov-io commented Dec 13, 2019

Codecov Report

Merging #33 into master will decrease coverage by 0.01%.
The diff coverage is 98.03%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #33      +/-   ##
==========================================
- Coverage   98.20%   98.18%   -0.02%     
==========================================
  Files          24       25       +1     
  Lines         500      495       -5     
==========================================
- Hits          491      486       -5     
  Misses          9        9              
Impacted Files Coverage Δ
src/impNeighbourAvg.c 94.73% <94.73%> (ø)
R/distance.R 100.00% <100.00%> (ø)
R/imputation.R 97.59% <100.00%> (-0.06%) ⬇️
R/rla.R 100.00% <100.00%> (ø)
src/init.c 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3df233e...a3698aa. Read the comment docs.

@jorainer jorainer self-requested a review January 7, 2020 06:54
Copy link
Member

@jorainer jorainer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this PR.

Regarding the discussion on the unit tests and evaluation whether the functionality is working correctly: I suggest to add a @note section saying that the methods have been implemented as described in the papers, but, because there is no reference implementation available, we are unable to guarantee that results would be identical.

@sgibb
Copy link
Member Author

sgibb commented Mar 5, 2020

@jorainer I added the note you suggested. Could we merge this now?

@sgibb
Copy link
Member Author

sgibb commented Mar 5, 2020

BTW: the travis failure seems to be related to a S4Vectors problem upstream.

@jorainer
Copy link
Member

jorainer commented Mar 5, 2020

OK for me @sgibb - you can merge.

@jorainer
Copy link
Member

jorainer commented Mar 5, 2020

Regarding S4Vectors, strangely enough travis for PR #45 does not fail...

@lgatto
Copy link
Member

lgatto commented Mar 5, 2020

I apologise in advance, because this will lead to a snake_case vs camelCase discussion...

I initially found the leading n confusing, and then inferred from the documentation that these returned normalised distances. What about a more telling name, starting with norm or normali[z|s]ed.

@jorainer
Copy link
Member

Honestly - I prefer dotproduct/ndotproduct over dotproduct/normDotproduct (or norm_dotproduct or normdotproduct)... but I agree, this should be clearly (better?) described in the documentation.

@sgibb
Copy link
Member Author

sgibb commented Mar 30, 2020

I don't really like normDotProduct etc. (neither norm_dot_product) and I am happy to extend the documentation. Every section starts with

ndotproduct: the normalized dot product ...

Any suggestions for improvement?

@jorainer
Copy link
Member

maybe one general section where you specify that function names for methods that produce normalized similarities start with an n?

@sgibb
Copy link
Member Author

sgibb commented Mar 31, 2020

Ok, I a few more words.

@sgibb sgibb requested a review from jorainer March 31, 2020 10:24
Copy link
Member

@jorainer jorainer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK for me - again 😉

@sgibb sgibb changed the title Add some popular distance matrices Add some popular distance metrices May 8, 2020
@sgibb
Copy link
Member Author

sgibb commented May 8, 2020

Merged manually due to conflicts introduced by the addition of imputation/normalisation and bioc submission.

@sgibb sgibb closed this May 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants