New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add some popular distance metrices #33
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice and very important contribution! Documentation could however be improved slightly.
x <- matrix(c(1:5, 1:5), ncol = 2) | ||
y <- matrix(c(1:5, 5:1), ncol = 2) | ||
|
||
test_that("ndotproduct", { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way that we could cross-check with external results (i.e. results from another/original implementation of the methods)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be really great. But I only aware of our implementation in MSnbase
and the one in OrgMassSpecR
. The latter produce (of course) different results for the dotproduct. Especially the weighting isn't available in any tools I know:
https://github.com/OrgMassSpec/OrgMassSpecR/blob/aa0e4ee927b7f7051efa6400ad6f3fae8539f7c8/OrgMassSpecR/R/SpectrumSimilarity.R#L42
similarity_score <- as.vector((u %*% v) / (sqrt(sum(u^2)) * sqrt(sum(v^2))))
vs our:
Lines 79 to 84 in b1e1ad0
ndotproduct <- function(x, y, m = 0L, n = 0.5, na.rm = TRUE) { | |
wx <- .weightxy(x[, 1L], x[, 2L], m, n) | |
wy <- .weightxy(y[, 1L], y[, 2L], m, n) | |
sum(wx * wy, na.rm = na.rm)^2L / | |
(sum(wx^2L, na.rm = na.rm) * sum(wy^2L, na.rm = na.rm)) | |
} |
Which (if m = 0
and n = 0.5
) boils down to:
sum(sqrt(u) * sqrt(v))^2L / (sum(u) * sum(v))
# vs OrgMassSpecR
sum(u * v) / (sqrt(sum(u^2)) * sqrt(sum(v^2)))
As you see already this "simple" normalised dotproduct is hard to implement correctly.
@lgatto, @tobiasko, @tnaake are you aware of any software/web service were we can double check our results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the spectral contrast angle (SCA) I found this paper:
https://link.springer.com/content/pdf/10.1016/S1044-0305(01)00327-0.pdf
but they do simulations and list means/medians in the paper. But there is some basic geometric stuff in there that might help!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm...what about the following: Define s (actual position and intensity values do not matter), generate s' that is orthogonal in all dimension. Could be done by a function f(s) -> s'. The geometric definition of SCA would require that SCA of s vs. s` is 90 degree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tobiasko I am not sure that this is working because we the dotproduct
is part of the spectralangle
and we use a normalised dotpoduct here. So may we have to implement an unnoramlized dotproduct and spectral angle first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tobiasko: thanks for your code suggestions but we already have a binning and an outer join.
dotproduct
etc. are needed in Chromatograms
as well. That's why we put it here. We already have a compareSpectra
in Spectra
.
This is far more powerful than R/dotproduct.R
Why? And R/dotproduct.R
is obsolete with this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Because...
> ## copied from
> ## https://github.com/rformassspectrometry/MsCoreUtils/blob/distance/R/distance.R
>
> .weightxy <- function(x, y, m = 0, n = 0.5) {
+ x ^ m * y ^ n
+ }
>
> ndotproduct <- function(x, y, m = 0L, n = 0.5, na.rm = TRUE) {
+ wx <- .weightxy(x[, 1L], x[, 2L], m, n)
+ wy <- .weightxy(y[, 1L], y[, 2L], m, n)
+ sum(wx * wy, na.rm = na.rm)^2L /
+ (sum(wx^2L, na.rm = na.rm) * sum(wy^2L, na.rm = na.rm))
+ }
>
> ## test on case with no shared peaks
> x <- matrix(c(6:10, 1:5), ncol = 2, dimnames = list(c(), c("mz", "intensity")))
> y <- matrix(c(1:5, 1:5), ncol = 2, dimnames = list(c(), c("mz", "intensity")))
> ndotproduct(x, y)
[1] 1
> # :-(
and that should be zero by definition!? Please compare to case 2 above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These is expected. As you mentioned above all distance/similarity measurements require binned spectra. The input in your example is not aligned but ndotproduct
assumes it is (and for that an intensity of 1:5
against an intensity 1:5
should return a similarity of 1
). So a fair comparison would be:
joinndotproduct <- function(x, y) {
j <- join(x[, "mz"], y[, "mz"], type = "outer")
ndotproduct(x[j$x,], y[j$y,])
}
joinndotproduct(x, y)
# [1] 0
That is exactly what Spectra::compareSpectra
is doing. compareSpectra
is more powerful because it allows the user to select a joining function (outer/inner/left/right/(graph based if the other PR will be accepted)/user provided) and a distance/similarity function (one of these that are part of this PR or any other function the user provides).
We try to build functions for just one task (following the unix philosophy: doing on thing and do it well. Your normDotProduct
does at least 2 task: joining and distance measurement).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As already mentioned above my problem with the dotproduct is the following:
The general definition of a normalized dotproduct is:
sum(u * v) / (sqrt(sum(u^2)) * sqrt(sum(v^2))
but in the literature (e.g. Stein and Scott 1994) it is
sum(sqrt(u) * sqrt(v))^2L / (sum(u) * sum(v)))
(for the default weighting with m = 0, and n = 0.5)
This will definitively yield different results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also providing some background info: we did separate the binning/peak matching and similarity calculation steps because we wanted to give the users the choice to select a specific method for each of these steps (i.e. some people might prefer binning the data while others might want to match peaks based on difference of their m/z). The similarity calculations are performed on the matched (or aligned spectra). It would thus also be possible to use different, base R, similarity calculation functions.
For the reason to have these in MsCoreUtils
and not in Spectra
: some/many of these functions will be re-used in the Chromatograms
package.
Codecov Report
@@ Coverage Diff @@
## master #33 +/- ##
==========================================
- Coverage 98.20% 98.18% -0.02%
==========================================
Files 24 25 +1
Lines 500 495 -5
==========================================
- Hits 491 486 -5
Misses 9 9
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this PR.
Regarding the discussion on the unit tests and evaluation whether the functionality is working correctly: I suggest to add a @note
section saying that the methods have been implemented as described in the papers, but, because there is no reference implementation available, we are unable to guarantee that results would be identical.
@jorainer I added the note you suggested. Could we merge this now? |
BTW: the travis failure seems to be related to a |
OK for me @sgibb - you can merge. |
Regarding |
I apologise in advance, because this will lead to a snake_case vs camelCase discussion... I initially found the leading n confusing, and then inferred from the documentation that these returned normalised distances. What about a more telling name, starting with norm or normali[z|s]ed. |
Honestly - I prefer |
I don't really like
Any suggestions for improvement? |
maybe one general section where you specify that function names for methods that produce normalized similarities start with an n? |
Ok, I a few more words. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK for me - again 😉
Merged manually due to conflicts introduced by the addition of imputation/normalisation and bioc submission. |
This PR implements a few distance/similarity measurements that are discussed in #29 and #30.
Would be great if @tobiasko could also have a look at the implementation.
The implemented scores are proposed in:
Stein and Scott 1994
Toprak et al. 2014
Please note that there is also a
.calibrate
function for scaling. I thought I would need it for the distance functions to scale them to0:1
but it seems not to be necessary. But I would definitively need it for thenormalise
function. I can remove it if you like.