Add some popular distance metrices #33

sgibb · 2019-12-11T22:05:08Z

This PR implements a few distance/similarity measurements that are discussed in #29 and #30.

Would be great if @tobiasko could also have a look at the implementation.

The implemented scores are proposed in:
Stein and Scott 1994

Toprak et al. 2014

Please note that there is also a .calibrate function for scaling. I thought I would need it for the distance functions to scale them to 0:1 but it seems not to be necessary. But I would definitively need it for the normalise function. I can remove it if you like.

jorainer

Nice and very important contribution! Documentation could however be improved slightly.

R/distance.R

jorainer · 2019-12-12T07:25:57Z

tests/testthat/test_distance.R

+x <- matrix(c(1:5, 1:5), ncol = 2)
+y <- matrix(c(1:5, 5:1), ncol = 2)
+
+test_that("ndotproduct", {


is there a way that we could cross-check with external results (i.e. results from another/original implementation of the methods)?

That would be really great. But I only aware of our implementation in MSnbase and the one in OrgMassSpecR. The latter produce (of course) different results for the dotproduct. Especially the weighting isn't available in any tools I know:
https://github.com/OrgMassSpec/OrgMassSpecR/blob/aa0e4ee927b7f7051efa6400ad6f3fae8539f7c8/OrgMassSpecR/R/SpectrumSimilarity.R#L42

similarity_score <- as.vector((u %*% v) / (sqrt(sum(u^2)) * sqrt(sum(v^2))))

vs our:

MsCoreUtils/R/distance.R

Lines 79 to 84 in b1e1ad0

ndotproduct <- function(x, y, m = 0L, n = 0.5, na.rm = TRUE) {

wx <- .weightxy(x[, 1L], x[, 2L], m, n)

wy <- .weightxy(y[, 1L], y[, 2L], m, n)

sum(wx * wy, na.rm = na.rm)^2L /

(sum(wx^2L, na.rm = na.rm) * sum(wy^2L, na.rm = na.rm))

}

Which (if m = 0 and n = 0.5) boils down to:

sum(sqrt(u) * sqrt(v))^2L / (sum(u) * sum(v)) # vs OrgMassSpecR sum(u * v) / (sqrt(sum(u^2)) * sqrt(sum(v^2)))

As you see already this "simple" normalised dotproduct is hard to implement correctly.

@lgatto, @tobiasko, @tnaake are you aware of any software/web service were we can double check our results?

For the spectral contrast angle (SCA) I found this paper:

https://link.springer.com/content/pdf/10.1016/S1044-0305(01)00327-0.pdf

but they do simulations and list means/medians in the paper. But there is some basic geometric stuff in there that might help!

Hmmm...what about the following: Define s (actual position and intensity values do not matter), generate s' that is orthogonal in all dimension. Could be done by a function f(s) -> s'. The geometric definition of SCA would require that SCA of s vs. s` is 90 degree.

@tobiasko I am not sure that this is working because we the dotproduct is part of the spectralangle and we use a normalised dotpoduct here. So may we have to implement an unnoramlized dotproduct and spectral angle first?

@tobiasko: thanks for your code suggestions but we already have a binning and an outer join.

dotproduct etc. are needed in Chromatograms as well. That's why we put it here. We already have a compareSpectra in Spectra.

This is far more powerful than R/dotproduct.R

Why? And R/dotproduct.R is obsolete with this PR.

Why? Because...

> ## copied from > ## https://github.com/rformassspectrometry/MsCoreUtils/blob/distance/R/distance.R > > .weightxy <- function(x, y, m = 0, n = 0.5) { + x ^ m * y ^ n + } > > ndotproduct <- function(x, y, m = 0L, n = 0.5, na.rm = TRUE) { + wx <- .weightxy(x[, 1L], x[, 2L], m, n) + wy <- .weightxy(y[, 1L], y[, 2L], m, n) + sum(wx * wy, na.rm = na.rm)^2L / + (sum(wx^2L, na.rm = na.rm) * sum(wy^2L, na.rm = na.rm)) + } > > ## test on case with no shared peaks > x <- matrix(c(6:10, 1:5), ncol = 2, dimnames = list(c(), c("mz", "intensity"))) > y <- matrix(c(1:5, 1:5), ncol = 2, dimnames = list(c(), c("mz", "intensity"))) > ndotproduct(x, y) [1] 1 > # :-(

and that should be zero by definition!? Please compare to case 2 above.

These is expected. As you mentioned above all distance/similarity measurements require binned spectra. The input in your example is not aligned but ndotproduct assumes it is (and for that an intensity of 1:5 against an intensity 1:5 should return a similarity of 1). So a fair comparison would be:

joinndotproduct <- function(x, y) { j <- join(x[, "mz"], y[, "mz"], type = "outer") ndotproduct(x[j$x,], y[j$y,]) } joinndotproduct(x, y) # [1] 0

That is exactly what Spectra::compareSpectra is doing. compareSpectra is more powerful because it allows the user to select a joining function (outer/inner/left/right/(graph based if the other PR will be accepted)/user provided) and a distance/similarity function (one of these that are part of this PR or any other function the user provides).

We try to build functions for just one task (following the unix philosophy: doing on thing and do it well. Your normDotProduct does at least 2 task: joining and distance measurement).

As already mentioned above my problem with the dotproduct is the following:

The general definition of a normalized dotproduct is:

sum(u * v) / (sqrt(sum(u^2)) * sqrt(sum(v^2))

but in the literature (e.g. Stein and Scott 1994) it is

sum(sqrt(u) * sqrt(v))^2L / (sum(u) * sum(v)))

(for the default weighting with m = 0, and n = 0.5)

This will definitively yield different results.

Also providing some background info: we did separate the binning/peak matching and similarity calculation steps because we wanted to give the users the choice to select a specific method for each of these steps (i.e. some people might prefer binning the data while others might want to match peaks based on difference of their m/z). The similarity calculations are performed on the matched (or aligned spectra). It would thus also be possible to use different, base R, similarity calculation functions.

For the reason to have these in MsCoreUtils and not in Spectra: some/many of these functions will be re-used in the Chromatograms package.

codecov-io · 2019-12-13T19:36:06Z

Codecov Report

Merging #33 into master will decrease coverage by 0.01%.
The diff coverage is 98.03%.

@@            Coverage Diff             @@
##           master      #33      +/-   ##
==========================================
- Coverage   98.20%   98.18%   -0.02%     
==========================================
  Files          24       25       +1     
  Lines         500      495       -5     
==========================================
- Hits          491      486       -5     
  Misses          9        9

Impacted Files	Coverage Δ
src/impNeighbourAvg.c	`94.73% <94.73%> (ø)`
R/distance.R	`100.00% <100.00%> (ø)`
R/imputation.R	`97.59% <100.00%> (-0.06%)`	⬇️
R/rla.R	`100.00% <100.00%> (ø)`
src/init.c	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3df233e...a3698aa. Read the comment docs.

jorainer

I'm fine with this PR.

Regarding the discussion on the unit tests and evaluation whether the functionality is working correctly: I suggest to add a @note section saying that the methods have been implemented as described in the papers, but, because there is no reference implementation available, we are unable to guarantee that results would be identical.

…jorainer

…jorainer

sgibb · 2020-03-05T12:00:12Z

@jorainer I added the note you suggested. Could we merge this now?

sgibb · 2020-03-05T12:08:08Z

BTW: the travis failure seems to be related to a S4Vectors problem upstream.

jorainer · 2020-03-05T12:23:05Z

OK for me @sgibb - you can merge.

jorainer · 2020-03-05T12:25:33Z

Regarding S4Vectors, strangely enough travis for PR #45 does not fail...

lgatto · 2020-03-05T12:38:44Z

I apologise in advance, because this will lead to a snake_case vs camelCase discussion...

I initially found the leading n confusing, and then inferred from the documentation that these returned normalised distances. What about a more telling name, starting with norm or normali[z|s]ed.

jorainer · 2020-03-23T18:26:26Z

Honestly - I prefer dotproduct/ndotproduct over dotproduct/normDotproduct (or norm_dotproduct or normdotproduct)... but I agree, this should be clearly (better?) described in the documentation.

sgibb · 2020-03-30T09:14:10Z

I don't really like normDotProduct etc. (neither norm_dot_product) and I am happy to extend the documentation. Every section starts with

ndotproduct: the normalized dot product ...

Any suggestions for improvement?

jorainer · 2020-03-30T10:28:42Z

maybe one general section where you specify that function names for methods that produce normalized similarities start with an n?

…s into distance

sgibb · 2020-03-31T10:24:06Z

Ok, I a few more words.

jorainer

OK for me - again 😉

sgibb · 2020-05-08T12:46:47Z

Merged manually due to conflicts introduced by the addition of imputation/normalisation and bioc submission.

sgibb added 8 commits December 11, 2019 11:52

feat: add .calibrate

cd3082c

feat: add .weightxy

3c646ac

refactor: remove old dotproduct

1921649

feat: add ndotproduct

1d29adc

feat: add neuclidean

14a7eba

feat: add nspectraangle

bc718db

feat: add navdist

62744b3

docs: add distance functions to pkgdown site

b1e1ad0

sgibb added the enhancement New feature or request label Dec 11, 2019

sgibb requested review from jorainer and tnaake December 11, 2019 22:05

sgibb self-assigned this Dec 11, 2019

jorainer requested changes Dec 12, 2019

View reviewed changes

sgibb added 3 commits December 12, 2019 09:35

docs: add some newlines as suggested by @jorainer

0003f15

docs: fix typo

a8d7716

docs: add 'see details section' for m and n argument

7fa2dd8

jorainer self-requested a review January 7, 2020 06:54

jorainer approved these changes Jan 7, 2020

View reviewed changes

sgibb added 3 commits March 5, 2020 12:36

Merge branch 'master' into distance

2beac94

docs: remove duplicated author entries

73f1d2d

docs: add note about missing reference implementation as suggested by @…

08bc3fd

…jorainer

Merge branch 'master' into distance

67e80e2

sgibb added 3 commits March 30, 2020 12:46

docs: emphasize that sim/dist calculations are normalised

cad64d8

Merge branch 'master' into distance

5491d7a

Merge branch 'distance' of github.com:rformassspectrometry/MsCoreUtil…

a3698aa

…s into distance

sgibb requested a review from jorainer March 31, 2020 10:24

jorainer approved these changes Apr 3, 2020

View reviewed changes

sgibb changed the title ~~Add some popular distance matrices~~ Add some popular distance metrices May 8, 2020

sgibb closed this May 8, 2020

jmbadia mentioned this pull request Oct 13, 2020

Implement additional distance metrices #30

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add some popular distance metrices #33

Add some popular distance metrices #33

sgibb commented Dec 11, 2019

jorainer left a comment

jorainer Dec 12, 2019

sgibb Dec 12, 2019 •

edited

tobiasko Dec 12, 2019

tobiasko Dec 12, 2019

sgibb Dec 13, 2019

sgibb Dec 21, 2019

tobiasko Dec 22, 2019

sgibb Dec 23, 2019

sgibb Dec 23, 2019

jorainer Jan 7, 2020

codecov-io commented Dec 13, 2019 •

edited

jorainer left a comment

sgibb commented Mar 5, 2020

sgibb commented Mar 5, 2020

jorainer commented Mar 5, 2020

jorainer commented Mar 5, 2020

lgatto commented Mar 5, 2020

jorainer commented Mar 23, 2020

sgibb commented Mar 30, 2020

jorainer commented Mar 30, 2020

sgibb commented Mar 31, 2020

jorainer left a comment

sgibb commented May 8, 2020

	ndotproduct <- function(x, y, m = 0L, n = 0.5, na.rm = TRUE) {
	wx <- .weightxy(x[, 1L], x[, 2L], m, n)
	wy <- .weightxy(y[, 1L], y[, 2L], m, n)
	sum(wx * wy, na.rm = na.rm)^2L /
	(sum(wx^2L, na.rm = na.rm) * sum(wy^2L, na.rm = na.rm))
	}

Add some popular distance metrices #33

Add some popular distance metrices #33

Conversation

sgibb commented Dec 11, 2019

jorainer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgibb Dec 12, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Dec 13, 2019 • edited

Codecov Report

jorainer left a comment

Choose a reason for hiding this comment

sgibb commented Mar 5, 2020

sgibb commented Mar 5, 2020

jorainer commented Mar 5, 2020

jorainer commented Mar 5, 2020

lgatto commented Mar 5, 2020

jorainer commented Mar 23, 2020

sgibb commented Mar 30, 2020

jorainer commented Mar 30, 2020

sgibb commented Mar 31, 2020

jorainer left a comment

Choose a reason for hiding this comment

sgibb commented May 8, 2020

sgibb Dec 12, 2019 •

edited

codecov-io commented Dec 13, 2019 •

edited