Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inverted-list all-pairs can be optimised #83

Open
hamishmorgan opened this issue Jul 20, 2012 · 0 comments
Open

Inverted-list all-pairs can be optimised #83

hamishmorgan opened this issue Jul 20, 2012 · 0 comments

Comments

@hamishmorgan
Copy link
Member

The inverted-list based all-pairs method can be optimised by constructing the list on the fly, rather than pre-constructing it. The method was used in previous thesauri software versions, but was omitted in later revisions. The method is described in:

  • Bayardo, R., Ma, Y., and Srikant, R. (2007). Scaling up all pairs similarity search. In Proceedings of the 16th international conference on World Wide Web, pages 131–140.

The current all-pairs algorithm is as follows:

InvertedAllPairs0(V):
  S = {}
  for x in V where x[i] != 0:
    for i = 1 ... m:
      I[i] = I[i] union {x}
  for x in V:
    C = FindCandidates(x,I)
    for y in C where sim(x,y) > 0:
      S = S union sim(x,y)
  return S    

FindCandidates(x, I):
  C = {}
  for i in 1 to m where x[i] != 0:
    C = C union I[i]
  return C

This algorithm can be improved by building the inverted index on the fly. The first vector is compared agains an empty index, then added to it, and so on. This reduces the amount of work in FindCandidates by half, over the algorithm lifetime. This is particularly effective for commutative similarity measures (those define a symmetric notion of similarity), but will also improve non-commutative measures. To support non-commutative measures to similarity calculation sim(x,y) must be performed in both direction at once. The proposed change will result in the following algorithm:

InvertedAllPairs1(V):
  S = {}
  I[1...m] = {}
  for x in V:
    for y in FindCandidates(x,I):
      S = S union {sim(x,y), sim(y,x)}
    for i in 1 to m  where x[i] != 0:
      I[i] = I[i] union {x}      
  return S    

There exists further optimisations, proposed in the above-cited and elsewhere. However they all have undesirable side-effects with respect the flexibility and scalability. For example score accumulation can be performed during index construction; but this increases the memory requirements since all scores must be stored until the end of the process. The inclusion of a similarity threshold can be leveraged to further limit the candidates set, by only includes those neighbours which can possibly contribute to a similarity that excedes the threshold. The problem with this is that it pre-supposes properties of the similarity measure that are not always true.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant