Inverted-list all-pairs can be optimised #83

hamishmorgan · 2012-07-20T10:58:29Z

The inverted-list based all-pairs method can be optimised by constructing the list on the fly, rather than pre-constructing it. The method was used in previous thesauri software versions, but was omitted in later revisions. The method is described in:

Bayardo, R., Ma, Y., and Srikant, R. (2007). Scaling up all pairs similarity search. In Proceedings of the 16th international conference on World Wide Web, pages 131–140.

The current all-pairs algorithm is as follows:

InvertedAllPairs0(V):
  S = {}
  for x in V where x[i] != 0:
    for i = 1 ... m:
      I[i] = I[i] union {x}
  for x in V:
    C = FindCandidates(x,I)
    for y in C where sim(x,y) > 0:
      S = S union sim(x,y)
  return S    

FindCandidates(x, I):
  C = {}
  for i in 1 to m where x[i] != 0:
    C = C union I[i]
  return C

This algorithm can be improved by building the inverted index on the fly. The first vector is compared agains an empty index, then added to it, and so on. This reduces the amount of work in FindCandidates by half, over the algorithm lifetime. This is particularly effective for commutative similarity measures (those define a symmetric notion of similarity), but will also improve non-commutative measures. To support non-commutative measures to similarity calculation sim(x,y) must be performed in both direction at once. The proposed change will result in the following algorithm:

InvertedAllPairs1(V):
  S = {}
  I[1...m] = {}
  for x in V:
    for y in FindCandidates(x,I):
      S = S union {sim(x,y), sim(y,x)}
    for i in 1 to m  where x[i] != 0:
      I[i] = I[i] union {x}      
  return S

There exists further optimisations, proposed in the above-cited and elsewhere. However they all have undesirable side-effects with respect the flexibility and scalability. For example score accumulation can be performed during index construction; but this increases the memory requirements since all scores must be stored until the end of the process. The inclusion of a similarity threshold can be leveraged to further limit the candidates set, by only includes those neighbours which can possibly contribute to a similarity that excedes the threshold. The problem with this is that it pre-supposes properties of the similarity measure that are not always true.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inverted-list all-pairs can be optimised #83

Inverted-list all-pairs can be optimised #83

hamishmorgan commented Jul 20, 2012

Inverted-list all-pairs can be optimised #83

Inverted-list all-pairs can be optimised #83

Comments

hamishmorgan commented Jul 20, 2012