Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Levenshtein term similarity matrix and fast SCM between corpora #2016

Merged
merged 66 commits into from
Jan 14, 2019

Conversation

Witiko
Copy link
Contributor

@Witiko Witiko commented Apr 4, 2018

Introduction

This is a follow-up of #1827 (Implement Soft Cosine Measure). The original implementation included only a single term similarity matrix based on word embeddings. The new commits add a Levenshtein term similarity matrix. To prevent code duplication and to reduce complexity, I have separated the matrix building algorithm from the code that retrieves most similar terms. In reaction to #1955, the Soft Cosine Measure (SCM) can now be computed not only between a pair of vectors, but also between a corpus, and a vector and between a pair of corpora. This last point is also the future work suggested in #1827. Issues to discuss are with regards to the placement of new code, deprecation of old code, and speeding up the Levenshtein distance implementation.

The gensim.similarities.termsim module

A major structural change is the addition of the gensim.similarities.termsim module. Before this addition, there existed the WordEmbeddingsKeyedVectors.similarity_matrix method that contained both the matrix building algorithm and the algorithm for retriving most similar terms for a given term. Now the matrix building algorithm has been moved into a separate SparseTermSimilarityMatrix director class and the algorithm for retrieving most similar terms has been moved into a separate WordEmbeddingSimilarityIndex builder class that implements the TermSimilarityIndex interface. This change follows the single responsibility principle.

Two new classes implementing the TermSimilarityIndex interface have also been added. The LevenshteinSimilarityIndex retrieves the most similar terms according to the “Levenshtein similarity” described by Charlet and Damnati, 2017 [1, sec. 2.2]. The UniformTermSimilarityIndex assumes all distinct terms are equally similar and its main use is in testing SparseTermSimilarityMatrix.

The following UML class diagram captures the new structure:

soft-cosine-uml

The WordEmbeddingsKeyedVectors.similarity_matrix method and the similarity_matrix function in the gensim.similarities.levenshtein module currently serve as facades that construct a SparseTermSimilarityMatrix using the appropriate TermSimilarityIndex behind the scenes. This keeps the code backwards-compatible. I marked the functions for deprecation in 4.0.0, but if the Gensim policies allow, we can get rid of them sooner. Besides backwards compatibility, the facades are also convenient for the users, but there is an associated maintenance cost if we decided to keep them.

The gensim.similarities.levenshtein module

The gensim.similarities.levenshtein module contains code for computing the “Levenshtein similarity” described by Charlet and Damnati, 2017 [1]. See the benchmark for a detailed performance analysis.

The SparseTermSimilarityMatrix.inner_product method

The SparseTermSimilarityMatrix.inner_product method contains code for computing the inner product between two vectors or corpora expressed in a non-orthogonal basis.

There is a bit of “smart” linear algebra involved in computing the inner product between two L2-normalized m×n corpus matrices X and Y, which I will briefly describe here. We need to normalize each column document vector x in X by sqrt(xᵀ ⋅ Sx), which is equivalent to the entrywise (Hadamard) division of each row in X by the diagonal of sqrt(Xᵀ ⋅ SX), where S is the m×m term similarity matrix. However, sqrt(Xᵀ ⋅ SX) is an O(mn² ≈ m⁵) operation. We can instead directly compute the column vector of the diagonal as sqrt(Xᵀ ⋅ S * Xᵀ) summed along the row axis, where * is the entrywise product, which is an O(m²n ≈ m⁴) operation.

The SparseTermSimilarityMatrix.inner_product method method is a more general variant of the gensim.matutils.softcossim function, which only computes the inner product between two normalized vectors, not general vectors or corpora. I marked the matutils.softcossim function for deprecation in 4.0.0, but if the Gensim policies allow, we can get rid of it sooner.

Since the gensim.similarities.termsim model imports the corpus2csc function from the matutils module, importing SparseTermSimilarityMatrix.inner_product from gensim.matutils would result in a cross-dependency. Therefore, the softcossim function carries all the old code instead of calling inner_product.

Future work

The SparseTermSimilarityMatrix class constructor uses the dictionary of keys (DOK) sparse matrix format to incrementally build a sparse matrix. This is convenient, but, as explained by @maciejkula, space-inefficient. I observed a 10-fold increase in RAM usage compared to using three dynamic arrays (two for indices, and one for the data) with the shortest possible unsigned integer data types. Similar technique will be implemented to the SparseTermSimilarityMatrix class constructor.

The levdist function admits a max_distance parameter that allows us to terminate the computation of the Levenshtein distance early. This optimization will be introduced to the python-Levenshtein module from Antti Haapala (see also the discussion below).

The LevenshteinSimilarityIndex.most_similar method currently uses pointwise Levenshtein distance to retrieve the most similar terms to a given term in average time O(mn²), where m is the number of terms in a dictionary and n is the average word length. When used to compute a term similarity matrix, this results in average time O(mmn² = m²n²). A more time-efficient procedure for computing the Levenshtein distance between all terms in a dictionary will be implemented.

References

  1. Delphine Charlet and Geraldine Damnati, SimBow at SemEval-2017 Task 3: Soft-Cosine Semantic Similarity between Questions for Community Question Answering, 2017. (link to PDF)
  2. Vít Novotný. Implementation Notes for the Soft Cosine Measure, 2018. (link to PDF)

@@ -775,6 +776,9 @@ def cossim(vec1, vec2):
return result


@deprecated(
"Function will be removed in 4.0.0, use " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: no need to use + for concatenation if this happens in ().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix this once we figure out what to actually deprecate.

setup.py Outdated
@@ -309,6 +309,7 @@ def finalize_options(self):
'scipy >= 0.18.1',
'six >= 1.5.0',
'smart_open >= 1.2.1',
'python-Levenshtein >= 0.10.2'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, but for adding new core-dependency we should have serious reasons
CC: @piskvorky

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see 2 ways

  • implement this functionality yourself
  • add it as "conditional" import (and move to test dependencies)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dependency might be temporary, since the current Levenshtein distance implementation is inefficient. We discussed more efficient bulk Levenshtein distance implementation in #1955.

Copy link
Owner

@piskvorky piskvorky Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, introducing new core dependencies is not desirable. Especially if it's only used in new, experimental modules.

I agree that naive implementations that just call dist(s1, s2) repeatedly are not very useful in practice, but it's good as a baseline and for checking regressions.

For "non-experimental" code we'd want something that takes the concrete problem constraints into account: pre-calculating static parts, tries, automata, indexes, early-out when maximum acceptable distance exceeded etc.

@@ -371,6 +372,7 @@ def testIter(self):


class TestSoftCosineSimilarity(unittest.TestCase, _TestSimilarityABC):
@deprecated("Method will be removed in 4.0.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this is a good idea to deprecate something in tests. Anyway, if we'll remove old code - tests will be broken and we'll see it clearly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After we have decided what to actually deprecate, I will remove the annotations from tests. They are currently there just to make it clear what parts of the code are proposed for deprecation.

@@ -0,0 +1,131 @@
#!/usr/bin/env python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not place it in test_similarities (same for test_term_similarity)?

logger = logging.getLogger(__name__)


class TermSimilarityIndex(SaveLoad):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that gensim.models is the right place for it (same for levenshtein), we have gensim.similarities for this kind of stuff (exception only for KeyedVectors)

CC: @piskvorky

Copy link
Contributor Author

@Witiko Witiko Apr 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module placement is one of the things that I hoped we could discuss, since there is little documentation about where things should go. Both modules implement term similarity models; if you think they would feel more at home in gensim.similarities, then to gensim.similarities they will go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think better to place it in gensim.similarities, because

  • This is purely about the similarity (and we already have submodule for it)
  • This doesn't "quack" like a standard gensim model

self.dictionary = dictionary
self.term_similarity = term_similarity

def most_similar(self, t1, topn=10):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this isn't ready, am I right?

Copy link
Contributor Author

@Witiko Witiko Apr 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually quite ready. UniformTermSimilarityIndex assigns a constant similarity to any pair of distinct words; the main use is for testing SparseTermSimilarityMatrix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also quite useful for benchmarking the maximum throughput of the SparseTermSimilarityMatrix constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another use would be the construction of an identity term similarity matrix by setting term_similarity to zero.

@Witiko Witiko force-pushed the levenshtein-softcossim branch 2 times, most recently from fb4e3ff to 44e68f8 Compare April 5, 2018 09:49
@menshikh-iv
Copy link
Contributor

@Witiko please write TO-DO list for this PR (what else need to change) for simpler navigation / better understanding

setup.py Outdated
@@ -309,6 +309,7 @@ def finalize_options(self):
'scipy >= 0.18.1',
'six >= 1.5.0',
'smart_open >= 1.2.1',
'python-Levenshtein >= 0.10.2'
Copy link
Owner

@piskvorky piskvorky Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, introducing new core dependencies is not desirable. Especially if it's only used in new, experimental modules.

I agree that naive implementations that just call dist(s1, s2) repeatedly are not very useful in practice, but it's good as a baseline and for checking regressions.

For "non-experimental" code we'd want something that takes the concrete problem constraints into account: pre-calculating static parts, tries, automata, indexes, early-out when maximum acceptable distance exceeded etc.

@Witiko
Copy link
Contributor Author

Witiko commented Apr 8, 2018

I have finished the description and the TODO list as requested, and I inserted both into the first post for improved legibility. Any comments, especially on the deprecation TODO list items, are welcome.

@Witiko Witiko force-pushed the levenshtein-softcossim branch 3 times, most recently from a2de996 to f7388e1 Compare April 11, 2018 11:45
@Witiko Witiko force-pushed the levenshtein-softcossim branch 2 times, most recently from 190bf91 to 2c11ff2 Compare April 11, 2018 16:22
@Witiko
Copy link
Contributor Author

Witiko commented Sep 11, 2018

@menshikh-iv @piskvorky Just a reminder that this PR:

  • passes tests on Python 2.7.9, and 3.4.4 (CI services are false positives) and has no conflicts with the develop branch,
  • improves the design of the SCM by decomposing it into a matrix builder, a number of term similarity indices, and a routine for computing the inner product of corpora / documents,
  • improves the speed of the inner product of two corpora by a factor of thousands for large corpora,
  • maintains backwards compatibility with the API established in PR Implement Soft Cosine Measure #1827,
  • adds preliminary support for the Levenshtein distance,
  • adds support for producing symmetric positive definite matrices that can later be factorized.

In other words, this PR should be good to go for the upcoming release. What are your thoughts?

@menshikh-iv
Copy link
Contributor

@Witiko please resolve an merge conflict

@Witiko Witiko force-pushed the levenshtein-softcossim branch 3 times, most recently from 55a19d8 to 597191b Compare January 9, 2019 22:14
@Witiko
Copy link
Contributor Author

Witiko commented Jan 10, 2019

@menshikh-iv I merged the develop branch. There should be no conflicts for the moment.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice @Witiko 👍

please resolve current review and I'll merge that

docs/notebooks/soft_cosine_tutorial.ipynb Outdated Show resolved Hide resolved
gensim/similarities/levenshtein.py Outdated Show resolved Hide resolved
gensim/similarities/levenshtein.py Show resolved Hide resolved
gensim/similarities/levenshtein.py Outdated Show resolved Hide resolved
gensim/similarities/levenshtein.py Outdated Show resolved Hide resolved
gensim/similarities/levenshtein.py Outdated Show resolved Hide resolved
gensim/similarities/levenshtein.py Outdated Show resolved Hide resolved
gensim/models/keyedvectors.py Outdated Show resolved Hide resolved
@menshikh-iv
Copy link
Contributor

Awesome work @Witiko, I hope your research going well :)

@menshikh-iv menshikh-iv merged commit f3cf463 into piskvorky:develop Jan 14, 2019
@Witiko
Copy link
Contributor Author

Witiko commented Jan 17, 2019

Below, you will find the suggested changelog.

🌟 New Features

👍 Improvements

⚠️ Deprecations (will be removed in the next major release)

  • gensim.matutils.softcossim
  • gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similarity_matrix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants