Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make termsim matrix positive definite even with negative similarities #2397

Merged
merged 4 commits into from
May 4, 2019

Conversation

Witiko
Copy link
Contributor

@Witiko Witiko commented Feb 27, 2019

When a SparseTermSimilarityMatrix object is constructed with non-default parameters that satisfy the following constraints: symmetric and positive_definite and threshold < 0, the matrix embedded in the constructed object may not be positive definite. This PR hardens the code against negative term similarities, making sure that the term similarity matrix is always positive definite.

@Witiko
Copy link
Contributor Author

Witiko commented Mar 1, 2019

Note that this set of parameters is unusual, since it makes sense to threshold the similarities at zero (that is why we are using a sparse matrix afterall), so although this bugfix changes the produced matrices, few users should be affected.

@mpenkov mpenkov added the bugfix label Apr 28, 2019
@mpenkov
Copy link
Collaborator

mpenkov commented Apr 28, 2019

Thank you for this PR. Unfortunately, I'm not very familiar with the underlying code, so it's difficult for me to comment on the substance of your PR. I do notice that it appears to be a bugfix. So, I have a few questions:

  • Is it possible to add a unit test? Is it worth it?
  • How does the bug manifest itself? In other words, what can happen if we don't fix this?

@Witiko
Copy link
Contributor Author

Witiko commented Apr 28, 2019

Is it possible to add a unit test? Is it worth it?

It is possible to unit-test this specific bug. This will be useful if someone reverts the fix in the future, or reimplements the SparseTermSimilarityMatrix class.

How does the bug manifest itself? In other words, what can happen if we don't fix this?

If this bug is not fixed, the term similarity matrix produced by the SparseTermSimilarityMatrix class constructor will not always be positive definite, violating the API contract.

Gensim does not depend on the positive definiteness of a term similarity matrix at the moment. However, a user may produce word embeddings from the matrix using Cholesky factorization. This will fail for matrices that are not symmetric positive definite.

As I noted above, this bug manifests itself rarely. The users affected by the bug will be mainly those who use grid search or a similar technique to optimize the threshold parameter.

@mpenkov
Copy link
Collaborator

mpenkov commented Apr 28, 2019

Ok, could you please add a test to this PR?

@Witiko
Copy link
Contributor Author

Witiko commented Apr 29, 2019

I have just added the unit test in c87aa31.

Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this looks good to me. Thank you for your contribution.

@mpenkov mpenkov merged commit 18bcd11 into piskvorky:develop May 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants