Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix str() method in WmdSimilarity #3282

Merged
merged 8 commits into from
Mar 30, 2022
Merged

Conversation

DingQK
Copy link
Contributor

@DingQK DingQK commented Dec 25, 2021

This PR fixes #3269 , by updating str() function through the using of latest names. The test result is as follows:

from gensim.models import Word2Vec
from gensim.similarities import WmdSimilarity

a = "Trump’s decision not to join the Trans-Pacific Partnership (TPP) came as little surprise. During his election campaign he railed against international trade deals, blaming them for job losses and focusing anger in the industrial heartland. Obama had argued that this deal would provide an effective counterweight to China in the region."
b = "Trump's decision to withdraw the US from TPP is also a first step in the administration's efforts to amass a governing coalition to push the new President's agenda, one that includes the blue-collar workers who defected from Democrats and flocked to Trump's candidacy in November."
c = "Although the Trans-Pacific Partnership had not been approved by Congress, Mr. Trump’s decision to withdraw not only doomed former President Barack Obama’s signature trade achievement, but also carried broad geopolitical implications in a fast-growing region. The deal, which was to link a dozen nations from Canada and Chile to Australia and Japan in a complex web of trade rules, was sold as a way to permanently tie the United States to East Asia and create an economic bulwark against a rising China."

def w2v(corpus):
    w2v_model = Word2Vec(corpus, alpha=0.025, vector_size=50,
         window=5, min_count=1, workers=2, epochs=50, sg = 0)
    return w2v_model

corpus = "\n".join([a,b,c])
model = w2v(corpus)
print(model, type(model))
print(WmdSimilarity([a,b,c], model))

image

@piskvorky piskvorky changed the title Fixes #3269 Fix str() method in WmdSimilarity Dec 25, 2021
@codecov
Copy link

codecov bot commented Dec 25, 2021

Codecov Report

Merging #3282 (68ca983) into develop (acd8308) will decrease coverage by 1.85%.
The diff coverage is n/a.

❗ Current head 68ca983 differs from pull request most recent head d5ba108. Consider uploading reports for the commit d5ba108 to get more accurate results

@@             Coverage Diff             @@
##           develop    #3282      +/-   ##
===========================================
- Coverage    81.39%   79.53%   -1.86%     
===========================================
  Files          122       68      -54     
  Lines        21107    11783    -9324     
===========================================
- Hits         17179     9372    -7807     
+ Misses        3928     2411    -1517     
Impacted Files Coverage Δ
gensim/similarities/docsim.py 23.95% <ø> (-0.76%) ⬇️
gensim/scripts/glove2word2vec.py 76.19% <0.00%> (-7.15%) ⬇️
gensim/corpora/wikicorpus.py 93.75% <0.00%> (-1.04%) ⬇️
gensim/models/rpmodel.py 89.47% <0.00%> (-0.53%) ⬇️
gensim/models/ldamulticore.py 90.58% <0.00%> (-0.33%) ⬇️
gensim/corpora/dictionary.py 94.17% <0.00%> (-0.09%) ⬇️
gensim/models/hdpmodel.py 71.27% <0.00%> (-0.08%) ⬇️
gensim/models/coherencemodel.py 93.24% <0.00%> (-0.07%) ⬇️
gensim/corpora/hashdictionary.py 97.56% <0.00%> (-0.06%) ⬇️
... and 91 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0bd1d06...d5ba108. Read the comment docs.

@DingQK
Copy link
Contributor Author

DingQK commented Dec 25, 2021

Hi @piskvorky ,

I have checked the only failed build wheel (3.6, ubuntu-latest, x64), but it seems to be unrelated to this PR. The error is in test_word2vec.py, about word similarity ranks. The correct rank is 2 but it returns 3. Thank you so much!

@piskvorky
Copy link
Owner

piskvorky commented Dec 25, 2021

@austereantelope seems related to your recent PR change – please check.

@austereantelope
Copy link
Contributor

austereantelope commented Dec 25, 2021

@piskvorky yes I noticed this issue, this new PR #3281 addresses it

@DingQK
Copy link
Contributor Author

DingQK commented Dec 25, 2021

@austereantelope Hi! I am wondering why would tests on Ubuntu with different python version have different check results for test_word2vec.py (some pass, some do not). Do you find out the underlying reason? Thank you so much!

@austereantelope
Copy link
Contributor

austereantelope commented Dec 25, 2021

@DingQK this test will have different results not neccessarily because of the different python version, the underlying reason is the test is non-deterministic meaning if I ran this test multiple times I might get different results (regardless of the environment). The source of the non-determinisim might be randomness in the algorithm and in this case I believe the test is using multiple workers which also might introduce some non-determinisim

@DingQK
Copy link
Contributor Author

DingQK commented Dec 25, 2021

@austereantelope Thank you so much for your explanation! Maybe the tests need to consider this non-deterministic characteristic (in test_word2vec.py).

@DingQK
Copy link
Contributor Author

DingQK commented Dec 26, 2021

Hi @piskvorky ,

According to the kind explanation of @austereantelope, if the test is indeterministic, perhaps rerunning the failed build wheel (3.6, ubuntu-latest, x64) might help? The fix on wmdsimilarity's str() method seems irrelevant to this error. Thank you so much!

@piskvorky
Copy link
Owner

piskvorky commented Dec 26, 2021

I don't see a way to re-run just one job in Github Actions (@mpenkov is this possible?). So I re-ran the entire CI workflow.

@piskvorky
Copy link
Owner

@DingQK also, can you add a unit test? Tests are under https://github.com/RaRe-Technologies/gensim/tree/develop/gensim/test

@DingQK
Copy link
Contributor Author

DingQK commented Dec 26, 2021

@piskvorky You mean a unit test solely for this str() method in wmdsimilarity (for example test_wmdsimilarity.py)? I can do that!

@piskvorky
Copy link
Owner

Yes. Ideally add the new test among existing WmdSimilarity tests (if they already exist). Thanks!

@DingQK
Copy link
Contributor Author

DingQK commented Dec 26, 2021

Yes. Ideally add the new test among existing WmdSimilarity tests (if they already exist). Thanks!

Hi @piskvorky ,

While adding unit test, I found out an inconsistency between WmdSimilarity usage in documentation and in unit tests.

In document,

model = Word2Vec(common_texts, vector_size=20, min_count=1)
index = WmdSimilarity(common_texts, model)

whereas in the definition of function and unit test,

self.cls = similarities.WmdSimilarity
self.w2v_model = Word2Vec(TEXTS, min_count=1).wv
index = self.cls(TEXTS, self.w2v_model)

Maybe the documentation needs an update? And which usage should I follow in adding the unit test? Thank you so much!

@piskvorky
Copy link
Owner

What documentation? And what inconsistency? I don't follow.

@DingQK
Copy link
Contributor Author

DingQK commented Dec 27, 2021

What documentation? And what inconsistency? I don't follow.

@piskvorky Sorry for my unclear description. The documentation I am referring to is https://radimrehurek.com/gensim/similarities/docsim.html#gensim.similarities.docsim.WmdSimilarity. The documentation's directory in this repo is docs/similarities/docsim.html.

In the documentation, the second input of the init function of WmdSimilarity is Word2Vec(). But in gensim/test/test_similarities.py, the unit tests (TestWmdSimilarity) for WmdSimilarity uses Word2Vec().wv as the second input. I am guessing through the development of WmdSimilarity, the version of WmdSimilarity.__init__ and TestWmdSimilarity are not updated simultaneously.

If this is an inconsistency, which usage should I follow and may I fix this issue? Thank you so much for your time!

@piskvorky
Copy link
Owner

I see what you mean. I'm not sure which one is correct – can you check the code please?

@DingQK
Copy link
Contributor Author

DingQK commented Dec 27, 2021

Hi @piskvorky , I check the code closely and I believe the documentation needs an update as the second input is class KeyedVectors. I updated the documentation and the annotation code. Also, I add a unit test for this str method in test_similarities.py. Thank you so much!

@piskvorky
Copy link
Owner

piskvorky commented Dec 27, 2021

Looks good, thanks!

I left one minor request in the code review; otherwise ready for review & merge under @mpenkov's watchful eye.

@DingQK
Copy link
Contributor Author

DingQK commented Dec 27, 2021

I have changed the code as requested in my latest commit. It is my pleasure to contribute! And thank you so much for your patient and kind help! @piskvorky

gensim/similarities/docsim.py Outdated Show resolved Hide resolved
@piskvorky piskvorky added the bug Issue described a bug label Feb 18, 2022
@piskvorky piskvorky added this to the Next release milestone Feb 18, 2022
@DingQK DingQK requested a review from piskvorky March 10, 2022 09:19
@mpenkov mpenkov merged commit f32581e into piskvorky:develop Mar 30, 2022
@mpenkov
Copy link
Collaborator

mpenkov commented Mar 30, 2022

Merged. Thank you for your patience @DingQK !!

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possible bug in WmdSimilarity
4 participants