Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation of word2vec models against semantic similarity datasets #1047

Merged
merged 55 commits into from Dec 22, 2016

Conversation

akutuzov
Copy link
Contributor

We long had analogy evaluation of wor2vec models in Gensim (also known as analogical inference). However, another type of evaluation is widespread in distributional semantics world, that is using word pairs ranked by their semantic similarity (see SimLex-999 and other datasets), and the correlation of these similarities to those produced by the model.

This PR adds the self.evaluation function to perform such evaluation against arbitrary datasets.

tmylk and others added 30 commits November 5, 2015 19:07
Conflicts:
	CHANGELOG.txt
	gensim/models/word2vec.py
@piskvorky
Copy link
Owner

piskvorky commented Dec 15, 2016

Thanks @akutuzov , that looks useful!

But what's with all those commits? Most look unrelated, and some look downright scary (like f3f2a52).

Also, we'll have to change the name evaluate to something more specific -- how about evaluate_word_pairs?

@akutuzov
Copy link
Contributor Author

Thanks @piskvorky
I am certainly not against renaming, done.
As for extra commits, I am trying to understand why Github has bound this PR with my previous one (#538).
Only the last few commits starting from e11909f make sense in the context of this PR. Only two files are changed, in fact.

@akutuzov
Copy link
Contributor Author

This is crazy.
So, can you squash all these commits into one or probably I should just start another PR from scratch?

@piskvorky piskvorky added the feature Issue described a new feature label Dec 16, 2016
@tmylk
Copy link
Contributor

tmylk commented Dec 19, 2016

@akutuzov thanks for the feature. Could we please add some simple unit tests for this new feature?

@akutuzov
Copy link
Contributor Author

@tmylk what can those be? Evaluating against a toy dataset? Should it follow the same structure as testAccuracy in https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_word2vec.py#L370?

Also, what should we do with the old unneeded commits in this PR? As I've said, I can probably start a new one from scratch, if it is not possible to just squash them all into one on Gensim side.

@tmylk
Copy link
Contributor

tmylk commented Dec 19, 2016

That test is not a good example. It is not a test of accuracy but a test of KeyedVectors. A good test is when a model trained on Lee corpus being given a single pair to evaluate, like in the sanity test

There is another point. Having the small and canonical questions-words.txt in the repo helps a lot of people to test accuracy of their models. So we should add a semantic similarity dataset it is less than 1Mb .

Don't worry about commits, I will squash them.

@akutuzov
Copy link
Contributor Author

OK, I will add a test, then.

@tmylk tmylk merged commit baf0f16 into piskvorky:develop Dec 22, 2016
@tmylk
Copy link
Contributor

tmylk commented Dec 22, 2016

Thanks for the PR! Merging to add it to this year's release. Tests and a dataset should be in a separate PR.

@akutuzov
Copy link
Contributor Author

Cool, thanks!
I will implement tests in a few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Issue described a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants