Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word2vec coherence #1530

Merged
merged 31 commits into from
Sep 18, 2017
Merged

Word2vec coherence #1530

merged 31 commits into from
Sep 18, 2017

Conversation

macks22
Copy link
Contributor

@macks22 macks22 commented Aug 13, 2017

Resolves #1380.

This PR adds a new coherence measure to CoherenceModel, called "c_w2v". This uses word2vec word vectors for computing similarity between terms to calculate coherence. This implementation adds a new accumulator in the text_analysis module that either uses pre-trained KeyedVectors or trains a Word2Vec model on the input corpus to derive KeyedVectors. A new keyword argument keyed_vectors is added to the CoherenceModel to pass in pre-trained vectors.

Rather than requiring parity between the corpus dictionary and the keyed vectors vocabulary, the coherence calculation ignores terms that are missing from the vocabulary (with a warning). This PR also adds a new with_std argument to the get_coherence_per_topic method to calculate the standard deviation between the various segments for each topic. This can serve as some indication of topics that have good overall coherence but a few notable outlier terms or subsets.

Finally, I noticed that the CoherenceModel's method of getting topics from models was implemented in a switch-style manner and refactored it to use polymorphism. Specifically, I introduced a new get_topics method to all topic models that returns the topic-term distributions (except for LSI, which just returns the weights, since its not a real distribution).

Sweeney, Mack added 18 commits June 14, 2017 13:46
… to allow passing in pre-trained, pre-loaded word embeddings, and adjust the similarity measure to handle missing terms in the vocabulary. Add a `with_std` option to all confirmation measures that allows the caller to get the standard deviation between the topic segment sets as well as the means.
…ures, and add test case to sanity check `word2vec_similarity`.
…st coverage for this, and update the `CoherenceModel` to use this for getting topics from models.
…bility distributions for the probabilistic topic models.
… will uncache the accumulator and the topics will be shrunk/expanded accordingly.
… to allow passing in pre-trained, pre-loaded word embeddings, and adjust the similarity measure to handle missing terms in the vocabulary. Add a `with_std` option to all confirmation measures that allows the caller to get the standard deviation between the topic segment sets as well as the means.
…ures, and add test case to sanity check `word2vec_similarity`.
…st coverage for this, and update the `CoherenceModel` to use this for getting topics from models.
…bility distributions for the probabilistic topic models.
… will uncache the accumulator and the topics will be shrunk/expanded accordingly.
…f the executables are not installed, instead of passing them inappropriately.
…c_coherence

# Conflicts:
#	gensim/test/test_coherencemodel.py
#	gensim/topic_coherence/direct_confirmation_measure.py
#	gensim/topic_coherence/indirect_confirmation_measure.py
@macks22 macks22 mentioned this pull request Aug 13, 2017
Sweeney, Mack added 8 commits August 15, 2017 07:32
…ew Word2Vec-based coherence metric "c_w2v".
…t of models or top-N lists efficiently. Update the notebook to use the helper methods. Add `TextDirectoryCorpus` import in `corpora.__init__` so it can be imported from package level. Update notebook to use `corpora.TextDirectoryCorpus` instead of redefining it.
@macks22
Copy link
Contributor Author

macks22 commented Aug 29, 2017

@menshikh-iv I believe this PR is now ready to go. I responded to the change requests in the original PR: I've added the get_topics method to the BaseTopicModel and I've updated the notebook to demonstrate how to use the new coherence measure. I believe the existing CI check failures are all due to the imports in the corpora.__init__, which are being viewed as "unused imports" despite their intentionality.

Please let me know if there are any additional changes needed. Thanks!

@@ -261,6 +260,10 @@ def for_topics(cls, topics_as_topn_terms, **kwargs):
for topic in topic_list:
topn = max(topn, len(topic))

if 'topn' in kwargs:
topn = min(kwargs.get('topn'), topn)
del kwargs['topn']
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a look at dict.pop().

@piskvorky piskvorky removed the request for review from menshikh-iv August 29, 2017 13:12
@macks22
Copy link
Contributor Author

macks22 commented Sep 9, 2017

@piskvorky @menshikh-iv all merge conflicts have been resolved and all requested changes made. Please let me know if there is anything else to be done here. Thanks!

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice PR, thank you @macks22 🔥 !!
Great update for notebook (easy to read), and general idea looks good.

Please resolve my small comments and I'll merge it!

"Wall time: 43.7 s\n"
"Dictionary(24593 unique tokens: [u'woods', u'hanging', u'woody', u'localized', u'gaa']...)\n",
"CPU times: user 20 s, sys: 797 ms, total: 20.8 s\n",
"Wall time: 21.1 s\n"
]
}
],
"source": [
"%%time\n",
"\n",
"corpus = NewsgroupCorpus(data_path)\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's a reason to use NewsgroupCorpus instead of sklearn data interface? For comfort?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a comfort thing -- I already had code for this. I've updated to use the sklearn.datasets.fetch_20newsgroups function for retrieving the data.

"name": "stderr",
"output_type": "stream",
"text": [
"WARNING:gensim.topic_coherence.indirect_confirmation_measure:at least one topic word not in word2vec model\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please suppress this warning here (and below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done



def cosine_similarity(segmented_topics, accumulator, topics, measure='nlr', gamma=1,
with_std=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use vertical indent (here you have not very long line) please make definition in one line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new argument here, which made it longer, so I used the hanging indent instead.

…er to get 20 newsgroups corpus. Add `with_support` option to the confirmation measures to determine how many words were ignored during calculation. Add `flatten` function to `utils` that recursively flattens an iterable into a list. Improve the robustness of coherence model comparison by using nanmean and mean value imputation when looping over the grid of top-N values to compute coherence for a model. Fix too-long logging statement lines in `text_analysis`.
@macks22
Copy link
Contributor Author

macks22 commented Sep 16, 2017

@menshikh-iv thank you for the review; I've addressed your changes. In updating the notebook, I also noticed a few other minor things to improve the robustness of this code, so I made some other minor changes. Please let me know if there is anything else you'd like modified. Thanks!

@menshikh-iv
Copy link
Contributor

Thanks for your work @macks22, you are TOP contributor 🔥

@menshikh-iv menshikh-iv merged commit 4c0737a into piskvorky:develop Sep 18, 2017
@macks22 macks22 deleted the word2vec_coherence branch September 18, 2017 11:58
@ydennisy
Copy link

ydennisy commented Sep 3, 2020

@macks22 this functionality is not documented - would be great to add the docs 👍

@mpenkov
Copy link
Collaborator

mpenkov commented Sep 17, 2020

@ydennisy May I interest you in making a PR? ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a new indirect confirmation measure based on word2vec similarity
5 participants