Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add topic coherence pipeline to gensim #710

Merged
merged 18 commits into from Jun 22, 2016

Conversation

Projects
None yet
5 participants
@dsquareindia
Copy link
Contributor

commented May 27, 2016

Addresses #278.
@tmylk @piskvorky is this alright?

@piskvorky

This comment has been minimized.

Copy link
Member

commented May 28, 2016

What is "segmentation"?

Can you add some background info / motivation?

@dsquareindia

This comment has been minimized.

Copy link
Contributor Author

commented May 29, 2016

@piskvorky sorry I forgot to tag the issue before. I've tagged it now.

@piskvorky

This comment has been minimized.

Copy link
Member

commented May 29, 2016

Ah, nice! :-)

@tmylk please review.

@piskvorky piskvorky added the feature label May 29, 2016

@dsquareindia dsquareindia changed the title Created segmentation module. Added S_One_Pre segmentation. [WIP] Add topic coherence pipeline to gensim May 29, 2016

@dsquareindia

This comment has been minimized.

Copy link
Contributor Author

commented May 29, 2016

@piskvorky @tmylk I've added the API I was thinking of for adding the topic coherence pipeline. I've currently implemented only the U_mass topic coherence (still have to work a lot on it though) using this pipeline. Sorry for not adding the tests yet. Just wanted to show you what kind of an API I was thinking of. Is this API fine?

@piskvorky

This comment has been minimized.

Copy link
Member

commented May 30, 2016

Looks ok to me conceptually (but have a look at LdaModel.get_topic_terms() method).


EPSILON = 1e-12 # Should be small. Value as suggested in paper.

def Log_Conditional_Probability(segmented_topics, per_topic_probability):

This comment has been minimized.

Copy link
@piskvorky

piskvorky May 30, 2016

Member

PEP8: function names are lowercase.

Model persistency is achieved via its load/save methods.
"""
def __init__(self, corpus, topics, coherence='u_mass'): # Have to validate coherence input. Check for invalid pipeline methods.

This comment has been minimized.

Copy link
@piskvorky

piskvorky May 30, 2016

Member

Always put FIXME around notes-to-self. That makes it easy to grep them & make sure we don't forget about it before merge.

@dsquareindia

This comment has been minimized.

Copy link
Contributor Author

commented May 31, 2016

I've added initial tests for the segmentation module. Although I'm still thinking about what kind of inputs can I get in the segmentation module (or what different outputs can the users use from LDA to input into the pipeline).


logger = logging.getLogger(__name__)

def s_one_pre(topics):

This comment has been minimized.

Copy link
@dsquareindia

dsquareindia May 31, 2016

Author Contributor

Add a model argument here. Handle it inside

"cell_type": "markdown",
"metadata": {},
"source": [
"## Demonstration for the `u_mass` topic coherence using topic coherence pipeline"

This comment has been minimized.

Copy link
@dsquareindia

dsquareindia Jun 7, 2016

Author Contributor

@tmylk I've added an example notebook for u_mass topic coherence. Is this alright?

@dsquareindia dsquareindia force-pushed the dsquareindia:topic_coherence branch from 8749621 to 7a04893 Jun 9, 2016

@dsquareindia

This comment has been minimized.

Copy link
Contributor Author

commented Jun 9, 2016

@tmylk @piskvorky I have updated this PR with the latest version. I have also added a U_mass tutorial notebook. Could you please have a look at this intermediate version briefly?

@dsquareindia

This comment has been minimized.

Copy link
Contributor Author

commented Jun 14, 2016

Added initial draft of c_v topic coherence (the best one from the paper). Will now check for mathematical correctness and other optimizations

self.conf = direct_confirmation_measure.log_conditional_probability
self.aggr = aggregation.arithmetic_mean

elif self.coherence == 'c_v':

This comment has been minimized.

Copy link
@tmylk

tmylk Jun 15, 2016

Contributor

'c_v' should be a constant to avoid duplication

This comment has been minimized.

Copy link
@dsquareindia

dsquareindia Jun 16, 2016

Author Contributor

Sorry can you please elaborate on this a bit more?

@dsquareindia dsquareindia force-pushed the dsquareindia:topic_coherence branch from 213fcff to b45c00e Jun 15, 2016

@dsquareindia

This comment has been minimized.

Copy link
Contributor Author

commented Jun 15, 2016

I've added c_v coherence to the existing notebook.

@tmylk

This comment has been minimized.

Copy link
Contributor

commented Jun 16, 2016

Thanks! Why doesn't pyldavis show on github?

@dsquareindia

This comment has been minimized.

Copy link
Contributor Author

commented Jun 16, 2016

Yeah even I don't know why... Have to follow the nbviewer link to view it fully :/

@dsquareindia

This comment has been minimized.

Copy link
Contributor Author

commented Jun 16, 2016

Added tests for checking mathematical correctness of the direct_confirmation_measures module.

@dsquareindia

This comment has been minimized.

Copy link
Contributor Author

commented Jun 16, 2016

Added mathematical correctness test for indirect_confirmation_measure module.

@dsquareindia

This comment has been minimized.

Copy link
Contributor Author

commented Jun 17, 2016

I've modified the ipython notebook a bit. Hopefully it's more human-interpretable now 😄

@dsquareindia

This comment has been minimized.

Copy link
Contributor Author

commented Jun 17, 2016

Added introduction to the notebook. @tmylk can you please review?

@tmylk

This comment has been minimized.

Copy link
Contributor

commented Jun 21, 2016

@dsquareindia Could you move the new files you created from gensim/.py to gensim/topic_coherence/.py ? These modules are only needed for topic coherence and shouldn't be in root.

@dsquareindia

This comment has been minimized.

Copy link
Contributor Author

commented Jun 22, 2016

@tmylk done. Could you please check topic_coherence/__init__.py? Not quite sure about that.

@dsquareindia dsquareindia force-pushed the dsquareindia:topic_coherence branch from 4cbd3ae to 738235d Jun 22, 2016

@dsquareindia

This comment has been minimized.

Copy link
Contributor Author

commented Jun 22, 2016

@tmylk tests pass now. Just used HashDictionary instead of Dictionary.

@dsquareindia

This comment has been minimized.

Copy link
Contributor Author

commented Jun 22, 2016

@tmylk I have added support for LdaVowpalWabbit wrapper. You can check out an example here

@tmylk tmylk merged commit 6151747 into RaRe-Technologies:develop Jun 22, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@dsquareindia dsquareindia deleted the dsquareindia:topic_coherence branch Jun 23, 2016

This was referenced Oct 3, 2016

@guntherzhao

This comment has been minimized.

Copy link

commented Sep 24, 2018

I am trying to use Umass measure to pick the best number of topics, but I do not know what Umass exactly means? About the coherence score, is it the bigger, the better, or just the opposite? Below is the output of my test with Umass measure. How many topics should I pick?
image

@menshikh-iv

This comment has been minimized.

Copy link
Collaborator

commented Sep 28, 2018

@guntherzhao simple rule: more - better, here - 7 topics.

Our coherence is port of https://github.com/dice-group/Palmetto , you can read more about it on Palmetto wiki page or http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf

For future - please use mailing list for questions (GitHub only for feature requests & bug reports)

@guntherzhao

This comment has been minimized.

Copy link

commented Sep 28, 2018

@menshikh-iv Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.