-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix PR #963 : Passing all the params through the apply call in lda.get_document_topics #978
Conversation
is it just that they are filtered? Feel free to change the |
Hi @tmylk , For the code below :
Should the output be like this ??
|
So what is happening in topic 3? |
At this line, we are returning I will show below for others to realize what mistake they are making while making call to function
will output :
As we can see, each tuple has 3 lists of So, the proper way to access the data for the entire corpus is like this :
which will return:
And if we want to get the data for |
In order to get the The helper function is as follows:
And now, I can call the above function as follows:
The actual result is as follows:
And the aggregated result are as follows:
Is the aggregated result desirable? If yes, then I will work on improving this helper function. Can you also give me some tips in improving this helper function ?? In case the document topics are unique throughout the corpus (can it happen ?), I can structure the data so that it will look like this:
|
I don't see a need for aggregated function - please convince me why you think it is heplful. The current API below looks ok, though it needs to be documented in the docstring and in the lda ipynb tutorial
|
Yes, I do agree that the aggregated function is not needed actually, but users were confused and calling like this:
instead of this:
It will be a good idea to be documented in docstring and lda tutorial. Plus, users can store the data for all corpus like this:
And access a particular document details like this:
and output like this:
|
Sounds great. Let's update docstring and tutorial. |
@tmylk I have updated the tutorial in |
@@ -977,7 +985,9 @@ def __getitem__(self, bow, eps=None): | |||
Ignore topics with very low probability (below `eps`). | |||
|
|||
""" | |||
return self.get_document_topics(bow, eps) | |||
#Is eps equivalent to minimum_probability? | |||
eps = self.minimum_probability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the reason to add this line? it is used anyway in https://github.com/parthoiiitm/gensim/blob/2ac2f951452e7d968acace2b6d7b11be8b79ad62/gensim/models/ldamodel.py#L912
@@ -80,7 +80,8 @@ class LdaMulticore(LdaModel): | |||
def __init__(self, corpus=None, num_topics=100, id2word=None, workers=None, | |||
chunksize=2000, passes=1, batch=False, alpha='symmetric', | |||
eta=None, decay=0.5, offset=1.0, eval_every=10, iterations=50, | |||
gamma_threshold=0.001, random_state=None): | |||
gamma_threshold=0.001, random_state=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is minimium_probability
? Please add it and pass through
@@ -249,6 +249,26 @@ def testGetDocumentTopics(self): | |||
self.assertTrue(isinstance(k, int)) | |||
self.assertTrue(isinstance(v, float)) | |||
|
|||
#Test case to use the get_document_topic function for the corpus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a new test for minimum_prob
and minimum_phi_value
. Select values that change the number of topics and docs returned
#Test case to use the get_document_topic function for the corpus | ||
all_topics = model.get_document_topics(self.corpus, minimum_probability=0.5, minimum_phi_value=0.3, per_word_topics=True) | ||
|
||
self.assertEqual(model.state.numdocs, len(corpus)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the filtering params have no effect? It is a better test if they change the outcome
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am seeing the effect in case I choose minimum_probability=0.8 and minimum_phi_value=1.0, then most of the doc topic list and word phi list will be blank. Should I use that as a test case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
Hi, I have added the test case to check get_document_topics() working on corpus or not. However, the build will fail because of the new test case I have added. For some words in corpus,
word_topics
andword_phis
are not returned .If the corpus is small, I am getting the correct value. However, if I am taking the corpus as defined in
test_ldamodel.py
, then it throws the error . Let me show this via a toy example calledapp.py
For the above code, I am getting complete output as follows:
However, if I take the corpus as:
then the following error will be thrown:
Please let me know how to modify the code for word_topics and word_phis in the returned tuple so that it won't throw the ValueError.