Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add inference for new unseen author for gensim.models.AuthorTopicModel #1766

Merged
merged 24 commits into from
Mar 26, 2018

Conversation

Stamenov
Copy link
Contributor

@Stamenov Stamenov commented Dec 6, 2017

Add function get_new_author_topics() to infer topics distribution of a new unseen author, by passing a corpus - list of documents in "bag of words" format.

@Stamenov
Copy link
Contributor Author

Stamenov commented Dec 6, 2017

Sill some open TODOs:

  • how should the rho() function be defined inside the get_new_author_topics()
  • unit test

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contribution @Stamenov, please continue your work according to your plan and my comments.
Also, please have a look at PEP8 problems https://travis-ci.org/RaRe-Technologies/gensim/jobs/312612778#L516

@olavurmortensen can you review PR too?

@@ -882,6 +883,98 @@ def get_document_topics(self, word_id, minimum_probability=None):

raise NotImplementedError('Method "get_document_topics" is not valid for the author-topic model. Use the "get_author_topics" method.')

def get_new_author_topics(self, corpus, minimum_probability=None):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except Exception as e:
#something went wrong! Rollback temporary changes in object and log
rollback_new_author_chages()
logging.error(traceback.format_exc())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to use traceback here, use logging.exception(e) instead

corpus, self.author2doc, self.doc2author, rho(),
collect_sstats=False, chunk_doc_idx=corpus_doc_idx
)
except Exception as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too broad clause, please specify the concrete type(s) of exception.

@Stamenov
Copy link
Contributor Author

Stamenov commented Dec 7, 2017

Of course I will continue my work here. Thanks for quick review @menshikh-iv . I have already addressed your remarks in my code.

@olavurmortensen
Copy link
Contributor

Great work @Stamenov. At a glance, it looks good to me, although do I have some comments.

Let me see if I understand the logic of the code, correct me if I'm wrong. First of all, you are inferring gamma for a collection of documents, assuming that they are attributed to a single author. So what you do is:

  • Add documents to self.corpus.
  • Add a single temporary author to author dictionaries.
  • Randomly initialize gamma, as per usual.
  • Run self.inference to obtain gammas for the documents.
  • Remove authors and documents from model.

This is what you do, right?

I don't think adding the documents to self.corpus is necessary, because self.inference only uses the documents you pass to it. Calling self.extend_corpus is also very slow, unfortunately, so always avoid this if possible.

As a side note, the chunk_doc_idx argument to self.inference has to do with getting an author list from self.doc2author[doc_no], it doesn't have to do with what documents are accessed.

rho is a bit tricky, as you mention. The point of rho is to interpolate between a previously computed gamma and the new one, as in line 443 in the code. I'm not quite sure about how to use it here this just yet.

It makes sense to make it in a single pass, since you're not updating the lambdas, and only one author is being updated. So it may be a good idea to make sure the number of iterations (over each document) is high enough. Maybe let self.inference accept iterations as an (optional) argument.

Finally, I must note that it isn't obvious what inference on held-out data should be in the author-topic model. I think that is because observations aren't independent. This method doesn't take into account that many author may contribute to single document, which is really the strength of the AT model. That being said, I think this way is the best way to do it.

Sorry for the wall of text 😝 Keep up the good work!

new_author_name = "placeholder_name"

# Add new documents in corpus to self.corpus.
self.extend_corpus(corpus)
Copy link
Contributor

@olavurmortensen olavurmortensen Dec 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove line 939. self.inference uses the corpus supplied as input, and calling self.extend_corpus is very slow.

@menshikh-iv
Copy link
Contributor

ping @Stamenov, what's status here?

@Stamenov
Copy link
Contributor Author

Sorry, christmas activities. Will try to push the recommended changes these days.

@Stamenov
Copy link
Contributor Author

Stamenov commented Jan 2, 2018

https://travis-ci.org/RaRe-Technologies/gensim/jobs/324067238
Can someone help me out with this build error? Tox package?
Thanks.

@piskvorky
Copy link
Owner

@Stamenov thanks! @menshikh-iv will help out with the build error once he's back from holiday (next week) :)

@Stamenov
Copy link
Contributor Author

Stamenov commented Jan 2, 2018

Great, 10x for the quick response.

@menshikh-iv
Copy link
Contributor

@Stamenov this isn't a tox problem (pip can't install tox package because the network disappeared in Travis), I re-run job, let's wait for the result.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests for this functionality

CC @olavurmortensen, can you review this again?

@@ -882,6 +882,84 @@ def get_document_topics(self, word_id, minimum_probability=None):

raise NotImplementedError('Method "get_document_topics" is not valid for the author-topic model. Use the "get_author_topics" method.')

def get_new_author_topics(self, corpus, minimum_probability=None):
"""Infers topics for new author.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -882,6 +882,84 @@ def get_document_topics(self, word_id, minimum_probability=None):

raise NotImplementedError('Method "get_document_topics" is not valid for the author-topic model. Use the "get_author_topics" method.')

def get_new_author_topics(self, corpus, minimum_probability=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's a reason to minimum_probability=None here (instead of 1e-8)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv It is the same in get_document_topics in LDA (https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/ldamodel.py#L987), and get_term_topics as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, I should leave it like this, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Stamenov yes, leave as is

len_input_corpus = sum(1 for _ in corpus)
if len_input_corpus == 0:
logger.warning("AuthorTopicModel.get_new_author_topics() called with an empty corpus")
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "empty return" needed? Maybe better raise exception explicitly (for all empty return)

if len_input_corpus == 0:
logger.warning("AuthorTopicModel.get_new_author_topics() called with an empty corpus")
return
if not len_input_corpus < self.chunksize:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for this limitation? I don't see it implemented anywhere else.

return

new_author_name = "placeholder_name"
corpus_doc_idx = list(range(0, len_input_corpus))
Copy link
Contributor

@olavurmortensen olavurmortensen Jan 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead set corpus_doc_idx to list(range(self.total_docs, self.total_docs + len_input_corpus)) (see comment below).

# Add new author in author2doc and doc into doc2author.
self.author2doc[new_author_name] = corpus_doc_idx
for new_doc_id in corpus_doc_idx:
self.doc2author[new_doc_id] = [new_author_name]
Copy link
Contributor

@olavurmortensen olavurmortensen Jan 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of how you set corpus_doc_idx, you're overwriting self.doc2author[0] (0 through len_corpus_idx) here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@olavurmortensen
Copy link
Contributor

@Stamenov I made some comments in the code.

I'm afraid I've written the inference method in a sort of clumsy way, which makes this more cumbersome than it should be, sorry about that. The code really could use a refactor, but I don't have the time to do that.

@menshikh-iv
Copy link
Contributor

ping @Stamenov, when you plan to finish this PR?

@Stamenov
Copy link
Contributor Author

I am doing this parallel to my thesis, so mids February.

@menshikh-iv
Copy link
Contributor

@Stamenov what else do you need to complete this PR?

@Stamenov
Copy link
Contributor Author

Hi, I am have done some fixing around the builds.
Also in order to satisfy the Travis CI build I had to change the docstring for get_aurhor_topics(), which was not in the scope of my PR. Is this ok, can you please look into it?
All but one is now green, ci/circleci, and it seems to be ldamodel-related, so I could use some help there.
Furthermore, I need some test cases to implement, so if you have some off the top of your head, I will appreciate them.
I am also working on a accompanying evaluation report (paper) for authorship prediction on a couple of datasets, using this inference function, if this will be of any interest for Gensim.

return pow(self.offset + 1 + 1, -self.decay)
=======
return pow(self.offset + 1, -self.decay)
>>>>>>> Stashed changes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D'oh!

Topic distribution for the given `corpus`.

"""
# TODO: how should this function look like for get_new_author_topics?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned I am iteration over a few versions for the function and will commit the best performing.
It was already discussed here that it is not clear how we should implement this one.

except ValueError as e:
# Something went wrong! Rollback temporary changes in object and log
rollback_new_author_chages()
logging.exception(e)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is critical to be caught, since if it is not, we won't call rollback_new_author_changes*() and the state of the model will be altered with the temp vars.

except ValueError as e:
# Something went wrong! Rollback temporary changes in object and log
rollback_new_author_chages()
logging.exception(e)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would even much rather catch any exception, rollback the changes and log the exception.
The warning can be an addition to that.

@@ -450,6 +450,31 @@ def testTermTopics(self):
self.assertTrue(isinstance(topic_no, int))
self.assertTrue(isinstance(probability, float))

def testNewAuthorTopics(self):
Copy link
Contributor Author

@Stamenov Stamenov Feb 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the sanity check, I tried another test case, where I would infer the topics for a new author, using the exact same documents from the corpus that an author already is assigned to. We can expect that the topic distributions should be very similar, as it is, but not close enough so that we can use np.allclose(..,..). Example:

In [2]: model.get_new_author_topics(corpus=corpus[1:3])
Out[2]: [(0, 0.914887958236441), (1, 0.08511204176355897)]

In [3]: model.get_author_topics(author_name="sally")
Out[3]: [(0, 0.9290420023752036), (1, 0.07095799762479651)]

Of course something li this would work.
sally_topics = model.get_author_topics(author_name="sally")
new_authortopics = model.get_new_author_topics(corpus=corpus[1:3])
self.assertTrue(np.allclose(jillnewauthortopics,jilltopics, atol=1e-1))

@menshikh-iv do you approve of this test?

@Stamenov
Copy link
Contributor Author

@menshikh-iv I have pushed the test and exception improvement, also the rho() function which works best for me. Hope it is satisfactory.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Feb 27, 2018

Great work @Stamenov 👍 LGTM for me (only fix PEP8 mistakes please)

@olavurmortensen ping, PR looks ready, are you approve?

@olavurmortensen
Copy link
Contributor

@Stamenov @menshikh-iv Sorry for not responding, I was sick for a while. I'll take a look soon.

raise ValueError("AuthorTopicModel.get_new_author_topics() called with an empty corpus")

new_author_name = "placeholder_name"
corpus_doc_idx = list(range(self.total_docs, self.total_docs + len_input_corpus))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining this line. For example: "indexes representing the documents in the input corpus".


# Add the new placeholder author to author2id/id2author dictionaries.
num_new_authors = 1
author_id = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks confusing. Just say author_id = self.num_authors, or something like that.

if len_input_corpus == 0:
raise ValueError("AuthorTopicModel.get_new_author_topics() called with an empty corpus")

new_author_name = "placeholder_name"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for completeness, check that this author isn't already in the dictionary.

self.state.gamma = np.vstack([self.state.gamma, gamma_new])

# Should not record the sstats, as we are goint to delete the new author after calculated.
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this try block throw an exception if there is a problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does, for example, when corpus is invalid (list of something else), or if the inference is somehow interrupted, or maybe something else. Must make sure to rollback the temporary changes in any case. If makes sense to show the exception to the user, but leave the model state untouched.

@olavurmortensen
Copy link
Contributor

I made some comments in the code. I think it would also nice to see some example use-cases on real data, hopefully with good results, to see if this actually works.

@Stamenov
Copy link
Contributor Author

Stamenov commented Mar 5, 2018

@olavurmortensen I will post a notebook, where I train and predict authors on the Reuters 50x50 dataset soon!

@menshikh-iv menshikh-iv changed the title Feature: New Author Inference Add inference for new unseen author for gensim.models.AuthorTopicModel Mar 9, 2018
@menshikh-iv
Copy link
Contributor

@Stamenov how is going? when you plan to finish PR (looks like this almost done).

@Stamenov
Copy link
Contributor Author

Hi, the PR is ready, as far as code and functionality is concerned.
The only thing is the playbook, which I have prepared yesterday, but I was getting very strange and inconsistent results. I have to take my time to debug it properly.
I can commit a preliminary version of it though.
I would say this PR is closable and I can open up a new one for the playbook?

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Mar 12, 2018

@Stamenov better to add notebook in current PR (this is hardly-related changes), we'll merge together
new functionality & notebook with an example as one PR.

Also, don't forget to resolve cosmetic issues based on @olavurmortensen comments.

@menshikh-iv
Copy link
Contributor

@Stamenov great, code looks good to merge 👍, the last thing - please add an example to https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/atmodel_tutorial.ipynb

@Stamenov
Copy link
Contributor Author

Thanks. I am working on it. I believe this weel it should be done.

@menshikh-iv
Copy link
Contributor

Hello @Stamenov

  1. You upload some html (instead of ipynb), look inside your file
  2. Please add a new section to https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/atmodel_tutorial.ipynb (instead of adding new notebook)

@Stamenov
Copy link
Contributor Author

Hi @menshikh-iv ,
I see your point with adding a new section instead of a new notebook, but I find it would be very difficult in keeping it clean. There would be a lot of double code from the atmodel_tutorial.ipynb and the notebook itself would become very long. Also I use a different dataset, not the NIPS one.

@menshikh-iv
Copy link
Contributor

Thanks for your work @Stamenov 👍 nice feature!

Thanks @olavurmortensen for review 👍

Copy link
Contributor

@jonaschn jonaschn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am interested in an appropriate value of rhot for visualizing the AT model in bmabey/pyLDAvis#161

Topic distribution for the given `corpus`.

"""
def rho():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Stamenov @olavurmortensen Please document the rationale behind this definition of rho.
Why didn't you write return pow(self.offset + 2 , -self.decay)?
In an earlier version this was return pow(self.offset + 1, -self.decay).
Could you provide any hint?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants