-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Dynamic Topic Models in python. #739
Conversation
Could you please update it to be more up to date with your private fork? |
@tmylk all the methods are more or less done. I still have to see |
Note: as of now, the classes and methods are not well arranged, and there are a few mock classes (which will be removed) to help me with testing. Once all the testing is done and a crude DTM is fit, I will rearrange the structure of the DTM code. |
import sys | ||
|
||
# this is a mock LDA class to help with testing until this is figured out | ||
class mockLDA(utils.SaveLoad): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the mock class be in test folder rather than in prod file?
Cleaned up code, added docstrings so that it is more reviewable now. Made changes to many core classes so most of the tests would be failing; will fix them and further clean tomorrow. |
totals = numpy.zeros(counts.shape[1]) | ||
|
||
for w in range(0, W): | ||
self.variance[w], self.fwd_variance[w] = self.compute_post_variance(w, self.chain_variance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zip*
and list comprehension is easier to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
|
||
1) Include DIM mode. Most of the infrastructure for this is in place. | ||
2) See if LdaPost can be replaces by LdaModel completely without breakign anything. | ||
3) Heavy lifting going on in the sslm class - efforts can be made to cythonise mathematical methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how much time is spent there according to line profiler? is it really a perf bottleneck?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fit_sslm
takes up majority of the time with update_obs
being particularly slow.
@tmylk , @piskvorky , I've added a tutorial notebook and tests to accompany the code. |
@piskvorky I think this is ready for your review. (Though I still have some comments about the code and notebook to discuss with Bhargav today.) |
OK, I'll have a look, probably this weekend. Thanks @bhargavvader ! One thing I see immediately is that this PR needs a better description. When people come here from the CHANGELOG/web/google/blog etc, they shouldn't be greeted with |
@piskvorky , haha, yes. I've changed the title and description with a brief explanation and TODO for further works. |
@tmylk I see a lot of TODOs in the description, and a request for review. Was this really meant to be merged? |
@piskvorky , the TODOs in the description are more of further works to improve the code- things like improving performance and making it distributed. I've changed it from TODO to |
A python translation from C of Dynamic Topic Models, as described by Blei at al in this paper, originally written in C here.
Dynamic Topic Models are used to topic model time-series tagged documents such that the topics evolve over each time-slice. This is useful in finding out how certain key words in topics change over time and in finding document similarity between documents far apart in time but contextually similar.
This was my Google Summer of Code 2016 project.
Future Works:
- in particular, a lot of DIM depends on LdaPost being in place.
- in particular, update_obs and the optimization takes a lot time.