Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Hierarchical Dirichlet Process #73

Merged
merged 17 commits into from Mar 8, 2012

Conversation

Projects
None yet
3 participants
Owner

piskvorky commented Jan 14, 2012

jesterhazy's HDP code

jesterhazy and others added some commits Jan 14, 2012

minor code style changes
* also set the topic formatter to default ("gensim-style")

this var seems to be unused; is that a bug or is it really not needed?

Owner

piskvorky commented Jan 14, 2012

@jesterhazy , in the documentation it says "allows inference of topic distribution on new, unseen documents" -- how do you trigger this functionality (I see no __getitem__ there)?

I tried running HDP over a subset of wikipedia articles and the resulting topics seemed pretty random. I'll try again with the NY corpus, but I wonder if there's a standard set of results that we could use as "ground truth", in unittests etc.?

Like this it's kinda hard to say whether it's doing what it should...

Contributor

jesterhazy commented Jan 14, 2012

That code and the var_sticks_ss stuff is direct from the original author. It is probably orphaned code.

I also noticed unimpressive topics on the NYT corpus, but didn't dig too deep. You are probably right about the bugs. I will compare with the math in Wang's paper to see if I can spot the problem.

Sent from my iPhone

On 2012-01-14, at 2:15 PM, Radim Řehůřekreply@reply.github.com wrote:

@jesterhazy , in the documentation it says "allows inference of topic distribution on new, unseen documents" -- how do you trigger this functionality (I see no __getitem__ there)?

I tried running HDP over a subset of wikipedia articles and the resulting topics seemed pretty random. I'll try again with the NY corpus, but I wonder if there's a standard set of results that we could use as "ground truth", in unittests etc.?

Like this it's kinda hard to say whether it's doing what it should...


Reply to this email directly or view it on GitHub:
piskvorky#73 (comment)

Contributor

jesterhazy commented Jan 20, 2012

@piskvorky The problem definitely exists in the original author's version too. I reached out to him for some assistance, and will update you when I have news.

Contributor

jesterhazy commented Jan 22, 2012

@piskvorky do you have a dataset (not too large) that produces good results with your LDA code? The advice I got from Chong was to improve the vocabulary (remove stops, prune w/ tf-idf), but I am not seeing better results from this. If I had data known to work well with the LDA code, that would eliminate one unknown.

Owner

piskvorky commented Jan 24, 2012

@jesterhazy Any reasonable dataset ought to produce reasonable results -- LSI&LDA etc all claim to be independent of a particular corpus. I usually try on a subset of wikipedia.

But here, we can try tfidf+LSI and LDA on the NY corpus, so that we can compare whether the topics make sense there.

I'll get to this beginning of next month -- if you have time sooner, let me know so we can discus the results.

Contributor

jesterhazy commented Jan 24, 2012

Sounds good to me. I've been running tfidf + HDP and LDA on the NYT data, and another set of ohsumed medical abstracts, and found all the results hard to judge. I will send some details tonight.

On 2012-01-24, at 2:58 AM, Radim Řehůřek wrote:

Any reasonable dataset ought to produce reasonable results -- LSI&LDA etc all claim to be independent of a particular corpus. I usually try on a subset of wikipedia.

But here, we can try tfidf+LSI and LDA on the NY corpus, so that we can compare whether the topics make sense there.

I'll get to this beginning of next month -- if you have time sooner, let me know so we can discus the results.


Reply to this email directly or view it on GitHub:
piskvorky#73 (comment)

Contributor

jesterhazy commented Feb 9, 2012

@piskvorky

I've been running a lot of tests on the online hdp code, and have some results. I wonder if you could take a peek at them, and let me know what you think.

I am using the ohsumed corpus of ~57,000 medical abstracts. I trimmed the dictionary to about 17,400 features by eliminating terms with numbers or punctuation (except hyphens), and then calling dict2.filter_extremes(10, 0.2, 30000).

I then ran (separately) gensim's LDA, and HDP implementations, and printed the top 20 topics:

lda = models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=100, chunksize=2048, decay=0.6)
hdp = models.HdpModel(corpus=corpus, id2word=dictionary, outputdir=options.outdir, chunksize=2048)

I also hacked the HdpModel code to process the training corpus ten times, to see if that helped.

Then I converted the corpus and dictionary into the format used by the original author's code, and re-ran the HDP tests using an unmodified version of his code.

You can see the topics produced at https://gist.github.com/1781113. I also have LSI results for the same tests, if you think those would be interesting.

What I see in them is:

  • lda did a pretty good job
  • all of the hdp runs (gensim or original code) produced junk topics

What do you think?

If you are interested in trying this corpus yourself, let me know, I can send you my files.

Owner

piskvorky commented Feb 12, 2012

I see words like "she" and "more" and "there" in the topics, which could indicate that dictionary pruning was not aggressive enough.

Still, a method must be a little robust and not rely on absolutely "perfect" input data. LDA and LSI seem to work reasonably well on this dataset -- maybe HDP is more finicky about its input?

I'll try asking on the topic-models forum, hopefully someone can give practical hints on what to expect from HDP or what we're doing wrong. EDIT: https://lists.cs.princeton.edu/pipermail/topic-models/2012-February/001746.html

Owner

piskvorky commented Feb 16, 2012

@jesterhazy , it looks like we're on our own, the academicians' response was rather ... academic :-) Though Mr. Buntine's suggestion may be worth exploring once the basics work.

I see two courses of action:

  1. include HDP in gensim anyway, as it is, aka "let users sort it out"
  2. try with different preprocessing/dictionary/corpora first, in hopes of gaining more insight into what works + when + how.

I am a bit wary of 1., because then all the support requests and questions fall on my shoulders, and I won't know what to say :) Also it seems a bit unprofessional, even for open-source software like gensim.

But of course 2. means further delays and more work.

Your opinion?

Contributor

jesterhazy commented Feb 16, 2012

I will try some more tests this weekend on the medical abstracts, to see if I can get decent topics out of it.

Chong Wang's nyt topics certainly look good, but they are very far from what I've been able to generate. It would be nice to know more about his pre-processing steps. A complicating factor: I think he is using the NYT Annotated Corpus, which is more recent, larger, and cleaner than the UCI vector data I have access to right now.

Owner

piskvorky commented Feb 18, 2012

@jesterhazy Chong replied with more info, including concrete parameters and dictionary: https://lists.cs.princeton.edu/pipermail/topic-models/2012-February/thread.html

That is cool indeed; perhaps the problem was really only in our dataset size (too small).

I'll leave your code running over the entire Wikipedia. Once you run it on the medical abstracts as well, we'll have the two data points, which ought to be enough to judge whether including HDP makes sense or not.

I will prune the wikipedia dictionary very aggressively -- I noticed Chong's vocabulary never exceeded 10k features.

Contributor

jesterhazy commented Feb 18, 2012

I am just running it on the medical abstracts, after aggressively cleaning the dictionary, and am not seeing better results yet. I will post them soon. The corpus is relatively small though (~57k docs).

On 2012-02-18, at 7:21 AM, Radim Řehůřek wrote:

Chong replied with more info, including concrete parameters and dictionary: https://lists.cs.princeton.edu/pipermail/topic-models/2012-February/thread.html

That is cool indeed; perhaps the problem was really only in our dataset size (too small).

I'll leave your code running over the entire Wikipedia. Once you run it on the medical abstracts as well, we'll have the two data points, which ought to be enough to judge whether including HDP makes sense or not.

I will prune the wikipedia dictionary very aggressively -- I noticed Chong's vocabulary never exceeded 10k features.


Reply to this email directly or view it on GitHub:
piskvorky#73 (comment)

Contributor

jesterhazy commented Feb 18, 2012

New results: https://gist.github.com/1859673

The Wang hdp results are starting to look better. Perhaps with a larger corpus, they would turn out well?

The gensim hdp results don't look that great, but do look like they are heading in the same direction as the Wang results. The gensim version runs much more quickly -- it processes each document once, where Wang's code processes random chunks of the corpus for a fixed number of iterations (online algorithm, applied in a very batchy way). Maybe this extra processing partly helps compensate for the small corpus?

Contributor

jesterhazy commented Feb 18, 2012

Additional hdp results using Wang code: https://gist.github.com/1860242

I let the analysis run with more iterations, longer maxtime settings.

They are definitely starting to look like reasonable topics.

I'm not sure what this means for the gensim version. I think the hdpmodel code is probably ok, but with the corpus I am using, it clearly needs multiple passes over the same docs to reach a decent model. I could redo the chunking code to support this, but it seems out of place in an online algorithm.

What do you think?

Contributor

jesterhazy commented Feb 20, 2012

@piskvorky:

more results, this time from my nyt corpus: https://gist.github.com/1869822

The wang hdp code is producing pretty good results on a cleaned up version of the corpus, if allowed to run for a longer time. The results above were for a 3-hour run. 3 hours gives hdp enough time to see each document ~ 3 times. My gensim version still only sees each doc once. I guess fixing that is the next step, although that basically disqualifies this as an online algorithm.

BTW - the corpus I am using is not the same as Wang. He appears to be using the "NYT Annotated Corpus" from LDC. I am using the older, messier, vectors-only dataset from the UCI bag-of-words dataset (http://archive.ics.uci.edu/ml/datasets/Bag+of+Words). 300,000 docs, 102660 features, no access to full text. I will try to post this info in topic-models next opportunity.

piskvorky added some commits Mar 2, 2012

WIP: added lemmatizer
* uses the optional python package `pattern`
* now lives in gensim.utils, but will be moved `gensim.parsing` in the future
* the wikipedia parsing script uses lemmas automatically, if `pattern` is installed
Owner

piskvorky commented Mar 2, 2012

I decided to do the vocabulary thing more thoroughly, so I integrated a shallow parser (POS) and lemmatizer into the wikipedia parsing, rebuilding the entire corpus.

The lemmatizer I used is pretty slow, so I also extended the code to work in parallel (multicore), so that we get results this year :) That's why it's taking so long (apologies for the delays). I'm still on it, it's still running.

Re. your NYT results: the CW HDP results are indeed starting to look reasonable. But the gensim HDP still resembles random noise :(

What exactly are the changes you made wrt the original code? I though this was a 1:1 translation of Wang's code?

HDP is such an arcane algo that I don't think anybody would mind it being multi-pass. Optimizing it might come after people start using it and complain it's slow, but with the results right now, that's never gonna happen :)

Owner

piskvorky commented Mar 4, 2012

Update: I used the param values suggested by CW (kappa=0.8, tau=1), for the English Wikipedia run. I checked the result yesterday and all topics were exactly equal... probably a bug :(

I re-ran the Wikipedia HDP, this time leaving the default parameters (kappa=1.0, tau=64.0).

After processing half a million documents, the results look much more reasonable:

gensim.models.hdpmodel : 2012-03-04 14:24:23,159 : INFO : topic 0: 0.006*album/NN + 0.006*film/NN + 0.005*game/NN + 0.004*song/NN + 0.004*band/NN + 0.004*season/NN + 0.003*award/NN + 0.003*show/NN + 0.003*music/NN + 0.002*episode/NN + 0.002*player/NN + 0.002*league/NN + 0.002*record/VB + 0.002*television/NN + 0.002*role/NN + 0.002*character/NN + 0.002*star/NN + 0.002*rock/NN + 0.002*perform/VB + 0.002*tv/NN
gensim.models.hdpmodel : 2012-03-04 14:24:23,324 : INFO : topic 1: 0.011*age/NN + 0.010*county/NN + 0.010*population/NN + 0.008*household/NN + 0.008*town/NN + 0.006*median/JJ + 0.005*average/JJ + 0.005*income/NN + 0.005*mile/NN + 0.004*race/NN + 0.004*square/NN + 0.004*township/NN + 0.003*census/NN + 0.003*district/NN + 0.003*size/NN + 0.003*park/NN + 0.003*older/JJ + 0.003*density/NN + 0.003*male/NN + 0.003*female/NN
gensim.models.hdpmodel : 2012-03-04 14:24:23,957 : INFO : topic 2: 0.008*party/NN + 0.005*government/NN + 0.004*election/NN + 0.004*president/NN + 0.004*minister/NN + 0.003*political/JJ + 0.003*law/NN + 0.003*court/NN + 0.002*leader/NN + 0.002*force/NN + 0.002*elect/VB + 0.002*military/JJ + 0.002*union/NN + 0.002*british/JJ + 0.002*democratic/JJ + 0.002*power/NN + 0.002*vote/NN + 0.002*prime/JJ + 0.002*council/NN + 0.002*german/NN
gensim.models.hdpmodel : 2012-03-04 14:24:24,133 : INFO : topic 3: 0.002*story/NN + 0.002*art/NN + 0.002*science/NN + 0.002*woman/NN + 0.002*century/NN + 0.002*novel/NN + 0.002*study/NN + 0.002*society/NN + 0.002*isbn/NN + 0.002*social/JJ + 0.002*film/NN + 0.002*theory/NN + 0.002*press/NN + 0.001*god/NN + 0.001*often/RB + 0.001*law/NN + 0.001*human/JJ + 0.001*form/NN + 0.001*term/NN + 0.001*describe/VB
gensim.models.hdpmodel : 2012-03-04 14:24:24,768 : INFO : topic 4: 0.008*upload/VB + 0.004*jpg/NN + 0.004*datum/NN + 0.004*function/NN + 0.004*user/NN + 0.003*computer/NN + 0.003*example/NN + 0.003*image/NN + 0.003*version/NN + 0.003*software/NN + 0.002*space/NN + 0.002*value/NN + 0.002*language/NN + 0.002*wikipedia/NN + 0.002*png/NN + 0.002*application/NN + 0.002*bit/NN + 0.002*window/NN + 0.002*code/NN + 0.002*model/NN
gensim.models.hdpmodel : 2012-03-04 14:24:24,948 : INFO : topic 5: 0.004*cell/NN + 0.003*water/NN + 0.003*energy/NN + 0.003*effect/NN + 0.002*cause/VB + 0.002*type/NN + 0.002*process/NN + 0.002*often/RB + 0.002*material/NN + 0.002*light/NN + 0.002*surface/NN + 0.002*study/NN + 0.002*body/NN + 0.002*acid/NN + 0.002*increase/VB + 0.002*temperature/NN + 0.002*occur/VB + 0.002*patient/NN + 0.002*level/NN + 0.002*protein/NN
gensim.models.hdpmodel : 2012-03-04 14:24:25,581 : INFO : topic 6: 0.008*force/NN + 0.006*air/NN + 0.005*ship/NN + 0.005*aircraft/NN + 0.005*army/NN + 0.004*battle/NN + 0.003*operation/NN + 0.003*division/NN + 0.003*british/JJ + 0.003*gun/NN + 0.003*navy/NN + 0.003*german/NN + 0.003*military/JJ + 0.003*squadron/NN + 0.003*royal/JJ + 0.003*unit/NN + 0.003*command/NN + 0.003*naval/JJ + 0.002*officer/NN + 0.002*regiment/NN
gensim.models.hdpmodel : 2012-03-04 14:24:25,760 : INFO : topic 7: 0.007*king/NN + 0.004*son/NN + 0.004*century/NN + 0.003*emperor/NN + 0.003*ii/NN + 0.003*roman/NN + 0.003*prince/NN + 0.003*duke/NN + 0.003*church/NN + 0.003*kingdom/NN + 0.003*earl/NN + 0.003*battle/NN + 0.002*town/NN + 0.002*castle/NN + 0.002*royal/JJ + 0.002*henry/NN + 0.002*charle/NN + 0.002*empire/NN + 0.002*lord/NN + 0.002*marry/VB
gensim.models.hdpmodel : 2012-03-04 14:24:26,392 : INFO : topic 8: 0.009*river/NN + 0.007*island/NN + 0.004*species/NN + 0.004*park/NN + 0.004*lake/NN + 0.003*water/NN + 0.003*mountain/NN + 0.003*town/NN + 0.003*population/NN + 0.003*region/NN + 0.003*north/JJ + 0.002*north/RB + 0.002*bird/NN + 0.002*south/RB + 0.002*sea/NN + 0.002*district/NN + 0.002*forest/NN + 0.002*northern/JJ + 0.002*km/NN + 0.002*land/NN
gensim.models.hdpmodel : 2012-03-04 14:24:26,572 : INFO : topic 9: 0.005*student/NN + 0.004*college/NN + 0.004*station/NN + 0.003*radio/NN + 0.003*program/NN + 0.003*business/NN + 0.003*science/NN + 0.002*channel/NN + 0.002*education/NN + 0.002*news/NN + 0.002*fm/NN + 0.002*network/NN + 0.002*bank/NN + 0.002*tv/NN + 0.002*research/NN + 0.002*technology/NN + 0.002*television/NN + 0.002*government/NN + 0.002*market/NN + 0.002*building/NN
gensim.models.hdpmodel : 2012-03-04 14:24:27,205 : INFO : topic 10: 0.007*car/NN + 0.007*engine/NN + 0.006*station/NN + 0.006*airport/NN + 0.005*railway/NN + 0.004*train/NN + 0.004*model/NN + 0.003*air/NN + 0.003*airline/NN + 0.003*road/NN + 0.003*passenger/NN + 0.003*aircraft/NN + 0.003*operate/VB + 0.003*race/NN + 0.003*vehicle/NN + 0.002*speed/NN + 0.002*route/NN + 0.002*ford/NN + 0.002*ret/NN + 0.002*rail/NN
gensim.models.hdpmodel : 2012-03-04 14:24:27,383 : INFO : topic 11: 0.009*language/NN + 0.004*word/NN + 0.004*century/NN + 0.003*form/NN + 0.003*king/NN + 0.002*chinese/NN + 0.002*greek/NN + 0.002*modern/JJ + 0.002*temple/NN + 0.002*example/NN + 0.002*speak/VB + 0.002*ancient/JJ + 0.002*dialect/NN + 0.002*often/RB + 0.002*vowel/NN + 0.002*period/NN + 0.002*bc/NN + 0.002*term/NN + 0.002*refer/VB + 0.002*god/NN
gensim.models.hdpmodel : 2012-03-04 14:24:28,015 : INFO : topic 12: 0.011*game/NN + 0.005*character/NN + 0.004*player/NN + 0.003*power/NN + 0.003*story/NN + 0.003*comic/NN + 0.003*kill/VB + 0.003*version/NN + 0.002*episode/NN + 0.002*film/NN + 0.002*earth/NN + 0.002*battle/NN + 0.002*star/NN + 0.002*black/JJ + 0.002*issue/NN + 0.002*reveal/VB + 0.002*destroy/VB + 0.002*video/NN + 0.001*card/NN + 0.001*comic/JJ
gensim.models.hdpmodel : 2012-03-04 14:24:28,194 : INFO : topic 13: 0.006*william/NN + 0.005*elect/VB + 0.005*player/NN + 0.004*minister/NN + 0.004*actor/NN + 0.004*jame/NN + 0.004*president/NN + 0.004*british/JJ + 0.004*democratic/JJ + 0.004*re/NN + 0.004*politician/NN + 0.003*george/NN + 0.003*republican/JJ + 0.003*robert/NN + 0.003*singer/NN + 0.003*charle/NN + 0.003*actress/NN + 0.003*thoma/NN + 0.003*french/JJ + 0.003*prime/JJ
gensim.models.hdpmodel : 2012-03-04 14:24:28,823 : INFO : topic 14: 0.015*music/NN + 0.005*song/NN + 0.004*orchestra/NN + 0.004*piano/NN + 0.004*composer/NN + 0.004*band/NN + 0.004*album/NN + 0.003*opera/NN + 0.003*instrument/NN + 0.003*musical/JJ + 0.003*perform/VB + 0.003*string/NN + 0.003*guitar/NN + 0.003*op/NN + 0.003*recording/NN + 0.003*dance/NN + 0.003*performance/NN + 0.002*style/NN + 0.002*record/VB + 0.002*symphony/NN
gensim.models.hdpmodel : 2012-03-04 14:24:29,007 : INFO : topic 15: 0.015*delete/NN + 0.009*wikipedia/NN + 0.009*feb/NN + 0.008*user/NN + 0.008*jan/NN + 0.008*mar/NN + 0.007*deletion/NN + 0.007*keep/NN + 0.007*live/NN + 0.006*nov/NN + 0.006*dec/NN + 0.006*delete/VB + 0.005*vote/NN + 0.005*debate/NN + 0.005*here/RB + 0.005*oct/NN + 0.005*longer/RB + 0.005*no/RB + 0.004*edit/VB + 0.004*sep/NN
gensim.models.hdpmodel : 2012-03-04 14:24:29,633 : INFO : topic 16: 0.011*club/NN + 0.010*cup/NN + 0.010*league/NN + 0.009*season/NN + 0.007*game/NN + 0.006*player/NN + 0.006*football/NN + 0.006*goal/NN + 0.006*championship/NN + 0.005*round/NN + 0.004*final/JJ + 0.004*score/VB + 0.003*champion/NN + 0.003*lose/VB + 0.003*division/NN + 0.003*stadium/NN + 0.003*town/NN + 0.003*finish/VB + 0.003*match/NN + 0.003*winner/NN
gensim.models.hdpmodel : 2012-03-04 14:24:29,813 : INFO : topic 17: 0.021*church/NN + 0.006*catholic/JJ + 0.005*bishop/NN + 0.005*christian/NN + 0.004*god/NN + 0.004*jesu/NN + 0.003*saint/NN + 0.003*holy/JJ + 0.003*orthodox/JJ + 0.003*church/VB + 0.003*roman/NN + 0.003*century/NN + 0.002*religious/JJ + 0.002*christ/NN + 0.002*priest/NN + 0.002*pope/NN + 0.002*jewish/JJ + 0.002*council/NN + 0.002*baptist/NN + 0.002*father/NN
gensim.models.hdpmodel : 2012-03-04 14:24:30,439 : INFO : topic 18: 0.005*wine/NN + 0.004*food/NN + 0.003*often/RB + 0.003*fruit/NN + 0.003*product/NN + 0.003*plant/NN + 0.003*variety/NN + 0.003*beer/NN + 0.003*sell/VB + 0.002*popular/JJ + 0.002*dish/NN + 0.002*water/NN + 0.002*usually/RB + 0.002*white/JJ + 0.002*tea/NN + 0.002*meat/NN + 0.002*red/JJ + 0.002*grow/VB + 0.002*common/JJ + 0.002*contain/VB
gensim.models.hdpmodel : 2012-03-04 14:24:30,616 : INFO : topic 19: 0.042*kategori/NN + 0.038*category/NN + 0.020*kategorija/NN + 0.017*categoria/NN + 0.015*kategoria/NN + 0.014*категория/NN + 0.011*categorie/NN + 0.011*kategorie/NN + 0.010*zh/NN + 0.010*categoria/VB + 0.010*kategória/NN + 0.010*категорија/NN + 0.010*ko/NN + 0.009*catégorie/NN + 0.009*kategorie/VB + 0.009*ru/NN + 0.008*es/NN + 0.008*categoría/VB + 0.008*sv/NN + 0.008*fr/NN
gensim.models.hdpmodel : 2012-03-04 14:24:30,636 : INFO : PROGRESS: finished document 468992 of 3533010

Btw this wiki corpus uses a vocab of 50k, but it should be much cleaner compared to the simple alphabetic tokenizer.

Contributor

jesterhazy commented Mar 4, 2012

The new wikipedia run is still with (my) gensim version of the HDP code?

Owner

piskvorky commented Mar 4, 2012

Also, regarding the mini-batches:

Mini-batches. To improve stability of the online learning algorithm, practitioners typically 
use multiple samples to compute gradients at a time—a small set of documents in our case. 
Let S be a small set of documents and S = |S| be its size. In this case, rather than computing 
the natural gradients using DLj , we use (D/S) sum_j Lj. The update equations can then be 
similarly derived.

(from the original paper).

So the mini-batches are meant to improve stability; the algo can still be considered online, as long as S is reasonably small.

piskvorky added some commits Mar 4, 2012

commented out periodic model saves (took up an insane amount of disk …
…space)

* but retained periodic topic printing/logging
use only `lemma` as feature id; was: `lemma/tag`
* because the tags seem to be buggy -- all NN words are also present as VB (abbreviation/VB). so I just collapse words with the same lemma into one, oh well.
Contributor

jesterhazy commented Mar 4, 2012

Radim,

The hdp algorithm in the hdpmodel class should be a direct adaptation of the CW code. However, CW's code includes a "driver" script that divides the corpus into test and training slices, and handles batching. It does batches by choosing random chunks out of the whole corpus and processing them. The chunk size is equal to the batchsize setting. It repeats this until it reaches an iteration or timeout limit. This means that a) it needs the whole corpus up front to select from, and ) it can process the same documents multiple times, or skip documents. This doesn't fit my idea of an online algorithm, so I took that code out, and replaced it with chunking code that processes the corpus once from start to finish in appropriate-sized batches. This is the only intentional difference. I'm thinking of adding some iteration code (and parameters to control them) to see if makes the ohsumed or nyt results I got more comparable to the CW results.

JE

On 2012-03-02, at 11:20 AM, Radim Řehůřek wrote:

I decided to do the vocabulary thing more thoroughly, so I integrated a shallow parser (POS) and lemmatizer into the wikipedia parsing, rebuilding the entire corpus.

The lemmatizer I used is pretty slow, so I also extended the code to work in parallel (multicore), so that we get results this year :) That's why it's taking so long (apologies for the delays). I'm still on it, it's still running.

Re. your NYT results: the CW HDP results are indeed starting to look reasonable. But the gensim HDP still resembles random noise :(

What exactly are the changes you made wrt the original code? I though this was a 1:1 translation of Wang's code?

HDP is such an arcane algo that I don't think anybody would mind it being multi-pass. Optimizing it might come after people start using it and complain it's slow, but with the results right now, that's never gonna happen :)


Reply to this email directly or view it on GitHub:
piskvorky#73 (comment)

Owner

piskvorky commented Mar 4, 2012

Yep, the same code, with only cosmetic changes: https://github.com/piskvorky/gensim/blob/hdp_wip/gensim/models/hdpmodel.py

Contributor

jesterhazy commented Mar 4, 2012

re: mini-batches. Agreed. CW's batching code was combined with random doc selection, test/training split code, and iteration control. The combined effect was not really online.

Owner

piskvorky commented Mar 4, 2012

Yes, I see, I also went through the paper and the code.

As it is now, the HDP code does mini-batches in the same way the online LDA code does (except in LDA, the batches can be processed by different cores/computers ~ parallelism speed up).

I'll try to hunt down why the previous kappa/tau combination produced such degenerate results, perhaps it's something trivial (I'm thinking maybe a bad float/int cast).

Contributor

jesterhazy commented Mar 4, 2012

I'm running now with the new iteration/timeout code. Interim results for nyt corpus look promising. Full results in a few hours.

topic 0: 0.024_game + 0.019_plai + 0.016_team + 0.015_season + 0.010_player + 0.010_run + 0.009_point + 0.009_win + 0.008_coach + 0.008_hit
topic 1: 0.024_compani + 0.014_percent + 0.010_market + 0.009_stock + 0.008_million + 0.007_busi + 0.007_price + 0.006_share + 0.005_execut + 0.005_billion
topic 2: 0.011_vote + 0.010_campaign + 0.009_elect + 0.008_presid + 0.008_polit + 0.008_democrat + 0.006_candid + 0.006_support + 0.005_republican + 0.005_issu
topic 3: 0.008_palestinian + 0.006_offici + 0.006_kill + 0.005_offic + 0.005_case + 0.005_govern + 0.005_attack + 0.005_isra + 0.005_polic + 0.004_famili
topic 4: 0.011_offici + 0.008_attack + 0.008_militari + 0.007_govern + 0.007_forc + 0.006_countri + 0.006_war + 0.006_terrorist + 0.005_leader + 0.005_secur
topic 5: 0.007_film + 0.007_plai + 0.006_movi + 0.006_music + 0.005_look + 0.005_live + 0.004_book + 0.004_love + 0.004_famili + 0.003_feel
topic 6: 0.005_room + 0.005_look + 0.005_hous + 0.005_home + 0.004_com + 0.004_place + 0.004_www + 0.004_live + 0.004_open + 0.004_build
topic 7: 0.008_tax + 0.008_compani + 0.007_million + 0.007_percent + 0.007_govern + 0.006_plan + 0.006_billion + 0.005_offici + 0.005_cut + 0.004_industri
topic 8: 0.009_school + 0.008_patient + 0.008_percent + 0.007_studi + 0.007_student + 0.006_drug + 0.006_doctor + 0.006_test + 0.005_women + 0.005_diseas
topic 9: 0.012_case + 0.009_compani + 0.007_court + 0.007_law + 0.006_lawyer + 0.006_investig + 0.005_offici + 0.005_govern + 0.005_offic + 0.005_feder
PROGRESS: finished document 1004448 of 30000

Owner

piskvorky commented Mar 4, 2012

Ok. I will clean up the work-in-progress branch hdp_wip and push it here to hdp.

Contributor

jesterhazy commented Mar 4, 2012

here are the final results of that run. they look ok to me.

https://gist.github.com/1975163

Owner

piskvorky commented Mar 5, 2012

And here are the Wikipedia results, after a single pass through the 3.5M docs: https://gist.github.com/1979640

In both our results, the topics near the beginning seem to have converged well; the tail is questionable. But there'll be time to tune/diagnose/improve the parameters later (a non-parametric method, ha!); I agree these results look reasonable.

I will include HDP in the next release of gensim, some time this week. Thanks a lot, Jonathan!

piskvorky added a commit that referenced this pull request Mar 8, 2012

Merge pull request #73 from piskvorky/hdp
Hierarchical Dirichlet Process

@piskvorky piskvorky merged commit 197ff6f into develop Mar 8, 2012

@jesterhazy , could you please comment on this mailing list thread regarding how to determine the number of topics found by the HDP-LDA model?

Contributor

jesterhazy commented Nov 10, 2017

@TC-Rudel done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment