New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LdaMulticore error in multiprocessing #445

Closed
lucidfrontier45 opened this Issue Sep 4, 2015 · 7 comments

Comments

Projects
None yet
3 participants
@lucidfrontier45
Copy link

lucidfrontier45 commented Sep 4, 2015

Hi. I tried to build an LDAModel by LdaMulticore as follows.

import pymongo
import numpy as np
import gensim

db = pymongo.MongoClient("192.168.254.226:37017").documents
ret = list(db.parsed.find({}, {"_id":1, "words":1, "title":1}).limit(150000))
documents = [r["words"] for r in ret]
print("data was loaded")

word_dict = gensim.corpora.Dictionary(documents)
print("dictionary created")

corpus = [word_dict.doc2bow(doc) for doc in documents]
print("corpus created")

model = gensim.models.LdaMulticore(corpus=corpus, num_topics=500,
    id2word=word_dict, iterations=200, workers=3)
print("lda done")

model.save("/tmp/test/model")

When data is few, it ran OK. However when I increase data, it crashed in multiprocessing.

Traceback (most recent call last):
  File "/home/isc/anaconda/envs/gensim/lib/python3.4/multiprocessing/queues.py", line 248, in _feed
    send_bytes(obj)
  File "/home/isc/anaconda/envs/gensim/lib/python3.4/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/home/isc/anaconda/envs/gensim/lib/python3.4/multiprocessing/connection.py", line 399, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

Any idea? I use python3.4 of Anaconda distribution and gensim 0.12 installed with pip.
Thank you.

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Sep 4, 2015

What is your dictionary size? (even better, paste your entire log either here or using gist)

My guess would be that filtering out too frequent/infrequent words both improves performance (less memory, faster training) and quality (better topics).

Have a look at corpus streaming and word frequency filtering. The way you're using gensim is very suboptimal, data streaming is a big part of it.

@lucidfrontier45

This comment has been minimized.

Copy link

lucidfrontier45 commented Sep 5, 2015

Hi @piskvorky thank you for reply.

What is your dictionary size? (even better, paste your entire log either here or using gist)

The above was the whole trace. No other error messages.
The dictionary contains 306938 words according to len(word_dict.token2id)

Have a look at corpus streaming and word frequency filtering. The way you're using gensim is very suboptimal, data streaming is a big part of it.

I don't understand why corpus streaming can improve performance or fix the above error.
Since LDA is an iterative algorithm, the whole corpus must be on memory for efficient iteration otherwise they must be loaded from disk or network for each iteration.

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Sep 5, 2015

LDA in gensim is a streamed (online) algorithm.

To turn on logging, see instruction at the very top of the tutorial page.

If len(word_dict) (no need for .token2id) is 300k, and you're asking for 500 topics, then we're talking 300k * 500 * 8 = ~1.2GB for the word-topic matrix. That's suspiciously close to 2**30, and it looks like multiprocessing has problems sending large models. Like I said, I think token filtering is your best bet, both to improve performance and quality.

@lucidfrontier45

This comment has been minimized.

Copy link

lucidfrontier45 commented Sep 5, 2015

That's suspiciously close to 2**30, and it looks like multiprocessing has problems sending large models.

Your right. I made num_topics to 100 and the error was gone. I'll try filtering words next.

I still have one question. Is there any documented limitation of num_words x num_topics?
As you said 2**30 GB will cause multiprocessing problems.

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Sep 5, 2015

There is no limitation in gensim. It's a problem with multiprocessing (Python's built-in library), which fails to serialize large objects to send between processes.

If you really want a large model, use plain LdaModel (not LdaMulticore). That one doesn't use multiprocessing, so it's not affected by this limitation.

But of course, you won't be able to utilize multiple cores then...

@cscorley

This comment has been minimized.

Copy link
Collaborator

cscorley commented Sep 7, 2015

FWIW, you can get numpy to utilize multiple cores for it's operations by ensuring it's compiled with openmp directives enabled (I have no idea if Anaconda ships this way). This would allow you to use the LdaModel while also getting some juice from your other cores. However, there may be tradeoffs for you in doing that: http://stackoverflow.com/questions/15414027/multiprocessing-pool-makes-numpy-matrix-multiplication-slower

@lucidfrontier45

This comment has been minimized.

Copy link

lucidfrontier45 commented Sep 7, 2015

@piskvorky
thank you a lot for help!

@cscorley
Recent anaconda is shipped with openblas but multi-threading looks not enabled. Nevertheless I can build numpy against other openblas for anaconda.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment