New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cPickle Problem #31

Closed
dedan opened this Issue May 20, 2011 · 5 comments

Comments

Projects
None yet
3 participants
@dedan
Contributor

dedan commented May 20, 2011

As described in this thread on the mailinglist I have a problem when pickling models that contain very large matrices. As a intermediate solution I would like to contribute something which stores the matrices using the numpy methods.

@piskvorky: We wanted to discuss this when you have been in Berlin but then forgot about it. Do you have some ideas of how to implement this nicely? I would like to hear your suggestions so that the solution fits your ideas of gensim. If you have no time or no ideas, I will just implement something and send you a pull request

@piskvorky

This comment has been minimized.

Member

piskvorky commented May 20, 2011

Hmm, so the goal here is to avoid the cPickle bug? Then maybe only override the save/load methods, to store a .npy file (for large matrices) and a .pkl file (everything else). Or store both in a single archive, with zipfile, which could have positive impact on file size. But on the other hand, zipfile would mean no mmap = no sharing of memory for the same model between multiple processes...

A more general solution would be using PyTables, like Matt Goodman suggested in the thread you link to. I'm not sure we need another dependency yet, and it's certainly more work, but up to you :)

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Jun 9, 2011

dedan, you could also try jsonpickle.
All you need to (IIRC) is applying this patch: #30 (comment)

@piskvorky

This comment has been minimized.

Member

piskvorky commented Jun 13, 2011

@dedan: I changed (Sparse)MatrixSimilarity to use numpy binary format instead of cPickle. Maybe that solution works for you too?

It's really simple, I override the object's save/load to store the large matrix separately (so that it can be mmap'ed back later) and store the rest of the object normally, with cPickle.

@dedan

This comment has been minimized.

Contributor

dedan commented Jun 14, 2011

Thank you @piskvorsky. I'll have a look on this solution later or tomorrow. I think it is similar to my current solution.

@Dieterbe

This comment has been minimized.

Contributor

Dieterbe commented Jun 22, 2011

@dedan, does the new code (based on the numpy binary format instead of cPickle) fix your issue?
the code is actually already included in gensim 0.8.0

@piskvorky piskvorky closed this Aug 22, 2011

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment