As described in this thread on the mailinglist I have a problem when pickling models that contain very large matrices. As a intermediate solution I would like to contribute something which stores the matrices using the numpy methods.
@piskvorky: We wanted to discuss this when you have been in Berlin but then forgot about it. Do you have some ideas of how to implement this nicely? I would like to hear your suggestions so that the solution fits your ideas of gensim. If you have no time or no ideas, I will just implement something and send you a pull request
Hmm, so the goal here is to avoid the cPickle bug? Then maybe only override the save/load methods, to store a .npy file (for large matrices) and a .pkl file (everything else). Or store both in a single archive, with zipfile, which could have positive impact on file size. But on the other hand, zipfile would mean no mmap = no sharing of memory for the same model between multiple processes...
A more general solution would be using PyTables, like Matt Goodman suggested in the thread you link to. I'm not sure we need another dependency yet, and it's certainly more work, but up to you :)
dedan, you could also try jsonpickle.
All you need to (IIRC) is applying this patch: #30 (comment)
@dedan: I changed (Sparse)MatrixSimilarity to use numpy binary format instead of cPickle. Maybe that solution works for you too?
It's really simple, I override the object's save/load to store the large matrix separately (so that it can be mmap'ed back later) and store the rest of the object normally, with cPickle.
Thank you @piskvorsky. I'll have a look on this solution later or tomorrow. I think it is similar to my current solution.
@dedan, does the new code (based on the numpy binary format instead of cPickle) fix your issue?
the code is actually already included in gensim 0.8.0