Added save/load functionality to AnnoyIndexer #845

fortiema · 2016-08-30T11:33:26Z

Having Annoy integrated directly into gensim is really great, but one feature that I was personally missing is the ability to save/load indexes. I am working with indexes in the 10s of GB and having to recreate them every time I run my code is a waste of time.

So I added a simple save/load interface that is similar to Annoy.

For example this code:

fname = 'index'
if os.path.exists(fname):
    self.index_en = AnnoyIndexer()
    self.index_en.load(fname)
    self.index_en.model = model
else:
    self.index_en = AnnoyIndexer(model, 1)
    self.index_en.save(fname)

Will create 2 files, index and index.d. Both files must be present when using the load function, otherwise nothing happens.

For this to work, I also added a if case in the constructor to allow for object creation without passing model and num_trees.

try/except on import and using pickle protocol v2 to stay 2-3 compatible.

All comments and suggestions are welcome.

…tructor

fortiema · 2016-08-30T11:55:21Z

Added basic test cases to test_similarities for both Word2Vec and Doc2Vec

…me+'.d' exists before trying to load index. Added test case for unexistant index file.

tmylk · 2016-09-04T16:20:07Z

Thanks for the PR!

Could you please add a line to CHANGELOG and update the annoy notebook tutorial with the new functionality?

fortiema · 2016-09-05T05:27:28Z

Sure! Will push when done.

…rsisting AnnoyIndexer instances.

fortiema · 2016-09-09T06:56:18Z

Any other suggestions to make this interface more robust?

tmylk · 2016-09-15T14:38:32Z

gensim/similarities/index.py

+    def save(self, fname):
+        self.index.save(fname)
+        d = {'f': self.model.vector_size, 'num_trees': self.num_trees, 'labels': self.labels}
+        pickle.dump(d, open(fname+'.d', 'wb'), 2)


Please use smart_open as in https://github.com/RaRe-Technologies/gensim/blob/6a289fefd72f038c8cc14826f63624950f5de1f8/gensim/utils.py#L896

tmylk · 2016-09-15T14:39:03Z

gensim/similarities/index.py

+
+    def load(self, fname):
+        if os.path.exists(fname) and os.path.exists(fname+'.d'):
+            d = pickle.load(open(fname+'.d', 'rb'))


Please use smart_open as in https://github.com/RaRe-Technologies/gensim/blob/6a289fefd72f038c8cc14826f63624950f5de1f8/gensim/utils.py#L907

tmylk · 2016-09-15T14:39:22Z

gensim/test/test_similarities.py

+        from gensim.similarities.index import AnnoyIndexer
+        self.test_index = AnnoyIndexer()
+        self.test_index.load('test-index')
+


It has to raise IOError

…ot found.

tmylk · 2016-09-16T10:04:38Z

Thanks for the quick fix. Once 2.6 tests runs we could merge.

Also it would be interesting to see a test where 2 parallel processes load the same model from disk and mmap the same index file?

fortiema · 2016-09-19T02:13:28Z

Great suggestion, let me add this as well!

Matt Fortier added 2 commits August 30, 2016 19:27

Added save/load functionality to AnnoyIndexer

e0af7b1

Added test cases, Added default parameter values to AnnoyIndexer cons…

d92e501

…tructor

piskvorky assigned tmylk Aug 30, 2016

Initialize index and labels to None at creation. Also verify that fna…

61d3ef1

…me+'.d' exists before trying to load index. Added test case for unexistant index file.

Added entry to CHANGELOG. Added section in tutorial notebook about pe…

8ce445e

…rsisting AnnoyIndexer instances.

tmylk reviewed Sep 15, 2016

View reviewed changes

Matt Fortier added 2 commits September 16, 2016 15:41

Use smart_open interface. Raise IOError if any of the index file is n…

7df7d1e

…ot found.

Fix assertRaises py2.6 support in test_similarities

31b5e53

Fix string formatting py2.6 support in test_similarities

3a546ca

tmylk merged commit 3a546ca into piskvorky:develop Sep 27, 2016

fortiema deleted the annoy-saveload branch September 28, 2016 01:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added save/load functionality to AnnoyIndexer #845

Added save/load functionality to AnnoyIndexer #845

fortiema commented Aug 30, 2016 •

edited

Loading

fortiema commented Aug 30, 2016

tmylk commented Sep 4, 2016

fortiema commented Sep 5, 2016

fortiema commented Sep 9, 2016

tmylk Sep 15, 2016

tmylk Sep 15, 2016

tmylk Sep 15, 2016

tmylk commented Sep 16, 2016

fortiema commented Sep 19, 2016

Added save/load functionality to AnnoyIndexer #845

Added save/load functionality to AnnoyIndexer #845

Conversation

fortiema commented Aug 30, 2016 • edited Loading

fortiema commented Aug 30, 2016

tmylk commented Sep 4, 2016

fortiema commented Sep 5, 2016

fortiema commented Sep 9, 2016

tmylk Sep 15, 2016

Choose a reason for hiding this comment

tmylk Sep 15, 2016

Choose a reason for hiding this comment

tmylk Sep 15, 2016

Choose a reason for hiding this comment

tmylk commented Sep 16, 2016

fortiema commented Sep 19, 2016

fortiema commented Aug 30, 2016 •

edited

Loading