Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #1196 - Gensim error when loading FastText #1214

Merged
merged 9 commits into from Mar 30, 2017

Conversation

jaksmid
Copy link
Contributor

@jaksmid jaksmid commented Mar 14, 2017

Gensim can load large fasttext model on Mac
Loading of binary fasttext models is faster
Fasttext vector size is correctly set
Vector_size is also saved when loading just the vector_file, which can be useful if you are not interested in training

@tmylk
Copy link
Contributor

tmylk commented Mar 17, 2017

Travis tests re-ran after smart_open update

@tmylk
Copy link
Contributor

tmylk commented Mar 20, 2017

What is the purpose of adding a new attribute vector_size to all KeyedVectors? isn't it always syn0.shape[0]? If necessary, maybe just add it to FastText class?

@jaksmid
Copy link
Contributor Author

jaksmid commented Mar 21, 2017

Thanks @tmylk for the comment. To my understanding, you would use FastText in the case you want to load both the vec and bin files of the fasttext. In case you just want to load the vectors you can use FastTextKeyedVectors. As you pointed out, you can use syn0.shape[0] which may not be that straightforward if you are not familiar with the gensim internals. So I thought it can be a good idea to add vector_size explicitly so people can figure out the right name of the attribute more easily.

By adding it to the FastText class you meant FastTextKeyedVectors class, right?
Thanks!

EDIT: It seems like it should be self.syn0.shape[1] as seen in keyedvectors - save_word2vec_format which kind of proves my points that it is not that staightforward to get it right :)

Aslo, there is currenty no override of load_word2vec_format in the FastTextKeydVector, so unless we increase the complexity of FastTextKeyedVector by adding this override we have no place to initialize vector_size is FastTextKeyedVector.

Looking forward to your suggestions.

@jaksmid
Copy link
Contributor Author

jaksmid commented Mar 22, 2017

Furthermore, you cannot just use FastTextKeyedVectors without FastText initialization (which needs both vec and bin) as __contains__ in FastTextKeyedVectors references self.min_n which is actually set in FastText load_model_params.

Copy link
Contributor

@tmylk tmylk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just indent changes requested. Glad it became a comprehensive fix!
Please add a note in the changelog.md as well.

@@ -169,10 +189,12 @@ def testMostSimilarCosmul(self):
"""Test most_similar_cosmul for in-vocab and out-of-vocab words"""
# In vocab, sanity check
self.assertEqual(len(self.test_model.most_similar_cosmul(positive=['the', 'and'], topn=5)), 5)
self.assertEqual(self.test_model.most_similar_cosmul('the'), self.test_model.most_similar_cosmul(positive=['the']))
self.assertEqual(self.test_model.most_similar_cosmul('the'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use hanging indent

# Out of vocab check
self.assertEqual(len(self.test_model.most_similar_cosmul(['night', 'nights'], topn=5)), 5)
self.assertEqual(self.test_model.most_similar_cosmul('nights'), self.test_model.most_similar_cosmul(positive=['nights']))
self.assertEqual(self.test_model.most_similar_cosmul('nights'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging indent preferred

@jaksmid
Copy link
Contributor Author

jaksmid commented Mar 23, 2017

Thanks for reviewing the code.
I incorporated the code review and merged with the current develop.

It seems like the build for python 2 stalled, could you rerun it @tmylk please?

@gojomo gojomo changed the title Issue 1196 Fix #1196 - Gensim error when loading FastText Mar 23, 2017
@tmylk
Copy link
Contributor

tmylk commented Mar 24, 2017

These tests are know to occasionally fail but it's the first time they fail constantly. Will disable them in the main branch soon.

@jaksmid
Copy link
Contributor Author

jaksmid commented Mar 24, 2017

Ok. Let me know if I can help with anything.

@tmylk tmylk merged commit ed49ea6 into piskvorky:develop Mar 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants