Improve pre-trained embeddings loading #24

JohnGiorgi · 2018-08-14T15:55:53Z

Previously, pre-trained embeddings were loaded by simply opening and looping over the file with python. This was a problem for a few reasons, mainly because it required the conversion of embeddings in C binary format (.bin) to C plain text format (.txt), which were much much larger.

TODO

Use gensim's KeyedVectors.load_word2vec_format to load pre-trained embeddings into memory from a file
Do a sanity check by running the model with this new loading mechanism and comparing performance to previously achieved performance
Need to add back in the debug functionality, whereby passing the argument --debug would only load some N number of embeddings. Gensim has a mechanism for only loading this top N number of embeddings, use this!
Unit tests are breaking, likely because my dummy_word_embeddings are not properly formatted. Maybe just use the Google300 ones?

Changed

Pre-trained embeddings are now loaded using gensim's KeyedVectors.load_word2vec_format method, no need to convert from .bin to .txt. Saber can load any word embeddings in the word2vec format.
~43% speed up in loading pre-trained embeddings (one of Sabers big bottlenecks)
.bin file is (at least in our case) is ~65% smaller than the corresponding .txt file
No more confusing instructions for user about converting pre-trained embeddings from .bin to .txt format.

Resources

this blog
gensim documentation

Closes #15.

coveralls · 2018-08-14T21:46:45Z

Pull Request Test Coverage Report for Build 122

15 of 15 (100.0%) changed or added relevant lines in 2 files are covered.
2 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.3%) to 75.236%

Files with Coverage Reduction	New Missed Lines	%
saber/sequence_processor.py	2	70.65%

Totals
Change from base Build 118:	0.3%
Covered Lines:	1197
Relevant Lines:	1591

💛 - Coveralls

Improve pre-trained embeddings loading Former-commit-id: 709d6860a493c67fbf188ef8e52c47dbb547b368 [formerly 877737dc69044c4faa6d2c97bd7616493a67db7d] [formerly e86cfaf5d153397b298d203ce2cef18c827a9af9 [formerly d7a63b2]] Former-commit-id: d26724f8c28f149560b644c3a8fa331980c8e1d9 [formerly 7c938cafeb9ec2347d6009db378e41722f8c1fbb] Former-commit-id: 0362b1994af23ba803124dd9606b7317e3ff2c03

JohnGiorgi added 3 commits August 14, 2018 11:22

Adding gensim to the list of requirements 📇

39899e4

Update instructions for using pre-trained word embeddings 📚

a424b70

Use gensim load pre-trained embeddings ✨

1617cc9

JohnGiorgi self-assigned this Aug 14, 2018

JohnGiorgi added the enhancement New feature or request label Aug 14, 2018

JohnGiorgi added 6 commits August 14, 2018 15:50

✨ Add parameter so user can load .txt formatted embeddings

e227146

✨ if self.config.debug, only top 10K word vectors are loaded

e57de18

📚 Adding instructions for how to use GloVe embeddings

b7c66ed

🐛 Fixes issue that was breaking build

b5b0317

Drop bin_to_text, no need for it anymore

36870e0

♻️ remove unused import

c11df60

JohnGiorgi merged commit d7a63b2 into master Aug 14, 2018

JohnGiorgi deleted the fix-embedding-loading branch August 14, 2018 23:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve pre-trained embeddings loading #24

Improve pre-trained embeddings loading #24

JohnGiorgi commented Aug 14, 2018 •

edited

Loading

coveralls commented Aug 14, 2018 •

edited

Loading

Improve pre-trained embeddings loading #24

Improve pre-trained embeddings loading #24

Conversation

JohnGiorgi commented Aug 14, 2018 • edited Loading

coveralls commented Aug 14, 2018 • edited Loading

Pull Request Test Coverage Report for Build 122

💛 - Coveralls

JohnGiorgi commented Aug 14, 2018 •

edited

Loading

coveralls commented Aug 14, 2018 •

edited

Loading