Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve pre-trained embeddings loading #24

Merged
merged 9 commits into from
Aug 14, 2018
Merged

Conversation

JohnGiorgi
Copy link
Contributor

@JohnGiorgi JohnGiorgi commented Aug 14, 2018

Previously, pre-trained embeddings were loaded by simply opening and looping over the file with python. This was a problem for a few reasons, mainly because it required the conversion of embeddings in C binary format (.bin) to C plain text format (.txt), which were much much larger.

TODO

  • Use gensim's KeyedVectors.load_word2vec_format to load pre-trained embeddings into memory from a file
  • Do a sanity check by running the model with this new loading mechanism and comparing performance to previously achieved performance
  • Need to add back in the debug functionality, whereby passing the argument --debug would only load some N number of embeddings. Gensim has a mechanism for only loading this top N number of embeddings, use this!
  • Unit tests are breaking, likely because my dummy_word_embeddings are not properly formatted. Maybe just use the Google300 ones?

Changed

  • Pre-trained embeddings are now loaded using gensim's KeyedVectors.load_word2vec_format method, no need to convert from .bin to .txt. Saber can load any word embeddings in the word2vec format.
  • ~43% speed up in loading pre-trained embeddings (one of Sabers big bottlenecks)
  • .bin file is (at least in our case) is ~65% smaller than the corresponding .txt file
  • No more confusing instructions for user about converting pre-trained embeddings from .bin to .txt format.

Resources

Closes #15.

@JohnGiorgi JohnGiorgi self-assigned this Aug 14, 2018
@JohnGiorgi JohnGiorgi added the enhancement New feature or request label Aug 14, 2018
@coveralls
Copy link

coveralls commented Aug 14, 2018

Pull Request Test Coverage Report for Build 122

  • 15 of 15 (100.0%) changed or added relevant lines in 2 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.3%) to 75.236%

Files with Coverage Reduction New Missed Lines %
saber/sequence_processor.py 2 70.65%
Totals Coverage Status
Change from base Build 118: 0.3%
Covered Lines: 1197
Relevant Lines: 1591

💛 - Coveralls

@JohnGiorgi JohnGiorgi merged commit d7a63b2 into master Aug 14, 2018
@JohnGiorgi JohnGiorgi deleted the fix-embedding-loading branch August 14, 2018 23:31
JohnGiorgi added a commit that referenced this pull request Aug 3, 2019
Improve pre-trained embeddings loading

Former-commit-id: 709d6860a493c67fbf188ef8e52c47dbb547b368 [formerly 877737dc69044c4faa6d2c97bd7616493a67db7d] [formerly e86cfaf5d153397b298d203ce2cef18c827a9af9 [formerly d7a63b2]]
Former-commit-id: d26724f8c28f149560b644c3a8fa331980c8e1d9 [formerly 7c938cafeb9ec2347d6009db378e41722f8c1fbb]
Former-commit-id: 0362b1994af23ba803124dd9606b7317e3ff2c03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants