New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use pretrained word vectors #23

Closed
orph opened this Issue Dec 22, 2016 · 6 comments

Comments

Projects
None yet
5 participants
@orph
Copy link

orph commented Dec 22, 2016

Hi there! I'm looking for some guidance on how to use pretrained word vectors, either Google word2vec or GloVe. Any examples of how to convert these from download to a format that can be passed using -pre_word_vecs_enc or -pre_word_vecs_dec would be very helpful.

@jroakes

This comment has been minimized.

Copy link
Contributor

jroakes commented Dec 22, 2016

You can read more here: http://opennmt.net//Guide/#pre-trained-embeddings

I would start with looking at the code here. https://github.com/zhexxian/From-Machine-Learning-To-Zero-Day-Exploits/blob/5e619235f2248ddadde26849ba4acf2fda01f925/util/GloVeEmbedding.lua . In https://github.com/OpenNMT/OpenNMT/blob/master/onmt/modules/WordEmbedding.lua, they basically load a torch .t7 file and assign the index and weights to nn.nn.LookupTable:weights. One thing that is important is to make sure is the the the words in your src and target files are found in whatever embeddings you use. Also, memory will be helped if you can parse the large glove files to only represent what is in your training data.

Probably the biggest reason this was not incorporated by default into the codebase to this point is that this is meant for translation (used bi-directionally) and most of the pre-trained embeddings I am familiar with are English.

@srush

This comment has been minimized.

Copy link
Contributor

srush commented Dec 22, 2016

Thanks @jroakes this is a very nice explanation. A couple further points.

  1. there are word embeddings in many different languages! Here is a list https://sites.google.com/site/rmyeid/projects/polyglot#TOC-Download-the-Embeddings (I will include this on the site).
  2. We could add a tool script that takes a dictionary and a GloVe file and aligns them into the correct format.

If either of you wanted to submit a pull request for (2), we would be happy to give credit and include it in the tools/ directory.

@srush srush added the enhancement label Dec 22, 2016

@jroakes

This comment has been minimized.

Copy link
Contributor

jroakes commented Dec 22, 2016

@srush I am working on #2

@jroakes

This comment has been minimized.

Copy link
Contributor

jroakes commented Jan 8, 2017

@srush anf @orph : Pull request for this is here: #54

Sorry this took me a bit longer than anticipated.

@StevenLOL

This comment has been minimized.

Copy link

StevenLOL commented Jan 13, 2017

more word vectors are here

https://github.com/Kyubyong/wordvectors

@guillaumekln

This comment has been minimized.

Copy link
Member

guillaumekln commented Feb 14, 2017

Closing this enhancement which is tracked by the above PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment