New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for initializing embeddings from pre-trained word vectors. #32
Added support for initializing embeddings from pre-trained word vectors. #32
Conversation
Thank you for your contribution. I had a quick look at the code. It's look good but I would refactor the loading/projecting code into a separate That's just minor modifications. The overall code is good. I'll do it if you don't. |
It's merged. I just have a small question. On your code, you disable the training for the embeded vectors. Done at this line: This code is only executed only once the first time model is created. Are you sure that when the model is reloaded, the training will still be disabled for those variables ? |
…model When initEmbeddings is used, the embeddings are removed from the training variables. This was however not done when restoring a model from checkpoint.
…model When initEmbeddings is used, the embeddings are removed from the training variables. This was however not done when restoring a model from checkpoint.
…model When initEmbeddings is used, the embeddings are removed from the training variables. This was however not done when restoring a model from checkpoint.
…model When initEmbeddings is used, the embeddings are removed from the training variables. This was however not done when restoring a model from checkpoint.
Good catch ! This is indeed part of the model and not saved in the checkpoint state. We must therefore disable training for the embeddings upon reload. I've just created a pull request to solve this. |
Fixes #32 - Embeddings were not disable when restoring model
Do you see significant differences when using pre-trained embeddings? Do you need less training to achieve the same qualitative results? When you say "small datasets" @eschnou, how small are we talking here? |
Really difficult for me to have a conclusion at this stage. I have poor results whatever I try. This is probably due to my hardware limitations. I can't go above 256 units in my hidden layer which seems too low (Google is using 4096 units in their paper A Neural Conversational Model. I need to try this on EC2 with a decent layer size. With respect to size, I've read somewhere (can't find the source anymore) that you need millions of words to properly train word2vec. That means 100k sentences at least. Richard Schorer is providing similar numbers in his class. So, I assume that if you have a 'small' set (in the 10k sentences), then using pre-trained embeddings can only help. It also makes sense intuitively, vectores trained on millions of sentences will embed more 'meaning' than vectors trained on much smaller sets. One way to visualize this is using tenserboard, and have a look at the embeddings after training. You'll see that with pre-trained ones, the distribution looks much nicer. |
What GPU setup do you have? I can spin you up a p2.xlarge instance if you need (I have some free credits) |
Thanks Julien! My home setup is two years old, with a GTX760 and a core-i7. Good enough to play, learn and investigate small models, but I definitively need to swicth to EC2 for real stuff. Testing on EC2 is on my to-do list, I'll let you know when I get some results ! |
Hi @eschnou , I do have a data set I would like to train on (it includes books and long form video interviews). Would the word2vec be the right place to start putting the data into a readable source? |
The following patch enable to initialize embeddings with pre-trained word vectors. This is especially usefull when working with small datasets. In order to use this feature, just dowload the init vectors and place them in /data/word2vec then launch with --initEmbeddings.
The difference can easily be seen by looking at the embeddings in tensorboard: