Added support for initializing embeddings from pre-trained word vectors. #32

eschnou · 2016-12-20T21:35:38Z

The following patch enable to initialize embeddings with pre-trained word vectors. This is especially usefull when working with small datasets. In order to use this feature, just dowload the init vectors and place them in /data/word2vec then launch with --initEmbeddings.

The difference can easily be seen by looking at the embeddings in tensorboard:

embedding_rnn_seq2seq/RNN/EmbeddingWrapper/embedding
embedding_rnn_seq2seq/embedding_rnn_decoder/embedding

Conchylicultor · 2016-12-20T22:44:57Z

Thank you for your contribution. I had a quick look at the code. It's look good but I would refactor the loading/projecting code into a separate loadEmbedding function in order to keep the main function clean.
Also to check if the model has been restored, I think it's better to just check self.glob_step instead of self.restored, just to avoid a redundant variable.
Finally, your math import don't seems to be used. Not sure if this was intentional.

That's just minor modifications. The overall code is good. I'll do it if you don't.

Conchylicultor · 2016-12-21T05:23:50Z

It's merged. I just have a small question. On your code, you disable the training for the embeded vectors. Done at this line:
https://github.com/Conchylicultor/DeepQA/blob/master/chatbot/chatbot.py#L414

This code is only executed only once the first time model is created. Are you sure that when the model is reloaded, the training will still be disabled for those variables ?

…model When initEmbeddings is used, the embeddings are removed from the training variables. This was however not done when restoring a model from checkpoint.

eschnou · 2016-12-21T07:08:36Z

Good catch ! This is indeed part of the model and not saved in the checkpoint state. We must therefore disable training for the embeddings upon reload. I've just created a pull request to solve this.

Fixes #32 - Embeddings were not disable when restoring model

julien-c · 2017-01-28T09:55:00Z

Do you see significant differences when using pre-trained embeddings? Do you need less training to achieve the same qualitative results?

When you say "small datasets" @eschnou, how small are we talking here?

eschnou · 2017-01-28T19:29:32Z

Really difficult for me to have a conclusion at this stage. I have poor results whatever I try. This is probably due to my hardware limitations. I can't go above 256 units in my hidden layer which seems too low (Google is using 4096 units in their paper A Neural Conversational Model. I need to try this on EC2 with a decent layer size.

With respect to size, I've read somewhere (can't find the source anymore) that you need millions of words to properly train word2vec. That means 100k sentences at least. Richard Schorer is providing similar numbers in his class.

So, I assume that if you have a 'small' set (in the 10k sentences), then using pre-trained embeddings can only help. It also makes sense intuitively, vectores trained on millions of sentences will embed more 'meaning' than vectors trained on much smaller sets.

One way to visualize this is using tenserboard, and have a look at the embeddings after training. You'll see that with pre-trained ones, the distribution looks much nicer.

julien-c · 2017-01-29T08:54:36Z

What GPU setup do you have? I can spin you up a p2.xlarge instance if you need (I have some free credits)

eschnou · 2017-01-29T20:35:47Z

Thanks Julien! My home setup is two years old, with a GTX760 and a core-i7. Good enough to play, learn and investigate small models, but I definitively need to swicth to EC2 for real stuff. Testing on EC2 is on my to-do list, I'll let you know when I get some results !

eschnou · 2017-01-30T06:41:19Z

@julien-c Have a look at my latest comment on #47 . I've finally launched training on EC2 p2.xlarge. I could do it with 2048 hiddensize (4096 still blowing up).

d-w-h · 2017-02-03T14:46:04Z

Hi @eschnou , I do have a data set I would like to train on (it includes books and long form video interviews). Would the word2vec be the right place to start putting the data into a readable source?

Added support for initializing embeddings from pre-trained word vectors.

c85d6a7

Conchylicultor merged commit c85d6a7 into Conchylicultor:master Dec 21, 2016

eschnou deleted the feature/init_embeddings branch December 21, 2016 07:14

Conchylicultor added a commit that referenced this pull request Dec 21, 2016

Merge pull request #33 from eschnou/bugfix/issue-32

ffb039b

Fixes #32 - Embeddings were not disable when restoring model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for initializing embeddings from pre-trained word vectors. #32

Added support for initializing embeddings from pre-trained word vectors. #32

eschnou commented Dec 20, 2016

Conchylicultor commented Dec 20, 2016 •

edited

Conchylicultor commented Dec 21, 2016 •

edited

eschnou commented Dec 21, 2016

julien-c commented Jan 28, 2017

eschnou commented Jan 28, 2017

julien-c commented Jan 29, 2017

eschnou commented Jan 29, 2017

eschnou commented Jan 30, 2017

d-w-h commented Feb 3, 2017

Added support for initializing embeddings from pre-trained word vectors. #32

Added support for initializing embeddings from pre-trained word vectors. #32

Conversation

eschnou commented Dec 20, 2016

Conchylicultor commented Dec 20, 2016 • edited

Conchylicultor commented Dec 21, 2016 • edited

eschnou commented Dec 21, 2016

julien-c commented Jan 28, 2017

eschnou commented Jan 28, 2017

julien-c commented Jan 29, 2017

eschnou commented Jan 29, 2017

eschnou commented Jan 30, 2017

d-w-h commented Feb 3, 2017

Conchylicultor commented Dec 20, 2016 •

edited

Conchylicultor commented Dec 21, 2016 •

edited