Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for initializing embeddings from pre-trained word vectors. #32

Merged

Conversation

eschnou
Copy link
Contributor

@eschnou eschnou commented Dec 20, 2016

The following patch enable to initialize embeddings with pre-trained word vectors. This is especially usefull when working with small datasets. In order to use this feature, just dowload the init vectors and place them in /data/word2vec then launch with --initEmbeddings.

The difference can easily be seen by looking at the embeddings in tensorboard:

  • embedding_rnn_seq2seq/RNN/EmbeddingWrapper/embedding
  • embedding_rnn_seq2seq/embedding_rnn_decoder/embedding

@Conchylicultor
Copy link
Owner

Conchylicultor commented Dec 20, 2016

Thank you for your contribution. I had a quick look at the code. It's look good but I would refactor the loading/projecting code into a separate loadEmbedding function in order to keep the main function clean.
Also to check if the model has been restored, I think it's better to just check self.glob_step instead of self.restored, just to avoid a redundant variable.
Finally, your math import don't seems to be used. Not sure if this was intentional.

That's just minor modifications. The overall code is good. I'll do it if you don't.

@Conchylicultor Conchylicultor merged commit c85d6a7 into Conchylicultor:master Dec 21, 2016
@Conchylicultor
Copy link
Owner

Conchylicultor commented Dec 21, 2016

It's merged. I just have a small question. On your code, you disable the training for the embeded vectors. Done at this line:
https://github.com/Conchylicultor/DeepQA/blob/master/chatbot/chatbot.py#L414

This code is only executed only once the first time model is created. Are you sure that when the model is reloaded, the training will still be disabled for those variables ?

eschnou added a commit to eschnou/DeepQA that referenced this pull request Dec 21, 2016
…model

When initEmbeddings is used, the embeddings are removed from the
training variables. This was however not done when restoring a
model from checkpoint.
eschnou added a commit to eschnou/DeepQA that referenced this pull request Dec 21, 2016
…model

When initEmbeddings is used, the embeddings are removed from the
training variables. This was however not done when restoring a
model from checkpoint.
eschnou added a commit to eschnou/DeepQA that referenced this pull request Dec 21, 2016
…model

When initEmbeddings is used, the embeddings are removed from the
training variables. This was however not done when restoring a
model from checkpoint.
eschnou added a commit to eschnou/DeepQA that referenced this pull request Dec 21, 2016
…model

When initEmbeddings is used, the embeddings are removed from the
training variables. This was however not done when restoring a
model from checkpoint.
@eschnou
Copy link
Contributor Author

eschnou commented Dec 21, 2016

Good catch ! This is indeed part of the model and not saved in the checkpoint state. We must therefore disable training for the embeddings upon reload. I've just created a pull request to solve this.

@eschnou eschnou deleted the feature/init_embeddings branch December 21, 2016 07:14
Conchylicultor added a commit that referenced this pull request Dec 21, 2016
Fixes #32 - Embeddings were not disable when restoring model
@julien-c
Copy link
Contributor

Do you see significant differences when using pre-trained embeddings? Do you need less training to achieve the same qualitative results?

When you say "small datasets" @eschnou, how small are we talking here?

@eschnou
Copy link
Contributor Author

eschnou commented Jan 28, 2017

Really difficult for me to have a conclusion at this stage. I have poor results whatever I try. This is probably due to my hardware limitations. I can't go above 256 units in my hidden layer which seems too low (Google is using 4096 units in their paper A Neural Conversational Model. I need to try this on EC2 with a decent layer size.

With respect to size, I've read somewhere (can't find the source anymore) that you need millions of words to properly train word2vec. That means 100k sentences at least. Richard Schorer is providing similar numbers in his class.

So, I assume that if you have a 'small' set (in the 10k sentences), then using pre-trained embeddings can only help. It also makes sense intuitively, vectores trained on millions of sentences will embed more 'meaning' than vectors trained on much smaller sets.

One way to visualize this is using tenserboard, and have a look at the embeddings after training. You'll see that with pre-trained ones, the distribution looks much nicer.

@julien-c
Copy link
Contributor

What GPU setup do you have? I can spin you up a p2.xlarge instance if you need (I have some free credits)

@eschnou
Copy link
Contributor Author

eschnou commented Jan 29, 2017

Thanks Julien! My home setup is two years old, with a GTX760 and a core-i7. Good enough to play, learn and investigate small models, but I definitively need to swicth to EC2 for real stuff. Testing on EC2 is on my to-do list, I'll let you know when I get some results !

@eschnou
Copy link
Contributor Author

eschnou commented Jan 30, 2017

@julien-c Have a look at my latest comment on #47 . I've finally launched training on EC2 p2.xlarge. I could do it with 2048 hiddensize (4096 still blowing up).

@d-w-h
Copy link

d-w-h commented Feb 3, 2017

Hi @eschnou , I do have a data set I would like to train on (it includes books and long form video interviews). Would the word2vec be the right place to start putting the data into a readable source?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants