Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Keras wrapper for Word2Vec model in Gensim #1248

Merged

Conversation

chinmayapancholi13
Copy link
Contributor

This PR adds a Keras wrapper for Word2Vec Model in Gensim.

@chinmayapancholi13
Copy link
Contributor Author

@tmylk I have tried to use the wrapper for a smaller version of the 20NewsGroup task. The code used is as follows. (It is based on the code used here)

from __future__ import print_function

import os
import sys
import numpy as np
from gensim.sklearn_integration.keras_wrapper_gensim_word2vec import KerasWrapperWord2VecModel
from gensim.models import word2vec
from keras.engine import Input
from keras.layers import merge
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from keras.layers import Dense, Input, Flatten
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Model

BASE_DIR = ''
TEXT_DATA_DIR = BASE_DIR + './path/to/text/data/dir'
MAX_SEQUENCE_LENGTH = 1000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2
BATCH_SIZE = 128

# prepare text samples and their labels

texts = []  # list of text samples
labels_index = {}  # dictionary mapping label name to numeric id
labels = []  # list of label ids

for name in sorted(os.listdir(TEXT_DATA_DIR)):
    path = os.path.join(TEXT_DATA_DIR, name)
    if os.path.isdir(path):
        label_id = len(labels_index)
        labels_index[name] = label_id
        for fname in sorted(os.listdir(path)):
            if fname.isdigit():
                fpath = os.path.join(path, fname)
                if sys.version_info < (3,):
                    f = open(fpath)
                else:
                    f = open(fpath, encoding='latin-1')
                t = f.read()
                i = t.find('\n\n')  # skip header
                if 0 < i:
                    t = t[i:]
                texts.append(t)
                f.close()
                labels.append(label_id)

# vectorize the text samples into a 2D integer tensor
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
labels = to_categorical(np.asarray(labels))

# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
num_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

x_train = data[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
x_val = data[-num_validation_samples:]
y_val = labels[-num_validation_samples:]

# train the embedding matrix
data1 = word2vec.LineSentence('./path/to/input/data')
Keras_w2v = KerasWrapperWord2VecModel(data1, min_count=1)
embedding_layer = Keras_w2v.get_embedding_layer()

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)

x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(35)(x)  # global max pooling
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(len(labels_index), activation='softmax')(x)

model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

model.fit(x_train, y_train, validation_data=(x_val, y_val), batch_size=BATCH_SIZE)

@tmylk
Copy link
Contributor

tmylk commented Mar 30, 2017

Please add unit tests and an ipynb

@chinmayapancholi13
Copy link
Contributor Author

@tmylk Sure.

@chinmayapancholi13
Copy link
Contributor Author

@tmylk I have added unit tests for word similarity task (using cosine distance) as well as a smaller version of the 20NewsGroups classification task. I have also created an IPython notebook for the wrapper explaining both these examples.

Addresses of Atheist Organizations
USA
FREEDOM FROM RELIGION FOUNDATION
Darwin fish bumper stickers and assorted other atheist paraphernalia are available from the Freedom From Religion Foundation in the US.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put data into test/test_data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmylk The data used in the unit tests is present in test/test_data folder already. The data used in the IPython notebooks is present in docs/notebooks/datasets folder. We are using the same data at both places so to avoid unnecessary duplication, should I use the data in test/test_data (i.e. set the path accordingly in the ipynb) in the ipynb notebooks as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, test/test_data is a better location for data used in both

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. So, I'll change the path set in the ipynb for this functionality.

#
# Copyright (C) 2011 Radim Rehurek <radimrehurek@seznam.cz>
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html
"""Keras wrappers for gensim.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Wrapper to allow gensim word2vec as input into Keras." is more clear. And in other docstrings and ipynbs too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmylk The file __init__.py is common for the entire folder gensim/keras_integration (which in turn would have the files for integration of the various models with Keras). So shouldn't the docstring here be more generic (like it already is)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be more general, just pointing out the difference in meaning between "keras wrapper for gensim" vs "gensim wrapper for keras". Which one is it?
I think it is a "Wrapper to allow gensim models as input into Keras."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Understood it now. Thanks for pointing this out. So, I'll change "Keras wrappers for gensim" to "Wrappers to allow gensim models as input into Keras".

@tmylk
Copy link
Contributor

tmylk commented May 2, 2017

Thanks for the feature. The change seems to be so small - just 5 lines in get_embedding_layer()

Instead of creating a new class, could you please just add this one method to KeyedVectors?

@chinmayapancholi13
Copy link
Contributor Author

@tmylk Sure. I'll add the function in KeyedVectors instead of creating a new class.

@tmylk
Copy link
Contributor

tmylk commented May 18, 2017

@chinmayapancholi13 this is more appropriate to have in KeyedVectors class.
And to have an example integration with Classification in https://github.com/stephenhky/PyShortTextCategorization/blob/db246e3ade2fcdea58953ff807d259464765a661/shorttext/classifiers/embed/nnlib/VarNNEmbedVecClassification.py

@chinmayapancholi13
Copy link
Contributor Author

@tmylk I have moved the function get_embedding_layer to keyedvectors.py. I am adding the classification example now.
Also, the tests are failing because of PEP8 checking being done in the test data. Is this what we expect?

@chinmayapancholi13
Copy link
Contributor Author

@tmylk Thanks a lot for your feedback. I have incorporated your suggestions as follows :

  • rename model_wv as wv : Done

  • cosine of 2 vectors is not the probability of them occuring together : I have replaced the comment with output is the cosine distance between the two words (as a similarity measure)

  • Please fix The Merge layer is deprecated and will be removed after 08/2017 : I have replaced merge with dot function (with param normalize set to True for taking the cosine distance). The particular warning is no longer there.

  • Please compare the result of keras cos model with a simple wv[word_b].dot(wv[word_a]). it should be the same : The values are the same when we set normalize as False in the dot function. This is because in such a case the output value is only the dot product (and not the cosine distance) of the two vectors.
    Would we want this check to be explicitly added somewhere in the unit-tests or the ipynb notebook? Or was this checking only for our verification of the behavior of the Keras function?

  • In the final cell the score of {'mathematics': 0.97023982, is very good. Please call it a good result! : I have updated the comment in the last cell.

@@ -0,0 +1,8 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file and folder are no longer needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right! Removing this folder right away.

setup.py Outdated
@@ -233,6 +233,9 @@ def finalize_options(self):
'scikit-learn',
'pyemd',
'annoy',
'theano',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would only one of them be enough? tf is preferred

@chinmayapancholi13 chinmayapancholi13 changed the title [WIP] Keras wrapper for Word2Vec model in Gensim [MRG] Keras wrapper for Word2Vec model in Gensim Jun 1, 2017
@tmylk
Copy link
Contributor

tmylk commented Jun 2, 2017

Let's add a note about shorttext to ipynb.

And then LGTM

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jun 3, 2017

Need to fix problems with travis (so strange, because keras is installed, but in log I see unittest.case.SkipTest: Test requires Keras to be installed, which is not available)

@chinmayapancholi13
Copy link
Contributor Author

@menshikh-iv This is because Keras also needs Tensorflow to be installed (when Tensorflow is the default backend). And we are not installing Tensorflow in .travis.yml because if we do pip install tensorflow, the 'only CPU-support' version gets installed by default. So if a user already has GPU-supported version of tensorflow (using pip install tensorflow-gpu) installed, it would get overwritten. This was also the reason behind removing automatic installation of TF while installation of Keras (see keras-team/keras#5776 (comment)).

Thus, we(similar to Keras) expect the users to install Tensorflow by themselves.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jun 3, 2017

@chinmayapancholi13 Thank for your clarification, but what will we do with Travis?

@chinmayapancholi13
Copy link
Contributor Author

@menshikh-iv One solution can be to include TF in the installation and add a note that users would have to re-install the TF of their choice again (i.e. in case it gets overwritten). For the tests to get passed in travis, I believe we must install TF somehow so this can be a solution in my opinion.

@menshikh-iv
Copy link
Contributor

@chinmayapancholi13 Sounds good for me, let's do it.

@menshikh-iv
Copy link
Contributor

@chinmayapancholi13 Great 🥇

@menshikh-iv menshikh-iv merged commit 7e74d15 into piskvorky:develop Jun 4, 2017
Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the interesting feature and notebook!

There are some code style issues -- can you fix that? @menshikh-iv

"cell_type": "markdown",
"metadata": {},
"source": [
"Then, we call the wrapper and pass appropriate parameters."
Copy link
Owner

@piskvorky piskvorky Jun 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What wrapper?

Here and below, the text refers to some "wrapper", but I see no wrapper.

"source": [
"word_a = 'graph'\n",
"word_b = 'trees'\n",
"output = keras_model.predict([np.asarray([model.wv.vocab[word_a].index]), np.asarray([model.wv.vocab[word_b].index])]) # output is the cosine distance between the two words (as a similarity measure)\n",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment would be better moved after the code, on a separate line, to improve readability.

"source": [
"# global variables\n",
"\n",
"nb_filters=1200 # number of filters\n",
Copy link
Owner

@piskvorky piskvorky Jun 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: space around the assignment operator (x = 1).

Here and in my other places in the notebook.

" category_col, descp_col = df.columns.values.tolist()\n",
" shorttextdict = defaultdict(lambda : [])\n",
" for category, descp in zip(df[category_col], df[descp_col]):\n",
" if type(descp)==str:\n",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work across Python 2 / Python 3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It is working fine for both Python 2 and 3.

" shorttextdict = defaultdict(lambda : [])\n",
" for category, descp in zip(df[category_col], df[descp_col]):\n",
" if type(descp)==str:\n",
" shorttextdict[category] += [descp]\n",
Copy link
Owner

@piskvorky piskvorky Jun 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

append simpler, faster and more readable?

Also, if you need to convert to plain dict anyway below, a plain setdefault(key, []).append(x) may be easier than defaultdict.

" \"\"\"\n",
" df = pd.read_csv(filepath)\n",
" category_col, descp_col = df.columns.values.tolist()\n",
" shorttextdict = defaultdict(lambda : [])\n",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaultdict(list)

" Return an example data set, with three subjects and corresponding keywords.\n",
" This is in the format of the training input.\n",
" \"\"\"\n",
" data_path = './datasets/keras_classifier_training_data.csv'\n",
Copy link
Owner

@piskvorky piskvorky Jun 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.path.join better (will work on Windows too).

Here and elsewhere.

"cell_type": "markdown",
"metadata": {},
"source": [
"The result above clearly suggests (~ 98% probability!) that the input `artificial intellegence` should belong to the category `mathematics`, which conforms very well with the expected output in this case.\n",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intellegence => intelligence

@chinmayapancholi13
Copy link
Contributor Author

@piskvorky Thanks a lot for your comprehensive feedback! :) I'd be happy to make these changes in a new PR. I'll also try to keep in mind these code-style issues in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants