[MRG] Keras wrapper for Word2Vec model in Gensim #1248

chinmayapancholi13 · 2017-03-30T12:49:51Z

This PR adds a Keras wrapper for Word2Vec Model in Gensim.

…into develop

chinmayapancholi13 · 2017-03-30T13:15:00Z

@tmylk I have tried to use the wrapper for a smaller version of the 20NewsGroup task. The code used is as follows. (It is based on the code used here)

from __future__ import print_function

import os
import sys
import numpy as np
from gensim.sklearn_integration.keras_wrapper_gensim_word2vec import KerasWrapperWord2VecModel
from gensim.models import word2vec
from keras.engine import Input
from keras.layers import merge
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from keras.layers import Dense, Input, Flatten
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Model

BASE_DIR = ''
TEXT_DATA_DIR = BASE_DIR + './path/to/text/data/dir'
MAX_SEQUENCE_LENGTH = 1000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2
BATCH_SIZE = 128

# prepare text samples and their labels

texts = []  # list of text samples
labels_index = {}  # dictionary mapping label name to numeric id
labels = []  # list of label ids

for name in sorted(os.listdir(TEXT_DATA_DIR)):
    path = os.path.join(TEXT_DATA_DIR, name)
    if os.path.isdir(path):
        label_id = len(labels_index)
        labels_index[name] = label_id
        for fname in sorted(os.listdir(path)):
            if fname.isdigit():
                fpath = os.path.join(path, fname)
                if sys.version_info < (3,):
                    f = open(fpath)
                else:
                    f = open(fpath, encoding='latin-1')
                t = f.read()
                i = t.find('\n\n')  # skip header
                if 0 < i:
                    t = t[i:]
                texts.append(t)
                f.close()
                labels.append(label_id)

# vectorize the text samples into a 2D integer tensor
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
labels = to_categorical(np.asarray(labels))

# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
num_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

x_train = data[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
x_val = data[-num_validation_samples:]
y_val = labels[-num_validation_samples:]

# train the embedding matrix
data1 = word2vec.LineSentence('./path/to/input/data')
Keras_w2v = KerasWrapperWord2VecModel(data1, min_count=1)
embedding_layer = Keras_w2v.get_embedding_layer()

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)

x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(35)(x)  # global max pooling
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(len(labels_index), activation='softmax')(x)

model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

model.fit(x_train, y_train, validation_data=(x_val, y_val), batch_size=BATCH_SIZE)

tmylk · 2017-03-30T13:15:32Z

Please add unit tests and an ipynb

chinmayapancholi13 · 2017-03-30T13:23:07Z

@tmylk Sure.

chinmayapancholi13 · 2017-04-06T14:16:01Z

@tmylk I have added unit tests for word similarity task (using cosine distance) as well as a smaller version of the 20NewsGroups classification task. I have also created an IPython notebook for the wrapper explaining both these examples.

tmylk · 2017-05-02T20:13:58Z

docs/notebooks/datasets/20_newsgroup_keras/alt.atheism/49960

+Addresses of Atheist Organizations
+USA
+FREEDOM FROM RELIGION FOUNDATION
+Darwin fish bumper stickers and assorted other atheist paraphernalia are available from the Freedom From Religion Foundation in the US.


Please put data into test/test_data

@tmylk The data used in the unit tests is present in test/test_data folder already. The data used in the IPython notebooks is present in docs/notebooks/datasets folder. We are using the same data at both places so to avoid unnecessary duplication, should I use the data in test/test_data (i.e. set the path accordingly in the ipynb) in the ipynb notebooks as well?

yes, test/test_data is a better location for data used in both

Got it. So, I'll change the path set in the ipynb for this functionality.

tmylk · 2017-05-02T21:35:08Z

gensim/keras_integration/__init__.py

+#
+# Copyright (C) 2011 Radim Rehurek <radimrehurek@seznam.cz>
+# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html
+"""Keras wrappers for gensim.


"Wrapper to allow gensim word2vec as input into Keras." is more clear. And in other docstrings and ipynbs too

@tmylk The file __init__.py is common for the entire folder gensim/keras_integration (which in turn would have the files for integration of the various models with Keras). So shouldn't the docstring here be more generic (like it already is)?

It can be more general, just pointing out the difference in meaning between "keras wrapper for gensim" vs "gensim wrapper for keras". Which one is it?
I think it is a "Wrapper to allow gensim models as input into Keras."

Okay. Understood it now. Thanks for pointing this out. So, I'll change "Keras wrappers for gensim" to "Wrappers to allow gensim models as input into Keras".

tmylk · 2017-05-02T21:41:03Z

Thanks for the feature. The change seems to be so small - just 5 lines in get_embedding_layer()

Instead of creating a new class, could you please just add this one method to KeyedVectors?

chinmayapancholi13 · 2017-05-03T22:46:26Z

@tmylk Sure. I'll add the function in KeyedVectors instead of creating a new class.

…definition

tmylk · 2017-05-18T11:38:14Z

@chinmayapancholi13 this is more appropriate to have in KeyedVectors class.
And to have an example integration with Classification in https://github.com/stephenhky/PyShortTextCategorization/blob/db246e3ade2fcdea58953ff807d259464765a661/shorttext/classifiers/embed/nnlib/VarNNEmbedVecClassification.py

chinmayapancholi13 · 2017-05-18T18:26:19Z

@tmylk I have moved the function get_embedding_layer to keyedvectors.py. I am adding the classification example now.
Also, the tests are failing because of PEP8 checking being done in the test data. Is this what we expect?

…into keras_wrapper_word2vec

chinmayapancholi13 · 2017-05-29T09:49:03Z

@tmylk Thanks a lot for your feedback. I have incorporated your suggestions as follows :

rename model_wv as wv : Done
cosine of 2 vectors is not the probability of them occuring together : I have replaced the comment with output is the cosine distance between the two words (as a similarity measure)
Please fix The Merge layer is deprecated and will be removed after 08/2017 : I have replaced merge with dot function (with param normalize set to True for taking the cosine distance). The particular warning is no longer there.
Please compare the result of keras cos model with a simple wv[word_b].dot(wv[word_a]). it should be the same : The values are the same when we set normalize as False in the dot function. This is because in such a case the output value is only the dot product (and not the cosine distance) of the two vectors.
Would we want this check to be explicitly added somewhere in the unit-tests or the ipynb notebook? Or was this checking only for our verification of the behavior of the Keras function?
In the final cell the score of {'mathematics': 0.97023982, is very good. Please call it a good result! : I have updated the comment in the last cell.

tmylk · 2017-05-30T00:00:05Z

gensim/keras_integration/__init__.py

@@ -0,0 +1,8 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-


This file and folder are no longer needed

Right! Removing this folder right away.

tmylk · 2017-05-30T11:38:53Z

setup.py

@@ -233,6 +233,9 @@ def finalize_options(self):
    'scikit-learn',
    'pyemd',
    'annoy',
+    'theano',


would only one of them be enough? tf is preferred

tmylk · 2017-06-02T12:03:05Z

Let's add a note about shorttext to ipynb.

And then LGTM

menshikh-iv · 2017-06-03T14:57:38Z

Need to fix problems with travis (so strange, because keras is installed, but in log I see unittest.case.SkipTest: Test requires Keras to be installed, which is not available)

chinmayapancholi13 · 2017-06-03T16:25:18Z

@menshikh-iv This is because Keras also needs Tensorflow to be installed (when Tensorflow is the default backend). And we are not installing Tensorflow in .travis.yml because if we do pip install tensorflow, the 'only CPU-support' version gets installed by default. So if a user already has GPU-supported version of tensorflow (using pip install tensorflow-gpu) installed, it would get overwritten. This was also the reason behind removing automatic installation of TF while installation of Keras (see keras-team/keras#5776 (comment)).

Thus, we(similar to Keras) expect the users to install Tensorflow by themselves.

menshikh-iv · 2017-06-03T16:36:46Z

@chinmayapancholi13 Thank for your clarification, but what will we do with Travis?

chinmayapancholi13 · 2017-06-03T16:41:03Z

@menshikh-iv One solution can be to include TF in the installation and add a note that users would have to re-install the TF of their choice again (i.e. in case it gets overwritten). For the tests to get passed in travis, I believe we must install TF somehow so this can be a solution in my opinion.

menshikh-iv · 2017-06-04T08:01:49Z

@chinmayapancholi13 Sounds good for me, let's do it.

menshikh-iv · 2017-06-04T17:03:58Z

@chinmayapancholi13 Great 🥇

piskvorky

Thanks for the interesting feature and notebook!

There are some code style issues -- can you fix that? @menshikh-iv

piskvorky · 2017-06-05T02:51:30Z