# Using pre-trained Glove vectors with Scikit-learn and Keras

Pre-trained Glove word vectors can be downloaded here:

http://nlp.stanford.edu/data/glove.6B.zip

## Building the embeddings index

Loading the word vectors is pretty straight forward:

In [1]:
import os
import numpy as np

GLOVE_DIR = 'glove/'

embeddings_index = {}
f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'), encoding="utf8")
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


Example usage of the embeddings_index:

In [2]:
embeddings_index['hockey']

array([ 0.1309    ,  0.011308  ,  0.85140002, -0.44016999, -0.92250001,
       -0.15672   ,  0.073091  , -0.10772   , -1.40629995, -0.2111    ,
       -0.35148999, -1.31570005,  0.38323   , -0.070724  , -0.3457    ,
        0.028638  ,  0.26027   ,  0.25404   , -0.38690999, -0.08535   ,
       -0.85211998,  1.89660001, -0.032885  ,  0.24192999,  0.28060001,
       -0.35846999,  0.13362999, -0.65781999,  0.89960998,  0.58815002,
       -0.59443998,  1.26090002, -0.57046002, -0.14374   ,  0.51617998,
       -1.11679995, -0.79299998,  1.04059994, -0.86282998, -0.81149   ,
        0.80037999,  0.19261   , -0.25137001, -1.15079999,  0.21164   ,
        0.75827998,  0.72103   , -0.72103   ,  0.48985001, -0.18956   ,
       -0.45517999, -0.37231001, -0.11526   ,  0.37957999, -0.047891  ,
       -1.74249995,  0.10357   ,  0.21019   ,  0.81901997,  1.21640003,
       -0.53884   ,  0.58485001, -0.26903999,  0.35712001,  0.30651999,
        0.22014999, -0.20929   ,  0.77614999, -0.21447   ,  0.36

## Loading the training and testing set

For training we will use the well known 20 newsgroups dataset.

In [3]:
import re
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'soc.religion.christian','comp.graphics','sci.med']

train_set = fetch_20newsgroups(subset='train',
                           categories=categories,
                           remove=('headers', 'footers', 'quotes'),
                           shuffle=True, 
                           random_state=42)

test_set = fetch_20newsgroups(subset='test',                          
                          categories=categories,
                          remove=('headers', 'footers', 'quotes'),
                          shuffle=True, 
                          random_state=42)

print("Train documents: %d"%(len(train_set.data)))
print("Test documents: %d"%(len(test_set.data)))

Train documents: 2257
Test documents: 1502


## Define the embedding vectorizer

We will define a Vectorizer class to be used with our pipeline. 

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import defaultdict

class EmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.dim = len(next(iter(word2vec.values())))
        
    def fit(self, X, y):
        return self
        
    def transform(self, X):
        return np.array([
                np.mean([self.word2vec[w] for w in texts.split() if w in self.word2vec] or
                        [np.zeros(self.dim)], axis=0)
                        for texts in X])

## Support Vector Machine classifier

We will build a pipeline using our vectorizer and the SVM classifier (SVC) with a linear kernel.

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

text_clf = Pipeline([('vectorizer', EmbeddingVectorizer(embeddings_index)),
                     ('clf', SVC(kernel="linear"))
])

text_clf.fit(train_set.data, train_set.target)
predicted = text_clf.predict(test_set.data)

print("Accuracy on the test set: %.2f%%" % (np.mean(predicted == test_set.target) * 100))

Accuracy on the test set: 75.63%


## Comparison with Bag of words approach

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

text_clf = Pipeline([('tfidf_vectorizer', TfidfVectorizer()),
                     ('clf', SVC(kernel="linear"))
])

text_clf.fit(train_set.data, train_set.target)
predicted = text_clf.predict(test_set.data)

print("Accuracy on the test set: %.2f%%" % (np.mean(predicted == test_set.target) * 100))

Accuracy on the test set: 80.23%


## Convolutional neural network in Keras

In [7]:
from keras.utils import to_categorical

from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.layers import Embedding
from keras.layers import Conv1D, GlobalAveragePooling1D, MaxPooling1D
from keras.optimizers import RMSprop

SEQ_LEN = 100

Using TensorFlow backend.


Prepare input embeddings and transform labels to binary class matrices:

In [8]:
vect = EmbeddingVectorizer(embeddings_index)

x_train = vect.transform(train_set.data)
x_test = vect.transform(test_set.data)

y_train = to_categorical(train_set.target) 
y_test = to_categorical(test_set.target)

x_train = np.reshape(x_train, (x_train.shape[0], 100, 1))
x_test = np.reshape(x_test, (x_test.shape[0], 100, 1))

x_train.shape

(2257, 100, 1)

Define our model:

In [9]:
model = Sequential()
model.add(Conv1D(64, 3, activation='relu', input_shape=(100,1)))
model.add(Conv1D(64, 3, activation='relu'))
model.add(MaxPooling1D(3))
model.add(Conv1D(128, 3, activation='relu'))
model.add(Conv1D(128, 3, activation='relu'))
model.add(GlobalAveragePooling1D())
model.add(Dropout(0.5))
model.add(Dense(len(categories), activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop', #adam
              metrics=['accuracy'])

print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d_1 (Conv1D)            (None, 98, 64)            256       
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 96, 64)            12352     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 32, 64)            0         
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 30, 128)           24704     
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 28, 128)           49280     
_________________________________________________________________
global_average_pooling1d_1 ( (None, 128)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
__________

Now let's train it an check the accuracy:

In [10]:
%%time

model.fit(x_train, y_train,
          batch_size=128,
          epochs=10,
          validation_data=(x_test, y_test))

Train on 2257 samples, validate on 1502 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Wall time: 16.2 s


<keras.callbacks.History at 0x24008b14fd0>

In [11]:
scores = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 75.00%
