### Text Classification Using Word Embeddings and CNN


In [1]:
import os, sys

import numpy as np

from keras.models import Model

from keras.layers import Input, Dense, Flatten
from keras.layers import Conv1D, MaxPooling1D
from keras.layers import Embedding

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

from keras.utils import to_categorical

Using TensorFlow backend.
  return f(*args, **kwds)


#### Preparing the text data
Iterate over the folders in which our text documents are stored, and format them into a list of documents. 
At the same time also prepare a list of class indices matching the documents

In [2]:
docs = []          # list of text samples
labels = []        # list of label ids
labels_Index = {}  # dictionary mapping label index to label name

PATH = os.getcwd()

TEXT_DATA_DIR = os.path.join(PATH, "txt")

In [3]:
for name in os.listdir(TEXT_DATA_DIR):
    path = os.path.join(TEXT_DATA_DIR, name)
    if os.path.isdir(path):
        label_Id = len(labels_Index)
        labels_Index[label_Id] = name
        for fname in sorted(os.listdir(path)):
            fpath = os.path.join(path, fname)
            f = open(fpath, encoding = "ISO-8859-1")
            t = f.read()
            docs.append(t)
            f.close()
            labels.append(label_Id)

print('Found %s docs.' % len(docs))

Found 36 docs.


Format the text samples and labels into tensors that can be fed into a neural network. 

To do this, we will rely on Keras utilities 

    keras.preprocessing.text.Tokenizer 
    keras.preprocessing.sequence.pad_sequences.

__Tokenizer__

    Class for vectorizing texts, or/and turning texts into sequences (=list of word indexes, where the word of rank i in the dataset (starting at 1) has index i).

__fit_on_texts(texts)__

    Arguments:  
        texts: list of texts to train on.
        
__word_index__ attribute: 

    Dictionary mapping words (str) to their rank/index (int). Only set after fit_on_texts was called.

In [4]:
# Prepare tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(docs)

word_Index = tokenizer.word_index

vocab_Size = len(word_Index) + 1
print('Found %s unique tokens.' % vocab_Size)

Found 7314 unique tokens.


__texts_to_sequences(texts)__

    Arguments:
        texts: list of texts to turn to sequences.
    Return: list of sequences (one per text input).

In [5]:
# integer encode the documents
sequences = tokenizer.texts_to_sequences(docs)
print(docs[1], sequences[1])

for i in sequences:
    print (len(i))

I stand here today humbled by the task before us, grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors. I thank President Bush for his service to our nation, as well as the generosity and cooperation he has shown throughout this transition.

Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents.

So it has been. So it must be with this generation of Americans.

That we are in the midst of crisis is now well understood. Our nation is at war, against a far-reaching network of violence and hatred. Our economy is badly weakened, a consequence of greed and i

The sequences have different lengths and Keras prefers inputs to be vectorized and all inputs to have the same length. We will pad all input sequences to have the length of 1000. Again, we can do this with a built in Keras's pad_sequences() function.

In [6]:
MAX_SEQUENCE_LENGTH = 1000

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

print('Shape of data tensor:', data.shape)

Shape of data tensor: (36, 1000)


In [7]:
# split the data into a training set and a test set
X_train = np.append(data[:10], data[12:22], axis=0)
X_train = np.append(X_train, data[24:34], axis=0)

X_test = np.append(data[10:12], data[22:24], axis=0)
X_test = np.append(X_test, data[34:], axis=0)

y_train = np.append(labels[:10], labels[12:22], axis=0)
y_train = np.append(y_train, labels[24:34], axis=0)

y_test = np.append(labels[10:12], labels[22:24],axis=0)
y_test = np.append(y_test, labels[34:],axis=0)

In [8]:
Y_train = to_categorical(y_train)
Y_test = to_categorical(y_test)

##### Preparing the Embedding layer

Compute an index mapping words to known embeddings, by parsing the data dump of pre-trained embeddings:

In [9]:
embeddings_index = {}
f = open(os.path.join(PATH, 'glove.6B.50d.txt'), encoding="utf8")
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


In [10]:
embedding_Matrix = np.zeros((vocab_Size, 50))
for word, i in word_Index.items():
    embedding_Vector = embeddings_index.get(word)
    if embedding_Vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_Matrix[i] = embedding_Vector

print (embedding_Matrix.shape)

(7314, 50)


###### Create the embedding layer

The key difference is that the embedding layer can be seeded with the GloVe word embedding weights. 

    We chose the 50-dimensional version, therefore the Embedding layer must be defined with output_dim set to 50. 
    We do not want to update the learned word weights in this model, therefore we will set the trainable attribute for the model to be False.

In [11]:
embedding_layer = Embedding(vocab_Size,
                            50,
                            weights=[embedding_Matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

Finally build a small 1D convnet to solve our classification problem:

In [12]:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(64, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(4)(x)
x = Conv1D(64, 5, activation='relu')(x)
x = MaxPooling1D(4)(x)
x = Conv1D(64, 5, activation='relu')(x)
x = MaxPooling1D(4)(x)  # global max pooling
x = Flatten()(x)
x = Dense(64, activation='relu')(x)
preds = Dense(len(labels_Index), activation='softmax')(x)

model = Model(sequence_input, preds)

In [13]:
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

In [14]:
# summarize the model
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 1000)              0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 1000, 50)          365700    
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 996, 64)           16064     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 249, 64)           0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 245, 64)           20544     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 61, 64)            0         
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 57, 64)            20544     
__________

In [15]:
model.fit(X_train, Y_train, epochs=25)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<keras.callbacks.History at 0x7f6ec5d60cc0>

#### Make predictions on test data 

In [16]:
Y_pred = model.predict(X_test)
print(Y_pred)

[[  5.84911704e-01   1.23765849e-01   2.91322559e-01]
 [  4.75651026e-01   1.95072576e-01   3.29276413e-01]
 [  1.34671293e-03   9.84617352e-01   1.40358750e-02]
 [  7.57883477e-04   9.94684517e-01   4.55762073e-03]
 [  3.93927187e-01   3.76961708e-01   2.29111016e-01]
 [  3.58260930e-01   7.78342336e-02   5.63904881e-01]]


In [17]:
y_pred =[]
for i in Y_pred:
    y_pred.append(np.argmax(i))

print(y_pred)

[0, 0, 1, 1, 0, 2]


In [18]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred)

0.83333333333333337

In [19]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred)

array([[2, 0, 0],
       [0, 2, 0],
       [1, 0, 1]])

Ref:
    
    https://keras.io
        
    https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html
    