# Code Lab 4A: CNN for Text Classification

In this Code Lab, we are going to implement *Convolutional Neural Networks for Sentence Classification* (Yoon Kim, 2014).

In his [paper](https://arxiv.org/abs/1408.5882), Yoon Kim proposed several techniques to achieve good text classification accuracy with minimal hyper-parameter tuning.

This notebook consist of 4 main sections:

1. Preparing the data
2. Implementing Yoon Kim's CNN model
3. Training the model
4. Evaluating the model

In [None]:
MAX_NB_WORDS = 100000    # max no. of words for tokenizer
MAX_SEQUENCE_LENGTH = 100 # max length of each entry (sentence), including padding
VALIDATION_SPLIT = 0.2   # data for validation (not used in training)
EMBEDDING_DIM = 100      # embedding dimensions for word vectors (word2vec/GloVe)

In [None]:
import numpy as np
import re, sys, csv, pickle
from tqdm import tqdm_notebook

from keras import regularizers, initializers, optimizers, callbacks
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.utils.np_utils import to_categorical
from keras.layers import *
from keras.models import Model

### 1. Prepare the data
**Read from dataset**

In [None]:
# download pre-trained GloVe vectors

import keras

GLOVE_URL = "https://s3-ap-southeast-1.amazonaws.com/deeplearning-iap-material/glove.6B.100d.txt.zip"
GLOVE_DIR = keras.utils.get_file("glove.6B.100d.txt.zip", GLOVE_URL, cache_subdir='datasets', extract=True)
print("GloVe data present at", GLOVE_DIR)
GLOVE_DIR = GLOVE_DIR.replace(".zip", "")

In [None]:
from nltk.corpus import stopwords
def clean_text(text, remove_stopwords=False):
    output = ""
    text = str(text).replace("\n", "")
    text = re.sub(r'[^\w\s]','',text).lower()
    if remove_stopwords:
        text = text.split(" ")
        for word in text:
            if word not in stopwords.words("english"):
                output = output + " " + word
    else:
        output = text
    return str(output.strip())[1:-3].replace("  ", " ")

In [None]:
texts, labels = [], [] # empty lists for the sentences and labels

data_neg = open("datasets/stanford_movie_neg.txt", "rb") 
for line in tqdm_notebook(data_neg, total=5331): 
    texts.append(clean_text(line, remove_stopwords=False))
    labels.append(int(0))

In [None]:
data_pos = open("datasets/stanford_movie_pos.txt", "rb") 
for line in tqdm_notebook(data_pos, total=5331): 
    texts.append(clean_text(line, remove_stopwords=False))
    labels.append(int(1))

In [None]:
print("Sample negative:", texts[0], labels[0])
print("Sample positive:", texts[-1], labels[-1])

**Word Tokenizer**

In [None]:
CACHE_TOKENIZER = True

tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)

if CACHE_TOKENIZER:
    with open('tokenizer.pickle', 'wb') as handle:
        pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
        print("[i] Saved word tokenizer to file: tokenizer.pickle")

# to use cached tokenizer:
"""
with open('tokenizer.pickle', 'rb') as handle:
    tokenizer = pickle.load(handle)
"""

**Generate the array of sequences from dataset**

In [None]:
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index
print('[i] Vocabulary size:', len(word_index))

# pad on both ends
data_int = pad_sequences(sequences, padding='pre', maxlen=(MAX_SEQUENCE_LENGTH-5))
data = pad_sequences(data_int, padding='post', maxlen=(MAX_SEQUENCE_LENGTH))

**Create the train-validation split**

In [None]:
labels = to_categorical(np.asarray(labels)) # convert the category label to one-hot encoding
print('[i] Shape of data tensor:', data.shape)
print('[i] Shape of label tensor:', labels.shape)

indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])
x_train = data[:-nb_validation_samples]
y_train = labels[:-nb_validation_samples]
x_val = data[-nb_validation_samples:]
y_val = labels[-nb_validation_samples:]

print('[i] Number of entries in each category:')
print("[+] Training:",y_train.sum(axis=0))
print("[+] Validation:",y_val.sum(axis=0))

**What does the data look like?**

In [None]:
print("Tokenized sequence:\n", data[0])
print("")
print("One-hot label:\n", labels[0])

### 2. Create the model
Yoon Kim's model has several notable features:
![model-structure](notebook_imgs/yoon_kim_structure.png)
* two sets of word embeddings for what he terms a **"multi-channel" approach**.
  * One of the word embeddings will be frozen (**"static channel"**), and one will be modified during the training process (**"non-static channel"**). 
* multiple convolutional kernel sizes

We will now start to create the model in `Keras`.

**Load word embeddings into an `embeddings_index`**

Create an index of words mapped to known embeddings, by parsing the data dump of pre-trained embeddings.

We use a set from [pre-trained GloVe vectors from Stanford](https://nlp.stanford.edu/projects/glove/).

In [None]:
embeddings_index = {}
f = open(GLOVE_DIR)
print("[i] (long) Loading GloVe from:",GLOVE_DIR,"...",end="")
for line in f:
    values = line.split()
    word = values[0]
    embeddings_index[word] = np.asarray(values[1:], dtype='float32')
f.close()
print("Done.\n[+] Proceeding with Embedding Matrix...", end="")
embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
print(" Completed!")

In [None]:
# second embedding matrix for non-static channel
embedding_matrix_ns = np.random.random((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix_ns[i] = embedding_vector

**Create the `Embedding` layers**

In [None]:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32') # input to the model

# static channel
embedding_layer_frozen = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)
embedded_sequences_frozen = embedding_layer_frozen(sequence_input)

# non-static channel
embedding_layer_train = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix_ns],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=True)
embedded_sequences_train = embedding_layer_train(sequence_input)

l_embed = Concatenate(axis=1)([embedded_sequences_frozen, embedded_sequences_train])

**Create the CNN layer with multiple kernel (filter) sizes**

In [None]:
l_conv_3 = Conv1D(filters=64,kernel_size=3,activation='relu')(l_embed)
l_conv_4 = Conv1D(filters=64,kernel_size=4,activation='relu')(l_embed)
l_conv_5 = Conv1D(filters=64,kernel_size=5,activation='relu')(l_embed)

l_conv = Concatenate(axis=1)([l_conv_3, l_conv_4, l_conv_5])

Followed by the rest of the model (boring!!)

In [None]:
l_pool = MaxPooling1D(4)(l_conv)
l_drop = Dropout(0.5)(l_pool)
l_flat = Flatten()(l_drop)
l_dense = Dense(32, activation='relu')(l_flat)
preds = Dense(2, activation='softmax')(l_dense) #follows the number of classes

**Compile the model into a static graph for training**

In [None]:
model = Model(sequence_input, preds)
model.compile(loss='binary_crossentropy',
              optimizer="rmsprop",
              metrics=['acc'])
model.summary()

**Model Architecture Visualisation**

In [None]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
SVG(model_to_dot(model).create(prog='dot', format='svg'))

**Word Embeddings Visualisation**

In [None]:
!rm metadata.txt logs/*

In [None]:
file = open("metadata.txt","w") 

file.write("<UNK>\n")

for key in tqdm_notebook(tokenizer.word_index.items()):
    #print(key[0], end=" ")
    file.write(key[0]+"\n")
    
file.close() 

In [None]:
# TensorBoard callback
from keras import callbacks
tb = keras.callbacks.TensorBoard(log_dir='./logs',write_graph=True,
                                 embeddings_freq=1, embeddings_layer_names=[],
                                 embeddings_metadata="../metadata.txt", embeddings_data=x_val)

### 3. Train the model

In [None]:
print("Training Progress:")
model_log = model.fit(x_train, y_train, validation_data=(x_val, y_val),
                      epochs=1, batch_size=64, callbacks=[tb])

In [None]:
# save the model
#model.save("best_weights.h5")

### 4. Evaluate the model

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.plot(model_log.history['acc'])
plt.plot(model_log.history['val_acc'])
plt.title('Accuracy (Higher Better)')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

plt.plot(model_log.history['loss'])
plt.plot(model_log.history['val_loss'])
plt.title('Loss (Lower Better)')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import itertools, pickle

classes = ["positive", "negative"]

In [None]:
Y_test = np.argmax(y_val, axis=1) # Convert one-hot to index
y_pred = model.predict(x_val)
y_pred_class = np.argmax(y_pred,axis=1)
print(classification_report(Y_test, y_pred_class, target_names=classes))

In [None]:
plt.style.use('seaborn-dark')
def plot_confusion_matrix(cm, labels,
                          normalize=True,
                          title='Confusion Matrix (Validation Set)',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        #print("Normalized confusion matrix")
    else:
        #print('Confusion matrix, without normalization')
        pass

    #print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(labels))
    plt.xticks(tick_marks, labels, rotation=45)
    plt.yticks(tick_marks, labels)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

plt.figure(figsize=(14,7))
cnf_matrix = confusion_matrix(Y_test, y_pred_class)
cnf_matrix = confusion_matrix(Y_test, y_pred_class)
plot_confusion_matrix(cnf_matrix, labels=classes)

In [None]:
model.save("cnn.h5")