# Convolutional Network Benchmark

To obtain a benchmark for a basic convolutional network, we will create a simple network that informs us what to expect when using these networks. This will not contain any novel specializations, it is done to find a baseline which we can improve upon.

The architecture used here is inspired by the research in the form of the paper ['A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification'](https://arxiv.org/pdf/1510.03820.pdf), and it not complex.

While previous experiments have been done to find applicability of convolutional networks, this is the first to run it over a large set of data.

In [1]:
from exp8_feature_extraction import get_balanced_dataset
from scripts.cross_validate import run_cross_validate
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Conv2D, Dense, Dropout, Flatten, MaxPooling2D
from tensorflow.keras.activations import relu, sigmoid

import numpy as np
import gensim
import json
import pickle

import tensorflow as tf
from keras.preprocessing.sequence import pad_sequences

Using TensorFlow backend.


In [2]:
all_reviews = get_balanced_dataset()

In [3]:
reviews_contents = [x.review_content for x in all_reviews]
labels = [1 if x.label else 0 for x in all_reviews]

In [4]:
short_reviews = []
filtered_labels = []
max_review_chars = 1000
for i, review in enumerate(reviews_contents):
    if len(review.split()) > max_review_chars:
        continue
    short_reviews.append(review)
    filtered_labels.append(labels[i])

In [5]:
max_review_words = 150
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(short_reviews)

short_sequences = []
short_labels = []
for i, sequence in enumerate(tokenizer.texts_to_sequences(short_reviews)):
    if len(x) <= max_review_words:
        short_sequences.append(sequence)
        short_labels.append(labels[i])
        
word_sequences = np.array(pad_sequences(short_sequences))

In [9]:
corpus_words = tokenizer.word_index
corpus_vocab_size = len(corpus_words)+1

In [10]:
word_vectors = gensim.models.KeyedVectors.load_word2vec_format("../../data/GoogleNews-vectors-negative300.bin",
                                                             binary=True)
embedding_length = word_vectors.vector_size

embedding_matrix = np.zeros((corpus_vocab_size, embedding_length))
for word, index in corpus_words.items():
if word in word_vectors.vocab:
  embedding_matrix[index] = np.array(word_vectors[word], dtype=np.float32)

with open(embeddings_file_name, 'w') as outfile:
  json.dump(embedding_matrix.tolist(), outfile)

We add a quick check to make sure any modifications haven't caused our sequences and embedding matrix to misalign:

In [11]:
hello_wv = embedding_matrix[tokenizer.texts_to_sequences(["hello"])[0][0]]
assert hello_wv[0] > -0.055 and hello_wv[0] < -0.054
assert hello_wv[1] > 0.017 and hello_wv[1] < 0.018
assert hello_wv[2] > -0.006 and hello_wv[2] < -0.005

In [12]:
def to_vectorized_reviews(word_sequences, max_review_len, embedding_matrix):
    vectorized_reviews = np.zeros((len(word_sequences), max_review_len, embedding_matrix.shape[1], 1))
    for i, word_sequence in enumerate(word_sequences):
        for j, word in enumerate(word_sequence):
            for k, val in enumerate(embedding_matrix[word]):
                vectorized_reviews[i][j][k][0] = val
    return vectorized_reviews

In [13]:
test_embedding_matrix = np.array([[1, 2, 3], [3, 2, 1]])

actual_vectorized_reviews = to_vectorized_reviews([[0, 1]], 2, test_embedding_matrix)
assert np.array_equal(actual_vectorized_reviews, np.array([[[[1], [2], [3]], [[3], [2], [1]]]]))

actual_vectorized_reviews = to_vectorized_reviews([[1, 0]], 2, test_embedding_matrix)
assert np.array_equal(actual_vectorized_reviews, np.array([[[[3], [2], [1]], [[1], [2], [3]]]]))

In [14]:
vectorized_reviews = to_vectorized_reviews(word_sequences, len(word_sequences[0]), embedding_matrix)

I will separate some data, holding it until after I have finished tweaking the model. This is because tweaking the model towards the data I'm testing against is likely to cause a misrepresentation of how well the model performs. Using unseen data is better representative of future unseen data.

In [25]:
training_vectors = vectorized_reviews[:-10000]
training_labels = short_labels_2[:-10000]
held_vectors = vectorized_reviews[-10000:]
held_labels = short_labels_2[-10000:]

In [15]:
#np.save("vectorized_reviews_abc", vectorized_reviews)
#with open("vectorized_reviews_abc", 'w') as outfile:
#    np.save(outfile, vectorized_reviews)

import pickle
with open("vectorized_reviews.pkl", 'wb') as outfile:
    pickle.dump(vectorized_reviews, outfile, protocol=4)
with open("vectorized_reviews_targets.pkl", 'wb') as outfile:
    pickle.dump(short_labels_2, outfile, protocol=4)

In [16]:
import pickle
with open("vectorized_reviews.pkl", 'rb') as infile:
  loaded_vectorized_reviews = pickle.load(infile)
with open("vectorized_reviews_targets.pkl", 'rb') as infile:
  loaded_short_labels_2 = pickle.load(infile)

In [24]:
def get_conv_wv_model():
  model = Sequential([
      Conv2D(
          filters=50,
          kernel_size=(10, 300), # TODO Use vector_length instead of hardcoding
          data_format="channels_last",
          input_shape=(max_review_words, 300, 1),
          activation=relu),
      tf.keras.layers.GlobalMaxPooling2D(data_format="channels_last"),
      Dropout(0.5),
      Flatten(),
      Dense(2, activation='softmax')
  ])
  model.compile(
      loss='binary_crossentropy',
      optimizer='adam',
      metrics=['accuracy'])
  return model

In [None]:
conv_wv_scores = run_cross_validate(get_conv_wv_model, training_vectors, training_labels, cv=6, categorical=True)
print(conv_wv_scores)

Fitting with:  (96649, 150, 300, 1) labels (96649, 2)
Train on 67654 samples, validate on 28995 samples
Epoch 1/12
Epoch 2/12
Epoch 3/12
Epoch 4/12
Epoch 5/12
Epoch 6/12
Epoch 7/12
Fitting with:  (96649, 150, 300, 1) labels (96649, 2)
Train on 67654 samples, validate on 28995 samples
Epoch 1/12
Epoch 2/12
Epoch 3/12
Epoch 4/12
Epoch 5/12
Epoch 6/12
Epoch 7/12
Epoch 8/12
Epoch 9/12
Fitting with:  (96649, 150, 300, 1) labels (96649, 2)
Train on 67654 samples, validate on 28995 samples
Epoch 1/12
Epoch 2/12
Epoch 3/12
Epoch 4/12
Epoch 5/12
Epoch 6/12
Epoch 7/12
Epoch 8/12
Epoch 9/12

In [29]:
print(conv_wv_scores)
print("Average accuracy:", sum(conv_wv_scores['accuracies'])/len(conv_wv_scores['accuracies']))

{'accuracies': [0.66613211939372, 0.6612177331746935, 0.6639077130029282, 0.6568368772311035, 0.654612240674634, 0.6592684567230587]}
Average accuracy: 0.6603291900333562
