# Convolutional Network Benchmark

To obtain a benchmark for a basic convolutional network, we will create a simple network that informs us what to expect when using these networks. This will not contain any novel specializations, it is done to find a baseline which we can improve upon.

The architecture used here is inspired by the research in the form of the paper ['A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification'](https://arxiv.org/pdf/1510.03820.pdf), and it not complex.

While previous experiments have been done to find applicability of convolutional networks, this is the first to run it over a large set of data.

In [1]:
from exp8_feature_extraction import get_balanced_dataset
from scripts.cross_validate import run_cross_validate
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Conv2D, Dense, Dropout, Flatten, MaxPooling2D
from tensorflow.keras.activations import relu, sigmoid

import numpy as np
import gensim
import json
import pickle

import tensorflow as tf
from keras.preprocessing.sequence import pad_sequences

Using TensorFlow backend.


In [2]:
all_reviews = get_balanced_dataset()

In [3]:
reviews_contents = [x.review_content for x in all_reviews]
labels = [1 if x.label else 0 for x in all_reviews]

In [4]:
short_reviews = []
short_labels = []
max_review_chars = 1000
for i, review in enumerate(reviews_contents):
    if len(review.split()) > max_review_chars:
        continue
    short_reviews.append(review)
    short_labels.append(labels[i])

In [5]:
max_review_words = 150
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(short_reviews)

short_sequences = [x for x in tokenizer.texts_to_sequences(short_reviews) if len(x) <= max_review_words]
word_sequences = np.array(pad_sequences(short_sequences))

In [6]:
short_labels_2 = []
for i, sequence in enumerate(tokenizer.texts_to_sequences(short_reviews)):
    if len(sequence) <= max_review_words:
        short_labels_2.append(short_labels[i])

In [7]:
len(short_labels_2)

125980

In [8]:
print(len(word_sequences[1]))

150
