## Binary classification of movie reviews

Based on [Deep learning with Python](https://www.manning.com/books/deep-learning-with-python).

This example applies a multilayer perceptron to classify movie reviews into "positive" and "negative" reviews, based on the text content of the reviews. 

We will be using the [IMDB dataset](https://keras.io/api/datasets/imdb/) packaged with Keras. This is a dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a list of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations, such as: "only consider the top 10,000 most common words".

As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.


In [None]:
# importing necessary libraries
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from keras.datasets import imdb
from keras import models
from keras import layers
from tensorflow.keras import optimizers


### Loading the dataset

The argument `num_words=10000` means that we will only keep the top 10,000 most frequently occurring words in the training data. Rare words will be discarded. 

The variables `train_data` and `test_data` are lists of reviews, each review being a list of word indices (encoding a sequence of words). `train_labels` and `test_labels` are lists of `0`s and `1`s, where 0 stands for *negative* and 1 stands for *positive*.

In [None]:
# loading the dataset and splitting into training and testing sets
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

### Exploratory analysis

In [None]:
# looking at the training data
train_data[0]

In [None]:
# looking at the labels
train_labels

In [None]:
# checking the indices (limited to 10,000 words) for the training data 
max([max(sequence) for sequence in train_data])

These are helper commands to decode reviews back to English words.

In [None]:
# word_index is a dictionary mapping words to an integer index
word_index = imdb.get_word_index()
# We reverse it, mapping integer indices to words
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
# We decode the review; note that our indices were offset by 3
# because 0, 1 and 2 are reserved indices for "padding", "start of sequence", and "unknown".
decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in train_data[0]])

In [None]:
decoded_review

### Data pre-processing

We need to **transform the lists of integers into tensors**, so we can feed them into a neural network. 

There are different ways for doing that. Our approach will be to one-hot-encode our lists to turn them into vectors of 0s and 1s. For instance, a sequence `[3, 5]` will be encoded into a 10,000-dimensional vector that would be all-zeros except for indices 3 and 5, which will be ones. Then, the first layer of our network can be a `dense` layer capable of handling floating point vector data.

In [None]:
# helper function for encoding the integer sequences into a binary matrix
def vectorize_sequences(sequences, dimension=10000):
    # create an all-zero matrix of shape (len(sequences), dimension)
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.  # set specific indices of results[i] to 1s
    return results

In [None]:

# vectorized training data
x_train = vectorize_sequences(train_data)
# vectorized test data
x_test = vectorize_sequences(test_data)

In [None]:
# instance of the training data
x_train[0]

In [None]:
# vectorized labels
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

In order to monitor during training the accuracy of the model on data that it has never seen before, we will create a "validation set" by setting apart 10,000 samples from the original training data:

In [None]:
# setting a validation sample for accuracy assessment
x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = y_train[:10000]
partial_y_train = y_train[10000:]

### Model definition

In [None]:
# model specification
# we are using two layers with 16 hidden units each and 'relu' activation
# also, an output layer with one hidden unit using 'sigmoid' activation to output 
# a probability indicating how likely the sample is to have a label 1
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

In [None]:
# model parameters
# optimizer, loss function, and performance metric
model.compile(optimizer=optimizers.RMSprop(learning_rate=0.001),
              loss='binary_crossentropy',
              metrics=['accuracy'])

### Training the model

We will train the model for 20 epochs (20 iterations over all samples in the `x_train` and `y_train` tensors), in mini-batches of 512 samples. We will keep track of loss and accuracy (`history` dictionary) on the 10,000 samples that we set apart, by passing the validation data as argument.

In [None]:
 # training the model 
 # training parameters are: training features and labels, number of epochs, batch size, and validation data
 history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=20,
                     batch_size=512,
                     validation_data=(x_val, y_val))

In [None]:
# checking the history object
history_dict = history.history
history_dict.keys()

### Visualising model performance

In [None]:
# Getting necessary data for plotting
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)

In [None]:
# Plotting training and validation loss
# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

In [None]:
# Plotting training and validation accuracy
plt.clf()   # clear figure
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

### Model (results) interpretation

What is happening with this model?

---

Let's train a second model from scratch, for few epochs, and compare the results.

In [None]:
# model definition
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

# model parameters
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# training parameters
history2 = model.fit(x_train, y_train, epochs=4, batch_size=512)
results = model.evaluate(x_test, y_test)

### Checking the final results

In [None]:
results

In [None]:
# Plotting training and validation loss
acc = history2.history['accuracy']
loss = history2.history['loss']
epochs = range(1, len(acc) + 1)

In [None]:
# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.title('Training loss - new model')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

In [None]:
# Plotting training and validation accuracy
plt.clf()   # clear figure
plt.plot(epochs, acc, 'bo', label='Training accuracy')
plt.title('Training accuracy - new model')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

### Making predictions

After having trained a network, we can use it to generate the likelihood of reviews being positive by using the `predict` method:

In [None]:
model.predict(x_test)