THE IMDB DATASET

The IMDB dataset is a set of 50,000 highly polarised reviews from the internet Movie Database (https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification)
The dataset itself is partitioned into 25,000 reviews for training and 25,000 reviews for testing. Within each partition set there is a 50-50 balance between positive and negative reviews.

Classyfiying movie reviews as positive or negative can be seen as an example of A Binary Classification problem or Two class classification. Binary Classification is the general task of separating a dataset based on classification of two groups on the basis of a certain classification rule. Binary Classification is a type of supervised learning, the positive and negative classes are predefined and the dataset set used in training is already labelled with the correct annotation or class by an expert annotator. An expert annotator could be a human or an external system.

IMDB dataset is pre-packaged in Keras.
The reviews, which are sequence of words have been pre-processed into sequence of integers, where each integer stands for a specific word in a dictionary.
The words in the reviewed are indexed by overall frequency in the dataset, so that for instance the number '3' encodes the 3rd most frequent word in the data.
'0' is used to encode any unknown word.


We start off by importing the IMDB dataset from the Keras package that is installed on our machine as shown in (line 1)

We load the dataset into two tuples (line 3) (A tuple is a simple data structure that contains multiple parts).
The first tuple contains dataset we will utilise for training of our model 'train_labels' and 'train_data'
'train_data' contains a list of sequences where each sequence is a list of indices that encode a sequence of word.
'train_labels' contains a list of integer labels that are either 1 or 0
The second tuple 'test_data' and 'test_labels' contains 1 and 0

The imdb dataset has the function load data that takes several parameeters (https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification)
We are only utilising the 'num_word' parameter which is set to 10,000. The parameter value of 10,000 means that only the top 10,000 words are used within the dataset.

In [17]:
from keras.datasets import imdb

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

# Check the size of the components of the tuples
print(train_data.size)
print(test_data.size)
print(train_labels.size)
print(test_labels.size)

# An example of what is contained in the training data
# A squence of integers that represent words that occur within movie reviews
print(train_data[4])

# An example of what is contained in th training labels
print(train_labels)

25000
25000
25000
25000
[1, 249, 1323, 7, 61, 113, 10, 10, 13, 1637, 14, 20, 56, 33, 2401, 18, 457, 88, 13, 2626, 1400, 45, 3171, 13, 70, 79, 49, 706, 919, 13, 16, 355, 340, 355, 1696, 96, 143, 4, 22, 32, 289, 7, 61, 369, 71, 2359, 5, 13, 16, 131, 2073, 249, 114, 249, 229, 249, 20, 13, 28, 126, 110, 13, 473, 8, 569, 61, 419, 56, 429, 6, 1513, 18, 35, 534, 95, 474, 570, 5, 25, 124, 138, 88, 12, 421, 1543, 52, 725, 6397, 61, 419, 11, 13, 1571, 15, 1543, 20, 11, 4, 2, 5, 296, 12, 3524, 5, 15, 421, 128, 74, 233, 334, 207, 126, 224, 12, 562, 298, 2167, 1272, 7, 2601, 5, 516, 988, 43, 8, 79, 120, 15, 595, 13, 784, 25, 3171, 18, 165, 170, 143, 19, 14, 5, 7224, 6, 226, 251, 7, 61, 113]
[1 0 0 ... 0 1 0]


Before we proceed onto building our Model and training it, we first have to preprocess the data we will be training and testing our model with.
Data Preprocessing is a common practice and it is taken to ensure the dataset we have adheres to the input type of the neural networks of our model.
A neural network takes the data structure of a tensor with floating-points, in some cases you can have tensors of integers.

Tranforming your data into tensor is known as Data Vectorization. Esentially we need to turn the sequences of words, represented by indices in a list and use 'one-hot' encoding to turn them into a tensor of float32 data (the default datatype in np is float64 set to float32 to save some space and compactability).

'One-Hot Encoding' is the process of taking our dataset with categories and tranforming the categories into some form of binary representation. So for example the word sequence [3,5] would be transformed into a 10,000 dimensional vector that would be all 0s except for the values in indices 3 and 5.

We create a class called 'vecorize_sequence' to achieve the process of data vectorisation.
We create an 10,000D tensor of zeros with the numpy function 'np.zeros'
The length of the sequence represent the number of axis in our tensor and the dimension on each axis is set to 10000

In [16]:
import numpy as np


def vectorize_sequences(sequences, dimensions=10000):
    results = np.zeros((len(sequences), dimensions), np.float32)
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

# Vectorizing training data and test data
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

# Vectorizing traingin labels and test labels
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(train_labels).astype('float32')

[0. 1. 1. ... 0. 0. 0.]


After all the pre-requisite setup we have gone through, we can now proceed to building our network.

We begin by importing the models and layers api from the keras module.

Models is a functional API that enables the initialisation and definition of the network we are composing.
Model comes with a bunch of methods that descibes the structural composition of the network.
For this task we are currently utilising the 'Sequential' method for deifning the structural composition.

The sequential model is a linear stack of layers, this means that each layers are stacked one after the other, layer-by-layer.
In a sequential model, the layers shares input from the layer before it and provide outputs directly to the layer after.

(WRITE A SEPERATE MEDIUM ARTICLE ON SEQUENTIAL MODEL https://keras.io/getting-started/sequential-model-guide/)
(WRITE A SEPERATE MEDIUM ARTICLE ON FUNCTIONAL MODEL)

We begin by calling the Sequential constructor from the models API.
Thereafter we can the add method from the sequential class we instantiated.
The add method takes in parameters that define the layers within the network.

One key information to note is that the first layer within your network needs to be provided with information on what input shape is to be expected, other layers after the first layer do not need a definition of the input shape as the layers can perform automatic shape inference.

To begin we create our first layer my calling the 'model.add()' method, in which we pass parameters that define the type of layer.
The definition of the layer is created by calling the layers.Dense() method. 
We are creating a dense layer which means that each nueron within the layers recieves input from all the neurons on the previous layer. 
The 'layers.Dense()' method takes in parameters that defines the number of units, type of activation function and the input shape that is expected, remember the input shape is only defined for the first layer.

The numbers of units in our first layer will be 16 (you can consider units to be a synomyns for neurons in ths case), therefore in the first two layers we have 16 neurons within a layer. 
In the last layer we define a layer with just 1 unit. 

A hidden unit is a dimension in the representation space of the layer.
Having hidden 16 units menas that the weight matrix will have shape (input_dimensions, 16)
The dot product with W and the input will project the data onto a 16-dimensional representation space, then the bias is added and the relu applicaiton is included.
The dimensonaity of the represenation space is how much freedom you are enabling the entwork to have whgen leerning internal rerpeseentation, so having a higeher dimensional repreesenation makes the network to learn more expensive representations, but it can make the network more computationaliy expensive and can lead to the learning of unwanted patterns
The key architecture deision to be made about such stack of Dense layers are the following:
- How many layers to use
- How many hidden units to choose for each layer

The last layer witth our network will output the scalar prediction regarding the sentiment of the current review.
The final layer will use a sigmoid activation as to output a probability, which is a score between 1 and 0, to have the score of 1 is how likey the review is to be positive.

The inclusion of a activation function such as relu (rectified linear unit) increases the represenational power of the network.
Without the relu activation function the dense layer would consit of linear opperations, a dot product and an addition:

output = dot(W, input) + b

The layer can only learn linear transformations (affine transformation) from the input data.
The hypothesis space will be restricted, as it would be the set of all possible transformation from the input data in a 16 dimensional space.
In order to incerase and acess a richer hypthesis space, we require deep representaion that is obtained via non-linearity or an activation function.

Relu is simply an operation that transforms negative input values to zero and positive input value retain their actual value.
-(UNDERSTAND WHY THIS ACTUALLY WORKS< WHY DOES RELU WORK)



In [19]:
from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

W0717 00:46:26.971276 44776 deprecation_wrapper.py:119] From C:\Users\Richmond\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0717 00:46:27.131489 44776 deprecation_wrapper.py:119] From C:\Users\Richmond\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0717 00:46:27.170385 44776 deprecation_wrapper.py:119] From C:\Users\Richmond\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.



Now that we have defined our model structure, layers and units, we can move onto the configuration of the model for training.

The model class has a method called compile that takes in everal parameters that defined the configuration and behaviour of the model during training.

One of the paramaters of the compile method is the definition of an optimizer. An optimizer updates the weight parameters in order to redce the loss or cost function. (INCLUDE SOME MORE INFO ON OPTIMISERS). We will be utilising rmsprop.

Another parameter is the loss function to be utilised during training. The loss function guides the gradient to a local minimum. The loss function we will be utilising is 'binary_crossentropy' (RESEARCH SOME MORE ON LOSS FUNCTIONS). Cross entropy is usually the best choice when dealing with models that output probabilities.
The cross entropy measures the distance between tthe probablity distribution of the ground truth and the prediction made by our model.

In [20]:
from keras import optimizers, losses, metrics

model.compile(optimizer=optimizers.RMSprop(lr=0.001), loss=losses.binary_crossentropy,
              metrics=[metrics.binary_accuracy])


W0717 00:46:55.672770 44776 deprecation_wrapper.py:119] From C:\Users\Richmond\Anaconda3\lib\site-packages\keras\optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0717 00:46:55.682745 44776 deprecation_wrapper.py:119] From C:\Users\Richmond\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:3376: The name tf.log is deprecated. Please use tf.math.log instead.

W0717 00:46:55.690720 44776 deprecation.py:323] From C:\Users\Richmond\Anaconda3\lib\site-packages\tensorflow\python\ops\nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


We will create a validation set from our traning data, we will set apart 10,000 training examples from the training data.
The purpose of doing this is to measure the accuracy of our model on unseen data during training

In [21]:
x_val = x_train[:10000]
y_val = y_train[:10000]
partial_x_train = x_train[10000:]
partial_y_train = y_train[10000:]

In order to train our model we make use of the fit method from the model class we have instantiatied.
The fit method takes in parameters such as the dataset to train on, the number of epochs, batch size and the validation data to utilise during training.

model.fit() returns an history object that contains a memeber called history (dictionary) that contains information on what occured during traininig.

In [22]:
history = model.fit(partial_x_train, partial_y_train, epochs=20, batch_size=512, validation_data=(x_val, y_val))

W0717 00:48:41.839047 44776 deprecation_wrapper.py:119] From C:\Users\Richmond\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.



Train on 15000 samples, validate on 10000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [25]:
%matplotlib notebook
import matplotlib.pyplot as plt

history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']

epochs = range(1, len(loss_values) + 1)

plt.plot(epochs, loss_values, 'bo', label='Training Loss')
plt.plot(epochs, val_loss_values, 'b', label='Validation Loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

<IPython.core.display.Javascript object>

In [26]:
plt.clf()
acc = history_dict['binary_accuracy']
val_acc = history_dict['val_binary_accuracy']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

<IPython.core.display.Javascript object>

In [27]:
model.predict(x_test)

array([[0.01435599],
       [0.9999999 ],
       [0.94098186],
       ...,
       [0.00154093],
       [0.00842685],
       [0.7272435 ]], dtype=float32)