# Classifying movie reviews

In this example, we will learn to classify movie reviews into "positive" reviews and "negative" reviews, just based on the text content of the reviews. This is a "binary classification example".

## The Dataset

The IMDB dataset: a set of 50,000 reviews from the Internet Movie Database. They are split into 25,000 reviews for training and 25,000 reviews for testing, each set consisting in 50% negative and 50% positive reviews.

The IMDB dataset comes pre-built in Keras. It has already been preprocessed: the reviews (sequences of words) have been turned into sequences of integers, where each integer stands for a specific word in a dictionary.

The following code will load the dataset (when you run it for the first time, about 80MB of data will be downloaded to your machine):

We load the dataset:

In [None]:
from keras.datasets import imdb

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

** Check training and test data dimension. **

## Prepare the Data

We cannot feed lists of integers into a neural network. We have to turn lists into tensors.

We could pad our lists so that they all have the same length, and turn them into an integer tensor of shape (samples, word_indices), then use as first layer in our network a layer capable of handling such integer tensors (the Embedding layer, which we will cover in detail later).
We could one-hot-encode our lists to turn them into vectors of 0s and 1s. Concretely, this would mean for instance turning the sequence [3, 5] into a 10,000-dimensional vector that would be all-zeros except for indices 3 and 5, which would be ones. Then we could use as first layer in our network a Dense layer, capable of handling floating point vector data.
We will go with the latter solution. Let's vectorize our data, which we will do manually for maximum clarity:

In [None]:
import numpy as np

def vectorize_sequences(sequences, dimension=10000):
    # Create an all-zero matrix of shape (len(sequences), dimension)
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.  # set specific indices of results[i] to 1s
    return results

# Our vectorized training data
x_train = vectorize_sequences(train_data)
# Our vectorized test data
x_test = vectorize_sequences(test_data)

** Check one sample. **

In [None]:
# Our vectorized labels
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

## Building the Network

The input data is simply vectors, and the labels are scalars (1 and 0): this is the easiest setup you will ever encounter. 
Key questions: how many layers do we need? How many hidden units for each layer? 
We will build a simple stack of fully-connected (Dense) layers. The network should have the following structure:
- Dense layer with 16 hidden units and ReLU activation function;
- Dense layer with 16 hidden units and ReLU activation function;
- Dense layer output with sigmoid activation function;

Remember: for the fully connected layer, it holds: output = relu(dot(W, input) + b), with W weight matrix and b bias.
Remember: the sigmoid function will give you a score between 0 and 1, which tells you "how likely the sample is to have "1", that means the review to be "positive".

**Import models and layers from Keras. **

**Create the network model as described above. **

** Configure the model with an  optimizer and  a loss function. Check Keras documentation and try to understand which one suits better for your problem. Use `accuracy` as metrics.**

** Do the same as in the previous cell but pass an optimizer class instance as the optimizer argument. Check Keras documentation for more insights. **

** Do the same as in the previous cell, but pass function objects as the loss or metrics arguments. **

## Validating the approach

**Create the Validation set of 10,000 samples and the Training one, consequentily. **

** Train the model for 20 epochs in mini-batches of 512 samples. Use the Validation set you have just created. Check Keras documentation for more info. **

** Check the `.history` of the result of your fit and print it to see in which format it is and how it is possible to access to the values. **

** Complete the #TO DO to obtain the plot of the training and validation accuracy. **

In [None]:
import matplotlib.pyplot as plt

acc = #TO DO: take from the history dictionary of the model the training accuracy
val_acc =  #TO DO: take from the history dictionary of the model the validation accuracy
loss =  #TO DO: take from the history dictionary of the model the training loss
val_loss =  #TO DO: take from the history dictionary of the model the validation loss 

epochs = range(1, len(acc) + 1)

# Plot the Loss

# "bo" is for "blue dot"
plt.plot(epochs,#TO DO: use the training loss, 'bo', label='Training loss') 
# b is for "solid blue line"
plt.plot(epochs,  #TO DO: use the validation loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

In [None]:
# Plot the Accuracy

plt.clf()   # clear figure
acc_values =  #TO DO: take from the history dictionary of the model the training accuracy
val_acc_values =  # TO DO: take from the history dictionary of the model the validation accuracy

# Plot Epochs vs Training accuracy and Epochs vs Validation accuracy
plt.plot(epochs, #TO DO: use training accuracy,  'bo', label='Training acc')  
plt.plot(epochs, #TO DO: use validation accuracy,, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

** What can you conclude from these results? Try to analyse and understand them. **

Let's train a new network from scratch for four epochs, then evaluate it on our test data:

** Train a new network from scratch for 4 epochs and evaluate it on the test data. **

## Test the Network

** Make predictions with your model. **

## Further experiments
**NOTE: In the next Lab, we will focus more in optimizing the hyperparameters and understanding how they may influence the outcomes. However here are some simple experiments you can also try now.**

** Try out these experiments and every time re-compute the accuracy to check how it changes. **
#### Experiment 1
We were using 2 hidden layers, try to use 1 or 3 hidden layers and see how it affects validation and test accuracy.
#### Experiment 2
Try to use layers with more hidden units or less hidden units: 32 units, 64 units...
#### Experiment 3
Try to use the another loss function.
#### Experiment 4
Try to use the tanh activation instead of ReLU.