# Text Classification using CNNs
In the previous workbook we managed to create a reasonable Sentiment Classifier. You may have found that when testing the classifier it was easy to fool the model.

Part of the limitations with the previous model is that it was basing the classification on the presences of words and the relationship between words in a sequence were not really being catered for.

We had a similar issue with Images and resolved this with Convolutional layers to great effect. We can use a similar approach with text. With images we used a 2 dimentional Convolution layer but for sequences of text this isn't going to work. Instead we will use 1 dimentional Convolutional layers.

In this workbook we will continue with our Sentiment Classifer example but implement the model using 1-D Convolutions layers instead of Dense layer.

# Importing some packages
We are using the Python programming language and a set of Machine Learning packages - Importing packages for use is a common task. For this workshop you don't really need to pay that much attention to this step (but you do need to execute the cell) since we are focusing on building models. However the following is a description of what this cell does that you can read if you are interested.

### Description of imports (Optional)
You don't need to worry about this code as this is not the focus on the workshop but if you are interested in what this next cell does, here is an explaination.

|Statement|Meaning|
|---|---|
|__import tensorflow as tf__ |Tensorflow (from Google) is our main machine learning library and we performs all of the various calculations for us and so hides much of the detailed complexity in Machine Learning. This _import_ statement makes the power of TensorFlow available to us and for convience we will refer to it as __tf__ |
|__from tensorflow import keras__ |Tensorflow is quite a low level machine learning library which, while powerful and flexible can be confusing so instead we use another higher level framework called Keras to make our machine learning models more readable and easier to build and test. This _import_ statement makes the Keras framework available to us.|
|__import numpy as np__ |Numpy is a Python library for scientific computing and is commonly used for machine learning. This _import_ statement makes the Keras framework available to us.|
|__import matplotlib.pyplot as plt__ |To visualise what is happening in our network we will use a set of graphs and MatPlotLib is the standard Python library for producing Graphs so we __import__ this to enable us to make pretty graphs.|
|__%matplotlib inline__| this is a Jupyter Notebook __magic__ commmand that tells the workbook to produce any graphs as part of the workbook and not as pop-up window.|

In [6]:
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf
from tensorflow import keras

import numpy as np

print(tf.__version__)
%matplotlib inline

2.0.0-beta1


## Helper functions
The following cell contains a set of helper functions that makes our models a little clearer. We will not be going through these functions (since they require Python knowlege) so just make sure you have run this cell.

In [7]:
def printLossAndAccuracy(history):
  import matplotlib.pyplot as plt
  history_dict = history.history

  acc = history_dict['acc']
  val_acc = history_dict['val_acc']
  loss = history_dict['loss']
  val_loss = history_dict['val_loss']

  epochs = range(1, len(acc) + 1)

  # "bo" is for "blue dot"
  plt.plot(epochs, loss, 'bo', label='Training loss')
  # b is for "solid blue line"
  plt.plot(epochs, val_loss, 'b', label='Validation loss')
  plt.title('Training and validation loss')
  plt.xlabel('Epochs')
  plt.ylabel('Loss')
  plt.legend()
  
  plt.show()

  plt.clf()   # clear figure

  plt.plot(epochs, acc, 'bo', label='Training acc')
  plt.plot(epochs, val_acc, 'b', label='Validation acc')
  plt.title('Training and validation accuracy')
  plt.xlabel('Epochs')
  plt.ylabel('Accuracy')
  plt.legend()


  plt.show()

def predictStarRating(review, model, word_index):
      encoded_review = [encodeReview(review, word_index)]
      sequence = keras.preprocessing.sequence.pad_sequences(encoded_review,
                                                      value=word_index["<PAD>"],
                                                      padding='post',
                                                      maxlen=256)
      prediction = model.predict(sequence)
      if prediction >= 0.9:
          return "5 Stars"
      if prediction >= 0.7:
          return "4 Stars"
      if prediction >= 0.5:
          return "3 Stars"
      if prediction >= 0.3:
              return "2 Stars"
      return "1 Star"
      

def getWordIndex(corpus):
  word_index = corpus.get_word_index()

  # The first indices are reserved
  word_index = {k:(v+3) for k,v in word_index.items()}
  word_index["<PAD>"] = 0
  word_index["<START>"] = 1
  word_index["<UNK>"] = 2  # unknown
  word_index["<UNUSED>"] = 3

  reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])  

  return word_index, reverse_word_index

def decodeReview(text, reverse_word_index):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

def encodeReview(text, word_index):
  remove_words = ["!", "$", "%", "&", "*", ".", "?", "<", ">", ","]  
  for word in remove_words:
      text = text.replace(word, "")
  
  return [word_index[token] if token in word_index else 2 for token in text.lower().split(" ")]


# Load the dataset

In [8]:
# Load the data
imdb = keras.datasets.imdb

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
print("Training entries: {}, labels: {}".format(len(train_data), len(train_labels)))

# A dictionary mapping words to an integer index
word_index, reverse_word_index = getWordIndex(imdb)

Training entries: 25000, labels: 25000


## Preprocessing Textual Data
As before we will Pad our sequences to be a standard length

In [9]:
# We will limit our Reviews to the first 256 words and truncate any reviews that are longer than this
max_words = 256

# We will pad our Training  and Testing reviews so they are all 256 words long and truncate any that are longer
# We will add any padding to the end of the sentence.
train_data = keras.preprocessing.sequence.pad_sequences(train_data,
                                                        value=word_index["<PAD>"],
                                                        padding='post',
                                                        truncating='post',
                                                        maxlen=max_words)

test_data = keras.preprocessing.sequence.pad_sequences(test_data,
                                                       value=word_index["<PAD>"],
                                                       padding='post',
                                                        truncating='post',
                                                       maxlen=max_words)

# Define your Model
## Using 1D Convolutions
In the Image Classification Lessons we introduced the idea of Convolutions and Pooling (a CNN) to improve image classification. While Convolutions were originally concieved for Image Processing, we can use the same idea with Textual data (it's all numbers after all!)

For images we used `Conv2D()` to define a Convolutional layer that would scan an 2-D image with a 2-D filter. Our text sentences only have one dimention (length) so we use a 1-D Convolutional layer. Keras provides this as `Conv1D()` where rather than provide a 2-D _Kernal_ we specify a 1-D _Kernal_.

Aside from these small difference it's very similar to the image processing with a CNN.

Let's build one and see if we can improve on our previous model.
## Exercise
As before we have provided a skeleton model that consists of the Input/Embedding layer and the Output Layer. We have also provided you with a single hidden layer showing an example of a 1D Convolutional network.

Review the model definition and in your teams decide on the structure of the models you want to train and evaluate. Then each of you implement one of these designs.

Options for your hidden layers include:
- Having a single Convolution layer (and Pooling) followed by one or more Dense layers
- Having a sequence of Convolution (and Pooling) layer connected to the output layer
- Having a sequence of Convolution (and Pooling) layer followed by one or more Dense layers

At each layer you should consider and specify the number of filter/units, the kernel and pool sizes.

In [10]:
# input shape is the vocabulary count used for the movie reviews (10,000 words)
vocab_size = 10000

# Classification using CNN
model = keras.Sequential()

# Input Layer
model.add(keras.layers.Embedding(input_dim = vocab_size, 
                                     output_dim=32, input_length=max_words))

# YOUR CHANGES START HERE

# Hidden Layers - CNN
# TODO: Decide on how many Filters you want to use for each of the Convolutional Layers
#       If you want to try additional Convolutional `layers then copy the line below to add addiitonal layers
model.add(keras.layers.Conv1D(filters=32, kernel_size=3, padding="same", activation="relu"))
model.add(keras.layers.MaxPooling1D(pool_size=2))



# YOUR CHANGES END HERE 

# Output Layer
model.add(tf.keras.layers.Flatten())
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))

# Compile the model
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['acc'])

# Print the Summary of the model
model.summary()

NameError: name 'vocab_size' is not defined

# Train the Model
We are now in a position to train our model against our data. We have set the number of epochs to 50 with the option to stop early if the validation loss stagnates.

In [None]:
# We'll train for some epochs
epochs = 50

# Stop early if our Validation Loss stagnates
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=3)

# Train our model
cnn_history = model.fit(train_data,
                    train_labels,
                    epochs=epochs,
                    batch_size=512,
                    validation_split = 0.2,
                    verbose=2,
                   callbacks=[early_stop])
print("Training Done")

## Evaluate the model
We will now evaluate our model against a set of previously unseen reviews

In [None]:
val_loss, val_acc = cnn_model.evaluate(test_data, test_labels)

print ("Test Loss:", val_loss)
print ("Test Accuracy:", val_acc)

In [None]:
printLossAndAccuracy(cnn_history)

### Execise
We have a method that will predict a star rating for given review. A really positive review will be predicted as 5 Stars, whereas a really negative review will be predicted as 1 Star.

Try some sample reviews by changing the value of _my_review_ and see if the predicted Star rating matches your expectations.

Given this system and the specific requirement to estimate a Star rating, what are the risks you should test for?

Work in your teams and:
- List out some of the risks you would want to test for
- Try out some of your ideas out and record any issues you find

In [None]:
# Try a review out
my_review = "ok directing but good casting."

print(my_review)
print("\nPredicted Star Rating (1-5 Stars) is {}".format(predictStarRating(my_review, 
                                                         model, word_index)))

# Optional Exercise 1
When we built our model some design descisions were made for you; these included:
- The sequence length of the inputs
- The Vocabulary size for the Embdedding layer
- The size of the output of the Embedding layer

In the next cell you have the option to change these values and re-train the model to see if you can improve the model by tweaking these parameters.

Discuss in your teams, what values you want to try then implement, train and evaluate the models.

In [12]:
# How long do you want the review sequences to be
max_words = 256

# This determines the top set of words we will train with
vocab_size = 10000

# This determine the sequence length produced at the end of the Embedding process
embedding_output = 32

## Get a fresh set of data

In [13]:
# Get a fresh set of data
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
print("Training entries: {}, labels: {}".format(len(train_data), len(train_labels)))

# A dictionary mapping words to an integer index
word_index, reverse_word_index = getWordIndex(imdb)

Training entries: 25000, labels: 25000


## Sequence Pre-processing

In [14]:
# We will pad our Training  and Testing reviews so they are all 256 words long and truncate any that are longer
# We will add any padding to the end of the sentence.
train_data = keras.preprocessing.sequence.pad_sequences(train_data,
                                                        value=word_index["<PAD>"],
                                                        padding='post',
                                                        truncating='post',
                                                        maxlen=max_words)

test_data = keras.preprocessing.sequence.pad_sequences(test_data,
                                                       value=word_index["<PAD>"],
                                                       padding='post',
                                                        truncating='post',
                                                       maxlen=max_words)

## Define our model

In [17]:
# Classification using CNN
model = keras.Sequential()

# Input Layer
model.add(keras.layers.Embedding(input_dim = vocab_size, 
                                     output_dim=embedding_output, input_length=max_words))

# YOUR CHANGES START HERE

# Hidden Layers - CNN
# TODO: Decide on how many Filters you want to use for each of the Convolutional Layers
#       If you want to try additional Convolutional `layers then copy the line below to add addiitonal layers
model.add(keras.layers.Conv1D(filters=32, kernel_size=3, padding="same", activation="relu"))
model.add(keras.layers.MaxPooling1D(pool_size=2))



# YOUR CHANGES END HERE 

# Output Layer
model.add(tf.keras.layers.Flatten())
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))

# Compile the model
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['acc'])

# Print the Summary of the model
model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 256, 128)          1280000   
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 256, 32)           12320     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 128, 32)           0         
_________________________________________________________________
flatten (Flatten)            (None, 4096)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 4097      
Total params: 1,296,417
Trainable params: 1,296,417
Non-trainable params: 0
_________________________________________________________________


## Train the new model

In [18]:
# We'll train for some epochs
epochs = 50

# Stop early if our Validation Loss stagnates
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=3)

# Train our model
cnn_history = model.fit(train_data,
                    train_labels,
                    epochs=epochs,
                    batch_size=512,
                    validation_split = 0.2,
                    verbose=2,
                   callbacks=[early_stop])
print("Training Done")

W0927 18:46:10.307217 13064 deprecation.py:323] From C:\Users\Bill\Anaconda3\envs\ai\lib\site-packages\tensorflow\python\ops\math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Train on 20000 samples, validate on 5000 samples
Epoch 1/50
20000/20000 - 19s - loss: 0.6858 - acc: 0.5556 - val_loss: 0.6495 - val_acc: 0.6480
Epoch 2/50
20000/20000 - 20s - loss: 0.4711 - acc: 0.8045 - val_loss: 0.3355 - val_acc: 0.8622
Epoch 3/50
20000/20000 - 20s - loss: 0.2390 - acc: 0.9060 - val_loss: 0.3061 - val_acc: 0.8780
Epoch 4/50
20000/20000 - 17s - loss: 0.1638 - acc: 0.9414 - val_loss: 0.3143 - val_acc: 0.8794
Epoch 5/50
20000/20000 - 20s - loss: 0.1205 - acc: 0.9614 - val_loss: 0.3541 - val_acc: 0.8726
Epoch 6/50
20000/20000 - 20s - loss: 0.0879 - acc: 0.9753 - val_loss: 0.3851 - val_acc: 0.8732
Training Done


## Evaluate the accuracy

In [None]:
val_loss, val_acc = model.evaluate(test_data, test_labels)

print ("Test Loss:", val_loss)
print ("Test Accuracy:", val_acc)