# Lesson 5 - Text Classification
In the previous 2 lessons we have seen how a machine can take a digital image and learn some form of classification of the image. In this lesson we will look at how a machine can learn from Textual data to perform some form of analysis and classification.

The new concepts in this lesson are:
- Learning from Text
- Word Embeddings
- Sentiment Analysis


## Importing some packages
We are using the Python programming language and a set of Machine Learning packages - Importing packages for use is a common task. For this workshop you don't really need to pay that much attention to this step (but you do need to execute the cell) since we are focusing on building models. However the following is a description of what this cell does that you can read if you are interested.

### Description of imports (Optional)
You don't need to worry about this code as this is not the focus on the workshop but if you are interested in what this next cell does, here is an explaination.
- __import tensorflow as tf__ - Tensorflow (from Google) is our main machine learning library and we performs all of the various calculations for us and so hides much of the detailed complexity in Machine Learning. This _import_ statement makes the power of TensorFlow available to us and for convience we will refer to it as __tf__
- __from tensorflow import keras__ - Tensorflow is quite a low level machine learning library which, while powerful and flexible can be confusing so instead we use another higher level framework called Keras to make our machine learning models more readable and easier to build and test. This _import_ statement makes the Keras framework available to us.
- __import numpy as np__ - Numpy is a Python library for scientific computing and is commonly used for machine learning. This _import_ statement makes the Keras framework available to us.
- __import matplotlib.pyplot as plt__ - To visualise what is happening in our network we will use a set of graphs and MatPlotLib is the standard Python library for producing Graphs so we __import__ this to enable us to make pretty graphs.
- __%matplotlib inline__ - this is a Jupyter Notebook __magic__ commmand that tells the workbook to produce any graphs as part of the workbook and not as pop-up window.

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf
from tensorflow import keras

import numpy as np
import lesson5

print(tf.__version__)

# About Our Dataset
In this lesson we will be using a Standard Text Database called the IMDB Reviews Database. This contains a large number of reviews from the Internet Movie Database site about a range of files.

This is a standard dataset used in Machine Learning for Natural Language Processing and _Keras_ provides us with easy access to this data. Each Movie Review has a Sentiment attached either Positive or Negative. We want to train a Classifer to estimate whether it things a provide review is Positive or Negative or some where in between.


### Discussion Point
Consider the review:

__"Overall the movie was utterly awsome although some of the scenes could be better litter. The story and cast were captivating. It's up there in my top 100 movies"__

Based on what you known about Deep Learning already, how do we make this review suitable for a learning model?

In [None]:
# Load the data
imdb = keras.datasets.imdb

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
print("Training entries: {}, labels: {}".format(len(train_data), len(train_labels)))
# A dictionary mapping words to an integer index
word_index, reverse_word_index = lesson5.getWordIndex(imdb)

## Let's look at a review
The IMDB dataset in Keras is already pre-encoded and ready for use. So we can print out a review as the Computer will work with it and then decode this to show the human readable review

In [None]:
print("Machine Readable Review: \n{}".format(train_data[0]))

In [None]:
print("Human Readable Review: \n{}".format(lesson5.decodeReview(train_data[0], reverse_word_index)))

## Preprocessing Textual Data
In the lessons on Classifying images, all our images were of the same size (28x28), this wasn't a coincidence. Under the hood Machine Learning makes heavy use of Linear Algebra and the are optimisations we can make if our inputs are the same size/shape.

With images (which you will see in the next lesson) we re-size the images to a standard size. With Textual Data we want to make all of our Movie Reviews to be the same length so we use an appraoch called __Padding__ which just inserts some standard value (usually '0') either to the start or the end of the each review to make them the same length. If a review is longer than some size we specify then it is truncated.

_Keras_ makes padding very easy using the `keras.preprocessing.sequence.pad_sequences()` method

In [None]:
# We will limit our Reviews to the first 256 words and truncate any reviews that are longer than this
max_words = 256

# We will pad our Training  and Testing reviews so they are all 256 words long and truncate any that are longer
# We will add any padding to the end of the sentence.
train_data = keras.preprocessing.sequence.pad_sequences(train_data,
                                                        value=word_index["<PAD>"],
                                                        padding='post',
                                                        maxlen=max_words)

test_data = keras.preprocessing.sequence.pad_sequences(test_data,
                                                       value=word_index["<PAD>"],
                                                       padding='post',
                                                       maxlen=max_words)

## Defining our Mode
For a basic text classifier, we can use the same layers (_Dense_ and _Flatten_) to learn the relationship between a sentence and its sentiment (Positive or Negative).

However that will not perform that well and so what we need is a specialist layer called an __Embedding__ layer. This is a specialised layer that learn relationships between inputs.

# TODO: PRovide a better description of Embeddings

So when we build our model our _Input Layer_ is an _Embedding_ layer. The IMDB Reviews have over 88,000 distinct "words" and this is probably too large for what we need. So we will limit our Embedding Layer to the top 10,000 words only.

Our Hidden Layers can be the standard _Dense_ layer (later in this workbook we will use Convolutions but just use Dense layers for now).

We are building a model to classify our input as either positive or negative so we only really need a single unit in our _Output Layer_.
- why? we can use a single numerical output and assume values close to '0' are Negative reviews and values close to '1' are Positive reviews).

Let's define our model

In [None]:
# input shape is the vocabulary count used for the movie reviews (10,000 words)
vocab_size = 10000

model = keras.Sequential()
# Input Layer
model.add(keras.layers.Embedding(input_dim = vocab_size, output_dim=32, input_length=max_words))
model.add(keras.layers.Flatten())

# Hidden Layers
# YOUR CHANGES START HERE
# TODO: Decide how many layers you want and how many units on each layer
model.add(keras.layers.Dense(units=256, activation=tf.nn.relu))
# YOUR CHANGES END HERE

# Output Layer
model.add(keras.layers.Dense(units=1, activation=tf.nn.sigmoid))

# Compile the model
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['acc'])

# Sumarise the model
model.summary()

In [None]:
# Stop early if our Validation Loss stagnates
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=3)

# We'll train for some epochs
epochs = 20

# Train our model
history = model.fit(train_data,
                    train_labels,
                    epochs=epochs,
                    batch_size=512,
                    validation_split = 0.2,
                    verbose=2,
                   callbacks=[early_stop])

## Evaluate our model
We will now evaluate our model against a set of previously unseen reviews

In [None]:
val_loss, val_acc = model.evaluate(test_data, test_labels)

print ("Test Loss:", val_loss)
print ("Test Accuracy:", val_acc)

In [None]:
lesson5.printLossAndAccuracy(history)

### Execise
We have a method that will predict a star rating for given review. A really positive review will be predicted as 5 Stars, whereas a really negative review will be predicted as 1 Star.

Try some sample reviews by changing the value of _my_review_ and see if the predicted Star rating matches your expectations.

Given this system and the specific requirement to estimate a Star rating, what are the risks you should test for?

Work in your teams and:
- List out some of the risks you would want to test for
- Try out some of your ideas out and record any issues you find

In [None]:
# Try a review out
my_review = "ok directing but good casting."


print(my_review)
print("\nPredicted Star Rating (1-5 Stars) is {}".format(lesson5.predictStarRating(my_review, 
                                                         model, word_index)))

### Exercise
Think about your current work, social or other situation and how the use of Text Classification could be used for __good__.

Work in your teams and:
- Identify possible uses for Text Classification in your context
- Think about what data you might need and where you can obtain it from
- Consider the Ethical and Social implications of doing this 

# Text Classification using CNNs (Optional)
In the Image Classification Lessons we introduced the idea of Convolutions and Pooling (a CNN) to improve image classification. While Convolutions were originally concieved for Image Processing, we can use the same idea with Textual data (it's all numbers after all!)

For images we used `Conv2D()` to define a Convolutional layer that would scan an 2-D image with a 2-D filter. Our text sentences only have one dimention (length) so we use a 1-D Convolutional layer. Keras provides this as `Conv1D()` where rather than provide a 2-D _Kernal_ we specify a 1-D _Kernal_.

Aside from these small difference it's very similar to the image processing with a CNN.

Let's build one and see if we can improve on our previous model.

In [None]:
# Classification using CNN
cnn_model = keras.Sequential()

# Input Layer
cnn_model.add(keras.layers.Embedding(input_dim = vocab_size, 
                                     output_dim=32, input_length=max_words))

# YOUR CHANGES START HERE

# Hidden Layers - CNN
# TODO: Decide on how many Filters you want to use for each of the Convolutional Layers
#       If you want to try additional Convolutional `layers then copy the line below to add addiitonal layers
cnn_model.add(keras.layers.Conv1D(filters=64, kernel_size=3, padding="same", activation="relu"))
cnn_model.add(keras.layers.MaxPooling1D(pool_size=2))

cnn_model.add(keras.layers.Flatten())
# Hidden Dense Layers
# TODO: Decide how many units you want in the Dense Layre
cnn_model.add(keras.layers.Dense(units=256, activation=tf.nn.relu))


# YOUR CHANGES END HERE 

# Output Layer
cnn_model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))

# Compile the model
cnn_model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['acc'])

# Print the Summary of the model
cnn_model.summary()


In [None]:
# Stop early if our Validation Loss stagnates
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=3)

# We'll train for some epochs
epochs = 20

# Train our model
cnn_history = model.fit(train_data,
                    train_labels,
                    epochs=epochs,
                    batch_size=512,
                    validation_split = 0.2,
                    verbose=2,
                   callbacks=[early_stop])
print("Training Done")

## Evaluate the model

In [None]:
val_loss, val_acc = cnn_model.evaluate(test_data, test_labels)

print ("Test Loss:", val_loss)
print ("Test Accuracy:", val_acc)

In [None]:
lesson5.printLossAndAccuracy(cnn_history)

### Execise
We have a method that will predict a star rating for given review. A really positive review will be predicted as 5 Stars, whereas a really negative review will be predicted as 1 Star.

Try some sample reviews by changing the value of _my_review_ and see if the predicted Star rating matches your expectations.

Given this system and the specific requirement to estimate a Star rating, what are the risks you should test for?

Work in your teams and:
- List out some of the risks you would want to test for
- Try out some of your ideas out and record any issues you find

In [None]:
# Try a review out
my_review = "ok directing but good casting."


print(my_review)
print("\nPredicted Star Rating (1-5 Stars) is {}".format(lesson5.predictStarRating(my_review, 
                                                         cnn_model, word_index)))