# Text Classification using Pre-trained Embdedding Layers
When we are training our networks we are learning our Embdeddngs from scratch each time and the quality of these embeddings is limited to our training data and the time we spend training.

In recent years, there has been a move towards using pre-trained layers and we saw an example of this when we used Pre-Trained Base Model for image classification.

For text we can leverage some pre-trained embedding layers; these are layers that have previously been trained on large amounts of data and for a long time.

In this workbook we will use a pre-trained embdedding layer to improve our model.

# Importing some packages
We are using the Python programming language and a set of Machine Learning packages - Importing packages for use is a common task. For this workshop you don't really need to pay that much attention to this step (but you do need to execute the cell) since we are focusing on building models. However the following is a description of what this cell does that you can read if you are interested.

### Description of imports (Optional)
You don't need to worry about this code as this is not the focus on the workshop but if you are interested in what this next cell does, here is an explaination.

|Statement|Meaning|
|---|---|
|__import tensorflow as tf__ |Tensorflow (from Google) is our main machine learning library and we performs all of the various calculations for us and so hides much of the detailed complexity in Machine Learning. This _import_ statement makes the power of TensorFlow available to us and for convience we will refer to it as __tf__ |
|__from tensorflow import keras__ |Tensorflow is quite a low level machine learning library which, while powerful and flexible can be confusing so instead we use another higher level framework called Keras to make our machine learning models more readable and easier to build and test. This _import_ statement makes the Keras framework available to us.|
|__import numpy as np__ |Numpy is a Python library for scientific computing and is commonly used for machine learning. This _import_ statement makes the Keras framework available to us.|
|__import matplotlib.pyplot as plt__ |To visualise what is happening in our network we will use a set of graphs and MatPlotLib is the standard Python library for producing Graphs so we __import__ this to enable us to make pretty graphs.|
|__%matplotlib inline__| this is a Jupyter Notebook __magic__ commmand that tells the workbook to produce any graphs as part of the workbook and not as pop-up window.|

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf
from tensorflow import keras

import numpy as np
import os

print(tf.__version__)
%matplotlib inline

In [None]:
# download pretrained GloVe embeddings
!wget http://nlp.stanford.edu/data/glove.6B.zip

In [None]:
# Unzip the file
!unzip glove.6B.zip

## Helper functions
The following cell contains a set of helper functions that makes our models a little clearer. We will not be going through these functions (since they require Python knowlege) so just make sure you have run this cell.

In [None]:
MAX_NUM_WORDS = 20000
EMBEDDING_DIM = 100
MAX_SEQUENCE_LENGTH = 100

def printLossAndAccuracy(history):
  import matplotlib.pyplot as plt
  history_dict = history.history

  acc = history_dict['acc']
  val_acc = history_dict['val_acc']
  loss = history_dict['loss']
  val_loss = history_dict['val_loss']

  epochs = range(1, len(acc) + 1)

  # "bo" is for "blue dot"
  plt.plot(epochs, loss, 'bo', label='Training loss')
  # b is for "solid blue line"
  plt.plot(epochs, val_loss, 'b', label='Validation loss')
  plt.title('Training and validation loss')
  plt.xlabel('Epochs')
  plt.ylabel('Loss')
  plt.legend()
  
  plt.show()

  plt.clf()   # clear figure

  plt.plot(epochs, acc, 'bo', label='Training acc')
  plt.plot(epochs, val_acc, 'b', label='Validation acc')
  plt.title('Training and validation accuracy')
  plt.xlabel('Epochs')
  plt.ylabel('Accuracy')
  plt.legend()


  plt.show()

def predictStarRating(review, model, word_index):
      encoded_review = [encodeReview(review, word_index)]
      sequence = keras.preprocessing.sequence.pad_sequences(encoded_review,
                                                      value=word_index["<PAD>"],
                                                      padding='post',
                                                      maxlen=MAX_SEQUENCE_LENGTH)
      prediction = model.predict(sequence)
      if prediction >= 0.9:
          return "5 Stars"
      if prediction >= 0.7:
          return "4 Stars"
      if prediction >= 0.5:
          return "3 Stars"
      if prediction >= 0.3:
              return "2 Stars"
      return "1 Star"
      

def getWordIndex(corpus):
  word_index = corpus.get_word_index()

  # The first indices are reserved
  word_index = {k:(v+3) for k,v in word_index.items()}
  word_index["<PAD>"] = 0
  word_index["<START>"] = 1
  word_index["<UNK>"] = 2  # unknown
  word_index["<UNUSED>"] = 3

  reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])  

  return word_index, reverse_word_index

def decodeReview(text, reverse_word_index):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

def encodeReview(text, word_index):
  remove_words = ["!", "$", "%", "&", "*", ".", "?", "<", ">", ","]  
  for word in remove_words:
      text = text.replace(word, "")
  
  return [word_index[token] if token in word_index else 2 for token in text.lower().split(" ")]

def load_glove_file_embeddings():
    # Extract the Embeddings
    glove_file = os.path.join('.', 'glove.6B.100d.txt')
    embeddings_index = {}
    f = open(glove_file, encoding="utf8")
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    f.close()
    return embeddings_index

def get_embedding_layer():
    embeddings_index = load_glove_file_embeddings()
    
    # prepare embedding matrix
    num_words = min(MAX_NUM_WORDS, len(word_index) + 1)
    embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
    for word, i in word_index.items():
        if i >= MAX_NUM_WORDS:
            continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

    # load pre-trained word embeddings into an Embedding layer
    # note that we set trainable = False so as to keep the embeddings fixed
    embedding_layer = keras.layers.Embedding(num_words,
                                EMBEDDING_DIM,
                                embeddings_initializer=keras.initializers.Constant(embedding_matrix),
                                input_length=MAX_SEQUENCE_LENGTH,
                                trainable=False)
    return embedding_layer

# Load the dataset

In [None]:
# Load the data
imdb = keras.datasets.imdb

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
print("Training entries: {}, labels: {}".format(len(train_data), len(train_labels)))

# A dictionary mapping words to an integer index
word_index, reverse_word_index = getWordIndex(imdb)

## Preprocessing Textual Data
As before we will Pad our sequences to be a standard length

In [None]:
# We will pad our Training  and Testing reviews so they are all 256 words long and truncate any that are longer
# We will add any padding to the end of the sentence.
train_data = keras.preprocessing.sequence.pad_sequences(train_data,
                                                        value=word_index["<PAD>"],
                                                        padding='post',
                                                        truncating='post',
                                                        maxlen=MAX_SEQUENCE_LENGTH)

test_data = keras.preprocessing.sequence.pad_sequences(test_data,
                                                       value=word_index["<PAD>"],
                                                       padding='post',
                                                        truncating='post',
                                                       maxlen=MAX_SEQUENCE_LENGTH)

# Define your Model
Before we can build our model we need to create our Embedding layer. This is not as straightforward to do as it was for our Image Classification; this will probably change in the future.

To support the creation of this pre-trained layer we have created some functions in the Helper Functions section - if you want you can review these functions but it is optional.

# Define your Model
## Using RNN layer

## Exercise
As before we have provided a skeleton model that consists of the Input/Embedding layer and the Output Layer. We have also provided you with a single hidden layer showing an example of a 1D Convolutional network.

Review the model definition and in your teams decide on the structure of the models you want to train and evaluate. Then each of you implement one of these designs.

Options for your hidden layers include:
- Having a single Convolution layer (and Pooling) followed by one or more Dense layers
- Having a sequence of Convolution (and Pooling) layer connected to the output layer
- Having a sequence of Convolution (and Pooling) layer followed by one or more Dense layers

At each layer you should consider and specify the number of filter/units, the kernel and pool sizes.

In [None]:
# input shape is the vocabulary count used for the movie reviews (10,000 words)
vocab_size = 10000

embedding_layer = get_embedding_layer()
# Classification using CNN
model = keras.Sequential()

# Input Layer
model.add(keras.layers.Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32'))
model.add(embedding_layer)


# Hidden Layers
# TODO: Create either a CNN or LSTM based hidden layer
#model.add(keras.layers.LSTM(units = 50))
model.add(keras.layers.Conv1D(filters=32, kernel_size=3, padding="same", activation="relu"))
model.add(keras.layers.MaxPooling1D(pool_size=2))

# Output Layer
model.add(tf.keras.layers.Flatten())
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))

# Compile the model
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['acc'])

# Print the Summary of the model
model.summary()

# Train the Model
We are now in a position to train our model against our data. We have set the number of epochs to 50 with the option to stop early if the validation loss stagnates.

In [None]:
# We'll train for some epochs
epochs = 50

# Stop early if our Validation Loss stagnates
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=3)

# Train our model
history = model.fit(train_data,
                    train_labels,
                    epochs=epochs,
                    batch_size=512,
                    validation_split = 0.2,
                    verbose=2,
                   callbacks=[early_stop])
print("Training Done")

## Evaluate the model
We will now evaluate our model against a set of previously unseen reviews

In [None]:
val_loss, val_acc = model.evaluate(test_data, test_labels)

print ("Test Loss:", val_loss)
print ("Test Accuracy:", val_acc)

In [None]:
printLossAndAccuracy(cnn_history)

### Execise
We have a method that will predict a star rating for given review. A really positive review will be predicted as 5 Stars, whereas a really negative review will be predicted as 1 Star.

Try some sample reviews by changing the value of _my_review_ and see if the predicted Star rating matches your expectations.

Given this system and the specific requirement to estimate a Star rating, what are the risks you should test for?

Work in your teams and:
- List out some of the risks you would want to test for
- Try out some of your ideas out and record any issues you find

In [None]:
# Try a review out
my_review = "ok directing but good casting."

print(my_review)
print("\nPredicted Star Rating (1-5 Stars) is {}".format(predictStarRating(my_review, 
                                                         model, word_index)))