# Sentiment analysis
In the video exercise, you were exposed to the various applications of sequence to sequence models. In this exercise you will see how to use a pre-trained model for sentiment analysis.

The model is pre-loaded in the environment on variable model. Also, the tokenized test set variables X_test and y_test and the pre-processed original text data sentences from IMDb are also available.You will learn how to pre-process the text data and how to create and train the model using Keras later in the course.

You will use the pre-trained model to obtain predictions of sentiment. The model returns a number between zero and one representing the probability of the sentence to have a positive sentiment. So, you will create a decision rule to set the prediction to positive or negative.

In [None]:
# Inspect the first sentence on `X_test`
print(X_test[0])

# Get the predicion for all the sentences
pred = model.predict(X_test)

# Transform the predition into positive (> 0.5) or negative (<= 0.5)
pred_sentiment = ["positive" if x>0.5 else "negative" for x in pred]

# Create a data frame with sentences, predictions and true values
result = pd.DataFrame({'sentence': sentences, 'y_pred': pred_sentiment , 'y_true': y_test})

# Print the first lines of the data frame
print(result.head())

In [19]:
import spacy
nlp=spacy.load('en_core_web_lg')
text ="You're afraid of insects and women, Ladybugs must render you catatonic.", 'Scissors cuts paper, paper covers rock, rock crushes lizard, lizard poisons Spock, Spock smashes scissors, scissors decapitates lizard, lizard eats paper, paper disproves Spock, Spock vaporizes rock, and as it always has, rock crushes scissors.'
DOC=nlp(str(text))
# Transform the list of sentences into a list of words
all_words = ' '.join(text).split(' ')
#all_words=[w for w in DOC ]
# Get number of unique words
unique_words = list(set(all_words))
# Dictionary of indexes as keys and words as values
index_to_word = {i:wd for i, wd in enumerate(sorted(unique_words))}
print(index_to_word)
# Dictionary of words as keys and indexes as values
word_to_index = {wd:i for i, wd in enumerate(sorted(unique_words))}
print(word_to_index)

{0: 'Ladybugs', 1: 'Scissors', 2: 'Spock', 3: 'Spock,', 4: "You're", 5: 'afraid', 6: 'always', 7: 'and', 8: 'as', 9: 'catatonic.', 10: 'covers', 11: 'crushes', 12: 'cuts', 13: 'decapitates', 14: 'disproves', 15: 'eats', 16: 'has,', 17: 'insects', 18: 'it', 19: 'lizard', 20: 'lizard,', 21: 'must', 22: 'of', 23: 'paper', 24: 'paper,', 25: 'poisons', 26: 'render', 27: 'rock', 28: 'rock,', 29: 'scissors', 30: 'scissors,', 31: 'scissors.', 32: 'smashes', 33: 'vaporizes', 34: 'women,', 35: 'you'}
{'Ladybugs': 0, 'Scissors': 1, 'Spock': 2, 'Spock,': 3, "You're": 4, 'afraid': 5, 'always': 6, 'and': 7, 'as': 8, 'catatonic.': 9, 'covers': 10, 'crushes': 11, 'cuts': 12, 'decapitates': 13, 'disproves': 14, 'eats': 15, 'has,': 16, 'insects': 17, 'it': 18, 'lizard': 19, 'lizard,': 20, 'must': 21, 'of': 22, 'paper': 23, 'paper,': 24, 'poisons': 25, 'render': 26, 'rock': 27, 'rock,': 28, 'scissors': 29, 'scissors,': 30, 'scissors.': 31, 'smashes': 32, 'vaporizes': 33, 'women,': 34, 'you': 35}


# Preparing text data for model input
Previously, you learned how to create dictionaries of indexes to words and vice versa. In this exercise, you will split the text by characters and continue to prepare the data for supervised learning.

Splitting the texts into characters may seem strange, but it is often done for text generation. Also, the process to prepare the data is the same, the only change is how to split the texts.

You will create the training data containing a list of fixed-length texts and their labels, which are the corresponding next characters.

You will continue to use the dataset containing quotes from Sheldon (The Big Bang Theory), available in the sheldon_quotes variable.

The print_examples() function print the pairs so you can see how the data was transformed. Use help() for details

In [18]:
# Create lists to keep the sentences and the next character
sentences = []   # ~ Training data
next_chars = []  # ~ Training labels

# Define hyperparameters
step = 2         # ~ Step to take when reading the texts in characters
chars_window = 10 # ~ Number of characters to use to predict the next one  

# Loop over the text: length `chars_window` per time with step equal to `step`
for i in range(0, len(sheldon_quotes) - chars_window, step):
    sentences.append(sheldon_quotes[i:i + chars_window])
    next_chars.append(sheldon_quotes[i + chars_window])

# Print 10 pairs
print_examples(sentences, next_chars, 10)


In [20]:
new_text =['A man either lives life as it happens to him meets it head-on and licks it or he turns his back on it and starts to wither away', 'To the brave crew and passengers of the Kobayshi Maru sucks to be you', 'Beware of more powerful weapons They often inflict as much damage to your soul as they do to you enemies', 'They are merely scars not mortal wounds and you must use them to propel you forward', 'You cannot explain away a wantonly immoral act because you think that it is connected to some higher purpose']
# Loop through the sentences and get indexes
new_text_split = []
for sentence in new_text:
    sent_split = []
    for wd in sentence.split(' '):
        index = word_to_index.get(wd,0 )
        sent_split.append(index)
    new_text_split.append(sent_split)

# Print the first sentence's indexes
print(new_text_split[0])

# Print the sentence converted using the dictionary
print(' '.join([index_to_word[index] for index in new_text_split[0]]))

[0, 0, 0, 0, 0, 8, 18, 0, 0, 0, 0, 18, 0, 7, 0, 18, 0, 0, 0, 0, 0, 0, 18, 7, 0, 0, 0, 0]
Ladybugs Ladybugs Ladybugs Ladybugs Ladybugs as it Ladybugs Ladybugs Ladybugs Ladybugs it Ladybugs and Ladybugs it Ladybugs Ladybugs Ladybugs Ladybugs Ladybugs Ladybugs it and Ladybugs Ladybugs Ladybugs Ladybugs


# Keras models
In this exercise you'll practice using two classes from the keras.models module. You will create one model using the two classes Sequential and Model.

The Sequential class is easier since the layers are assumed to be in order, while the Model class is more flexible and allows multiple inputs, multiple outputs and shared layers (shared weights).

The Model class needs to explicitly declare the input layer, while in the Sequential class, this is done with the input_shape parameter.

The objects and modules Sequential, Model, Dense, Input, LSTM and np (numpy) are already loaded on the environment.

In [22]:
from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Input
from tensorflow.keras.layers import Flatten
from numpy import array
from tensorflow.keras import backend
import tensorflow as tf
from tensorflow import keras
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
layers = tf.keras.layers
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
# Instantiate the class
model = tf.keras.Sequential(name="sequential_model")

# One LSTM layer (defining the input shape because it is the 
# initial layer)
model.add(layers.LSTM(128, input_shape=(None, 10), name="LSTM"))

# Add a dense layer with one unit
model.add(layers.Dense(1, activation="sigmoid", name="output"))

# The summary shows the layers and the number of parameters 
# that will be trained
model.summary()

Model: "sequential_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
LSTM (LSTM)                  (None, 128)               71168     
_________________________________________________________________
output (Dense)               (None, 1)                 129       
Total params: 71,297
Trainable params: 71,297
Non-trainable params: 0
_________________________________________________________________


# Functionel Model 

In [26]:
from tensorflow.keras.layers import Dense,Input
layers = tf.keras.layers
import tensorflow as tf
# Define the input layer
main_input =Input(shape=(None, 10), name="input")

# One LSTM layer (input shape is already defined)
lstm_layer = layers.LSTM(128, name="LSTM")(main_input)

# Add a dense layer with one unit
main_output = layers.Dense(1, activation="sigmoid", name="output")(lstm_layer)

# Instantiate the class at the end
model = tf.keras.Model(inputs=main_input, outputs=main_output, name="modelclass_model")

# Same amount of parameters to train as before (71,297)
model.summary()

Model: "modelclass_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (InputLayer)           [(None, None, 10)]        0         
_________________________________________________________________
LSTM (LSTM)                  (None, 128)               71168     
_________________________________________________________________
output (Dense)               (None, 1)                 129       
Total params: 71,297
Trainable params: 71,297
Non-trainable params: 0
_________________________________________________________________


Great! You can see that the keras.models.Sequential is very easy to use to add layers in sequence. On the other hand, the keras.models.Model class is very flexible and is usually the choice when scientists need deep customization in their solution. Also, you saw how one layer is connected to another layer in both cases, be by adding them in sequence using the method add, or by creating a layer and calling the desired (previous) layer like a function, in the Model class API, every layer is callable on a tensor and always return a tensor
# Keras preprocessing

The second most important module of Keras is keras.preprocessing. You will see how to use the most important modules and functions to prepare raw data to the correct input shape. Keras provides functionalities that substitute the dictionary approach you learned before.

You will use the module keras.preprocessing.text.Tokenizer to create a dictionary of words using the method .fit_on_texts() and change the texts into numerical ids representing the index of each word on the dictionary using the method .texts_to_sequences().

Then, use the function .pad_sequences() from keras.preprocessing.sequence to make all the sequences have the same size (necessary for the model) by adding zeros on the small texts and cutting the big ones.

In [5]:
# Import relevant classes/functions
from tensorflow.keras.preprocessing.text import  Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
texts =['A man either lives life as it happens to him meets it head-on and licks it or he turns his back on it and starts to wither away', 'To the brave crew and passengers of the Kobayshi Maru sucks to be you', 'Beware of more powerful weapons They often inflict as much damage to your soul as they do to you enemies', 'They are merely scars not mortal wounds and you must use them to propel you forward', 'You cannot explain away a wantonly immoral act because you think that it is connected to some higher purpose']
# Build the dictionary of indexes
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
# Change texts into sequence of indexes
texts_numeric = tokenizer.texts_to_sequences(texts)
print("Number of words in the sample texts: ({0}, {1})".format(len(texts_numeric[0]), len(texts_numeric[1])))
# Pad the sequences
texts_pad = pad_sequences(texts_numeric, 60,padding='post')
print("Now the texts have fixed length: 60. Let's see the first one: \n{0}".format(texts_pad[0]))

Number of words in the sample texts: (29, 14)
Now the texts have fixed length: 60. Let's see the first one: 
[ 7 12 13 14 15  5  3 16  1 17 18  3 19  8  4 20  3 21 22 23 24 25  8  3
  4 26  1 27  9  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0]


# Exploding gradient problem
In the video exercise, you learned about two problems that may arise when working with RNN models: the vanishing and exploding gradient problems.

This exercise explores the exploding gradient problem, showing that the derivative of a function can increase exponentially, and how to solve it with a simple technique.

The data is already loaded on the environment as X_train, X_test, y_train and y_test.

You will use a Stochastic Gradient Descent (SGD) optimizer and Mean Squared Error (MSE) as the loss function.

In the first step you will observe the gradient exploding by computing the MSE on the train and test sets. On step 2, you will change the optimizer using the clipvalue parameter to solve the problem.

The Stochastic Gradient Descent in Keras is loaded as SGD

In [None]:
# Create a Keras model with one hidden Dense layer
model = Sequential()
model.add(Dense(25, input_dim=20, activation='relu', kernel_initializer=he_uniform(seed=42)))
model.add(Dense(1, activation='linear'))

# Compile and fit the model
model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.01, momentum=0.9))
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=100, verbose=0)

# See Mean Square Error for train and test data
train_mse = model.evaluate(X_train, y_train, verbose=0)
test_mse = model.evaluate(X_test, y_test, verbose=0)

# Print the values of MSE
print('Train: %.3f, Test: %.3f' % (train_mse, test_mse))

# Avoid Exploding 
### Exploding gradient problem
In the video exercise, you learned about two problems that may arise when working with RNN models: the vanishing and exploding gradient problems.

This exercise explores the exploding gradient problem, showing that the derivative of a function can increase exponentially, and how to solve it with a simple technique.

The data is already loaded on the environment as X_train, X_test, y_train and y_test.

You will use a Stochastic Gradient Descent (SGD) optimizer and Mean Squared Error (MSE) as the loss function.

In the first step you will observe the gradient exploding by computing the MSE on the train and test sets. On step 2, you will change the optimizer using the clipvalue parameter to solve the problem.

The Stochastic Gradient Descent in Keras is loaded as SGD

In [None]:
# Create a Keras model with one hidden Dense layer
model = Sequential()
model.add(Dense(25, input_dim=20, activation='relu', kernel_initializer=he_uniform(seed=42)))
model.add(Dense(1, activation='linear'))

# Compile and fit the model
model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.01, momentum=0.9, clipvalue=3.0))
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=100, verbose=0)

# See Mean Square Error for train and test data
train_mse = model.evaluate(X_train, y_train, verbose=0)
test_mse= model.evaluate(X_test, y_test, verbose=0)

# Print the values of MSE
print('Train: %.3f, Test: %.3f' % (train_mse, test_mse))

The Exploding gradient problem can happen when using RNN models. Luckily, this can be addressed with simple techniques such as gradient clipping. Notice how after applying this technique, the outputs are no longer NaN, meaning that the gradients didn't 'explode' during Step 2


# Vanishing gradient problem
The other possible gradient problem is when the gradients vanish, or go to zero. This is a much harder problem to solve because it is not as easy to detect. If the loss function does not improve on every step, is it because the gradients went to zero and thus didn't update the weights? Or is it because the model is not able to learn?

This problem occurs more often in RNN models when long memory is required, meaning having long sentences.

In this exercise you will observe the problem on the IMDB data, with longer sentences selected. The data is loaded in X and y variables, as well as classes Sequential, SimpleRNN, Dense and matplotlib.pyplot as plt. The model was pre-trained with 100 epochs and its weights are stored on the file model_weights.h5

In [None]:
# Create the model
model = Sequential()
model.add(SimpleRNN(units=600, input_shape=(None, 1)))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='sgd', metrics=['accuracy'])

# Load pre-trained weights
model.load_weights('model_weights.h5')

# Plot the accuracy x epoch graph
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.legend(['train', 'val'], loc='upper left')
plt.show()

You can observe that at some point the accuracy stopped to improve, which can happen because of the vanishing gradient problem. This kind of problem is harder to detect than the exploding gradient problem and will demand deeper analysis by the data scientist. Researchers found a model architecture way to solve this problem, which you will study later in this course. Instead of using SimpleRNN cells, you can use the more complex ones such as Gated Recurrent Unit (GRU) and Long Short-Term Memory (LSTM) cells.
# GRU cells are better than simpleRNN
In this exercise you will re-run the same model as the first chapter of the course to compare the accuracy of the model by simpling changing the SimpleRNN cell to a GRU cell.

The model was already trained with 10 epochs, as in the previous model with a SimpleRNN cell. In order to compare the models, a test set (x_test, y_test) is already loaded in the environment, as well as the old model SimpleRNN_model

In [None]:
# Import the modules
from keras.layers import GRU, Dense

# Print the old and new model summaries
SimpleRNN_model.summary()
gru_model.summary()

# Evaluate the models' performance (ignore the loss value)
_, acc_simpleRNN = SimpleRNN_model.evaluate(X_test, y_test, verbose=0)
_, acc_GRU = gru_model.evaluate(X_test, y_test, verbose=0)

# Print the results
print("SimpleRNN model's accuracy:\t{0}".format(acc_simpleRNN))
print("GRU model's accuracy:\t{0}".format( acc_GRU))

# Stacking RNN layers
Deep RNN models can have tens to hundreds of layers in order to achieve state-of-the-art results.

In this exercise, you will get a glimpse of how to create deep RNN models by stacking layers of LSTM cells one after the other.

To do this, you will set the return_sequences argument to True on the firsts two LSTM layers and to False on the last LSTM layer.

To create models with even more layers, you can keep adding them one after the other or create a function that uses the .add() method inside a loop to add many layers with few lines of code.

In [None]:
# Import the LSTM layer
from keras.layers.recurrent import LSTM

# Build model
model = Sequential()
model.add(LSTM(units=128, input_shape=(None, 1), return_sequences=True))
model.add(LSTM(units=128, return_sequences=True))
model.add(LSTM(units=128, return_sequences=False))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Load pre-trained weights
model.load_weights('lstm_stack_model_weights.h5')

print("Loss: %0.04f\nAccuracy: %0.04f" % tuple(model.evaluate(X_test, y_test, verbose=0)))

Awesome! Stacking more layers also improve the accuracy of the model when comparing to the baseline 'simple_RNN' model! In the next lesson you will learn what else you can do to improve the model.

# The Embedding layer


# Number of parameters comparison
You saw that the one-hot representation is not a good representation of words because it is very sparse. Using the Embedding layer creates a dense representation of the vectors, but also demands a lot of parameters to be learned.

In this exercise you will compare the number of parameters of two models using embeddings and one-hot encoding to see the difference.

The model model_onehot is already loaded in the environment, as well as the Sequential, Dense and GRU from keras. Finally, the parameters vocabulary_size=80000 and sentence_len=200 are also loaded.

In [None]:
# Import the embedding layer
from keras.layers import Embedding

# Create a model with embeddings
model = Sequential(name="emb_model")
model.add(Embedding(input_dim=vocabulary_size+1, output_dim=wordvec_dim, input_length=sentence_len, trainable=True))
model.add(GRU(128))
model.add(Dense(1))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Print the summaries of the one-hot model
model_onehot.summary()

# Print the summaries of the model with embeddings
model.summary()

You can see the immense difference in the number of parameters when using the embedding layer! Don't worry, in the next exercise you will learn how make transfer learning to avoid having to train this layer.
# Transfer learning
You saw that when training an embedding layer, you need to learn a lot of parameters.

In this exercise, you will see that when using transfer learning it is possible to use the pre-trained weights and don't update them, meaning that all the parameters of the embedding layer will be fixed, and the model will only need to learn the parameters from the other layers.

The function load_glove is already loaded on the environment and retrieves the glove matrix as a numpy.ndarray vector. It uses the function covered on the lesson's slides to retrieve the glove vectors with 200 embedding dimensions for the vocabulary present in this exercise

In [None]:
# Load the glove pre-trained vectors
glove_matrix = load_glove('glove_200d.zip')

# Create a model with embeddings
model = Sequential(name="emb_model")
model.add(Embedding(input_dim=vocabulary_size + 1, output_dim=wordvec_dim, 
                    embeddings_initializer= Constant(glove_matrix), 
                    input_length=sentence_len, trainable=False))
model.add(GRU(128))
model.add(Dense(1))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Print the summaries of the model with embeddings
model.summary()

As you can see, the total parameters is very big, but the number of parameteres that will be trained is much smaller. The trained vectors already has values for the words, but is equal to a vector of zeros for new words not present in the pre-trained vectors. This can lead to problems if the task at hand is very specific

In [None]:
# Create the model with embedding
model = Sequential(name="emb_model")
model.add(Embedding(input_dim=max_vocabulary, output_dim=wordvec_dim, input_length=max_len))
model.add(SimpleRNN(units=128))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Load pre-trained weights
model.load_weights('embedding_model_weights.h5')

# Evaluate the models' performance (ignore the loss value)
_, acc_embeddings = model.evaluate(X_test, y_test, verbose=0)

# Print the results
print("SimpleRNN model's accuracy:\t{0}\nEmbeddings model's accuracy:\t{1}".format(acc_simpleRNN, acc_embeddings))

# Embeddings improves performance
Does the embedding layer improves the accuracy of the model? Let's check it out in the same IMDB data.

The model was already trained with 10 epochs, as in the previous model with simpleRNN cell. In order to compare the models, a test set (X_test, y_test) is available in the environment, as well as the old model simpleRNN_model. The old model's accuracy is loaded in the variable acc_SimpleRNN.

All required modules and functions as loaded in the environment: Sequential() from keras.models, Embedding and Dense from keras.layers and SimpleRNN from keras.layers.recurrent.

# Sentiment classification revisited

## Better sentiment classification
In this exercise, you go back to the sentiment classification problem seen in Chapter 1.

You are going to add more complexity to the model and improve its accuracy. You will use an Embedding layer to train word vectors on the training set and two LSTM layers to keep track of longer texts. Also, you will add an extra Dense layer before the output.

This is no longer a simple model, and the training can take some time. For this reason, a pre-trained model is available by loading its weights with the method .load_weights() from the keras.models.Sequential class. The model was trained with 10 epochs and its weights are available on the file model_weights.h5.

The following modules are loaded on the environment: Sequential, Embedding, LSTM, Dropout, Dense.


In [None]:
# Build and compile the model
model = Sequential()
model.add( Embedding(vocabulary_size, wordvec_dim, trainable=True, input_length=max_text_len))
model.add(LSTM(64, return_sequences=True, dropout=0.2, recurrent_dropout=0.15))
model.add(LSTM(64, return_sequences=False, dropout=0.2, recurrent_dropout=0.15))
model.add(Dense(16))
model.add(Dropout(rate=0.25))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Load pre-trained weights
model.load_weights('model_weights.h5')

# Print the obtained loss and accuracy
print("Loss: {0}\nAccuracy: {1}".format(*model.evaluate(X_test, y_test, verbose=0)))

Superb! You just increased the accuracy of your sentiment classification task from poorly 50% to more than 80%, 
# Using the CNN layer
In this exercise, you will use a pre-trained model that makes use of the Conv1D and MaxPooling1D layers from the keras.layers.convolutional module, and achieves even better accuracy on the classification task.

This architecture achieved good results in language modeling tasks such as classification, and is added here as an extra exercise to see it in action and have some intuitions.

Because this layer is not in the scope of the course, you will focus on how to use the layers together with the RNN layers you already learned.

Please follow the instructions to see the results.

In [None]:
# Print the model summary
model_cnn.summary

# Load pre-trained weights
model_cnn.load_weights('model_weights.h5')

# Evaluate the model to get the loss and accuracy values
loss, acc = model_cnn.evaluate(x_test, y_test, verbose=0)

# Print the loss and accuracy obtained
print("Loss: {0}\nAccuracy: {1}".format(loss, acc))

# Data pre-processing Multi Class Classification

# Prepare label vectors
In the video exercise, you learned the differences between binary classification and multi-class classification. You learned that there are some modifications to the data preparation process that need to be done before training the models.

In this exercise, you will prepare a raw dataset with labels given as text. The data is given as a pandas.DataFrame called df, with two columns: text with the text data and label with the label names. Your task is to make all the necessary transformations to the labels: change string to number and one-hot encode.

The module pandas as pd and the function to_categorical() from keras.utils.np_utils are already loaded in the environment and the first lines of the dataset is printed on the console for you to see.


In [None]:
# Get the numerical ids of column label
numerical_ids = df.label.cat.codes

# Print initial shape
print(numerical_ids.shape)

# One-hot encode the indexes
Y = to_categorical(numerical_ids)

# Check the new shape of the variable
print(Y.shape)

# Print the first 5 rows
print(Y[0:5])

With this approach, you are able to transform any dataset to the format needed by RNN models. You can try with a dataset of your liking. Also, you can see that each class is now equidistant and the loss function will treat every misclassification in the same way, allowing for a better learning phase. To train RNN models, it is necessary to transform the text representation of the classes to a numeric one-hot vector.

In [None]:
# Get the numerical ids of column label
numerical_ids = df.label.cat.codes

# Print initial shape
print(numerical_ids.shape)

# One-hot encode the indexes
Y = to_categorical(numerical_ids)

# Check the new shape of the variable
print(Y.shape)

# Print the first 5 rows
print(Y[0:5])

# Pre-process data
You learned the differences for pre-processing the data in the case of multi-class classification. Let's put that into practice by preprocessing the data in anticipation of creating a simple multi-class classification model.

The dataset is loaded in the variable news_dataset, and has the following attributes:

news_dataset.data: array with texts
news_dataset.target: array with target categories as numerical indexes
The sample data contains 5,000 observations.

In [None]:
# Create and fit tokenizer
tokenizer =Tokenizer()
tokenizer.fit_on_texts(news_dataset)
# Prepare the data
prep_data = tokenizer.texts_to_sequences(news_dataset.data)
prep_data = pad_sequences(prep_data, maxlen=200)
# Prepare the labels
prep_labels = to_categorical(news_dataset.target)
# Print the shapes
print(prep_data.shape)
print(prep_labels.shape)

# Transfer learning for language models
# Transfer learning starting point
In this exercise you will see the benefit of using pre-trained vectors as a starting point for your model.

You will compare the accuracy of two models trained with two epochs. The architecture of the models is the same: One embedding layer, one LSTM layer with 128 units and the output layer with 5 units which is the number of classes in the sample data. The difference is that one model uses pre-trained vectors on the embedding layer (transfer learning) and the other doesn't.

The pre-trained vectors used were the GloVE with 200 dimension. The training accuracy history of the validation set of both models are available in the variables history_no_emb and history_emb.

In [None]:
# Import plotting package
import matplotlib.pyplot as plt

# Insert lists of accuracy obtained on the validation set
plt.plot(history_no_emb['acc'], marker='o')
plt.plot(history_emb['acc'], marker='o')

# Add extra descriptions to plot
plt.title('Learning with and without pre-trained embedding vectors')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['no_embeddings', 'with_embeddings'], loc='upper left')

# Display the plot
plt.show()

Transfer learning provides a initial knowledge of the meaning of the words, and you can see that the model that used pre-trained embeddings started with higher accuracy. Of course, the model without transfer learning is learning directly from the corpus and is more specialized on the vocabulary present in the corpus, while the word embeddings used from transfer learning are more generic. By training the embeddings directly on the corpus, the model can be even better than the one initialized with the weights from transfer learning, but in many cases the computer power to train embeddings in a very big dataset is prohibitive.
# Word2Vec
In this exercise you will create a Word2Vec model using Keras.

The corpus used to pre-train the model is the script of all episodes of the The Big Bang Theory TV show, divided sentence by sentence. It is available in the variable bigbang.

The text on the corpus was transformed to lower case and all words were tokenized. The result is stored in the tokenized_corpus variable.

A Word2Vec model was pre-trained using a window size of 10 words for context (5 before and 5 after the center word), words with less than 3 occurrences were removed and the skip gram model method was used with 50 dimension. The model is saved on the file bigbang_word2vec.model.

The class Word2Vec is already loaded in the environment from gensim.models.word2vec.

In [None]:
from gensim.models.word2vec import Word2vec
# Word2Vec model
w2v_model = Word2Vec.load("bigbang_word2vec.model")

# Selected words to check similarities
words_of_interest = ['bazinga', 'penny', 'universe', 'spock', 'brain']

# Compute top 5 similar words for each of the words of interest
top5_similar_words = []
for word in words_of_interest:
    top5_similar_words.append(
      {word: [item[0] for item in w2v_model.wv.most_similar([word], topn=5)]}
    )

# Print the sisklearn.datasetsmilar words
print(top5_similar_words)

# Multi-class classification models


In [12]:
from sklearn.datasets import fetch_20newsgroups
cats = ['alt.atheism', 'sci.space','soc.religion.christian']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)
print(newsgroups_train.data[2])

From: rolfe@dsuvax.dsu.edu (Tim Rolfe)
Subject: Re: What did Lazarus smell like?
Lines: 15

In the discussion as to why Jesus spoke aloud the "Lazarus, come out",
I'm surprised that no one has noticed the verse immediately preceeding.

Jn 12:41  "Father, I thank you for listening to me, though I knew that
you always listen to me.  But I have said this for the sake of the
people that are standing around me that they may believe that you have
made my your messenger."  (Goodspeed translation)

My guess is that the "Lazarus, come out!" was also for the sake of the
crowd.
-- 
                                                    --- Tim Rolfe
                                                 rolfe@dsuvax.dsu.edu
                                                 rolfe@junior.dsu.edu
                                                RolfeT@columbia.dsu.edu



In [17]:
# Import relevant classes/functions
from tensorflow.keras.preprocessing.text import  Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow as tf 

tokenizer=Tokenizer()
tokenizer.fit_on_texts(texts)
# Transform the text into numerical indexes
news_num_indices = tokenizer.texts_to_sequences(newsgroups_train.data)

# Print the transformed example article
print(news_num_indices[2])

# Transform the labels into one-hot encoded vectors
labels_onehot = tf.keras.utils.to_categorical(newsgroups_train.target)

# Check before and after for the sample article
print("Before: {0}\nAfter: {1}".format(newsgroups_train.target[2] ,labels_onehot[2]))

[10, 5, 1, 10, 65, 10, 2, 1, 65, 2, 1, 10, 11, 10, 65, 47, 65, 6, 65, 2, 43, 66, 65, 10, 10, 11, 10]
Before: 2
After: [0. 0. 1.]


# Classifying news articles
In this exercise you will create a multi-class classification model.

The dataset is already loaded in the environment as news_novel. Also, all the pre-processing of the training data is already done and tokenizer is also available in the environment.

A RNN model was pre-trained with the following architecture: use the Embedding layer, one LSTM layer and the output Dense layer expecting three classes: sci.space, alt.atheism, and soc.religion.christian. The weights of this trained model are available on the classify_news_weights.h5 file.

You will pre-process the novel data and evaluate on a new dataset news_novel

In [None]:
# Change text for numerical ids and pad
X_novel = tokenizer.texts_to_sequences(news_novel.data)
X_novel = pad_sequences(X_novel, maxlen=400)

# One-hot encode the labels
Y_novel = to_categorical(news_novel.target)

# Load the model pre-trained weights
model.load_weights('classify_news_weights.h5')

# Evaluate the model on the new dataset
loss, acc = model.evaluate(X_novel, Y_novel, batch_size=64)

# Print the loss and accuracy obtained
print("Loss:\t{0}\nAccuracy:\t{1}".format(loss, acc))

You can see that it can take some time to train one epoch of a RNN model. Also, you can modify the model architecture to add or change layers, the more layers the model has, the more time it need to train all the parameters.
# Assessing the model's performance
# Precision-Recall trade-off
When working with classification tasks, the term Precision-Recall trade-off often appears. Where does it comes from?

Usually, the class with higher probability (obtained by the .predict_proba() method) is chosen to assign the document to. But, what if the maximum probability is equal to 0.1? Should you consider that document to belong to this class with only 10% probability?

The answer varies according to problem at hand. It is possible to add a minimum threshold to accept the classification, and by changing the threshold the values of precision and recall move in opposite directions.

The variables y_true and the model model are loaded. Also, if the probability is lower than the threshold, the document will be assigned to DEFAULT_CLASS (chosen to be class 2).


In [None]:
# Get probabilities for each class
pred_probabilities = model.predict_proba(X_test)

# Thresholds at 0.5 and 0.8
y_pred_50 = [np.argmax(x) if np.max(x) >= 0.5  else DEFAULT_CLASS for x in pred_probabilities]
y_pred_80 = [np.argmax(x) if np.max(x) >= 0.8 else DEFAULT_CLASS for x in pred_probabilities]

trade_off = pd.DataFrame({
    'Precision_50': precision_score(y_true, y_pred_50, average=None), 
    'Precision_80': precision_score(y_true, y_pred_80, average=None), 
    'Recall_50': recall_score(y_true, y_pred_50, average=None), 
    'Recall_80': recall_score(y_true, y_pred_80, average=None)}, 
  index=['Class 1', 'Class 2', 'Class 3'])

print(trade_off)

You can see that for some classes precision increased and recall decresed, and the opposite also can happen. When one metric increases, the other has to decrease. The trade-off depends on the specific problem you are solving. If missclassification is not desirable for the class of interest, then you should change the threshold to increase the precision. Likewise, if misclassification is acceptable and you are interested in covering all the observations of a specific class, then you should tune the threshold for higher recall values.
# Performance on multi-class classification
In this exercise, you will compute the performance metrics for models using the module sklearn.metrics.

The model is already trained and stored in the variable model. Also, the variables X_test and y_true are also loaded, together with the functions confusion_matrix() and classification_report() from sklearn.metrics package.

You will first compute the confusion matrix of the model. Then, to summarize a model's performance, you will compute the precision, recall and F1-score using the classification_report() function. In this function, you can optionally pass a list containing the classes names (they are stored it in the news_cat variable) to the parameter target_names to make the report more readable.

In [None]:
# Use the model to predict on new data
predicted = model.predict(X_test)

# Choose the class with higher probability 
y_pred = np.argmax(predicted, axis=1)

# Compute and print the confusion matrix
print(confusion_matrix(y_true, y_pred))

# Create the performance report
print(classification_report(y_true, y_pred, target_names=news_cat))

Thats great! You can have all the metrics in a single function call. It is an easy and fast way to evaluate the model performance in classification tasks. Also, remember that precision measures how good the predictions of the model are, meaning that high precision on one class makes the predictions the model make on that class to be reliable. In the other hand, recall measures how good the model is to predict each class, meaning that if you are interested in predicting a specific class and need high coverage on the number of true cases, you want high recall values.