# Multi-class classification

Next, in this chapter you will learn how to prepare data for the multi-class classification task, as well as the differences between multi-class classification and binary classification (sentiment analysis). Finally, you will learn how to create models and measure their performance with Keras.

# (1) Data pre-processing

## Text classification
Application of text classification
- Automatic news classification
- Document classification
- Queue segmentation for customer support

## Changes from binary classification
What change from binary to multi class:
- Shape of the output variable `y`
- Number of units on the output layer
- Activation function on the output layer
- Loss function

## Changes from binary classification
Shape of the output variable `y`:

- One-hot encoding of the classes

In [None]:
# Example: num_classes = 3
y[0] = [0, 1, 0]
y.shape = (N, num_classes)

## Change from binary classification

<p align='center'>
    <img src='image/Screenshot 2021-02-12 191952.png'>
</p>

Activation function on the output layer:

- `softmax` gives the probability of every class

In [None]:
# Output layer
model.add(Dense(num_classes, activation="softmax"))

Loss function:
- Instead of binary, we use categorical cross-entropy

In [None]:
# Compile the model
model.compile(loss='categorical_crossentropy')

## Preparing text categories for keras

In [None]:
y = ["sports", "economy", "data_science", "sports", "finance"]
# Transform to pandas series object
y_series = pd.Series(y, dtype="category")
# Print the category codes
print(y_series.cat.codes)

## Pre-processing y

In [None]:
from keras.utils.np_utils import to_categorical
y = np.array([0, 1, 2])

# Change to categorical
y_prep = to_categorical(y)
print(y_prep)

# Exercise I: Prepare label vectors
In the video exercise, you learned the differences between binary classification and multi-class classification. You learned that there are some modifications to the data preparation process that need to be done before training the models.

In this exercise, you will prepare a raw dataset with labels given as text. The data is given as a `pandas.DataFrame` called `df`, with two columns: `text` with the text data and `label` with the label names. Your task is to make all the necessary transformations to the labels: change string to number and one-hot encode.

The module `pandas` as `pd` and the function `to_categorical()` from `keras.utils.np_utils` are already loaded in the environment and the first lines of the dataset is printed on the console for you to see.

### Instructions 1/3

- Get the attribute `.cat.codes` of the column `label` contained on data frame `df` and print its shape.

In [None]:
# Get the numerical ids of column label
numerical_ids = df.label.cat.codes

# Print initial shape
print(numerical_ids.shape)

### Instructions 2/3

- One-hot encode the vector using the `to_categorical()` function and store the results in `Y` while printing the new shape.

In [None]:
# Get the numerical ids of column label
numerical_ids = df.label.cat.codes

# Print initial shape
print(numerical_ids.shape)

# One-hot encode the indexes
Y = to_categorical(numerical_ids)

# Check the new shape of the variable
print(Y.shape)

### Instructions 3/3

- Print the first 5 rows of the variable `Y`.

In [None]:
# Get the numerical ids of column label
numerical_ids = df.label.cat.codes

# Print initial shape
print(numerical_ids.shape)

# One-hot encode the indexes
Y = to_categorical(numerical_ids)

# Check the new shape of the variable
print(Y.shape)

# Print the first 5 rows
print(Y[:5])

# Exercise II: Pre-process data

You learned the differences for pre-processing the data in the case of multi-class classification. Let's put that into practice by preprocessing the data in anticipation of creating a simple multi-class classification model.

The dataset is loaded in the variable `news_dataset`, and has the following attributes:

- `news_dataset.data`: array with texts
- `news_dataset.target`: array with target categories as numerical indexes

The sample data contains 5,000 observations.

### Instructions

- Instantiate the `Tokenizer` class on the `tokenizer` variable.
- Fit the `tokenizer` variable on the text data.
- Use the `.texts_to_sequences()` method on the text data.
- Use the `to_categorical()` function to prepare the target indexes.

In [None]:
# Create and fit tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(news_dataset.data)

# Prepare the data
prep_data = tokenizer.texts_to_sequences(news_dataset.data)
prep_data = pad_sequences(prep_data, maxlen=200)

# Prepare the labels
prep_labels = to_categorical(news_dataset.target)

# Print the shapes
print(prep_data.shape)
print(prep_labels.shape)

# (2) Transfer learning for language models

## The idea behind transfer learning
Transfer learning:
- Start with better than random initial weights
- Use models trained on very big datasets
- "Open-source" data science models

## Available architectures
Base example: `I really loved this movie`

- Word2Vec
    - Continuous Bags of words (CBOW) `X = [I, really, this, movie], y = loved`
    - Skip-gram `X = loved, y = [I, really, this, movie]`
- FastText `X= [I, rea, eal, all, lly, really, ...], y = loved`
    - Uses word and n-grams of chars
- ELMo `X = [I, really, loved, this], y = movie`
    - Uses words, embeddings per context
    - Uses Deep bidirectional language models (biLM)
- Word2Vec and FastText are FastText are available on package `gensim` and ELMo on `tensorflow_hub`

## Example using Word2Vec

In [None]:
from gensim.models import word2vec
# Train the model
w2v_model = word2vec.Word2Vec(tokenized_corpus, size=embedding_dim,
                                window=neighbor_words_num, iter=100)
# Get top 3 similar words to "Captain"
w2v_model.wv.model_similar(["captain"], topn=3)

## Example using FastText

In [None]:
from gensim.model import fasttext
# Instantiate the model
ft_model = fasttext.FastText(size=embedding_dim, window=neighbor_words_num)
# Build vocabulary
ft_model.build_vocab(sentences=tokenized_corpus)
# Train the model
ft_model.train(sentences=tokenized_corpus,
                total_examples=len(tokenized_corpus),
                epochs=100)

# Exercise III: Transfer learning starting point

In this exercise you will see the benefit of using pre-trained vectors as a starting point for your model.

You will compare the accuracy of two models trained with two epochs. The architecture of the models is the same: One embedding layer, one LSTM layer with 128 units and the output layer with 5 units which is the number of classes in the sample data. The difference is that one model uses pre-trained vectors on the embedding layer (transfer learning) and the other doesn't.

The pre-trained vectors used were the `GloVE` with 200 dimension. The training accuracy history of the validation set of both models are available in the variables `history_no_emb` and `history_emb`.

### Instructions

- Import module `matplotlib.pyplot` as `plt`.
- Add the list of accuracy from the model without embeddings to the plot.
- Add the list of accuracy from the model with embeddings to the plot.
- Display the plot using the method `.show()`.


In [None]:
# Import plotting package
import matplotlib.pyplot as plt

# Insert lists of accuracy obtained on the validation set
plt.plot(history_no_emb['acc'], marker='o')
plt.plot(history_emb['acc'], marker='o')

# Add extra descriptions to plot
plt.title('Learning with and without pre-trained embedding vectors')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['no_embeddings', 'with_embeddings'], loc='upper left')

# Display the plot
plt.show()

<p align='center'>
    <img src='image/[2021-02-14 120104].svg'>
</p>

# Exercise IV: Word2Vec

In this exercise you will create a Word2Vec model using Keras.

The corpus used to pre-train the model is the script of all episodes of the The Big Bang Theory TV show, divided sentence by sentence. It is available in the variable `bigbang`.

The text on the corpus was transformed to lower case and all words were tokenized. The result is stored in the `tokenized_corpus` variable.

A `Word2Vec` model was pre-trained using a window size of 10 words for context (5 before and 5 after the center word), words with less than 3 occurrences were removed and the skip gram model method was used with 50 dimension. The model is saved on the file `bigbang_word2vec.model`.

The class `Word2Vec` is already loaded in the environment from `gensim.models.word2vec`.

### Instructions

- Load the pre-trained Word2Vec model.
- Store a `list` with the words `"bazinga", "penny", "universe", "spock", "brain"` in the variable `words_of_interest`, keeping them in that order.
- Iterate over each word of interest while using the `.most_similar()` method present on attribute `wv` and append the top 5 similar words to `top5_similar_words` as a dictionary.
- Print the found top 5 words for each of the words of interest.


In [None]:
# Word2Vec model
w2v_model = Word2Vec.load('bigbang_word2vec.model')

# Selected words to check similarities
words_of_interest = ["bazinga", "penny", "universe", "spock", "brain"]

# Compute top 5 similar words for each of the words of interest
top5_similar_words = []
for word in words_of_interest:
    top5_similar_words.append(
      {word: [item[0] for item in w2v_model.wv.most_similar([word], topn=5)]}
    )

# Print the similar words
print(top5_similar_words)

# (3) Multi-class classification models

## Review of the Sentiment classification model

In [None]:
# Build and compile the model
model = Sequential()
model.add(Embedding(10000, 128))
model.add(LSTM(128, dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

## Model architecture
Same architecture can be used

In [None]:
# Build the model
model = Sequential()
model.add(Embedding(10000, 128))
model.add(LSTM(128, dropout=0.2))
# Output layer has 'num_classes' units and uses 'softmax'
model.add(Dense(num_classes, activation="softmax"))
# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

## 20 News Group dataset
29 News Groups Dataset

- Available on `sklearn.datasets import fetch_20newsgroups`

In [None]:
# Import the function to load the data
from sklearn.datasets import fetch_20newsgroups
# Download train and test sets
news_train = fetch_20newsgroups(subset='train')
news_test = fetch_20newsgroups(subset='test')

The data has the following attributes:
- `news_train.DESCR`: Documentation.
- `news_train.data`: Text data.
- `news_train.filenames`: Path to the files on disk.
- `news_train.target`: Numerical index of the classes.
- `news_train.target_names`: Unique names of the classes.

## Pre-processing text data

In [None]:
# Import modules
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
# Create and fit the tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(news_train.data)
# Create the (X, Y) variables
X_train = tokenizer.texts_to_sequences(news_train.data)
X_train = pad_sequences(X_train, maxlen=400)
Y_train = to_categorical(news_train.target)

## Training on data
Train the model on training data

In [None]:
# Train the model
model.fit(X_train, Y_train, batch_size=64, epochs=100)
# Evaluate on test data
model.evaluate(X_test, Y_test)

# Exercise V: Exploring 20 News Groups dataset

In this exercise, you will be given a sample of the **20 News Groups** dataset obtained using the `fetch_20newsgroups()` function from `sklearn.datasets`, filtering only three classes: `sci.space`, `alt.atheism` and `soc.religion.christian`.

The dataset is loaded in the variable `news_dataset`. Its attributes are printed so you can explore them on the console.

Fore more details on how to use this function, see the [Sklearn documentation](https://scikit-learn.org/stable/datasets/index.html#the-20-newsgroups-text-dataset).

You will tokenize the texts and one-hot encode the labels step by step to understand how the transformations happen.

### Instructions 1/4

- Print the example article with index `5` from `news_dataset.data`.

In [None]:
# See example article
print(news_dataset.data[5])

### Instructions 2/4

- Transform the data into a sequence of numerical indexes.

In [None]:
# See example article
print(news_dataset.data[5])

# Transform the text into numerical indexes
news_num_indices = tokenizer.texts_to_sequences(news_dataset.data)

### Instructions 3/4

- Print the transformed example (index `5`) article.

In [None]:
# See example article
print(news_dataset.data[5])

# Transform the text into numerical indexes
news_num_indices = tokenizer.texts_to_sequences(news_dataset.data)

# Print the transformed example article
print(news_num_indices[5])

### Instructions 4/4

- Transform the labels into one-hot vectors using the function `to_categorical()` and print the original text label and the transformed one-hot vector at index `5` to see the transformed example.

In [None]:
# See example article
print(news_dataset.data[5])

# Transform the text into numerical indexes
news_num_indices = tokenizer.texts_to_sequences(news_dataset.data)

# Print the transformed example article
print(news_num_indices[5])

# Transform the labels into one-hot encoded vectors
labels_onehot = to_categorical(news_dataset.target)

# Check before and after for the sample article
print("Before: {0}\nAfter: {1}".format(news_dataset.target[5], labels_onehot[5]))

# Exercise VI: Classifying news articles

In this exercise you will create a multi-class classification model.

The dataset is already loaded in the environment as `news_novel`. Also, all the pre-processing of the training data is already done and `tokenizer` is also available in the environment.

A RNN model was pre-trained with the following architecture: use the `Embedding` layer, one `LSTM` layer and the output `Dense` layer expecting three classes: `sci.space`, `alt.atheism`, and `soc.religion.christian`. The weights of this trained model are available on the `classify_news_weights.h5` file.

You will pre-process the novel data and evaluate on a new dataset `news_novel`.

### Instructions

- Transform the data present on `news_novel.data` using the loaded `tokenizer`.
- Pad the obtained sequences of numerical indexes.
- Transform the labels present on `news_novel.target` into a one-hot representation.
- Evaluate the model using the method `.evaluate()` and print the loss and accuracy obtained.


In [None]:
# Change text for numerical ids and pad
X_novel = tokenizer.texts_to_sequences(news_novel.data)
X_novel = pad_sequences(X_novel, maxlen=400)

# One-hot encode the labels
Y_novel = to_categorical(news_novel.target)

# Load the model pre-trained weights
model.load_weights('classify_news_weights.h5')

# Evaluate the model on the new dataset
loss, acc = model.evaluate(X_novel, Y_novel, batch_size=64)

# Print the loss and accuracy obtained
print("Loss:\t{0}\nAccuracy:\t{1}".format(loss, acc))