In [None]:
!pip install keras keras-hub matplotlib --upgrade -q

In [None]:
import os
os.environ["KERAS_BACKEND"] = "jax"

In [None]:
# @title
import os
from IPython.core.magic import register_cell_magic

@register_cell_magic
def backend(line, cell):
    current, required = os.environ.get("KERAS_BACKEND", ""), line.split()[-1]
    if current == required:
        get_ipython().run_cell(cell)
    else:
        print(
            f"This cell requires the {required} backend. To run it, change KERAS_BACKEND to "
            f"\"{required}\" at the top of the notebook, restart the runtime, and rerun the notebook."
        )

## Chapter 4 - Classification and regression
In this chapter, we will explore the fundamental concepts of classification and regression, two essential problems in supervised machine learning. We will discuss binary classification, multi-class classification, and at the end of the chapter, we will cover regression problems as well.

### 4.1 Binary Classification - IMDb Movie Reviews
Binary classification involves categorizing data into one of two classes. The example we are going to use is the sentiment analysis of IMDb movie reviews, where the goal is to classify reviews as either positive or negative. The reviews are pre-labeled and quite polarized, making it a suitable dataset for binary classification tasks. Let's start by loading the dataset:

In [None]:
from keras.datasets import imdb

# Every review is a list of words taken from a dictionary. With 'num_words=10000',
# we are limiting the number of words to include into this dictionary to 10000.
# The dictionary is ordered with the first elements being the one more frequent.
# That means that dropping after the first 10k words, we are dropping words that
# probably were refered once or twice, so they are not very descriptive of the
# 'sentiment' of the review.
num_words = 10000
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(
    num_words=num_words
)

print(train_data[0][:10]) # First 10 words of the review
print(train_labels[0]) # Labels are \in {0, 1}, 0 is negative, 1 is positive

# To decode a review, we can take the dictionary:
word_index = imdb.get_word_index()

# Reverse it:
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

# And finally decode a review:
decoded_review = " ".join(
    [reverse_word_index.get(i - 3, "?") for i in train_data[0]]
)
print(decoded_review)

Now, how can we build a model to classify these reviews? The input, as it is, is not easily transformable into a tensor. We need to preprocess the text data. Two common approaches are:
1. Padding and truncating sequences to a fixed length. We then turn it into a tensor of shape (num_samples, max_length) of integers, where each integer represents a word index.
2. Using one-hot encoding to represent each review as a binary vector of size equal to the vocabulary size. The vector will have 1s at the indices corresponding to the words present in the text and 0s elsewhere. Beware, with this approach we lose the concept of word order. The neural network will not be able to distinguish between "The movie was great" and "Great was the movie".

We will use the second approach for this example. Here's how we can preprocess the data and build a simple neural network model using Keras:

In [None]:
import numpy as np

def multi_hot_encode(sequences, num_classes):
    results = np.zeros((len(sequences), num_classes))
    for i, sequence in enumerate(sequences):
        # It takes the i-th element of results, and it sets
        # to one the element whose index is in the sequence
        results[i][sequence] = 1.0
    return results

train_data_processed = multi_hot_encode(train_data, num_classes=num_words)
test_data_processed = multi_hot_encode(test_data, num_classes=num_words)

print(train_data_processed[0])

train_labels = train_labels.astype("float32")
test_labels = test_labels.astype("float32")

Now we are ready to build and train our model. When designing a model for binary classification, the last layer should have a single unit with a sigmoid activation function (restricting the output to the range [0, 1], representing a probability). Every element of of the dataset is a 1D tensor, and on this class of problems Dense layers with relu activations work well.

We still have to decide on the number of layers and units per layer. To guide a minimum the decision, we can imagine that every layer is learning a representation of the input data. Thanks to the non-linearity of the relu activation, we can learn complex representations. The more output units we have, the more complex the representation can be. However, having too many units can lead to overfitting, learning unwanted patterns in the training data that do not generalize well to unseen data.

We will use two layers with 16 units each. The reason for this choice will be clear in next chapters.

In [None]:
import keras
from keras import layers

model = keras.Sequential(
    [
        layers.Dense(16, activation="relu"),
        layers.Dense(16, activation="relu"),
        layers.Dense(1, activation="sigmoid"),
    ]
)

Now we have to compile the model. For binary classification, we typically use the **binary cross-entropy** loss function, which measures the difference between the predicted probability distribution and the ground truth distribution. The binary cross-entropy loss is defined as:
$$
L = -\frac{1}{N} \sum_{i=1}^{N} [y_i \log(p_i) + (1 - y_i) \log(1 - p_i)]
$$

What does this formula mean? We are computing the cross-entropy, that is a measure of dissimilarity between two probability distributions, between the true labels \(y_i\) and the predicted probabilities \(p_i\). If we use the same values for \(y_i\) and \(p_i\), for example the true labels, the loss function will represent the entropy of the distribution of the labels. The cross-entropy is always higher than or equal to the entropy, reaching equality when the two distributions are identical. Therefore, minimizing the cross-entropy loss will lead to better predictions. For more details on binary cross-entropy, refer to this [blog post](https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a/).


Back to the model compilation, we will use the Adam optimizer, which is virtually always a good choice. We will additionally monitor the accuracy metric during training.

It is also important to set aside a validation set to monitor the model's performance on unseen data during training. We can do this by splitting the training data into a smaller training set and a validation set, or we can use the `validation_split` parameter in the `fit` method.

In [None]:
model.compile(
    optimizer="adam",
    loss="binary_crossentropy",
    metrics=["accuracy"],
)

history = model.fit(
    train_data_processed,
    train_labels,
    epochs=20,
    batch_size=512,
    validation_split=0.2, # Alternative to validation_data=(val_data, val_label)
)

We can check the what happened during training by plotting the training and validation loss and accuracy over epochs. We can use the history object returned by the `fit` method to access this information.

In [None]:
import matplotlib.pyplot as plt

history_dict = history.history
print("History labels:", history_dict.keys())

loss = history_dict["loss"]
val_loss = history_dict["val_loss"]
epochs = range(1, len(loss) + 1)
plt.plot(epochs, loss, "r--", label="Training loss")
plt.plot(epochs, val_loss, "b", label="Validation loss")
plt.title("[IMDB] Training and validation loss")
plt.xlabel("Epochs")
plt.xticks(epochs)
plt.ylabel("Loss")
plt.legend()
plt.show()

In [None]:
# Clears the figure
plt.clf()
acc = history_dict["accuracy"]
val_acc = history_dict["val_accuracy"]
plt.plot(epochs, acc, "r--", label="Training acc")
plt.plot(epochs, val_acc, "b", label="Validation acc")
plt.title("[IMDB] Training and validation accuracy")
plt.xlabel("Epochs")
plt.xticks(epochs)
plt.ylabel("Accuracy")
plt.legend()
plt.show()

In the plots, we can see that the training loss decreases over epochs, indicating that the model is learning from the training data. However, the validation loss starts to increase after a certain number of epochs, indicating that the model is **overfitting** to the training data. It means that the model is learning patterns that are specific to the training data, maybe due to spurious correlations, and does not generalize well to unseen data. We will see techniques to mitigate overfitting in the next chapters. Here we can see that after 4 epochs the model starts to overfit. Therefore, we can choose to stop training after 4 epochs to get the best performance on unseen data.

Finally, we can evaluate the model on the test set to see how well it generalizes to unseen data:

In [None]:
model = keras.Sequential(
    [
        layers.Dense(16, activation="relu"),
        layers.Dense(16, activation="relu"),
        layers.Dense(1, activation="sigmoid"),
    ]
)
model.compile(
    optimizer="adam",
    loss="binary_crossentropy",
    metrics=["accuracy"],
)
model.fit(train_data_processed, train_labels, epochs=4, batch_size=512)
results = model.evaluate(test_data_processed, test_labels)
print("Result loss on test data:", results[0])
print("Result accuracy on test data:", results[1])

Let's test the model with a new review written from scratch:

In [None]:
review : str = "I don't think I ever saw a movie like this one. It was extremely boring and long! I fell asleep around 13 times!"
indicized_review = [1] + [word_index.get(word, 0) + 3 for word in review.split()]

# Let's see how the decoded review looks like after removal of out-of-vocabulary words
decoded_review = " ".join(
    [reverse_word_index.get(i - 3, "?") for i in indicized_review]
)
print(decoded_review)

tokenized_review = multi_hot_encode([indicized_review], num_words) # [indicized_reviews] because the method wants a 2D array
prediction = model.predict(tokenized_review)
print("Prediction:", prediction)

We can now test some changes. For example, we can try to increase the number of units in each layer or add more layers to see if the model's performance improves. However, we should be cautious about overfitting, especially with a small dataset. We can also experiment with different activation functions, optimizers, learning rates, and loss functions to see how they affect the model's performance.

In [None]:
model = keras.Sequential(
    [
        layers.Dense(16, activation="relu"),
        layers.Dense(4, activation="relu"),
        layers.Dense(1, activation="sigmoid"),
    ]
)
model.compile(
    optimizer="adam",
    loss="binary_crossentropy",
    metrics=["accuracy"]
)
history = model.fit(
    x=train_data_processed,
    y=train_labels,
    epochs=5,
    batch_size=512,
    validation_split=0.2
)

history_dict = history.history
print("History labels:", history_dict.keys())

loss = history_dict["loss"]
val_loss = history_dict["val_loss"]
epochs = range(1, len(loss) + 1)
plt.subplot(121)
plt.plot(epochs, loss, "r--", label="Training loss")
plt.plot(epochs, val_loss, "b", label="Validation loss")
plt.title("Training and validation loss")
plt.xlabel("Epochs")
plt.xticks(epochs)
plt.ylabel("Loss")
plt.legend()

plt.subplot(122)
acc = history_dict["accuracy"]
val_acc = history_dict["val_accuracy"]
plt.plot(epochs, acc, "r--", label="Training acc")
plt.plot(epochs, val_acc, "b", label="Validation acc")
plt.title("Training and validation accuracy")
plt.xlabel("Epochs")
plt.xticks(epochs)
plt.ylabel("Accuracy")
plt.legend()
plt.show()

results = model.evaluate(test_data_processed, test_labels)
print("Result loss on test data:", results[0])
print("Result accuracy on test data:", results[1])