<div class="alert alert-success"><h1>Building a More Robust Deep Learning Model in Python</h1></div>

Deep learning models can often be made more stable, efficient, and generalizable by introducing techniques that address common training pitfalls. Methods like **Batch Normalization** mitigate internal covariate shift, **Gradient Clipping** keeps parameters from exploding, **Early Stopping** helps prevent overfitting, and **Learning Rate Scheduling** fine-tunes the optimization process over time. In this tutorial, we will illustrate how to apply each of these techniques in the context of the MNIST digit classification task using Keras.

## Learning Objectives
By the end of this tutorial, you will:
+ Understand how to use Batch Normalization to stabilize and accelerate training.
+ Learn how to use Gradient Clipping to control exploding gradients.
+ Apply Early Stopping to preserve your best-trained model and prevent overfitting.
+ Explore Learning Rate Scheduling to dynamically adjust the learning rate based on validation performance.
+ Combine all these techniques to build a more robust and accurate deep learning model compared to a baseline approach.


## Prerequisites
Before we begin, ensure you have:
+ Basic knowledge of Python programming (variables, functions, classes).
+ Familiarity with the fundamentals of how to build a deep learning model in Python using Keras.
+ A Python (version 3.x) environment with the `tensorflow`, `keras`, and `matplotlib` packages installed.

<div class="alert alert-info"><b>Note:</b>To learn more about deep learning and how to build a deep learning model using Keras in Python, refer to  the LinkedIn Learning course titled <b>"Deep Learning with Python: Foundations"</b>.</div>

<div class="alert alert-success"><h2>1. Import and Preprocess the Data</h2></div>

We start by importing the data. For this tutorial, we'll use the **MNIST dataset**, a classic dataset in the machine learning community. It consists of 70,000 grayscale images of handwritten digits ranging from 0 to 9. Each image is 28 x 28 pixels, and the dataset is divided into 60,000 training images and 10,000 testing images. Our goal will be to develop a model that learns to correctly identify a handritten digit given the image.

In [None]:
from tensorflow import keras

keras.utils.set_random_seed(1234)
(train_images, train_labels), (test_images, test_labels) = keras.datasets.mnist.load_data()

Our deep learning model expects the images as a vector of size 784 (i.e. 28 $\times$ 28). So, let's flatten the images.

In [None]:
train_images = train_images.reshape(60000, 28 * 28)
test_images = test_images.reshape(10000, 28 * 28)

The model also expects the image pixel values scaled. Let's do that as well.

In [None]:
train_images = train_images.astype('float32') / 255
test_images = test_images.astype('float32') / 255

Finally, we also need to one-hot encode the image labels.

In [None]:
num_classes = 10
train_labels = keras.utils.to_categorical(train_labels, num_classes)
test_labels = keras.utils.to_categorical(test_labels, num_classes)

<div class="alert alert-success"><h2>2. Define the Model with Batch Normalization</h2></div>

Our model consists of an input layer with 784 nodes, two hidden layers with 512 and 128 nodes (respectively), and an output layer with 10 nodes. Between each of the hidden layers, we will normalize the outputs of one layer before feeding them into the next. This is known as **Batch Normalization**. Batch Normalization can stabilize the training of a deep learning model and help it converge faster. 

To apply BatchNormalization to a model, we simply include a `BatchNormalization()` layer to the model architecture.

In [None]:
from keras.layers import Input, Dense, BatchNormalization

model = keras.Sequential([
    Input(shape = (784,)),
    Dense(512, activation = 'relu'),
    BatchNormalization(),
    Dense(128, activation = 'relu'),
    BatchNormalization(),
    Dense(10, activation = 'softmax')
])

<div class="alert alert-success"><h2>3. Compile the Model with Gradient Clipping</h2></div>

Now that we've defined our model's architecture, let's compile it by specifying the optimizer, loss function and performance metric to optimize. We choose to use the `Adam()` optimizer for our model. By default, optimizers don’t impose any bounds on gradients. However, large gradients can cause a model’s parameters to fluctuate significantly during training and hamper convergence. **Gradient Clipping** mitigates this issue by limiting the magnitude (or norm) of gradients.

To implement Gradient Clipping, we set the `clipnorm` argument within the `Adam()` optimizer to `1.0`. This ensures that the L2 norm of the gradients do not exceed 1.0 during training. 

In [None]:
from keras.optimizers import Adam

model.compile(
    optimizer = Adam(clipnorm = 1.0),
    loss = 'categorical_crossentropy',
    metrics = ['accuracy']
)

Note that we can adjust the `clipnorm` value as we see fit based on our dataset or problem. Alternatively, we could have also used the `clipvalue` argument to clip gradiens by value instead of by norm.

<div class="alert alert-success"><h2>4. Train the Model using Callbacks</h2></div>

In Keras, a **callback** is an object that can perform custom actions at specific points during the training process, such as at the end of each epoch or batch. One of the most commonly used callbacks is **EarlyStopping**, which monitors a particular metric (e.g., validation loss) and stops training if that metric does not improve after a specified number of epochs (known as patience). This helps prevent overfitting by halting training when the model has reached its optimal performance on the validation set.

In [None]:
from keras.callbacks import EarlyStopping

early_stopping = EarlyStopping(
    monitor = 'val_loss', 
    patience = 3, 
    restore_best_weights = True
)   

+ `monitor = 'val_loss'`: This tells EarlyStopping to track the validation loss. You can change this to `'val_accuracy'` or another relevant metric if preferred.
+ `patience = 3`: If the monitored metric does not improve for 3 consecutive epochs, training will stop. This patience value can be tuned based on how quickly your metric tends to plateau.
+ `restore_best_weights = True`: Once EarlyStopping concludes training, the model automatically reverts to the weights from the best-performing epoch (lowest validation loss). This ensures you retain the optimal model parameters even if subsequent epochs lead to worse performance.

Another commonly used callback in Keras is `ReduceLROnPlateau`, which implements a **Learning Rate Scheduling** strategy. With this callback, the learning rate is automatically reduced when a monitored metric (e.g., validation loss) has stopped improving. Reducing the learning rate during training can help the model escape local minima or plateaus and often leads to better convergence. This is especially useful toward the later stages of training when larger steps (learning rates) might cause oscillations or prevent fine-tuning the model parameters.

In [None]:
from keras.callbacks import ReduceLROnPlateau

lrate_schedule = ReduceLROnPlateau(
    monitor = 'val_loss',
    factor = 0.1, 
    patience = 2,
    min_lr = 1e-6
)

+ `monitor = 'val_loss'`: Similar to EarlyStopping, this callback watches the validation loss for improvements. You can also monitor other metrics such as `'val_accuracy'`.
+ `factor = 0.1`: Every time the monitored metric fails to improve after a specified patience period, the current learning rate is multiplied by `0.5`. For instance, if your learning rate was $0.001$, it will be reduced to $0.0001$.
+ `patience = 2`: This tells the callback to wait for 2 consecutive epochs of no improvement in validation loss before reducing the learning rate.
+ `min_lr = 1e-6`: Ensures that the learning rate never goes below $10^{-6}$. This prevents the learning rate from becoming so small that training effectively stalls.

After defining our callbacks, we can combine them into a list and pass them to the callbacks argument in the `fit()` method. This way, both callbacks will be activated during each training epoch, helping us avoid overfitting and fine-tune the model’s learning rate for improved convergence.

In [None]:
my_callbacks = [early_stopping, lrate_schedule]

history = model.fit(
    train_images, 
    train_labels,
    epochs = 20,
    validation_split = 0.1,
    batch_size = 128,
    callbacks = my_callbacks
)

Notice that even though we specified 20 epochs within the `fit()` method, the training process stopped early. Early Stopping detected that the validation loss had not improved for the configured patience period of 3, and so it halted training before reaching epoch 20.

Let's plot the training and validation loss metrics to get a better sense of this.

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize = (8, 6))
plt.plot(history.history['loss'], label = 'Training Loss', marker = 'o')
plt.plot(history.history['val_loss'], label = 'Validation Loss', marker = 's')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()

Having added batch normalization, gradient clipping, early stopping, and learning rate scheduling to our training process, we have introduced multiple methods that help stabilize and optimize our deep learning model’s behavior. Each of these techniques address different potential pitfalls in model training, from exploding gradients to overfitting. Together, they yield a model that converges more reliably and generalizes better on unseen data.