---
# Optimisation Techniques for Machine Learning - Part 3

#### Program: `Deep Learning Indabax, Ghana, 2025` . 
#### 🏫 Institution: AIMS-RIC and ACITY 
#### 📅 Date: `*16 th June, 2025`

---

##### 👨‍🏫 Facilitator: Ishaya, Jeremiah Ayock & Toufiq Musah       

**Lecturer and Researcher in Machine Learning**  

✉️ Email: [jeremiah Ayock Ishaya](ayockishaya1029@gmail.com)  
🔗 LinkedIn: [jeremiah](https://www.linkedin.com/in/jeremiah-ayock-ishaya-a49a9999/)  

✉️ Email: [toufiq Musah](toufiqmusah32@gmail.com)  
🔗 LinkedIn: [toufiq](https://www.linkedin.com/in/toufiqmusah/) 

---

> 💡 *Optimization is not just math, it is the engine behind breakthroughs in modern AI.*

---

### 🛠️ Tools and  Frameworks used  

- Python 3.x . 
- TensorFlow 2.x / Keras
- Optuna for Hyperparameter Tuning
- Matplotlib / Seaborn for Visualization

---

#### Learning Objectives

By the end of this session, participants will be able to:


1. **Understand the Role of Optimization in Deep Learning**  

   - Explain how different optimizers (SGD, Momentum, RMSProp, Adam) affect training dynamics.  
   - Recognize the trade-offs between convergence speed, stability, and generalization.

2. **Compare and Evaluate Optimizers Experimentally**  

   - Implement multiple optimizers in TensorFlow/Keras and assess their performance using loss/accuracy curves.  
   - Diagnose underfitting/overfitting and interpret training behavior based on optimizer choice.

3. **Apply Learning Rate Scheduling Strategies**  

   - Integrate learning rate schedules such as `ExponentialDecay`, `CosineDecay`, and `OneCycleLR` in training pipelines.  
   - Visualize and interpret how learning rate dynamics impact convergence and final performance.

4. **Use Regularization Techniques to Improve Generalization**  

   - Apply dropout and L2 regularization in CNNs to reduce overfitting.  
   - Analyze the effect of regularization strategies on validation accuracy and loss.

5. **Perform Hyperparameter Tuning Using Optuna**  

   - Define an objective function and search space for tuning optimizers and regularization parameters.  
   - Run automated hyperparameter optimization and analyze results.

6. **Engage in Hands-on Practice**  

   - Complete guided exercises to solidify understanding of each concept.  
   - Collaborate on an integrated optimization challenge using best practices.

---


#### Optimization Areas in Neural Networks




| ``Area``                | ``Description ``                                                 |
|---------------------|--------------------------------------------------------------|
| Optimizer Choice     | Affects convergence speed and generalization                 |
| Learning Rate        | Most sensitive hyperparameter; needs careful tuning or scheduling |
| Batch Size           | Impacts training stability and generalization               |
| Weight Initialization| Affects early training dynamics                             |
| Regularization       | Prevents overfitting via L1, L2, Dropout                    |
| Early Stopping       | Reduces overfitting and computational cost                  |
| Architecture Tuning  | Number of layers, units, filters                            |
| Data Augmentation    | Improves generalization from limited data                  |
| Gradient Clipping    | Stabilizes training by limiting extreme updates             |
| Normalization Layers | E.g., ``BatchNorm`` speeds up training and improves stability   |


### Step 1: Importing Libraries of Interest

In [None]:
import optuna
import numpy as np
import matplotlib.pyplot as plt

import tensorflow as tf
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.optimizers import SGD, Adam  
from tensorflow.keras.metrics import SparseCategoricalAccuracy
from tensorflow.keras.losses import SparseCategoricalCrossentropy  
from tensorflow.keras import layers, models, optimizers, callbacks, regularizers

### Step 2: Data Loading & Preprocessing

**Objective:** Load and preprocess CIFAR-10 dataset .  

**Reason:** CIFAR-10 is widely used for image classification benchmarking

In [None]:
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

In [None]:
# Visualize some samples the training set

plt.figure(figsize=(10,10))
for i in range(20):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(x_train[i])
    plt.xlabel(y_train[i][0])
plt.show()

In [None]:
x_train = x_train / 255.0
x_test = x_test / 255.0

In [None]:
# Normalize using mean and standard deviation 
mean = np.mean(x_train, axis=(0, 1, 2),
               keepdims=True)
std = np.std(x_train, axis=(0, 1, 2),
             keepdims=True)

x_train = (x_train - mean) / std
x_test = (x_test - mean) / std

### Step 3 : Define the CNN Model 

**Objective:** Define a deep neural network using Keras for CIFAR-10 image classification.

**Includes:**

- CNN architecture
- Loss function: Cross-entropy
- Evaluation metric: Accuracy

**RaTational:** Establish a controlled baseline model to evaluate how different optimizers, scheduler, and regularization techniques affect training."


In [None]:
def build_base_model():
    model = models.Sequential([
        layers.Conv2D(32, (3, 3), 
                      activation='relu', 
                      padding='same', 
                      input_shape=(32, 32, 3)),
        layers.BatchNormalization(),
        layers.Conv2D(64, (3, 3), 
                      activation='relu',
                      padding='same'),
        layers.BatchNormalization(),
        layers.MaxPooling2D((2, 2)),
        layers.Dropout(0.25),
        layers.Flatten(),
        layers.Dense(128, 
                     activation='relu'),
        layers.BatchNormalization(),
        layers.Dropout(0.5),
        layers.Dense(10, 
                     activation='softmax')
    ])
    return model

In [None]:
loss_fn = SparseCategoricalCrossentropy()
metric = SparseCategoricalAccuracy()

**Student Task 1**

Modify the model above to:

1. Add one more `Conv2D` + `BatchNorm` block. 

2. Use `ReLU` for all activations. 

4. Use `Dropout=0.3` instead of `0.25` in the first dropout layer. 

5. Try to understand how these changes may affect learning capacity.

In [None]:
# Block of code here 

###

### 2. Optimization Comparison

**Objective:** Compare different optimizers: SGD, Momentum, Adam, RMSProp.

**Rational:** Different optimizers converge at different rates and reach different local minima. Professionals must know which optimizer fits their task best.


In [None]:
# Subjective to imports 

optimizers_dict = {
    "SGD": optimizers.SGD(learning_rate=0.01),
    "Momentum": optimizers.SGD(learning_rate=0.01, 
                               momentum=0.9),
    "Adam": optimizers.Adam(learning_rate=0.001),
    "RMSprop": optimizers.RMSprop(learning_rate=0.001) 
}

In [None]:
histories = {}
for name, opt in optimizers_dict.items():
    print(f"Training with {name}")
    model = build_base_model()
    model.compile(optimizer=opt, 
                  loss=loss_fn, 
                  metrics=[metric])
    history = model.fit(x_train, y_train,
                        validation_split=0.1, 
                        epochs=10, 
                        batch_size=64, 
                        verbose=0)
    histories[name] = history

In [None]:
plt.figure(figsize=(12, 6))
for name, history in histories.items():
    plt.plot(history.history['val_sparse_categorical_accuracy'],
             label=name)
plt.title('Validation Accuracy Comparison')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show() 

### Student Task 2

Based on the curves above,

1. Which optimizer converges fastest?  

2. Which optimizer leads to highest validation accuracy?  

3. Suggest reasons for these behaviors based on your prior experience.  


### 3. Learning Rate Scheduling

**Objective:** Improve learning dynamics using learning rate schedulers ``(Exponential Decay, Cosine Annealing).``  

**Rational:** Fixed learning rates may prevent reaching optimal weights.Scheduling can ``accelerate convergence`` and ``improve generalization.``


##### Mathematical Formulation of Cosine Annealing

The learning rate schedule follows:

$$
\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{\pi t}{T}\right)\right)
$$

Where:
- $\eta_t$: Learning rate at epoch $t$
- $\eta_{max}$: Maximum learning rate (0.01 in your implementation)
- $\eta_{min}$: Minimum learning rate (1e-5 in your implementation)
- $T$: Period length (20 in your implementation)
- $t$: Current epoch number

##### Boundary Conditions:

- At $t = 0$: $\cos(0) = 1 \Rightarrow \eta_0 = \eta_{max}$
- At $t = T$: $\cos(\pi) = -1 \Rightarrow \eta_T = \eta_{min}$

##### Phase Interpretation:

For $t \in [0, T]$:
1. The $\cos$ term decreases monotonically from 1 to -1
2. The learning rate decays smoothly from $\eta_{max}$ to $\eta_{min}$
3. The curve follows the first half-period of a cosine wave


In [None]:
def cosine_annealing(epoch, lr):
    max_lr = 0.01
    min_lr = 1e-5
    return min_lr + 0.5 * (max_lr - min_lr) * (1 + np.cos(np.pi * epoch / 20))

lrs = [cosine_annealing(e, 0) for e in range(20)]
plt.plot(lrs)
plt.title("Cosine Annealing Schedule")
plt.xlabel("Epoch")
plt.ylabel("Learning Rate")
plt.grid(True)
plt.show()

### Student Task 3

1. Use the cosine scheduler above in a model training loop.  

2. Observe and log how the learning rate changes during training using the callback:  
    `tf.keras.callbacks.LearningRateScheduler(cosine_annealing)`


In [None]:
### Task 3 Implementation here




### 4. Regularization Techniques

**Objective:** Control overfitting using ``Dropout`` and ``L2 Regularization.``

**Rational:** Complex models often overfit. Regularization increases generalization without sacrificing model capacity.


# Build a regularized model with Dropout and L2 Regularization
# Use 0.3 dropout for the first two layers and 0.5 for the last dropout layer
# Use L2 regularization with a lambda of 0.001 for all layers 
# Use ReLU for all activations 
# Use BatchNormalization after each Conv2D layer

In [None]:
def build_regularized_model():     
    model = models.Sequential([
        layers.Conv2D(32, (3, 3), activation='relu', padding='same',
                      kernel_regularizer=tf.keras.regularizers.l2(0.001), input_shape=(32, 32, 3)),
        layers.MaxPooling2D((2, 2)),
        layers.Dropout(0.3),

        layers.Conv2D(64, (3, 3), activation='relu', padding='same',
                      kernel_regularizer=tf.keras.regularizers.l2(0.001)),
        layers.MaxPooling2D((2, 2)),
        layers.Dropout(0.4),

        layers.Flatten(),
        layers.Dense(128, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.001)),
        layers.Dropout(0.5),
        layers.Dense(10, activation='softmax')
    ])
    return model

### Student Task 4 

1. Experiment with different L2 values: `0.0001`, `0.001`, `0.01`.  

2. Track validation accuracy and loss. Which setting overfits least?

In [None]:
# Task 4 Implementation here  




### 5. Hyperparameter Tuning with Optuna

**Objective:** Use Optuna to find the best learning rate and dropout combination.

**Rational:** Manual tuning is inefficient. Tools like Optuna accelerate experimentation.

In [None]:
import optuna

def objective(trial):
    lr = trial.suggest_loguniform('lr', 1e-5, 1e-2)
    dropout_rate = trial.suggest_float('dropout', 0.2, 0.5)

    model = models.Sequential([
        layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
        layers.MaxPooling2D((2, 2)),
        layers.Dropout(dropout_rate),
        layers.Flatten(),
        layers.Dense(128, activation='relu'),
        layers.Dropout(dropout_rate),
        layers.Dense(10, activation='softmax')
    ])
    
    opt = tf.keras.optimizers.Adam(learning_rate=lr)
    model.compile(optimizer=opt, loss=loss_fn, metrics=[metric])
    history = model.fit(x_train, y_train, epochs=5, batch_size=64, validation_split=0.1, verbose=0)
    return max(history.history['val_sparse_categorical_accuracy'])

In [None]:
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=5)
study.best_params

### Student Task 5

1. Change `Adam` to `SGD` in the objective function. Run another study with `momentum` as an additional hyperparameter.  

2. Observe how the optimal values shift.  

In [None]:
# Task 5 Implementation here  




### 6. Wrap-up Discussion 

**Main Discussion:**

- How does choice of optimizer affect final accuracy?  

- What happens when you use an overly aggressive learning rate?  

- What combination of techniques gave best performance?  

- Would you use the same setup in production?  

- How can you log, monitor, and scale these experiments using tools like Weights & Biases?  



**Reminder:** 
Theory guides intuition, but experiments validate solutions!

<button onclick="var sol=document.getElementById('sol1'); sol.style.display=sol.style.display==='none'?'block':'none'">Show/Hide Solution</button>
<div id="sol1" style="display:none">

```python
# Solution would go here
print("2 + 2 =", 4)
```
</div>

# 
