# Advanced Optimizations in Deep Learning

Optimization algorithms play a crucial role in deep learning, helping models learn from data by minimizing or maximizing an objective function. Advanced optimizers like Adam and RMSProp incorporate mechanisms that consider not just the first derivatives of the loss function but also the second moments of the gradients or adaptive learning rates, thereby enhancing the performance and convergence speed of deep neural networks.

https://www.ruder.io/optimizing-gradient-descent/

<img src="./imgs/deep_learning_optimization.jpeg" alt="drawing" width="725"/>

### Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is the fundamental optimizer in Deep Learning. It iteratively updates the weights of a neural network based on the negative gradient of the loss function. This essentially pushes the weights in a direction that minimizes the loss. However, SGD has limitations:

* **Slow Convergence:** SGD can be slow to converge, especially in landscapes with many shallow valleys (local minima). Imagine a ball rolling down a bumpy hill. It might get stuck in a small valley instead of reaching the global minimum at the bottom.

* **Sensitivity to Learning Rate:** The learning rate hyperparameter controls the step size taken during each update. Too high a learning rate can cause the weights to overshoot the minimum, bouncing around like a ball with too much momentum. Conversely, a very small learning rate makes the journey to the minimum painstakingly slow. 


### RMSProp (Root Mean Squared Prop)

RMSProp addresses SGD's sensitivity to learning rates by introducing an **exponentially decaying average of squared gradients**. This average is used to scale the learning rate for each parameter, leading to **adaptive learning behavior**.

* **Analogy:** Imagine a ball rolling down a hill, but the ground is constantly changing its friction. RMSProp adjusts the ball's speed based on the recent steepness it encountered. 

* **Benefits:**
    * **Faster Convergence:** Compared to SGD's constant learning rate, RMSProp's adaptation often leads to faster convergence.
    * **Handles Non-stationary Problems:** RMSProp can navigate the changing landscape more effectively when working with noisy or non-stationary data.


<img src="./imgs/contour.webp" alt="drawing" width="500"/>


### Adam (Adaptive Moment Estimation)

Adam is an optimizer that combines the strengths of SGD, Momentum, and RMSProp. It maintains an exponentially decaying average of gradients (similar to RMSProp) and another average of squared gradients with a bias correction. 

* **Analogy:** Imagine a ball with built-in momentum rolling on a complex surface. Adam considers both the immediate slope and the historical fluctuations to update the learning rate, leading to efficient navigation.

* **Benefits:**
    * **Often the Best Performer:** Adam is frequently considered the default choice due to its effectiveness across various problems. 
    * **Minimal Hyperparameter Tuning:** Adam generally requires less hyperparameter tuning compared to SGD or RMSprop.
    * **Works Well for Sparse Gradients:** Adam addresses the issue of weights that have very small or zero gradients in certain neural networks. 


![Adam Optimization](imgs/gradient_descent.gif)

### Comparison of Optimizers

To compare the performance of SGD, RMSProp, and Adam optimizers on the MNIST dataset, we'll perform the following steps:

1. Load the MNIST dataset.
2. Preprocess the data.
3. Define a simple neural network model for digit classification.
4. Train the model using each of the three optimizers: SGD, RMSProp, and Adam.
5. Evaluate and compare the performance of the models.


#### 1. Load the MNIST Dataset

In [1]:
import tensorflow as tf

# Load the MNIST dataset
mnist = tf.keras.datasets.mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# Normalize the data
X_train, X_test = X_train / 255.0, X_test / 255.0


#### 2. Define the Neural Network Model

In [2]:
def create_model(optimizer):
    model = tf.keras.Sequential([
        tf.keras.layers.Flatten(input_shape=(28, 28)),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(10, activation='softmax')
    ])
    
    model.compile(optimizer=optimizer, # set the optimizer here
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])
    return model


#### 3. Train Models with Different Optimizers

In [3]:
import time

# Experiment with different optimizers (try SGD, RMSprop, Adam)
optimizers = ['sgd', 'rmsprop', 'adam']
results = {}

for opt in optimizers:
    print(f"Training with {opt.upper()} optimizer...")
    start_time = time.time()
    model = create_model(opt)
    history = model.fit(X_train, y_train, epochs=5, validation_split=0.2, verbose=0)
    duration = time.time() - start_time
    test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
    results[opt] = (test_loss, test_acc, duration)
    print(f"{opt.upper()} - Test Loss: {test_loss:.4f}, Test Accuracy: {test_acc:.4f}, Training Time: {duration:.2f} seconds\n")


Training with SGD optimizer...
SGD - Test Loss: 0.2307, Test Accuracy: 0.9362, Training Time: 5.30 seconds

Training with RMSPROP optimizer...
RMSPROP - Test Loss: 0.1050, Test Accuracy: 0.9720, Training Time: 6.45 seconds

Training with ADAM optimizer...
ADAM - Test Loss: 0.0792, Test Accuracy: 0.9763, Training Time: 5.62 seconds



#### 4. Evaluation and Comparison

In [4]:
for opt, (loss, acc, time) in results.items():
    print(f"{opt.upper()}: Loss = {loss:.4f}, Accuracy = {acc:.4f}, Time = {time:.2f} seconds")


SGD: Loss = 0.2307, Accuracy = 0.9362, Time = 5.30 seconds
RMSPROP: Loss = 0.1050, Accuracy = 0.9720, Time = 6.45 seconds
ADAM: Loss = 0.0792, Accuracy = 0.9763, Time = 5.62 seconds


Comparing SGD, RMSProp, and Adam on the MNIST dataset illustrates the impact of advanced optimization techniques in deep learning training. While SGD serves as a solid baseline, optimizers that adjust learning rates adaptively, such as RMSProp and Adam, can significantly enhance model performance. Adam, in particular, stands out for achieving the best balance between high accuracy, low loss, and reasonable training time, making it an excellent choice for many deep learning tasks. 

### Choosing the Right Optimizer

* **There's No Silver Bullet:** The optimal optimizer choice depends on the specific problem, dataset, and model architecture. Experimentation is key!
* **Start with Adam:** Due to its robustness and adaptability, Adam is an excellent starting point for most deep learning tasks.
* **Consider RMSProp for Noisy Data:** If your data exhibits significant noise or non-stationarity, RMSProp might be a better choice than Adam. 
* **Fine-Tuning with SGD:**  In some cases, switching to SGD with a carefully selected learning rate can help refine a model that's already been pre-trained with Adam or RMSprop. 

**Remember:** The best way to determine the most suitable optimizer for your deep learning project is through experimentation! 
