### Part 1- Understanding Optimizers

1. What is the role of optimization algorithms in artificial neural networks? Why are they necessary?

Ans: Choosing the right optimizer for a neural network involves tradeoffs among convergence speed, stability, and generalization performance. 
        Gradient Descent is simple but slow. Stochastic Gradient Descent (SGD) converges faster but may be unstable. 
        Adam optimizer balances speed and stability with adaptive learning rates. RMSprop is stable but might converge slower. 
The choice depends on the specific task, architecture, and data. Experimentation and tuning are essential to find the best optimizer for optimal training performance.

2. Explainthe concept of gradient descent and its variants. Discuss their differences and tradeoffs in terms of convergence speed and memory requirements.

Ans: Gradient descent is an optimization technique used in neural networks to minimize errors during training. 
        It calculates the gradient (direction of steepest ascent) and updates model weights to reach the minimum error. 
        Variants like Stochastic GD use random samples for faster convergence but higher memory, while Mini-batch GD strikes a balance by using batches of data. 
        Each variant has tradeoffs in terms of convergence speed and memory requirements.

3. Describe the challenges associated with traditional gradient descent optimization methods (e.g., slow convergence, local minima). How do modern optimizers address these challenges?

Ans: Traditional gradient descent methods suffer from slow convergence and getting stuck in local minima. Modern optimizers address these challenges by using 
        adaptive learning rates (e.g., Adam) that speed up convergence by adjusting step sizes for each parameter. 
        They also use momentum to escape local minima and converge faster towards the global minimum.

4. Discuss the concepts of momentum and learning rate int the context of optimization algorithms. How do they impact convergence and model perfromance?

Ans: In optimization algorithms, momentum is a technique that helps the optimizer move faster towards the minimum by adding a fraction of the previous gradient. 
        Learning rate controls the step size in each update. Proper momentum prevents oscillations, while an appropriate learning rate balances convergence speed 
        and stable model performance.

### Part 2: Optimizer Techniques

5. Explain the concept of Stochastic Gradient Descent (SGD) and its advantages compared to traditional gradient descent. Discuss its limitations ad scenarios where it is most suitable.

Ans: Stochastic Gradient Descent (SGD) is an optimization method that updates model weights using one random training sample at a time instead of the entire dataset. 
        It converges faster, makes efficient use of memory, and escapes local minima better. However, its randomness may cause noisy updates, making it less suitable 
        for smoother loss landscapes or when noise is detrimental.

6. Describe the concept of Adam optimizer and how it combines momentum and adaptive learning rates. Discuss its benefits and potential drawbacks.

Ans: The Adam optimizer combines the benefits of momentum and adaptive learning rates. It calculates individual adaptive learning rates for each parameter based on 
        past gradients and squares of gradients. This helps it converge faster and be more robust to different learning rates. However, it may require tuning and 
        could overshoot the optimal values in some cases.

7. Explain the concept of RMSprop optimizer and how it addresses the challenges of adaptive learning rates. Compare it with Adam and discuss their relative strengths and weaknesses. 

Ans: RMSprop is an optimizer that addresses the challenges of adaptive learning rates by using a moving average of squared gradients. It helps prevent oscillations 
        in training and converges faster. Compared to Adam, RMSprop is simpler and requires less memory, but it may not adapt learning rates as effectively. 
        Adam performs well in many scenarios but might need more tuning.
        

### Part 3- Applying Optimizers

In [1]:
import tensorflow as tf


2023-08-26 14:41:36.650164: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-08-26 14:41:36.722246: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-08-26 14:41:36.724330: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
import numpy as np
from tensorflow.keras.datasets import mnist
from tensorflow.keras.layers import Dense,Flatten,BatchNormalization

In [6]:
def model(OPTIMIZER):
    # Loading the Dataset
    (X_train_full, y_train_full), (X_test, y_test) = mnist.load_data()
    
    #Scaling the data
    X_valid, X_train = X_train_full[:5000] / 255., X_train_full[5000:] / 255.
    y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
    
    #Scaling the test data 
    X_test=X_test/255
    
    model=tf.keras.models.Sequential()
    model.add(Flatten(input_shape=[28, 28], name="inputLayer"))
    model.add(Dense(200, activation="relu", name="hiddenLayer1"))
    model.add(Dense(100, activation="relu", name="hiddenLayer2"))
    model.add(Dense(10, activation="softmax", name="outputLayer"))

    model.compile(loss='sparse_categorical_crossentropy',
                optimizer=OPTIMIZER,
                metrics=["accuracy"])
    model.fit(X_train, y_train, epochs=25,
                    validation_data=(X_valid, y_valid), batch_size=1000)
    
    return model.history.history['val_accuracy'][-1]

8. Implement SGD, Adam, and RMSprop optimizers in a deep learning model using a framework of your choice. 
- Train the model on a suitable dataset and compare their impact on model convergence and performance


In [7]:
score_data={
    'score_SGD':model('SGD'),
    'score_Adam':model('adam'),
    'score_RMSProp':model('RMSProp')
}
print("\n\n")
print(score_data)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25



{'score_SGD': 0.907800018787384, 'score_Adam': 0.9814000129699707, 'score_RMSProp': 0.978600025177002}


9. Discuss the considerations and tradeoffs when choosing the appropriate optimizer for a given neural network architecture and task. Consider factors such as convergence speed, stability and generalization performance. 

Ans: Choosing the right optimizer for a neural network involves tradeoffs among convergence speed, stability, and generalization performance. 
- Gradient Descent is simple but slow. 
- Stochastic Gradient Descent (SGD) converges faster but may be unstable. 
- Adam optimizer balances speed and stability with adaptive learning rates. 
- RMSprop is stable but might converge slower. 

The choice depends on the specific task, architecture, and data. Experimentation and tuning are essential to find the best optimizer for optimal trai