## Training models using MNIST dataset
### Comparaison between SGD, Mini-batch SGD and Batch SGD

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

SGD training time:  1939.6208429336548  
Mini-batch SGD training time:  54.13084030151367  
Batch SGD training time:  8.09151005744934  
> SGD (Stochastic Gradient Descent) updates the model parameters using the gradient of the loss function computed on a single randomly selected data point at a time. This makes it computationally efficient but can result in noisy updates and slower convergence.

>Batch SGD updates the model parameters using the gradient of the loss function computed on the entire training dataset. This can lead to more accurate updates but can be computationally expensive, especially for large datasets.

>Mini-batch SGD updates the model parameters using the gradient of the loss function computed on a small random subset (or mini-batch) of the training dataset. This strikes a balance between the efficiency of SGD and the accuracy of batch SGD. It is the most commonly used optimization algorithm for deep learning models.  

![image-3.png](attachment:image-3.png)

> Batch SGD seems to score the lowest accuracy and the worst between the three models,
SGD with (batch size = 1) scores the best accuracy among the 3 models

SGD training time:  2999.7093324661255  
Mini-batch SGD training time:  1114.2193298339844  
Batch SGD training time:  1068.17999958992  
Mini-batch SGD with decay training time:  31.212229251861572  
SGD with momentum training time:  1917.5881078243256

#### SGD Mini-batch with Decay & SGD Mementum
>SGD with momentum and SGD mini-batch with decay are two variants of the stochastic gradient descent algorithm.

>SGD with momentum is an optimization technique that helps accelerate the convergence of the model during training. In traditional SGD, the update to the parameters is directly proportional to the gradient of the cost function. In SGD with momentum, the update is a combination of the current gradient and a fraction of the previous update. This allows the optimizer to keep moving in the direction of the previous update and dampen oscillations in the direction of the gradient. This can lead to faster convergence and less oscillations during training.

>SGD mini-batch with decay is another optimization technique that helps to accelerate the convergence of the model during training. In traditional mini-batch SGD, the learning rate remains constant during training. In SGD mini-batch with decay, the learning rate is reduced over time by multiplying it with a decay factor. This helps the optimizer to take larger steps in the beginning of training when the gradients are large and gradually reduce the step size as the optimizer gets closer to the minimum of the cost function. This can lead to faster convergence and better generalization of the model.

![image-3.png](attachment:image-3.png)
![image-4.png](attachment:image-4.png)

 SGD with Momentum of 0.9 is scoring really bad accuracy that does not pass 0.1   
It's difficult to determine the exact reason for the low accuracy here are some possible reasons:

>Learning rate: The learning rate (lr) of 0.01 may be too high or too low for this particular model and dataset.

>Number of epochs: 10 epochs may not be enough for the model to converge to a good solution.

>Overfitting: It's possible that the model is overfitting to the training data, which means that it is not generalizing well to new, unseen data.

>Model architecture: The architecture of the model may not be suitable for the task at hand.

#### Adam
![image.png](attachment:image.png)

#### RmsProp
![image.png](attachment:image.png)

SGD (Stochastic Gradient Descent) is a simple and widely used optimization algorithm that updates the model parameters in the direction of the negative gradient of the loss function with a fixed learning rate. SGD performs the update after every mini-batch of training examples, and this process can be repeated multiple times (epochs) until convergence. SGD is computationally efficient and can work well on small datasets, but it can suffer from slow convergence or getting stuck in local minima when dealing with complex, high-dimensional, or noisy datasets.

RMSprop (Root Mean Square Propagation) is an optimization algorithm that modifies the learning rate of SGD by scaling the gradient based on the average of the squared past gradients for each weight. RMSprop accumulates the exponential moving average of squared gradients and uses it to adjust the learning rate of each weight individually. This can result in a faster convergence rate and better generalization than SGD, especially on datasets with noisy or sparse gradients.

Adam (Adaptive Moment Estimation) is an optimization algorithm that combines the advantages of RMSprop and SGD with momentum. Adam computes adaptive learning rates for each weight based on the estimates of both the first and second moments of the gradients. The first moment is the mean of the gradients and the second moment is the uncentered variance of the gradients. Adam also uses momentum to speed up the convergence by accumulating a weighted moving average of past gradients. Adam is generally considered to be more computationally efficient and effective than both RMSprop and SGD, especially on large-scale and non-convex optimization problems.

> Best result are scored by ADAM then RmsProp then SGD

SGD training time:  2999.7093324661255   
Mini-batch SGD training time:  1114.2193298339844    
Mini-batch SGD with decay training time:  31.212229251861572  
SGD with momentum training time:  1917.5881078243256  

![image.png](attachment:image.png)

## Training models using CIFAR10 dataset
### Comparaison between models with different hyperparams

Learning curves of SGD model with the same previous architecture and with learning rate = 0.01

![image.png](attachment:image.png)  
 Using l2 norm:
 >L2 norm regularization, also known as ridge regression, is a technique used to prevent overfitting in machine learning models. It works by adding a penalty term to the loss function that encourages the model to have smaller weights.  
The penalty term is calculated as the sum of the squares of all the weights in the model multiplied by a hyperparameter called the regularization parameter, denoted as lambda. The overall loss function becomes:  
loss = original_loss + (lambda * sum_of_weights_squared)  
The effect of adding this penalty term is to encourage the model to distribute its weight values more evenly across all the features, rather than relying too heavily on any one feature. This helps to prevent the model from overfitting, as it is less likely to memorize the training data and instead learns more general patterns.  
In practice, the value of lambda is typically chosen by cross-validation to find the best balance between fitting the training data and avoiding overfitting.

![image-2.png](attachment:image-2.png)

![image-3.png](attachment:image-3.png)

![image-4.png](attachment:image-4.png)


![image-5.png](attachment:image-5.png)

Training stopped after 40 epochs  


![image-6.png](attachment:image-6.png)

Batch normalization is a technique used to normalize the inputs of a neural network layer, specifically the activations of the neurons in the layer. This normalization process involves subtracting the mean and dividing by the standard deviation of the inputs in each mini-batch. By doing so, batch normalization helps to reduce the impact of covariate shift, which can slow down the training process and make it difficult to optimize the neural network.

To add a batch normalization layer after the first hidden layer, we can use the BatchNormalization layer provided by Keras. 

![image.png](attachment:image.png)





![image.png](attachment:image.png)

![image.png](attachment:image.png)


![image-2.png](attachment:image-2.png)

Seems that the models are getting into overfiting thats why the score is so bad and the accuracy curve for the validation set is so noisy and does not converge

### Using RandomSearch in order to find best params
After using RandomSearch I got an output that identifies the best params to use:  
Best params: {'batch_size': 128, 'dropout_rate': 0.2136695399132326, 'learning_rate': 0.03199722257734033}

Using these params I got these results:

![image.png](attachment:image.png)

Still there is overfiting and we got bad accuracy