## <div style ="font-size:25px; border-radius:25px; border:3666; padding:10px;  background-color:#fffafa; text-align:center; color:#000000;">Optimizers</div>


<div style ="font-size:25px; padding:10px; border-radius:15px; background-color:#fffafa; color:#000000;">
<p >Optimizers are algorithms used in deep learning to adjust the weights of a neural network during training in order to minimize the loss function. They guide the model towards better performance by determining how the model's parameters should be updated based on gradients calculated through backpropagation </p>
<li>Gradient Descent</li>
<li>Momentum</li>
<li>AdaGrad</li>
<li>RMS Prop</li>
<li>Adam</li>
</div>




## <div style ="font-size:25px; border-radius:25px; border:3666; padding:10px;  background-color:#fffafa; text-align:center; color:#000000;"> Gradient Descent</div>


Gradient descent updates the weights based on the entire dataset.



$$
\theta := \theta - \eta \nabla_{\theta} J(\theta)
$$

In [1]:
class GradientDescent:
    def __init__(self,learning_rate = 0.001):
        self.learning_rate = learning_rate 
        
    def update(self,weights,gradients):
        weights = weights - self.learning_rate * gradients
        return weights


## <div style ="font-size:25px; border-radius:25px; border:3666; padding:10px;  background-color:#fffafa; text-align:center; color:#000000;">Stochastic Gradient Descent</div>



SGD updates the weights based on a single training example at each iteration.



$$
\theta := \theta - \eta \nabla_{\theta} J(\theta; x^{(i)}, y^{(i)})
$$

In [2]:
class SGD:
    def __init__(self,learning_rate = 0.001):
        self.learning_rate = learning_rate
        
    def update(self,weights,gradients):
        weights = weights - self.learning_rate * gradients
        return weights

## <div style ="font-size:25px; border-radius:25px; border:3666; padding:10px;  background-color:#fffafa; text-align:center; color:#000000;">Momentum</div>


Momentum helps accelerate SGD by adding a fraction of the previous update to the current update.


$$
v_t := \beta v_{t-1} + (1 - \beta) \nabla_{\theta} J(\theta)
$$

$$
\theta := \theta - \eta v_t 
$$

In [3]:
class Momentum:
    def __init__(self,learning_rate=0.001,momentum= 0.9):
        self.learning_rate = learning_rate 
        self.momentum = momentum
        self.velocity = 0
        
    def update(self,weights,gradients):
        self.velocity = self.momentum * self.velocity - self.learning_rate *self.gradient
        weights = weigths+self.velocity
        return weights

## <div style ="font-size:25px; border-radius:25px; border:3666; padding:10px;  background-color:#fffafa; text-align:center; color:#000000;"> AdaGrad</div>



Adagrad adapts the learning rate for each parameter based on the past gradients.


$$
\theta := \theta - \frac{\eta}{\sqrt{G_{t,ii} + \epsilon}} \nabla_{\theta} J(\theta)
$$

In [4]:
import numpy as np
class Adagrad:
    def __init__(self,learning_rate = 0.001,epsilon = 1e-18):
        self.learning_rate = learning_rate
        self.epsilon = epsilon
        self.cache = 0
        
    def update (self,weights,gradients):
        self.cache += gradients**2
        weight_update = (self.learning_rate/ (np.sqrt(self.cache)+self.epsilon)) * gradients
        weights -= weight_update
        return weights
        

## <div style ="font-size:25px; border-radius:25px; border:3666; padding:10px;  background-color:#fffafa; text-align:center; color:#000000;"> RMS Prop</div>



RMSProp scales the learning rate based on a moving average of the square of the gradients.


$$
E[g^2]t := \beta E[g^2]{t-1} + (1 - \beta) \nabla_{\theta} J(\theta)^2
$$

$$
\theta := \theta - \frac{\eta}{\sqrt{E[g^2]t + \epsilon}} \nabla{\theta} J(\theta) ]
$$

In [5]:
class RMSprop:
    def __init__(self,learning_rate = 0.0001,beta = 0.9,epsilon =1e-8):
        self.learning_rate =learning_rate
        self.beta = beta
        self.epsion = epsilon
        self.cache = cache
        
    def update(self,weights,gradients):
        self.cache = self.beta * self.cache + (1-self.beta )* gradients ** 2
        weight_update = (self.learning_rate*gradients) /(np.sqrt(self.cache)+self.epsilon)
        weight -= weight_update
        return weight
    
        
        

## <div style ="font-size:25px; border-radius:25px; border:3666; padding:10px;  background-color:#fffafa; text-align:center; color:#000000;"> Adam</div>


Adam combines momentum and RMSProp, with separate moving averages for the gradients and squared gradients.


$$
m_t := \beta_1 m_{t-1} + (1 - \beta_1) \nabla_{\theta} J(\theta)
$$
$$
v_t := \beta_2 v_{t-1} + (1 - \beta_2) (\nabla_{\theta} J(\theta))^2 ]
$$

$$
\hat{m}_t := \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t := \frac{v_t}{1 - \beta_2^t}
$$
$$
\theta := \theta - \frac{\eta \hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} ]
$$

In [6]:
class Adam:
    def __init__(self,learning_rate = 0.0001,beta1 = 0.9 , beta = 0.999,epsion = 1e-8):
        self.learning_rate = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        
        self.m = 0
        self.v = 0
        self.t = 0
        
    def update(sself,weights,gradients):
        self.t += 1
        self.m = self.beta1 * self.m + (1-self.beta1)*gradients
        self.v = self.beta2 * self.v + (1-self.beta2) * gradients
        m_hat =self.m / (1-self.beta1 **self.t)
        v_hat = self.v / ( 1- self.beta2 **self.t)
        
        weight_update = self.learning_rate * m_hat /(np.sqrt(v_hat) + self.epsilon)
        weights = weighs - weight_update
        return weights