RMS Porp
RMSprop (Root Mean Square Propagation) is a popular optimization algorithm used in deep learning to update the parameters of a neural network. It was introduced by Geoffrey Hinton in 2012 as an improvement over the standard stochastic gradient descent (SGD) optimizer.

The main idea behind RMSprop is to adjust the learning rate for each weight based on the average of the squared gradients for that weight. The algorithm keeps track of an exponential moving average of the squared gradients, which is then used to normalize the learning rate for each weight. This normalization prevents the learning rate from becoming too small or too large, which can slow down the training process or cause it to diverge.

RMSprop is particularly effective in dealing with sparse gradients, which are common in deep neural networks. It has become a popular choice for optimizing neural networks due to its ability to converge faster and produce better results than other optimization algorithms, especially in deep architectures.

Overall, RMSprop is a powerful optimization algorithm that can help accelerate the training of deep neural networks and improve their performance.

$$E[g^2]_t = \beta E[g^2]_{t-1} + (1 - \beta)\frac {dC}{dw}$$
$$w_t = w_{t - 1} - \frac {lr}{\sqrt {E[g^2]_t + epsilon}} \frac {dC}{dw}$$

In [132]:
import numpy as np

RMS Prop at the end is just a optimizing algorithm, that works on the underlying concept of Gradient Descent. As we know gradient descent wants weights and biases for a staring point, lets assume we have two paramters, these two parameters will have $2$ weights and $2$ biases

What to remember while weights intialization
* From experiments it is found, when weights are intialsed as $0$, they do not change at all 
* From experiments it is found that weight intialization of random numbers but wiht huge difference is a bad habit 
* From experimaent it is found that weight intialization of random numbers but with very little difference is a bad habit 


In [133]:
weights = np.random.rand(2) * 0.1
baises = np.random.rand(2) * 0.1

In [134]:
weights 

array([0.03937234, 0.09868363])

In [135]:
baises

array([0.08005411, 0.07454717])

Lets store both of these values into another array `params`

In [136]:
params = np.array([weights , baises])

In [137]:
params

array([[0.03937234, 0.09868363],
       [0.08005411, 0.07454717]])

As we know gradient is nothing but the derivative of the slopes, the slopes are in the form $x^2$, so there derivative will be $2x$ as $$\frac {d(x^n)}{dx} = nx^{n-1} ===> \frac {d(x^2)}{dx} = 2x^{2-1} ==> 2x$$

In [138]:
gradient = np.array([[params[0] * 2] , [params[1] * 2]])

In [139]:
gradient 

array([[[0.07874469, 0.19736726]],

       [[0.16010821, 0.14909435]]])

Now lets have a look at the formula $$E[g^2]_t = \beta E[g^2]_{t-1} + (1 - \beta)\frac {dC}{dw}$$

What is this $\frac {dC}{dw}$ here ?, It is the weighted moving average

Lets first apply this in the formula, for instance lets think that $\beta$ is not present in the equation 


In [142]:
updated_gradient = np.empty(shape = (2,2))

for i in range(2):
    upda_gradient = np.empty(shape = (2,))
    for j in range(2):
        up_gradient = gradient[i][0][j] ** 2 + gradient[i][0][j]
        upda_gradient = np.hstack([upda_gradient , up_gradient])
        upda_gradient = np.delete(upda_gradient , 0  , 0)
    updated_gradient = np.vstack([updated_gradient , upda_gradient])
    updated_gradient = np.delete(updated_gradient , 0 , 0)

In [143]:
updated_gradient

array([[0.08494541, 0.23632109],
       [0.18574285, 0.17132347]])

Lets now add $\beta$ to this, but what is beta, it is a hyperparameter constant, which controls the gradient, its value is $(0.9)$

In [144]:
beta = 0.9

In [145]:
updated_gradient = beta * gradient[0][0] ** 2 +(1 - beta) * gradient[0][0] 

In [146]:
updated_gradient = np.empty(shape = (2,2))

for i in range(2):
    upda_gradient = np.empty(shape = (2,))
    for j in range(2):
        up_gradient = beta * gradient[i][0][j] ** 2 + (1 - beta) * gradient[i][0][j]
        upda_gradient = np.hstack([upda_gradient , up_gradient])
        upda_gradient = np.delete(upda_gradient , 0  , 0)
    updated_gradient = np.vstack([updated_gradient , upda_gradient])
    updated_gradient = np.delete(updated_gradient , 0 , 0)

In [147]:
updated_gradient

array([[0.01345512, 0.05479518],
       [0.039082  , 0.03491565]])

Now we need to understand the basic concept behind weighted, I found a great explanation [here](https://towardsdatascience.com/understanding-rmsprop-faster-neural-network-learning-62e116fcf29a), or the expanation below 
* First, we look at the signs of the last two gradients for the weight.
* * If they have the same sign, that means, we’re going in the right direction, and should accelerate it by a small fraction, meaning we should increase the step size multiplicatively(e.g by a factor of 1.2). 
* * If they’re different, that means we did too large of a step and jumped over a local minima, thus we should decrease the step size multiplicatively(e.g. by a factor of 0.5).
* Then, we limit the step size between some two values. These values really depend on your application and dataset, good values that can be for default are 50 and a millionth, which is a good start.
* Now we can apply the weight update 

In [154]:
updated_gradient = np.empty(shape = (2,2))
gradient_rec = []
for k in range(0 ,10):
    if k == 0 or k == 1:
        gradient = gradient
    else :
        if np.all(np.array(gradient_rec[k-2]) > np.array(gradient_rec[k-1])):
            gradient = gradient_rec[k-2]
        else :
            gradient = gradient_rec[k-1]
    for i in range(2):
        upda_gradient = np.empty(shape = (2,))
        for j in range(2):
            up_gradient = beta * gradient[i][j] ** 2 + (1 - beta) * gradient[i][j]
            upda_gradient = np.hstack([upda_gradient , up_gradient])
            upda_gradient = np.delete(upda_gradient , 0  , 0)
        updated_gradient = np.vstack([updated_gradient , upda_gradient])
        updated_gradient = np.delete(updated_gradient , 0 , 0)
    gradient_rec.append(updated_gradient)

Now we need to apply another formula $$w_t = w_{t - 1} - \frac {lr}{\sqrt {E[g^2]_t + epsilon}} \frac {dC}{dw}$$

In [162]:
def rms_prop(columns , lr = 0.001 , beta = 0.9 , epochs = 100 , epsilon = 1e-7): 
    updated_gradient = np.empty(shape = (2,2))
    gradient_rec = []
    gradient_sum = np.zeros(shape = (len(columns) , 2))
    for epochs in range(epochs):
        if epochs == 0 or epochs == 1:
            gradient = gradient
        else :
            if np.all(np.array(gradient_rec[epochs-2]) > np.array(gradient_rec[epochs-1])):
                gradient = gradient_rec[epochs-2]
            else :
                gradient = gradient_rec[epochs-1]
        for parameters in range(2):
            upda_gradient = np.empty(shape = (len(columns),))
            for values in range(2):
                up_gradient = beta * gradient[parameters][values] ** 2 + (1 - beta) * gradient[parameters][values]
                upda_gradient = np.hstack([upda_gradient , up_gradient])
                upda_gradient = np.delete(upda_gradient , 0  , 0)
            updated_gradient = np.vstack([updated_gradient , upda_gradient])
            updated_gradient = np.delete(updated_gradient , 0 , 0)
        gradient_rec.append(updated_gradient)

        gradient_sum += np.array(gradient)

        params = params - np.dot((lr/ np.sqrt(gradient + epsilon)) , gradient_sum)

    return params 