<a href="https://colab.research.google.com/github/KayalvizhiT513/Gradient_Descent_Comparison/blob/main/SGD_Adagrad.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Adagrad (Adaptive Gradient Algorithm)**
Whatever the optimizer we learned till SGD with momentum, the learning rate remains constant. In Adagrad optimizer, there is no momentum concept so, it is much simpler compared to SGD with momentum.

The idea behind Adagrad is to use different learning rates for each parameter base on iteration. The reason behind the need for different learning rates is that the learning rate for sparse features parameters needs to be higher compare to the dense features parameter because the frequency of occurrence of sparse features is lower.

* Equation:

![link text](https://miro.medium.com/v2/resize:fit:828/format:webp/1*XWvo73EMLhIeGs35xkimVw.png)

In the above Adagrad optimizer equation, the learning rate has been modified in such a way that it will automatically decrease because the summation of the previous gradient square will always keep on increasing after every time step.





In [2]:
import numpy as np
import csv

# Step 1: Initialize weights and learning rate
w_0 = 0.8260560647266798
w_1 = 0.5782539087214469
learning_rate = 0.01

# Step 2: Load data from CSV file
data = np.genfromtxt('randXY.csv', delimiter=',', skip_header=1)  # Adjust the filename accordingly

# Extract X and Y from the loaded data
X = data[:, 0]  # Assuming the first column is X
Y = data[:, 1]  # Assuming the second column is Y

print("Initial w0: ", w_0, "\n Initial w1: ", w_1)

Initial w0:  0.8260560647266798 
 Initial w1:  0.5782539087214469


In [3]:
def percentage_difference(value1, value2):
    return (np.abs(value1 - value2) / ((value1 + value2) / 2)) * 100

In [11]:
import matplotlib.pyplot as plt

# Function to plot epoch vs. loss
def plot_loss_vs_epoch(loss_history, algorithm_name):
    plt.plot(range(len(loss_history)), loss_history, label=algorithm_name)
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.title(f'{algorithm_name} - Epoch vs. Loss')
    plt.legend()
    plt.show()

In [34]:
from math import sqrt

def update_w_with_adagrad(w, learning_rate, a, gradient_w, e = 1e-8):
    # Update learning_rate
    learning_rate = learning_rate / sqrt(a + e)
    print(learning_rate)
    # Update w0
    w = w - (learning_rate * gradient_w)

    return w, learning_rate

In [36]:
# Step 3: Define SGD function
def sgd_one_sample(X, Y, w0, w1, learning_rate, epochs=6000, tol=1, consecutive_instances=10):
    n = len(X)
    prev_loss = float('inf')
    count = 0
    a = 0

    # Initialize previous gradients and accumulators
    learning_rate_w0 = 0.01
    learning_rate_w1 = 0.01
    a_w0 = 0.0
    a_w1 = 0.0

    for epoch in range(epochs):
        for i in range(n):
            # Select one random data point
            random_index = np.random.randint(0, n)
            x_i = X[random_index]
            y_i = Y[random_index]

            # Calculate prediction and loss for the selected point
            prediction = w0 + w1 * x_i
            loss = (y_i - prediction)**2

            # Calculate gradients
            gradient_w0 = -2 * (y_i - prediction)
            gradient_w1 = -2 * (y_i - prediction) * x_i

            # Update weights
            w0, learning_rate_w0 = update_w_with_adagrad(w0, learning_rate_w0, a_w0, gradient_w0)
            w1, learning_rate_w1 = update_w_with_adagrad(w1, learning_rate_w1, a_w1, gradient_w1)

            # Update a for next iteration
            a_w0 += gradient_w0**2
            a_w1 += gradient_w1**2


        # Calculate overall loss for monitoring
        predictions = w0 + w1 * X
        overall_loss = np.mean((Y - predictions)**2)

        percent_diff = percentage_difference(prev_loss, overall_loss)
        if percent_diff < tol:
          count += 1
        else:
          count = 0

        # Print loss for monitoring
        if epoch % 500 == 0:
            #print(f"learning_rate_w0: {learning_rate_w0}, learning_rate_w1: {learning_rate_w1}")
            print(f"Epoch {epoch}, Loss: {overall_loss}")

        # Update the stopping criteria to consider non-inf values
        if count >= consecutive_instances:
            print(f"Epoch {epoch}, Loss: {overall_loss}")
            print("Converged! ", count)
            break

        # Append loss to the history
        loss_history_sgd_one_sample.append(overall_loss)

        # Update previous loss for the next iteration
        prev_loss = overall_loss

    return w0, w1

# Step 4: Run SGD with one training sample at a time
w_0 = 0.8260560647266798
w_1 = 0.5782539087214469
learning_rate = 0.01
loss_history_sgd_one_sample = []
w0_sgd_one_sample, w1_sgd_one_sample = sgd_one_sample(X, Y, w_0, w_1, learning_rate)
print(f"Final weights for SGD with one sample: w0={w0_sgd_one_sample}, w1={w1_sgd_one_sample}")

100.0
100.0
9.298760886226711
27.711526218244984
0.0036023646685775968
0.017442514937483083
2.8022630092704322e-08
1.433760460564053e-07
1.5904773403561191e-13
8.757906342326613e-13
7.822421432084659e-19
4.929176536329724e-18
3.5355489010365785e-24
2.6914289373142426e-23
1.446132819221398e-29
1.3626529008405143e-28
5.508979513396002e-35
6.617130028231591e-34
1.9437754498358028e-40
2.997171322465401e-39
6.327137774256306e-46
1.225898325697963e-44
2.0259622228089563e-51
5.011740320542445e-50
6.128391524645302e-57
1.9406746303984138e-55
1.749685875737617e-62
7.020602418340032e-61
4.72091578629556e-68
2.3563663881737488e-66
1.2639880157749611e-73
7.908789841080244e-72
3.290862619393649e-79
2.6204588310039686e-77
8.384074779397232e-85
8.624745597275803e-83
2.0528379728937005e-90
2.716277826625664e-88
4.970645942152942e-96
8.548447773251359e-94
1.1556433837579513e-101
2.54524341149201e-99
2.6199698160963326e-107
7.45176244501222e-105
5.715671069920386e-113
2.0588110669795995e-110
1.233036935

  return (np.abs(value1 - value2) / ((value1 + value2) / 2)) * 100


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0

Lastly, despite not having to manually tune the learning rate there is one huge disadvantage i.e due to monotonically decreasing learning rates, at some point in time step, the model will stop learning as the learning rate is almost close to 0.