<a href="https://colab.research.google.com/github/Ak08032000/Gradient-Descent/blob/master/Nesterov_Accelerated_Gradient_Descent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#What is Nesterov Accelerated Gradient Descent?
Nesterov accelerated gradient descent (NAG) is a variant of the standard gradient descent optimization algorithm used for minimizing an objective function. NAG introduces the concept of momentum to gradient descent by taking into account the direction of the previous update and using it to adjust the current update. The algorithm works by computing an intermediate value of the gradient at a point slightly ahead of the current point in the parameter space, using the momentum to correct for the overshoot that may occur.

Momentum gradient descent (MGD) and Nesterov accelerated gradient descent (NAG) are two variants of the standard gradient descent algorithm used for minimizing an objective function. Both algorithms use a momentum term to help the optimization algorithm "look ahead" and make better progress towards the optimum.

The main difference between MGD and NAG lies in how they compute the momentum term. In MGD, the momentum term is computed as a weighted average of the previous gradient updates, whereas in NAG, the momentum term is computed based on the gradient evaluated at a point slightly ahead of the current point.

In [None]:
import numpy as np

X = [0.5, 2.5]
Y = [0.2, 0.9]

In [None]:
def f(w,x,b):
  return 1.0/(1.0 + np.exp(-(w*x + b)))

In [None]:
def error(w,b):
  err = 0.0
  for x,y in zip(X,Y):
    fx = f(w,x,b)
    err += 0.5 * (fx - y)**2
  return err

In [None]:
def grad_w(w,b,x,y):
  fx = f(w,b,x) 
  return (fx - y) * fx * (1-fx) * x

def grad_b(w,b,x,y):
  fx = f(w,b,x)
  return (fx - y) * fx * (1-fx)

In [None]:
def do_nesterov_accelerated_gradient_descent():
  w, b, eta, max_epochs = 0, 0, 1.0, 100
  prev_v_w, prev_v_b, gamma = 0, 0, 0.9
  for i in range(max_epochs):
    dw, db = 0, 0
    v_w = gamma * prev_v_w
    v_b = gamma * prev_v_b
    for x,y in zip(X,Y):
      dw += grad_w(w - v_w, b - v_b, x, y)
      db += grad_b(w - v_w, b - v_b, x, y)

    v_w = gamma * prev_v_w + eta* dw
    v_b = gamma * prev_v_b + eta* db
    w = w - v_w
    b = b - v_b
    prev_v_w = v_w
    prev_v_b = v_b
    print(w,b)
    print(error)

In [None]:
do_nesterov_accelerated_gradient_descent()

-0.0538708338050038 -0.1009719423913125
<function error at 0x7f0b7320fe50>
-0.1566863475851223 -0.2934760218491476
<function error at 0x7f0b7320fe50>
-0.30544055347522214 -0.5710717394141629
<function error at 0x7f0b7320fe50>
-0.497701039661516 -0.9282772814706945
<function error at 0x7f0b7320fe50>
-0.7249181335279444 -1.3494247772768113
<function error at 0x7f0b7320fe50>
-0.9661895739531678 -1.796667533090145
<function error at 0x7f0b7320fe50>
-1.2003240580916912 -2.2309081041421486
<function error at 0x7f0b7320fe50>
-1.417258667256697 -2.6333591664655804
<function error at 0x7f0b7320fe50>
-1.6145691399406465 -2.999443691068694
<function error at 0x7f0b7320fe50>
-1.7928201520478773 -3.3301789402697195
<function error at 0x7f0b7320fe50>
-1.9534658736535984 -3.628252836961924
<function error at 0x7f0b7320fe50>
-2.0981210776989276 -3.896658210230362
<function error at 0x7f0b7320fe50>
-2.2283367843854727 -4.138271844796013
<function error at 0x7f0b7320fe50>
-2.34554054057382 -4.3557421557