# Gradient descent

```{prf:algorithm} Gradient descent
:label: GD
:nonumber:


To solve the optimization problem


$$
   \mathcal L(\boldsymbol w) \to \min\limits_{\boldsymbol w}
$$


do the following steps:


1. initialize $\boldsymbol w$ by some random values (e.g., from $\mathcal N(0, 1$))
2. choose **tolerance** $\varepsilon > 0$ and **learning rate** $\eta > 0$
3. while $\Vert \nabla\mathcal L(\boldsymbol w) \Vert > \varepsilon$ do the **gradient step**


   $$
   \boldsymbol w := \boldsymbol w - \eta\nabla\mathcal L(\boldsymbol w)
   $$
4. return $\boldsymbol w$
```


```{figure} https://developers.google.com/machine-learning/crash-course/images/LearningRateJustRight.svg
:align: center
:width: 400px
Main idea of gradient descent algorithm
```


The algorithm initializes parameters randomly, sets hyperparameters such as ***tolerance*** and ***learning rate***, and iteratively updates the parameters using gradient descent until convergence, ultimately returning the optimized parameter vector. This general framework is widely used for solving optimization problems in machine learning and other fields.


```{figure} https://media.tenor.com/7zVwMezTxiAAAAAC/gradientdescent-graph.gif
:align: center
:width: 400px
Iteration process of gradient descent
```


```{note}
**Learning rate** determines the size of the steps taken during the optimization and influences the convergence speed, thereby, a smaller learning rate may lead to more precise results but slower convergence.
```


```{note}
If condition $\Vert \nabla\mathcal L(\boldsymbol w) \Vert > \varepsilon$ holds for too long, the loop in step 3 terminates after some number iterations `max_iter`.
```

### Example


Let's consider the simple quadratic function $f(x) = x^2$. The purpose is to demonstrate a function where gradient descent can converge to the minimum in just a few steps.


1. Calculate the Gradient:
   $\nabla f(x) = 2x$
  - Initialize Parameters:
   Choose an initial value for x, e.g., $x_{\text{old}} = 3.0$.
  - Set Learning Rate:
   Choose a small learning rate, e.g., $\alpha = 0.1$.
2. Update rule:
   $x_{\text{new}} = x_{\text{old}} - \alpha \cdot \nabla f(x_{\text{old}})$
3. Iterative updates:


   **Iteration 1:**
   $\nabla f(x_{\text{old}}) = 2 \cdot 3.0 = 6.0$
  $x_{\text{new}} = x_{\text{old}} - 0.1 \cdot 6.0 = 3.0 - 0.6 = 2.4$
  Result: $x_{\text{new}} = 2.4, f(x_{\text{new}}) = (2.4)^2 = 5.76$


   **Iteration 2:**
  $\nabla f(x_{\text{old}}) = 2 \cdot 2.4 = 4.8$
  $x_{\text{new}} = x_{\text{old}} - 0.1 \cdot 4.8 = 2.4 - 0.48 = 1.92$
  Result: $x_{\text{new}} = 1.92, f(x_{\text{new}}) = (1.92)^2 = 3.6864$
  
   $$...$$
 
4. Final result:
   **Iteration n:**
  $\nabla f(x_{\text{n-1}}) = 2 \cdot x_{\text{n-1}}$
  $x_{\text{n}} = x_{\text{n-1}} - 0.1 \cdot \nabla f(x_{\text{n-1}})$
  Result: $f(x_{\text{n}}) = (x_{\text{n}})^2$


The process should show that the value of $x$ approaches the minimum $(x=0)$ relatively quickly due to the simplicity of the quadratic function.


<span style="display:none" id="quiz_1">W3sicXVlc3Rpb24iOiAiV2hhdCBpcyB0aGUgcHJpbWFyeSBnb2FsIHdoZW4gdXNpbmcgR3JhZGllbnQgRGVzY2VudCBvcHRpbWl6YXRpb24gdG8gdHJhaW4gYSBwcmVkaWN0aXZlIG1vZGVsPyIsICJ0eXBlIjogIm1hbnlfY2hvaWNlIiwgImFuc3dlcnMiOiBbeyJhbnN3ZXIiOiAiTWF4aW1pemUgdGhlIGVycm9yIHRvIGV4cGxvcmUgcGFyYW1ldGVyIHNwYWNlIHJhbmRvbWx5IiwgImNvcnJlY3QiOiBmYWxzZSwgImZlZWRiYWNrIjogIkluY29ycmVjdC4gVHJ5aW5nIHRvIG1heGltaXplIGVycm9ycyBpcyBsaWtlIHRyeWluZyB0byB3aW4gYSByYWNlIGJ5IGdvaW5nIGJhY2t3YXJkLiBOb3QgdGhlIGJlc3Qgc3RyYXRlZ3kgZm9yIGFjY3VyYXRlIHByZWRpY3Rpb25zISJ9LCB7ImFuc3dlciI6ICJTZXQgcGFyYW1ldGVycyByYW5kb21seSB0byBhY2hpZXZlIGFjY3VyYXRlIHByZWRpY3Rpb25zIiwgImNvcnJlY3QiOiBmYWxzZSwgImZlZWRiYWNrIjogIkluY29ycmVjdC4gU2V0dGluZyBwYXJhbWV0ZXJzIHJhbmRvbWx5IGlzIG5vdCBhIHN5c3RlbWF0aWMgYXBwcm9hY2ggZm9yIGFjY3VyYXRlIHByZWRpY3Rpb25zLiJ9LCB7ImFuc3dlciI6ICJNaW5pbWl6ZSB0aGUgZXJyb3IgdG8gYWRqdXN0IHBhcmFtZXRlcnMgZm9yIGFjY3VyYXRlIHByZWRpY3Rpb25zIiwgImNvcnJlY3QiOiB0cnVlLCAiZmVlZGJhY2siOiAiQ29ycmVjdCEgTWluaW1pemluZyBlcnJvcnMgaXMgbGlrZSBmaW5kaW5nIHRoZSBwZXJmZWN0IHNsaWNlIGluIGEgcGl6emFcdTIwMTRwcmVjaXNlIGFuZCBzYXRpc2Z5aW5nIGZvciBhY2N1cmF0ZSBwcmVkaWN0aW9ucy4ifSwgeyJhbnN3ZXIiOiAiSWdub3JlIGVycm9ycyBhbmQgZm9jdXMgb24gZmVhdHVyZSBlbmdpbmVlcmluZyIsICJjb3JyZWN0IjogZmFsc2UsICJmZWVkYmFjayI6ICJJbmNvcnJlY3QuIElnbm9yaW5nIGVycm9ycyB3b3VsZCBoaW5kZXIgdGhlIG1vZGVsJ3MgYWJpbGl0eSB0byBtYWtlIGFjY3VyYXRlIHByZWRpY3Rpb25zLiJ9XX1d</span>

In [None]:
from jupyterquiz import display_quiz

display_quiz("#quiz_1")

Nevertheless, the gradient descent has its drawbacks. It should be highlighted that 2 notable issues can be caused in practical applications. Similar to any vector, the negative gradient comprises both a direction and a magnitude. Depending on the specific function undergoing minimization, either one or both of these characteristics may pose challenges when employing the negative gradient as a descent direction.


### Vanishing behavior of gradient magnitude


Below we show an example run of gradient descent using a function


$$\begin{equation}
g(w) = w^4 +0.1
\end{equation}$$


whose minimum is at the origin w=0. This example highlights how the gradient magnitude influences step length in gradient descent. Steps are initially large away from a stationary point but become small, resembling crawling, near the function minimum. A deliberately set step length parameter $\alpha = 10^{-1}$ for 10 steps accentuates this behavior. The left panel shows the original function, and the right panel displays steps from start <span style='color:green'>(green)</span> to final <span style='color:red'>(red)</span>. The natural crawling near the minimum, due to vanishing gradient magnitude, hampers quick progress, as reflected in the cost function history plot.


```{figure} https://iili.io/JTfn4ft.png
:align: center
:width: 400px
Behavior of gradient descent near the minimum of a function
```


The other case of improper performance of gradient descent is the slow-crawling behavior of gradient descent. The information can be read in the ([provided source](https://jermwatt.github.io/machine_learning_refined/notes/3_First_order_methods/3_7_Problems.html#The-'zig-zagging'-behavior-of-gradient-descent)).


## Adding Momentum




```{admonition} Purpose of Momentum




When employing gradient descent, several challenges arise:




1. Becoming stuck at a local minimum, a consequence of the algorithm's greediness.
2. Overshooting and overlooking the global optimum due to excessively rapid movement along the gradient direction.
3. Oscillation, a phenomenon occurring when the function's value remains relatively constant, resembling navigation on a plateau where the height remains the same regardless of the direction.




To address these issues, a momentum term denoted as $\alpha$ is introduced into the expression for $\Delta \textbf{w}$ to stabilise the learning rate while approaching the global optimum value.




In the following, the superscript ***i*** indicates the iteration number:




$$\Delta \textbf{w}^i = - \eta \nabla_\textbf{w} f(\textbf{w}^i) + \alpha \textbf{w}^{i-1}$$


From that we can derive the formula for momentum term:




$$\alpha = \frac{\Delta \textbf{w}^i + \eta \nabla_\textbf{w} f(\textbf{w}^i)}{\textbf{w}^{i-1}}$$


```


## Gradient Descent for a Univariate Function


Let's initialize the objective function and its **derivative**. And set up gradient descent function.


```{note}
**Derivative $f'(x)$** is used to understand how function changes at a specific point, as it helps to figure out direction and speed toward function's minimum.
```


<span style="display:none" id="quiz_2">W3sicXVlc3Rpb24iOiAiV2hlbiBlbXBsb3lpbmcgR3JhZGllbnQgRGVzY2VudCwgd2hhdCBjaGFsbGVuZ2VzIGNhbiBiZSBhZGRyZXNzZWQgYnkgaW50cm9kdWNpbmcgYSBtb21lbnR1bSB0ZXJtICRcXGFscGhhJCA/IFNlbGVjdCBhbGwgdGhhdCBhcHBseSIsICJ0eXBlIjogIm1hbnlfY2hvaWNlIiwgImFuc3dlcnMiOiBbeyJhbnN3ZXIiOiAiVGhlIGFsZ29yaXRobSBjb252ZXJnZXMgdG9vIHNsb3dseSBkdWUgdG8gYSBzbWFsbCBsZWFybmluZyByYXRlIiwgImNvcnJlY3QiOiBmYWxzZSwgImZlZWRiYWNrIjogIkluY29ycmVjdC4gTm90IGFib3V0IGEgc3Ryb2xsLCBtb21lbnR1bSBkZWFscyB3aXRoIG90aGVyIGlzc3Vlcy4ifSwgeyJhbnN3ZXIiOiAiR2V0dGluZyB0cmFwcGVkIGluIGEgbG9jYWwgbWluaW11bSBiZWNhdXNlIG9mIHRoZSBhbGdvcml0aG0ncyBncmVlZGluZXNzIiwgImNvcnJlY3QiOiBmYWxzZSwgImZlZWRiYWNrIjogIkluY29ycmVjdC4gR3JlZWRpbmVzcyBpc24ndCBmaXhlZCBieSBkaWV0YXJ5IGFkanVzdG1lbnRzLCBtb21lbnR1bSBoYXMgYSBkaWZmZXJlbnQgam9iLiJ9LCB7ImFuc3dlciI6ICJVbmludGVuZGVkIG92ZXJzaG9vdGluZyBhbmQgbWlzc2luZyB0aGUgZ2xvYmFsIG9wdGltdW0gZHVlIHRvIHJhcGlkIG1vdmVtZW50IGFsb25nIHRoZSBncmFkaWVudCIsICJjb3JyZWN0IjogdHJ1ZSwgImZlZWRiYWNrIjogIkNvcnJlY3QhIE1vbWVudHVtIHByZXZlbnRzIG92ZXJzaG9vdGluZywga2VlcGluZyB0aGUgYWxnb3JpdGhtJ3MganVtcGluZyBza2lsbHMgaW4gY2hlY2suIn0sIHsiYW5zd2VyIjogIk9zY2lsbGF0aW9uLCByZXNlbWJsaW5nIG5hdmlnYXRpb24gb24gYSBwbGF0ZWF1IHdoZXJlIHRoZSBmdW5jdGlvbidzIHZhbHVlIHJlbWFpbnMgcmVsYXRpdmVseSBjb25zdGFudCByZWdhcmRsZXNzIG9mIHRoZSBkaXJlY3Rpb24iLCAiY29ycmVjdCI6IHRydWUsICJmZWVkYmFjayI6ICJDb3JyZWN0ISBEaW5nIGRpbmchIE1vbWVudHVtIGFjdHMgbGlrZSBhIEdQUywgZ3VpZGluZyB0aGUgYWxnb3JpdGhtIHRocm91Z2ggcGxhdGVhdSB3aXRob3V0IGNpcmNsZXMuIn1dfV0=</span>

In [5]:
display_quiz("#quiz_2")

<IPython.core.display.Javascript object>

In [261]:
def f(x):
    return x ** 4 - 4 * x ** 2 + 5 * x


def derivative_f(x):
    return 4 * x ** 3 - 8 * x + 5


def gradient_descent(iterations_limit, threshold, start, obj_func, derivative_f, learning_rate=0.05, momentum=0.5):
    point = start
    points = [start]
    values = [obj_func(start)]

    delta = 0
    i = 0
    diff = 1.0e10

    while i < iterations_limit and diff > threshold:
        delta = -learning_rate * derivative_f(point) + momentum * delta
        point += delta

        points.append(point)
        values.append(obj_func(point))

        diff = abs(values[-1] - values[-2])
        i += 1

    return points, values


start_point = 2
learning_rate = 0.05
momentum = 0.3

points, values = gradient_descent(100, 0.1, start_point, f, derivative_f, learning_rate, momentum)

The code below defines a set of helper functions for creating interactive 2D and 3D plots using the Plotly library. Each function is designed to generate a specific type of plot or visualization.

In [262]:
#Helper functions to draw interactive plots
import plotly.graph_objects as go
import numpy as np


def plot_function(x, y, name, color):
    return go.Scatter(x=x, y=y, mode='lines', name=name, line=dict(width=2, color=color))


def plot_function_3d(x, y, z, colorscale, highlightcolor):
    return go.Surface(
        x=x,
        y=y,
        z=z,
        colorscale=colorscale,
        opacity=0.8,
        contours=dict(
            z=dict(
                show=True,
                highlightcolor=highlightcolor,
                project=dict(z=True)
            )
        ),
        lighting=dict(ambient=0.5, diffuse=0.9)
    )


def plot_points_3d_labeled(x, y, z, symbol, color, colorscale, name, text):
    return go.Scatter3d(
        x=x,
        y=y,
        z=z,
        mode='text',
        marker=dict(
            size=6,
            symbol=symbol,
            color=color,
            colorscale=colorscale,
            opacity=1
        ),
        text=text,
        name=name,
        textposition='bottom center',
        showlegend=False
    )


def plot_points_3d(x, y, z, symbol, color, colorscale, name):
    return go.Scatter3d(
        x=x,
        y=y,
        z=z,
        mode='markers',
        marker=dict(
            size=6,
            symbol=symbol,
            color=color,
            colorscale=colorscale,
            opacity=1
        ),
        name=name
    )


def plot_points_labeled(x, y, color, symbol, name, text, size=8):
    return go.Scatter(
        x=x,
        y=y,
        mode='text',
        marker=dict(symbol=symbol, color=color, size=size),
        name=name,
        text=text,
        textposition='bottom right',
        showlegend=False,
    )


def plot_points(x, y, color, symbol, name, size=8):
    return go.Scatter(
        x=x,
        y=y,
        mode='markers',
        marker=dict(symbol=symbol, color=color, size=size),
        name=name
    )


def plot_line(x, y, color, name):
    return go.Scatter(
        x=x,
        y=y,
        mode='lines',
        line=dict(color=color),
        name=name
    )


def plot_frames(data):
    return go.Frame(data=data)


def get_layout(title, title_x, title_y, updatemenus, annotations=[], show_legend=True):
    return go.Layout(
        title=title,
        xaxis=dict(title=title_x),
        yaxis=dict(title=title_y),
        showlegend=show_legend,
        hovermode='closest',
        annotations=annotations,
        updatemenus=updatemenus,
    )


def get_layout_3d(title, title_x, title_y, title_z, updatemenus, annotations=[]):
    return go.Layout(
        scene=dict(
            xaxis=dict(title=title_x),
            yaxis=dict(title=title_y),
            zaxis=dict(title=title_z),
            camera=dict(eye=dict(x=-1.25, y=-1.25, z=0.55)),
        ),
        margin=dict(l=0, r=0, b=0, t=0),
        title=title,
        annotations=annotations,
        updatemenus=updatemenus,
    )


def plot_contour_map_3d(x, y, z, colorscale):
    return go.Contour(x=x, y=y, z=z, colorscale=colorscale)


The code below visualizes the gradient descent optimization process for given objective function using Plotly library. It allows users to interactively explore how the optimization trajectory changes based on different learning rates and momentum values. 

In [263]:
import plotly.offline as pyo


def gradient_descent_visualization_plot(learning_rate, momentum):
    def get_gradient_descent_points():
        iterations_limit = 100
        threshold = 0.1
        start = 2.0

        return gradient_descent(iterations_limit, threshold, start, f, derivative_f, learning_rate, momentum)

    def build_figure(obj_x, obj_y, gd_x, gd_y):
        return go.Figure(data=[plot_function(obj_x, obj_y, 'Function Line', 'blue'),
                               plot_function(obj_x, obj_y, 'Function Line', 'blue')],
                         layout=get_layout(f'Learning Rate = {learning_rate}',
                                           'X',
                                           'Y',
                                           [dict(type="buttons",
                                                 buttons=[dict(label="Play",
                                                               method="animate",
                                                               args=[None])])]),
                         frames=[plot_frames(
                             data=[
                                 plot_points([gd_x[k]], [gd_y[k]], 'red', 'circle', 'GD point')
                             ]) for k in range(len(gd_x))])

    gradient_descent_points, gradient_descent_values = get_gradient_descent_points()

    x_values = np.linspace(-3, 3, 100)
    y_values = f(x_values)

    figure = build_figure(x_values, y_values, gradient_descent_points, gradient_descent_values)
    pyo.iplot(figure)

In [264]:
learning_rate = 0.03
momentum = 0.5
gradient_descent_visualization_plot(learning_rate, momentum)

In [265]:
learning_rate = 0.05
momentum = 0.5
gradient_descent_visualization_plot(learning_rate, momentum)

In [266]:
learning_rate = 0.07
momentum = 0.5
gradient_descent_visualization_plot(learning_rate, momentum)

In [267]:
learning_rate = 0.1
momentum = 0.5
gradient_descent_visualization_plot(learning_rate, momentum)

## Gradient Descent for a Bivariate Function


The same procedure was applied similarly to the case of univariate function considered earlier. But now let's consider the function of two variable, because real-world scenarios mostly involve complex relationships between various factors. Therefore, further computations involve **partial derivatives** and finding a local minimum.
```{note}
**Partial derivatives** tell us how much a function changes with respect to just one variable, while keeping all other variables constant. In our case, we have gradient, which is a vector of partial derivatives of $\mathcal{L}(\boldsymbol w)$ with respect to each point w:


$$\nabla\mathcal{L}(\boldsymbol{w}) = \begin{bmatrix} \frac{\partial \mathcal{L}}{\partial w_1}, \frac{\partial \mathcal{L}}{\partial w_2}, \ldots, \frac{\partial \mathcal{L}}{\partial w_n} \end{bmatrix}$$
```


In [268]:
def f(x, y):
    return x ** 3 + x ** 2 - y ** 3 - 4 * x + 22 * y - 5


def derivative_f_x(x):
    return 3 * x ** 2 + 2 * x - 4


def derivative_f_y(y):
    return -3 * y ** 2 + 22


def gradient_descent(iterations_limit, threshold, start, obj_func, derivative_f_x, derivative_f_y, learning_rate=0.05,
                     momentum=0.5):
    point = start
    points = [start]
    values = [obj_func(*start)]

    x = point[0]
    y = point[1]

    delta_x = 0
    delta_y = 0
    i = 0
    diff = 1.0e10

    while i < iterations_limit and diff > threshold:
        delta_x = -learning_rate * derivative_f_x(x) + momentum * delta_x
        delta_y = -learning_rate * derivative_f_y(y) + momentum * delta_x
        x += delta_x
        y += delta_y

        points.append([x, y])
        values.append(obj_func(*[x, y]))

        diff = abs(values[-1] - values[-2])
        i += 1

    return points, values


start_point = [4.5, 2]
learning_rate = 0.05
momentum = 0.5

points, values = gradient_descent(10, 0.01, start_point, f, derivative_f_x, derivative_f_y, learning_rate, momentum)

3D plots can visualize a trajectory of the optimization process on a surface for a bivariate function. This can help to see how the algorithm moves toward the minimum.


In [269]:
def gradient_descent_3d_visualization_plot(gradient_descent_points, gradient_descent_values):
    x_obj_values = np.linspace(-4, 5, 100)
    y_obj_values = np.linspace(-4, 5, 100)

    x_obj, y_obj = np.meshgrid(x_obj_values, y_obj_values)
    z_obj = f(x_obj, y_obj)

    gd_x = [elem[0] for elem in gradient_descent_points]
    gd_y = [elem[1] for elem in gradient_descent_points]
    gd_z = gradient_descent_values

    def build_figure(obj_x, obj_y, obj_z, gd_x, gd_y, gd_z):
        return go.Figure(data=[plot_function_3d(obj_x, obj_y, obj_z, 'Plasma', 'limegreen'),
                               plot_function_3d(obj_x, obj_y, obj_z, 'Plasma', 'limegreen')],
                         layout=get_layout_3d(
                             'Gradient Descent 3D Visualization',
                             'X',
                             'Y',
                             'Z',
                             [dict(type="buttons",
                                   buttons=[dict(label="Play",
                                                 method="animate",
                                                 args=[None])])]),
                         frames=[plot_frames(
                             data=[
                                 plot_points_3d([gd_x[k]], [gd_y[k]], [gd_z[k]], 'circle', 'red', 'Viridis', 'GD point')
                             ]) for k in range(len(gd_x))])

    figure = build_figure(x_obj, y_obj, z_obj, gd_x, gd_y, gd_z)
    figure.show()


gradient_descent_3d_visualization_plot(points, values)

Alternative implementation of a ***gradient_descent_3d_visualization_plot*** function provides a more detailed visualization by annotating the gradient descent path with connecting lines between the points.

In [272]:
def gradient_descent_3d_visualization_plot(gradient_descent_points, gradient_descent_values):
    x_obj_values = np.linspace(-4, 5, 100)
    y_obj_values = np.linspace(-4, 5, 100)

    x_obj, y_obj = np.meshgrid(x_obj_values, y_obj_values)
    z_obj = f(x_obj, y_obj)

    gd_x = [elem[0] for elem in gradient_descent_points]
    gd_y = [elem[1] for elem in gradient_descent_points]
    gd_z = gradient_descent_values

    def build_figure(obj_x, obj_y, obj_z, gd_x, gd_y, gd_z):
        def annotate(x, y, z):
            lines = []
            arrows = []

            for i in range(len(x) - 1):
                lines.append(
                    go.Scatter3d(
                        x=[x[i], x[i + 1]],
                        y=[y[i], y[i + 1]],
                        z=[z[i], z[i + 1]],
                        mode='lines',
                        line=dict(color='black', width=3),
                        hoverinfo='all',
                        showlegend=False,
                        name = 'GD point'
                    )
                )
                arrows.append(
                    go.Scatter3d(
                        x=[x[i], x[i + 1]],
                        y=[y[i], y[i + 1]],
                        z=[z[i], z[i + 1]],
                        mode='markers',
                        marker=dict(size=8, color='red'),
                        hoverinfo='all',
                        showlegend=False,
                        name = 'GD point'
                    )
                )

            return lines, arrows

        lines, arrows = annotate(gd_x, gd_y, gd_z)

        return go.Figure(
            data=[plot_function_3d(obj_x, obj_y, obj_z, 'Plasma', 'limegreen'),
                  plot_points_3d_labeled(
                      gd_x, gd_y, gd_z, 'circle', 'red', 'Viridis', 'GD point',
                      [str(i + 1) for i in range(len(gd_x))]
                  )
                  ] + lines + arrows,
            layout=get_layout_3d(
                'Gradient Descent 3D Visualization',
                'X',
                'Y',
                'Z',
                []))

    figure = build_figure(x_obj, y_obj, z_obj, gd_x, gd_y, gd_z)
    figure.show()


gradient_descent_3d_visualization_plot(points, values)

Below is implemented and drawn a 2d counter plot of previously implemented 3d visualization. On a 2d space the contour lines represent the objective function, and the annotated points show the steps taken by the algorithm during optimization.


```{note}
A 2D representation of 3D space makes it easier to interpret the function's features and the trajectory of the optimization algorithm.
```


In [273]:
def gradient_descent_interactive_contour_plot(gradient_descent_points):
    x_obj_values = np.linspace(-4, 5, 100)
    y_obj_values = np.linspace(-4, 5, 100)

    x_obj, y_obj = np.meshgrid(x_obj_values, y_obj_values)
    z_obj = f(x_obj, y_obj)

    x_gd = [elem[0] for elem in gradient_descent_points]
    y_gd = [elem[1] for elem in gradient_descent_points]

    def build_figure(obj_x, obj_y, obj_z, gd_x, gd_y):
        def annotate(x, y):
            lines = []
            arrows = []

            for i in range(len(x) - 1):
                lines.append(
                    go.Scatter(
                        x=[x[i], x[i + 1]],
                        y=[y[i], y[i + 1]],
                        mode='lines',
                        line=dict(color='black', width=1),
                        hoverinfo='all',
                        showlegend=False,
                        name = 'GD point'
                    )
                )
                arrows.append(
                    go.Scatter(
                        x=[x[i], x[i + 1]],
                        y=[y[i], y[i + 1]],
                        mode='markers',
                        marker=dict(size=8, color='red'),
                        hoverinfo='all',
                        showlegend=False,
                        name = 'GD point'
                    )
                )

            return lines, arrows

        lines, arrows = annotate(gd_x, gd_y)

        fig = go.Figure(data=[plot_contour_map_3d(obj_x, obj_y, obj_z, 'Plasma'),
                              plot_points_labeled(gd_x, gd_y, 'blue', 'circle', 'GD point',
                                                  [str(i + 1) for i in range(len(gd_x))])] + lines + arrows,
                        layout=get_layout('2D Contour Plot for 3D representation of GD', 'X-axis', 'Y-axis',
                                          [],
                                          [], False),
                        frames=[])
        return fig

    figure = build_figure(x_obj_values, y_obj_values, z_obj, x_gd, y_gd)
    figure.show()


gradient_descent_interactive_contour_plot(points)

## Gradient Descent for a Linear regression model
```{figure} https://iili.io/JTBGoej.webp
:width: 400px
:align: center
```


To implement the gradient descent for linear regression model we follow the steps below:
1. Import testing dataset
2. Set manually intercept and slope values for a linear regression model
3. Implement Mean square error (MSE) ([see regression metrics section](https://fedmug.github.io/kbtu-ml-book/eval_metrics/regression.html#mean-squared-error-mse)) function and compute error value for the model.


In [274]:
from sklearn.model_selection import train_test_split
import pandas as pd

dataset_path = 'datasets/experience_salary.csv'
data = pd.read_csv(dataset_path)

intercept = 1
slope = 2


def linear_regression(x, intercept, slope):
    return intercept + slope * x


def mean_squared_error(x, intercept, slope, y_actual):
    n = len(x)
    total_error = 0

    for i in range(n):
        total_error += + (intercept + slope * x[i] - y_actual[i]) ** 2

    return total_error / n


experience = data[['experience']]
salary = data['salary']

X_train, X_test, Y_train, Y_test = train_test_split(experience, salary, test_size=0.2, random_state=42)
X = X_test.values.flatten()
Y = Y_test.values.flatten()

linear_regression_values = [linear_regression(x, intercept, slope) for x in X]
mse = mean_squared_error(X_test.values.flatten(), intercept, slope, Y)
print(f'MSE: {mse}')

MSE: 876.7019407082549


Below is a visualization of best-fit linear relationship according to our model.
Helper functions ***plot_points*** and ***plot_line*** functions are used to create traces for testing data points and the regression line.
Then we return a ***go.Figure*** object with the specified data and layout.


In [275]:
def linear_regression_without_gd_plot(X, Y, linear_regression_values):
    def build_figure(x_obj, y_obj, x_reg, y_reg):
        return go.Figure(
            data=[plot_points(x_obj, y_obj, 'blue', 'x', 'Testing Data'),
                  plot_line(x_reg, y_reg, 'red', 'Regression Line')],
            layout=get_layout('Linear Regression Without GD Plot', 'Experience', 'Salary', []))

    figure = build_figure(X, Y, X, linear_regression_values)
    figure.show()


#Plotting linear regression without Gradient Descent
linear_regression_without_gd_plot(X, Y, linear_regression_values)

Defined function that computes partial derivatives with respect to intercept and slope:
- We iterate over each data point in the input arrays x, y_actual
- Computes the partial derivatives of the mean squared error with respect to the intercept (*derivative_intercept*) and slope (*derivative_slope*) using the formula for the derivative of the mean squared error:


  - For the intercept: $\frac{\partial}{\partial \text{intercept}} \text{MSE} = \frac{2}{N} \sum_{i=1}^{N} (\text{intercept} + \text{slope} \cdot x[i] - \text{y_actual}[i])$
          
  - For the slope: $\frac{\partial}{\partial \text{slope}} \text{MSE} = \frac{2}{N} \sum_{i=1}^{N} (\text{intercept} + \text{slope} \cdot x[i] - \text{y_actual}[i]) \cdot x[i])$
      
- Return the average derivatives by dividing the accumulated derivatives by the number of data points *n*


In [276]:
def derivative_mean_squared_error(x, y_actual, intercept, slope):
    n = len(x)
    derivative_intercept = 0
    derivative_slope = 0

    for i in range(n):
        derivative_intercept += 2 * (intercept + slope * x[i] - y_actual[i])
        derivative_slope += 2 * (intercept + slope * x[i] - y_actual[i]) * x[i]

    return derivative_intercept / n, derivative_slope / n

Function below allows us getting updated values of intercept and slope parameter. The update involves subtracting the product of the learning rate and the respective partial derivative from the current values of the intercept and slope.

In [277]:
def gradient_descent(x, y, intercept, slope, learning_rate, num_iterations):
    for i in range(num_iterations):
        partial_derivative_intercept, partial_derivative_slope = derivative_mean_squared_error(x, y, intercept, slope)
        intercept -= learning_rate * partial_derivative_intercept
        slope -= learning_rate * partial_derivative_slope

    return intercept, slope


To call the ***gradient_descent*** function, initially we set arbitrary values for hyperparameters ***learning_rate*** and amount of ***iterations***.


1. In each *iteration*, the algorithm calculates the gradient of the cost function with respect to each parameter. The gradient points in the direction of the steepest increase in the cost.


2. The parameters are then updated in the opposite direction of the gradient to move towards the minimum of the mean square error function. The update is proportional to the *learning_rate*, a hyperparameter that determines the size of the steps taken in each iteration.



In [278]:
learning_rate = 0.001
iterations = 1000

gd_intercept, gd_slope = gradient_descent(X, Y, intercept, slope, learning_rate, iterations)
print(gd_intercept, gd_slope)

1.9080791607526073 0.9313247403333909


The code below prints the mean squared error and the predicted values for the testing data using the ***linear_regression*** model with gradient descent-optimized parameters.

In [279]:
gd_mse = mean_squared_error(X, gd_intercept, gd_slope, Y)
print(f'MSE with gradient descent: {gd_mse}')

gd_linear_regression_values = [linear_regression(x, gd_intercept, gd_slope) for x in X]

MSE with gradient descent: 29.349835951481545


Test the *Mean square error (MSE)* result on built-in linear regression model from sklearn library.

In [280]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

model = LinearRegression()
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
mse = mean_squared_error(Y_test, Y_pred)
print(f'Sklearn mse {mse}')

Sklearn mse 27.650268732842278


Let's define a function that draws a fit-line linear relationship after having applied the optimization.

In [281]:
def linear_regression_with_gd_plot(X, Y, sk_learn_reg_values, linear_regression_values,
                                   gd_linear_regression_values):
    def build_figure(x_obj, y_obj, x_reg, y_reg):
        return go.Figure(
            data=[plot_points(x_obj, y_obj, 'blue', 'x', 'Testing Data'),
                  plot_line(x_reg, y_reg, 'red', 'Regression Line'),
                  plot_line(x_reg, gd_linear_regression_values, 'green', 'GD Regression Line'),
                  plot_line(x_reg, sk_learn_reg_values, 'orange', 'Sklearn Regression Line')],
            layout=get_layout('Linear Regression With GD Plot', 'Experience', 'Salary', []))

    figure = build_figure(X, Y, X, linear_regression_values)
    figure.show()


#Plotting linear regression with Gradient Descent
linear_regression_with_gd_plot(X, Y, Y_pred, linear_regression_values, gd_linear_regression_values)

```{figure} https://iili.io/JTBwBnf.jpg
:align: center
:width: 400px
```