<a href="https://colab.research.google.com/github/GDS-Education-Community-of-Practice/DSECOP/blob/main/Intro_to_Deep_Learning/04_Gradient_Descent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lecture IV: Gradient Descent

As we discussed in the previous lecture, we convert the learning problem into an optimization problem: Try to minimize the loss/cost function.

Finding the minimum of a function frequently requires the knowledge of its derivative, which we will discuss in this lecture.

**Gradient descent**, one of the most well-known minimization algorithms, is an iterative first-order optimization algorithm used to find a local minimum/maximum of a given function.

In our problem, we are interested in inferring the best values of the model parameters: $\omega$ and $b$, corresponding to the cost function minimum. Mathematically speaking, the gradient descent method takes the following steps to find the minimum:
Initialize $\omega$ and $b$ (by guessing or randomly),
Find the slope of the cost function at that point,
Change the model parameters ($\omega$ and $b$) along the path of the steepest decent at the current location in the function to find the next set of parameter values.
Repeat steps 1-3 until the parameter values do not change (significantly), implying that we have reached the minimum of the cost function where the slope is effectively zero.
By definition, we change the parameter values at each iteration of the algorithm such that

\begin{equation}
\omega := \omega - \alpha \frac{dJ(\omega, b)}{d\omega}, \tag{1}
\label{eq:changingOmega}
\end{equation}
\begin{equation}
b := b - \alpha \frac{dJ(\omega, b)}{db}, \tag{2}
\label{eq:changingb}
\end{equation}

where $\alpha$ is **learning scale**.
Let's see this method in action. Suppose we have two inputs $x_1$ and $x_2$. Using the logistic regression method, we need to define three parameters, $\omega_1$, $\omega_2$, and $b$ such that

\begin{equation}
z = \omega_1 x_1 + \omega_2 x_2 + b, \tag{3}
\end{equation}

and then the output is going to be

\begin{equation}
a = \sigma(z) . \tag{4}
\end{equation}

Now let's define the loss function for the $i_{th}$-element in our training set:

\begin{equation}
L(a, y^i) = - y^i log(a) - (1 - y^i) log(1 - a) . \tag{5}
\label{eq:loss_1node}
\end{equation}

We want to find the slope of the loss function at the point ($\omega_1$, $\omega_2$, and $b$), and then based on the value of the slope at that point, modify them ($\omega_1$, $\omega_2$, and $b$) by using eq.\ref{eq:changingOmega} and eq.\ref{eq:changingb}, to reduce the loss function and rest at its minimum.

How can we find the derivative of the loss function with respect to $\omega_1$, $\omega_2$, and $b$? The most popular answer is the **backpropagation method**:
find the derivative of the loss function with respect to the $z$, then
with respect to $y$, and at the last step,
with respect to $\omega_1$, $\omega_2$, and $b$.
![GD1-2.jpg](attachment:GD1-2.jpg)

After calculating the derivative of the loss function with respect to $\omega_1$, $\omega_2$, and b, we then modify those using eq.\ref{eq:changingOmega} and eq.\ref{eq:changingb}. Then, we repeat the procedure: calculating $z$, $a$, and the loss function using modified $\omega_1$, $\omega_2$, and b. Then again, we find the derivative and modify them again based on the slope of the loss function.

# Homework
In programing, the term $\frac{dL}{dq}$ is also denoted by $dq$, where $q$ is a paramater, such as $z$ or $\omega_i$. Calculate $da$, $dz$, $d\omega_1$, $d\omega_2$, and $db$ using eq. \ref{eq:loss_1node} for loss function and considering sigmoid function.