# GRADIENT DESCENT

The gradient vector of a cost function in the context of linear regression is a vector that contains all the partial derivatives of the cost function with respect to each parameter in the model. For a model with parameters $\theta_0$ (intercept), $\theta_1$ (coefficient for $x_1$), and $\theta_2$ (coefficient for $x_2$), the gradient vector can be represented as:

$\nabla_{\theta} J(\theta) = \left[ \frac{\partial J(\theta)}{\partial \theta_0}, \frac{\partial J(\theta)}{\partial \theta_1}, \frac{\partial J(\theta)}{\partial \theta_2} \right]$

### Mean Squared Error (MSE) Cost Function:

The MSE cost function for linear regression is given by:

$J(\theta) = \frac{2}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2$

Where $\hat{y}_i = \theta_0 + \theta_1 x_{i1} + \theta_2 x_{i2}$ is the predicted value.

### Partial Derivatives:

The partial derivatives of the MSE cost function with respect to each parameter are:

1. $\frac{\partial J(\theta)}{\partial \theta_0} = \frac{2}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i) \cdot 1$
2. $\frac{\partial J(\theta)}{\partial \theta_1} = \frac{2}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i) \cdot x_{i1}$
3. $\frac{\partial J(\theta)}{\partial \theta_2} = \frac{2}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i) \cdot x_{i2}$

### Given Dataset:

Assuming a dataset with two features ($x_1$ and $x_2$) and two data points:

| $x_1$ | $x_2$   | $y$   |
|-------|---------|-------|
| 1     | 2       | 3     |
| 4     | 5       | 6     |

### Model Parameters:

Assuming initial model parameters:

- $\theta_0 = 0.5$
- $\theta_1 = 1$
- $\theta_2 = 0.5$

### Computation:

We will compute the gradient vector $\nabla_{\theta} J(\theta)$ using the given data and parameters.

1. Compute $\hat{y}_i$ for each observation.
2. Compute the partial derivatives $\frac{\partial J(\theta)}{\partial \theta_0}$, $\frac{\partial J(\theta)}{\partial \theta_1}$, and $\frac{\partial J(\theta)}{\partial \theta_2}$.
3. Combine the partial derivatives to form the gradient vector $\nabla_{\theta} J(\theta)$.

Let's perform these computations.

The computed gradient vector $\nabla_{\theta} J(\theta)$ of the Mean Squared Error (MSE) cost function with respect to all parameters ($\theta_0, \theta_1, \theta_2$) for the given dataset and model parameters is:

$\nabla_{\theta} J(\theta) = \begin{bmatrix} 0.5 \\ 3.5 \\ 4.0 \end{bmatrix}$

This gradient vector contains the partial derivatives of the MSE cost function with respect to each parameter in the model:
- The first element (0.5) is the partial derivative with respect to $\theta_0$ (the intercept term).
- The second element (3.5) is the partial derivative with respect to $\theta_1$ (the coefficient of feature $x_1$).
- The third element (4.0) is the partial derivative with respect to $\theta_2$ (the coefficient of feature $x_2$).

These values indicate the direction and magnitude of the steepest ascent in the cost function space. In gradient descent optimization, the parameters $\theta$ would be updated in the opposite direction of this gradient to minimize the cost function.

So in this example, where we have 3 thetas (parameters) and we are moving down to a minimum in a 4 dimensional space because we have 3 parameters as a result, and result of cost function using these parameters with features is the 4th dimension.

If you do not want to compute each partial derivative individually, you can use the following formula to compute the gradient vector all at once:

$\nabla_{\theta} J(\theta) = \frac{2}{m} \cdot X^T \cdot (X \cdot \theta - y)$

Where $X$ is the matrix of features, $y$ is the vector of target values, and $\theta$ is the vector of model parameters.

This means that in this particular example:
$\nabla_{\theta} J(\theta) = \frac{2}{m} \cdot X^T \cdot (X \cdot \theta - y)$ = $\frac{2}{2} \cdot \begin{bmatrix} 1 & 1 & 2 \\ 1 & 4 & 5 \end{bmatrix}^T \cdot \left( \begin{bmatrix} 1 & 1 & 2 \\ 1 & 4 & 5 \end{bmatrix} \cdot \begin{bmatrix} 0.5 \\ 1 \\ 0.5 \end{bmatrix} - \begin{bmatrix} 3 \\ 6 \end{bmatrix} \right)$ = $\begin{bmatrix} 0.5 \\ 3.5 \\ 4.0 \end{bmatrix}$

This means that also in this particular example:
$\nabla_{\theta} J(\theta) = \left[ \frac{\partial J(\theta)}{\partial \theta_0}, \frac{\partial J(\theta)}{\partial \theta_1}, \frac{\partial J(\theta)}{\partial \theta_2} \right]$ = $\frac{2}{m} \cdot X^T \cdot (X \cdot \theta - y)$

And in general:
$\nabla_{\theta} J(\theta) = \left[ \frac{\partial J(\theta)}{\partial \theta_0}, \frac{\partial J(\theta)}{\partial \theta_1}, \frac{\partial J(\theta)}{\partial \theta_2}, \dots, \frac{\partial J(\theta)}{\partial \theta_n} \right]$ = $\frac{2}{m} \cdot X^T \cdot (X \cdot \theta - y)$

Where $m$ is the number of observations in the dataset, $X$ is the matrix of features, $y$ is the vector of target values, and $\theta$ is the vector of model parameters.


In [81]:
y0_pred = 0.5*1 + 1*1 + 0.5*2
y1_pred = 0.5*1 + 1*4 + 0.5*5
y0_true = 3
y1_true = 6
y0_pred, y1_pred

(2.5, 7.0)

In [82]:
# Partial derivatives for each parameter based on the formulas provided above in markdown text.
theta_0 = 2/2 * (y0_true - y0_pred)*1 + (y1_true - y1_pred)*1
theta_1 = 2/2 * (y0_true - y0_pred)*1 + (y1_true - y1_pred)*4
theta_2 = 2/2 * (y0_true - y0_pred)*2 + (y1_true - y1_pred)*5

theta_0, theta_1, theta_2

(-0.5, -3.5, -4.0)

In [83]:
import numpy as np

# Corrected feature matrix X (including a column of ones for the intercept term, which has to be there added manually)
X = np.array([[1, 1, 2],
              [1, 4, 5]])

# Theta vector (assuming the first element is the intercept term)
thetas = np.array([0.5, 1, 0.5])

# Target vector y remains the same
y = np.array([3,
              6])

# Computing the gradient vector using closed form solution
gradient_vector = (2 / 2) * X.T.dot(X.dot(thetas) - y)

gradient_vector

array([0.5, 3.5, 4. ])

In [84]:
# Computing the gradient vector (provided you used slightly different cost formula)
gradient_vector = (1 / 2) * X.T.dot(X.dot(thetas) - y)
gradient_vector

array([0.25, 1.75, 2.  ])

### Gradient Descent Step:

The gradient descent step is given by:

$\theta = \theta - \alpha \cdot \nabla_{\theta} J(\theta)$

Where
$\alpha$ - is the learning rate.
$\theta$ - is the vector of model parameters.
$\nabla_{\theta} J(\theta)$ - is the gradient vector of the cost function with respect to the model parameters.

### Computation:
This is the computation of individual parameters separately:
$\theta_0 = \theta_0 - \alpha \cdot \frac{\partial J(\theta)}{\partial \theta_0}$
$\theta_1 = \theta_1 - \alpha \cdot \frac{\partial J(\theta)}{\partial \theta_1}$
$\theta_2 = \theta_2 - \alpha \cdot \frac{\partial J(\theta)}{\partial \theta_2}$

$\theta_0 = 0.5 - 0.001 \cdot 0.5$ = $0.4995$
$\theta_1 = 1 - 0.001 \cdot 3.5$ = $0.9965$
$\theta_2 = 0.5 - 0.001 \cdot 4.0$ = $0.496$

This is the computation of all parameters at once:
$\theta = \theta - \alpha \cdot \nabla_{\theta} J(\theta)$
$\theta = \begin{bmatrix} 0.5 \\ 1 \\ 0.5 \end{bmatrix} - 0.001 \cdot \begin{bmatrix} 0.5 \\ 3.5 \\ 4.0 \end{bmatrix}$ = $\begin{bmatrix} 0.4995 \\ 0.9965 \\ 0.496 \end{bmatrix}$



In [85]:
theta_new = thetas - np.dot(0.001, gradient_vector)
print(f"Old thetas:\n{thetas}")
print(f"New thetas:\n{theta_new}")

Old thetas:
[0.5 1.  0.5]
New thetas:
[0.49975 0.99825 0.498  ]
