Course: http://cs229.stanford.edu/syllabus.html

pdf from <http://cs229.stanford.edu/notes2021fall/cs229-notes1.pdf>
https://github.com/chasinginfinity/ml-from-scratch/blob/master/02%20Linear%20Regression%20using%20Gradient%20Descent/Linear%20Regression%20using%20Gradient%20Descent.ipynb


![](https://github.com/chasinginfinity/ml-from-scratch/raw/24c0c0472d87f31c65cb9ad82ff0836afce924f1/02%20Linear%20Regression%20using%20Gradient%20Descent/animation1.gif)

<a href="https://colab.research.google.com/github/farishijazi/ai-ml-dl-course/blob/master/3_implementing_LinearRegression.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg"></a>

In [None]:
from IPython.display import YouTubeVideo
import numpy as np
from matplotlib import pyplot as plt
import IPython as ipy

YouTubeVideo('aircAruvnKk', width=800, height=300)

## Why weight and bias?

play around and see the difference between the effects of W and b

In [None]:
#play around with Y=MX+b
# why have a weight and a bias?

In [None]:
ipy.display.HTML('<iframe src="https://www.desmos.com/calculator/os5lfdggic" width="800px" height="500px">')

In [None]:
# we could just concat a bias vector (as 1s) to the input data, and extend the weight matrix, it's more efficient that way


### Prediction

(assuming theta and X are both vectors. $\theta$ is the W

$$
\hat{y} =\sum_{i=0}^{d} \theta_{i} x_{i}-y \\
$$

as vectors:

$$
\hat{y} = \theta^TX \\
$$


```python
def predict(W, X):
    y_pred = np.dot(W.T, X)
    return y_pred
```

### loss function:


$$J(\theta) = \frac{1}{2} \cdot  \sum(\hat{y}^{(i)} - y^{(i)})^2 $$


$$J(\theta) = \frac{1}{2} \cdot  \sum(\theta^{(i)} \cdot  X^{(i)} - y^{(i)})^2 $$

```python
def J(y_pred, Y):
    loss = 1 / 2 * (y_pred - Y)**2
    return loss
```

### gradient descent

Gradient descent needs the derivative
$$
\theta_{j}:=\theta_{j}-\alpha \frac{\partial}{\partial \theta_{j}} J(\theta)
$$

```python
def W_update(W_old, j_grad, alpha=0.1):
    W_new = W_old - alpha * j_grad
    return W_new
```

![](https://miro.medium.com/max/1024/1*G1v2WBigWmNzoMuKOYQV_g.png)

---

let's find the derivative of the loss function with respect to the weights

$$J(\theta) = \frac{1}{2} \cdot \sum(\theta^{(i)} \cdot X^{(i)} - y^{(i)})^2 $$

Derivative of the prediction:

$$\frac{\partial}{\partial \theta_{j}}\hat{y} = X_j$$

 
Official derivation
$$
\begin{aligned}
\frac{\partial}{\partial \theta_{j}} J(\theta) &=\frac{\partial}{\partial \theta_{j}} \frac{1}{2}\left(\hat{y}-y\right)^{2} \\
&=2 \cdot \frac{1}{2}\left(\hat{y}-y\right) \cdot \frac{\partial}{\partial \theta_{j}}\left(\hat{y}-y\right) \\
&=\left(\hat{y}-y\right) \cdot \frac{\partial}{\partial \theta_{j}}\left(\sum_{i=0}^{d} \theta_{i} x_{i}-y\right) \\
&=\left(\hat{y}-y\right) x_{j}
\end{aligned}
$$

```python
def J_gradient(y_pred, Y, X):
    j_gradients = (y_pred - Y)*X # list of all gradients for each training sample
    j_gradient_sum = j_gradients.sum(axis=0).reshape(W.shape[0], -1) # summing all the gradients
    return j_gradient_sum


```

<!-- 
For one training example, the gradient update is as follows:

$$
\theta_{j}:=\theta_{j}-\alpha (\theta^{(i)} * X^{(i)}_j - y^{(i)})*(X^{(i)}_j)
$$
 -->

 The value of $\alpha$ is a hyper parameter, meaning it's not a learned parameter, you just choose is as a user

like a nob

<img src="https://www.lampandlight.eu/blog/wp-content/uploads/sites/7/2019/04/Blog700x510_dimmen-1.png" width=200px>


In [None]:
import os

def show_plot(y_pred, Y):
    plt.scatter(np.arange(len(Y)), Y, label='data')
    plt.plot(y_pred, label='prediction', color='red')
    plt.legend()
    plt.xlabel('x')
    plt.ylabel('y')

In [None]:
#HW find derivative of (J with respect to b) in Y=WX+b

In [None]:

def predict(W, X):
    y_pred = np.dot(X, W)
    return y_pred

def J(y_pred, Y):
    loss = 1 / 2 * (y_pred - Y)**2
    return loss

def J_gradient(y_pred, Y, X):
    j_gradients = (y_pred - Y)*X # list of all gradients for each training sample
#     return j_gradients
    j_gradient_sum = j_gradients.sum(axis=0) # taking mean of gradients
    j_gradient_sum = np.expand_dims(j_gradient_sum, -1) # reshaping so it will have same shape as W
    return j_gradient_sum

def W_update(W_old, j_grad, alpha=0.0001):
    W_new = W_old - alpha * j_grad
    return W_new



## now implement the gradient descent algorithm

all the functions are already defined

<!-- hide the answer -->
<details>
<summary>If you REALLY give up, the answer is hiding here</summary>

```python
y_pred = predict(W, X) # make prediction
loss = J(y_pred, Y) # calculate loss from prediction
j_grad = J_gradient(y_pred, Y, X) # calculate gradient
W = W_update(W, j_grad, alpha=0.0001) # update weights
```
</details>

In [None]:

import sklearn.datasets
iris_dataset = sklearn.datasets.load_iris()
X = iris_dataset['data']
Y = np.expand_dims(iris_dataset['target'].T, 1)


print(X.shape)
## now that we chose the data, we will initialize the weights
!rm plots/*  # delete plots for making the gif

use_bias = True

W = 2*np.random.uniform(size=(X.shape[1] + (1 if use_bias else 0), 1)) - 1
if use_bias:
    X = np.hstack((X, np.ones((X.shape[0], 1)))) ## adding bias term
print(X.shape)


for i in range(200):
    #########################################
    # #TODO: implement the gradient descent #
    # hint, look at the functions above     #
    #########################################
    y_pred = # make prediction                ## TODO: complete this line
    loss =   # calculate loss from prediction ## TODO: complete this line
    j_grad = # calculate gradient             ## TODO: complete this line
    W =      # update weights                 ## TODO: complete this line

    print(f'[{i}] loss', loss.mean())
    if i % 4 == 0:
        print(i)
        plt.title(f'linear regression iter:{i}. W={list(W.reshape(-1))}')
        show_plot(y_pred, Y)
        plt.title(f'linear regression iter:{i}. W={list(W.reshape(-1))}')
        os.makedirs('plots', exist_ok=True)
        plt.savefig(f'plots/{str(i).zfill(3)}.png')
        if i % 20 == 0:
            plt.show()
        plt.close()

print('y_pred', y_pred.shape)
show_plot(y_pred, Y)
plt.show()

In [None]:
## (optional) create GIF

!apt install imagemagick > /dev/null
!convert plots/*.png plots/linearregression1.gif
ipy.display.HTML('<img src="plots/linearregression1.gif">')

### Comparing with `sklearn`

In [None]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X, Y)
# reg.score(X, Y)
y_pred = reg.predict(X)

show_plot(y_pred, Y)
plt.show()

SUCCESS!!

We can see that our implementation and the sklearn implementation give the same result

In [None]:
X = np.array([[0, 0, 1], [1, 1, 1], [1, 0, 1], [0, 1, 1], [0, 0, 1], [1, 1, 1], [1, 0, 1], [0, 1, 1], ])
Y = np.array([[0, 1, 1, 0, 0, 1, 1, 0]]).T

W = 2*np.random.uniform(size=(3, 1))-1
b = np.zeros((1, 3))

# initializing weights between -1 and 1
plt.hist(2*np.random.uniform(size=1000)-1)