### Setup / Imports

In [1]:
import numpy as np

In [2]:
X = np.random.randint(low=1, high=10, size=(20, 3))
y = 1*X[:, 0] + 2*X[:, 1] + 1*X[:, 2]

### Overview

The goal of a linear regression is to fit a hypothesis, $h(t)$ as a linear combination of terms
$$\theta_0 + \theta_1 \cdot x_1 + \ldots + \theta_n \cdot x_n = h(x)$$

The loss function , $j$ which we choose will effect what kind of linear regression which we are doing. For example
- $J(\theta) = \frac{1}{2} \sum_{i=1}^n (h(x^{(i)}) - y)^2$, OLS or "Ordinary Least Squares"

### Gradient Derivation
Lets work out the Gradient for our "Ordinary Least Squares" function
$$\frac{\partial}{\partial \theta_j} J_\theta = \frac{\partial}{\partial \theta_j} \frac{1}{2} \sum_{i=1}^n (h(x^{(i)}) - y)^2$$

To make the derivation simpler lets change $ \frac{1}{2} \sum_{i=1}^n (h(x^{(i)}) - y)^2$ to $ \frac{1}{2} (h(x^{(i)}) - y)^2$ thus computing the gradient over a single training example $i$ rather than all $n$ training examples
$$\frac{\partial}{\partial \theta_j} J_\theta = \frac{\partial}{\partial \theta_j} \frac{1}{2} (h(x^{(i)}) - y)^2$$

By the chain rule, this then breaks into
$$= \frac{1}{2} \cdot 2 \cdot (h(x^{(i)}) - y) \cdot \frac{\partial}{\partial \theta_j} (h(x^{(i)}) - y)$$
$$= (h(x^{(i)}) - y) \cdot (x_j) $$

Note: Remember that the partial derivative of any variable other than the variable with whom you are differentiating is equivalent to differentiating a constant and yields $0$. Hence why in this instance, when we differentiate by the constant $\theta_j$, the only term left is its multiple $x_j$

### Gradient Descent Rule
Gradient Descent, by definition, is
$$\theta_j := \theta_j - \alpha \cdot \frac{\partial}{\partial \theta_j} J_\theta$$
(Where $\alpha$ is a parameter called learning rate, which controls the step size of each gradient update)

Plugging in our above equation we get our first OLS Linear Regression model
$$\theta_j := \theta_j - \alpha \cdot (h(x^{(i)}) - y) \cdot (x_j)$$ 
or
$$\theta_j := \theta_j + \alpha \cdot (y- h(x^{(i)})) \cdot (x_j)$$

In [3]:
# lets try it out our derivation

def loss(h, x, y, alpha):
    # x is a vector (training sample)
    # y is a scalar
    # alpha is a scalar
    
    return (alpha * (y - h)) * x

In [4]:
def simple_gradient_descent(X, y):
    epochs, alpha = 1000, .0001
    (n_rows, n_features) = X.shape
    theta = np.zeros(n_features)
    
    for _ in range(epochs):
        for x, y_i in zip(X, y):
            # estimate value
            h = np.dot(theta, x)
            
            # compute loss
            loss_value = loss(h, x, y_i, alpha)
            
            # update weights
            theta = theta + loss_value
    
    return theta

In [5]:
theta = simple_gradient_descent(X, y)
theta

array([1.00000011, 1.99999802, 1.00000201])

### Batch Gradient Descent
Our original algoritm is slow because we are computing the gradient at each step. What if instead we took an average of the squared residual of all of our training samples? Or in math terms

$$\frac{\partial}{\partial \theta_j} J_\theta = \frac{\partial}{\partial \theta_j} \frac{1}{2n} \sum_{i=1}^n (h(x^{(i)}) - y)^2$$

We already know (from above)
$$\frac{\partial}{\partial \theta_j} \frac{1}{2} (h(x^{(i)}) - y)^2 = (h(x^{(i)}) - y) \cdot (x_j)$$

Therefore 

$$\frac{\partial}{\partial \theta_j} \frac{1}{2n} \sum_{i=1}^n (h(x^{(i)}) - y)^2 = \frac{1}{n} \sum_{i=1}^n (h(x^{(i)}) - y) \cdot (x_j)$$

In [6]:
def batch_gradient_descent(X, y):
    (n_rows, n_features) = X.shape
    theta = np.zeros(n_features)
    alpha, epochs = .0001, 1000
    N = len(y)  # number of training examples
    for i in range(epochs):
        y_hat = np.dot(X, theta)
        theta = theta - alpha * (1.0/N) * np.dot(X.T, y_hat-y)
    return theta

In [7]:
batch_gradient_descent(X, y)

array([1.11764269, 1.70510187, 1.20338965])

### The Normal Equations

Its worth noting that in this very specific circumstance there exists a closed form way to compute $\theta$ without doing so iteratively.