# Linear Regression

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Program Variables
num_iters = 1500  # Number of iterations for gradient descent
alpha = 0.01  # Gradient descent learning rate

## Linear Regression Model Representation

Given the following assumptions:
- Training set contains $m$ observations with $n$ features
- Training set of features $X = [(x_1^{(1)}, x_2^{(1)}, \ldots, x_n^{(1)}), \ldots , (x_1^{(m)}, x_2^{(m)}, \ldots, x_n^{(m)}) ]$
- Training set of labels $y = [y^{(1)}, \ldots , y^{(m)}]$
- Parameter vector $\theta = [\theta_0, \ldots , \theta_n]$

Goal is to create a hypothesis function $h_{\theta}(x)$ with parameters (coefficients) $\theta$ in order to accurately calculate the label for a new, unseen $x$. The form of the hypothesis function is:

$$
h_{\theta}(x) = \theta_0 + \theta_1x_1 + \ldots + \theta_nx_n
$$

Assuming $x_0 = 1$ for all observations, this is equivalent to:

$$
h_{\theta}(x) = \displaystyle \sum_{j=0}^{n} \theta_jx_j^{(i)}
$$

You can add a new first column to $X$ applying the assumption $x_0^{(i)} = 1$ for all observations. This creates a design matrix of dimension $m \times (n+1)$ that's compatible to perform matrix multiplication with the $(n+1) \times 1$ vector $\theta$. The vectorized representation of the hypothesis function (that results in an $m \times 1$ vector with $h_{\theta}(x^{(i)})$ for all observations, $i$ to $m$) is:

$$
h_{\theta}(x) = X\theta
$$

## Linear Regression Cost Function

Given a training set of $m$ features $X$, $m$ labels $Y$, and $n$ parameters $\theta$, want to minimize the following cost function:

$$
J(\theta_0, \ldots, \theta_n) = J(\theta) = \frac{1}{2m} \displaystyle \sum_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})^2
$$

Vectorized formula:

$$
J(\theta) = \frac{1}{2m} (X\theta - y)^{T} (X\theta - y)
$$

In [2]:
def calc_cost(X, y, theta):
    '''
    Calculates the cost of using theta as the parameters for linear
        regression to fit the data points in X and y
    X: mx(n+1) design matrix of training set features
    y: mx1 vector of training set labels
    theta: (n+1)x1 vector of parameters
    Output: J(theta), float
    '''
    m = len(y)  # Number of training observations
    # J = (1/(2*m)) * ((X * theta - y)^T * (X * theta - y));
    return J

## Gradient Descent Method

Gradient descent formula:

Repeat until convergence {  
$\quad \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j}J(\theta) $  
}

Substituting the partial derivative of the cost function:

Repeat {  
$ \quad \theta_j := \theta_j - \frac {\alpha}{m} \displaystyle \sum_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})x_j^{(i)} $  
}

Vectorized formula:

$\theta := \theta - \frac {\alpha}{m} (X^T (X\theta - y))$

Note that $X$ is the design matrix of features, and should be scaled appropriately to help the gradient descent algorithm converge more quickly.

## Normal Equations Method

There's an analytical solution 