# Linear Regression

In [93]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Import Data - Housing Prices
data = pd.read_csv('ex1data2.txt', header=None)
data.columns = ['SquareFeet', 'NumBedrooms', 'Price']

X = data[['SquareFeet', 'NumBedrooms']].as_matrix()
# y = data['Price']
y = np.array(data['Price']).reshape(47, 1)
m = len(y)
n = X.shape[1]

data.head(10)

Unnamed: 0,SquareFeet,NumBedrooms,Price
0,2104,3,399900
1,1600,3,329900
2,2400,3,369000
3,1416,2,232000
4,3000,4,539900
5,1985,4,299900
6,1534,3,314900
7,1427,3,198999
8,1380,3,212000
9,1494,3,242500


## Linear Regression Model Representation

Given the following assumptions:
- Training set contains $m$ observations with $n$ features
- Training set of features $X = [(x_1^{(1)}, x_2^{(1)}, \ldots, x_n^{(1)}), \ldots , (x_1^{(m)}, x_2^{(m)}, \ldots, x_n^{(m)}) ]$
- Training set of labels $y = [y^{(1)}, \ldots , y^{(m)}]$
- Parameter vector $\theta = [\theta_0, \ldots , \theta_n]$

Goal is to create a hypothesis function $h_{\theta}(x)$ with parameters (coefficients) $\theta$ in order to accurately calculate the label for a new, unseen $x$. The form of the hypothesis function is:

$$
h_{\theta}(x) = \theta_0 + \theta_1x_1 + \ldots + \theta_nx_n
$$

Assuming $x_0 = 1$ for all observations, this is equivalent to:

$$
h_{\theta}(x) = \displaystyle \sum_{j=0}^{n} \theta_jx_j^{(i)}
$$

You can add a new first column to $X$ applying the assumption $x_0^{(i)} = 1$ for all observations. This creates a design matrix of dimension $m \times (n+1)$ that's compatible to perform matrix multiplication with the $(n+1) \times 1$ vector $\theta$. The vectorized representation of the hypothesis function (that results in an $m \times 1$ vector with $h_{\theta}(x^{(i)})$ for all observations, $i$ to $m$) is:

$$
h_{\theta}(x) = X\theta
$$

## Feature Scaling

Adjusting the model's numeric features so they're all on the same scale is important for linear regression, particularly if you use the gradient descent algorithm to minimize the cost function, or you use regularization.

A simple technique is to normalize each feature so it has a mean of zero and standard deviation of 1:

$$
x = \frac {x - \bar{x}}{s}
$$

Where $\bar{x}$ is the sample mean of a feature and $s$ is the sample standard deviation.

In [99]:
def feature_normalize(X):
    '''
    Normalizes the features in X where mean value of each
        feature is 0 and the standard deviation is 1
    Input: mxn feature matrix X
    Output: returns mxn normalized version of X,
        the 1xn row vector of means for each feature,
        and the 1xn row vector of standard deviations
    '''
    mean = X.mean()
    s = X.std()
    X_norm = (X - mean) / s
    return (X_norm, mean, s)


X_norm, mean, s = feature_normalize(X)
# print('Mean:\n{}\nStd Dev:\n{}'.format(mean, s))

# Add new first column of 1's
x_0 = np.ones((m, 1))
X_design = np.concatenate((x_0, X_norm), axis=1)
X_design.shape

(47, 3)

## Linear Regression Cost Function

Given a training set of $m$ features $X$, $m$ labels $Y$, and $n$ parameters $\theta$, want to minimize the following cost function:

$$
J(\theta_0, \ldots, \theta_n) = J(\theta) = \frac{1}{2m} \displaystyle \sum_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})^2
$$

Vectorized formula:

$$
J(\theta) = \frac{1}{2m} (X\theta - y)^{T} (X\theta - y)
$$

In [103]:
def calc_cost(X, y, theta):
    '''
    Calculates the cost of using theta as the parameters for linear
        regression to fit the data points in X and y
    X: mx(n+1) design matrix of scaled features
    y: mx1 vector of labels
    theta: (n+1)x1 vector of parameters
    Output: returns J(theta), float
    '''
    m = len(y)
    J = (1/(2*m)) * np.dot((np.dot(X, theta) - y).T, (np.dot(X, theta) - y))
    return J[0, 0]

## Gradient Descent Method

Gradient descent formula:

Repeat until convergence {  
$\quad \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j}J(\theta) $  
}

Substituting the partial derivative of the cost function:

Repeat {  
$ \quad \theta_j := \theta_j - \frac {\alpha}{m} \displaystyle \sum_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})x_j^{(i)} $  
}

Vectorized formula:

$\theta := \theta - \frac {\alpha}{m} (X^T (X\theta - y))$

Note that $X$ is the design matrix of features, and should be scaled appropriately to help the 
gradient descent algorithm converge more quickly.

In [115]:
# Gradient Descent Variables
num_iters = 700  # Number of iterations for gradient descent
alpha = 0.01  # Gradient descent learning rate


def gradient_descent(X, y, theta, alpha, num_iters):
    '''
    Gradient descent algorithm to iteratively find the minimum value
        of the cost function J(theta)
    X: mx(n+1) design matrix of scaled features
    y: mx1 vector of labels
    theta: (n+1)x1 vector of parameters
    alpha: the learning rate
    num_iters: number of iterations the algorithm should perfomr
    Output: returns
    '''
    m = len(y)
    J_history = np.zeros((num_iters, 1))
    
    for i in range(num_iters):
        # X' (n+1)xm matrix times h(x)-y mx1 vector -> (n+1)x1 vector
        # Scale vector by alpha/m
        adj = (alpha / m) * np.dot(X.T, (np.dot(X, theta) - y))

        # Simultaneously update thetas
        theta = theta - adj

        # Save the cost J in every iteration
        J_history[i] = calc_cost(X, y, theta)
    
    return (theta, J_history)


theta = np.zeros((n + 1, 1))
theta, J_history = gradient_descent(X_design, y, theta, alpha, num_iters)
# theta
# J_history[-10:]

## Normal Equations Method

Linear regression has an analytical solution to find the optimal values for $\theta$.

In [116]:
def normal_equation():
    '''
    TO DO
    '''
    pass


# Create design matrix with non-normalized features
X_not_norm_design = np.concatenate((x_0, X), axis=1)
X_not_norm_design[:10]

array([[  1.00000000e+00,   2.10400000e+03,   3.00000000e+00],
       [  1.00000000e+00,   1.60000000e+03,   3.00000000e+00],
       [  1.00000000e+00,   2.40000000e+03,   3.00000000e+00],
       [  1.00000000e+00,   1.41600000e+03,   2.00000000e+00],
       [  1.00000000e+00,   3.00000000e+03,   4.00000000e+00],
       [  1.00000000e+00,   1.98500000e+03,   4.00000000e+00],
       [  1.00000000e+00,   1.53400000e+03,   3.00000000e+00],
       [  1.00000000e+00,   1.42700000e+03,   3.00000000e+00],
       [  1.00000000e+00,   1.38000000e+03,   3.00000000e+00],
       [  1.00000000e+00,   1.49400000e+03,   3.00000000e+00]])