# Linear Regression with `scikit-learn`

In this notebook, we will introduce how to use the Python package `scikit-learn` to perform a linear regression task.

In [1]:
import warnings
import numpy as np
import pandas as pd
from helper import display_data # This function help us to display the first 5 data points
from sklearn import linear_model

# This is used for ignoring a harmless warning from the *scipy* package
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

## Load Data

In [2]:
data = np.loadtxt('data/ex1data2.txt', delimiter=',')
display_data(data)

array([[  2.10400000e+03,   3.00000000e+00,   3.99900000e+05],
       [  1.60000000e+03,   3.00000000e+00,   3.29900000e+05],
       [  2.40000000e+03,   3.00000000e+00,   3.69000000e+05],
       [  1.41600000e+03,   2.00000000e+00,   2.32000000e+05],
       [  3.00000000e+03,   4.00000000e+00,   5.39900000e+05]])

In [3]:
# pandas can give us a better visualization on tabular data.
df = pd.DataFrame(data, columns=['Area', '# of Bedroom', 'Price'])
df.head() # will return the first 5 data points

Unnamed: 0,Area,# of Bedroom,Price
0,2104.0,3.0,399900.0
1,1600.0,3.0,329900.0
2,2400.0,3.0,369000.0
3,1416.0,2.0,232000.0
4,3000.0,4.0,539900.0


In [4]:
X = data[:,0:2]
display_data(X)

array([[  2.10400000e+03,   3.00000000e+00],
       [  1.60000000e+03,   3.00000000e+00],
       [  2.40000000e+03,   3.00000000e+00],
       [  1.41600000e+03,   2.00000000e+00],
       [  3.00000000e+03,   4.00000000e+00]])

## Normalize Data

In real world problems, there are a lot of feature engineering works needed to do. But since the dataset we used here is very simple, we will apply only a simple feature normalization technique (sometimes called **Standardization**).

This is often a good preprocessing step to do when working with learning algorithms.

The formula we used here (for each feature) is:

$$x' = x-\frac{\bar{x}}{\sigma}$$

- **x' --> normalized x **
- **x_bar --> mean value of x**
- **sigma --> standard deviation**

In [5]:
def normalize(X):
    """Returns a normalized version of X where the mean value of each feature is 0 and the standard deviation is 1."""
    num_feature = X.shape[1]
    
    mu = np.zeros(num_feature)
    sigma = np.zeros(num_feature)
    X_norm = np.zeros(X.shape)

    for i in range(0, num_feature):
        mu[i] = np.mean(X[:,i])
        sigma[i] = np.std(X[:,i])
        X_norm[:,i] = (X[:,i] - mu[i]) / sigma[i]

    return X_norm, mu, sigma

In [6]:
X_norm, mu, sigma = normalize(X)
display_data(X_norm)

array([[ 0.13141542, -0.22609337],
       [-0.5096407 , -0.22609337],
       [ 0.5079087 , -0.22609337],
       [-0.74367706, -1.5543919 ],
       [ 1.27107075,  1.10220517]])

In [7]:
y = data[:,2]
display_data(y)

array([ 399900.,  329900.,  369000.,  232000.,  539900.])

## Linear Regression Models

Here, we will compare two linear regression models with different settings. 
The first one will automatically normalize the input X, the second won't. We have done the feature normalization for the second model.

### Linear Regression Model with Original Data

We will throw our original dataset **X** to the model directly and let the model handle the normalization for us.

In [8]:
lr1 = linear_model.LinearRegression()
lr1.fit(X, y)
lr1.coef_, lr1.intercept_

(array([  139.21067402, -8738.01911233]), 89597.909542797483)

The above result stands for

$$\hat{y} = 89597.909542797483 + 139.21067402\cdot{x_{1}} - 8738.01911233\cdot{x_{2}}$$

- **y_hat --> hypothesis, prediction value**
- **x1 --> nomalized area**
- **x2 --> nomalized # of bedroom**

In [9]:
to_pred = np.array([1650, 3]).reshape((1, -1))

In [10]:
lr1.predict(to_pred)

array([ 293081.4643349])

### Linear Regression Model with Nomalized Data

We will throw our normalized dataset **X_norm** to the model.

In [11]:
lr2 = linear_model.LinearRegression(normalize=False)
lr2.fit(X_norm, y)
lr2.coef_, lr2.intercept_

(array([ 109447.79646964,   -6578.35485416]), 340412.6595744681)

The above result stands for

$$\hat{y} = 340412.6595744681 + 109447.79646964\cdot{x_{1}} - 6578.35485416\cdot{x_{2}}$$

- **y_hat --> hypothesis, prediction value**
- **x1 --> nomalized area**
- **x2 --> nomalized # of bedroom**

Therefore, when we use the model to make a new prediction, we need to normalize the input data first (as below).

In [12]:
to_pred = np.array([(1650 - mu[0]) / sigma[0], (3 - mu[1]) / sigma[1]]).reshape((1, -1))

In [13]:
lr2.predict(to_pred)

array([ 293081.4643349])

### Comparation

Though we have linear models with different coefficient values and different data input requirements, we can obtain the same result. This proves that our normalization is valid and correct.

## Reference

[sklearn.linear_model.LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)