# Basic Linear Regression Analysis

Dataset: use boston dataset in sklearn toy dataset.  
Training and compare different linear model with default setting.

## Import library

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%pylab inline

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 30)
sns.set(style="white", color_codes=True)   

Populating the interactive namespace from numpy and matplotlib


In [2]:
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error

## Import boston dataset
Use the boston dataset to practice the different model

In [3]:
boston = datasets.load_boston()

## View the data description

In [10]:
print(boston.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

## Data Features

In [11]:
col = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT']
df = pd.DataFrame(boston.data, columns=col)

In [12]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


## Data Target

In [13]:
df_T = pd.Series(boston.target)
df_T.head()

0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
dtype: float64

## Cross-validation, seperate the data to traing set and testing set

In [15]:
from sklearn.cross_validation import train_test_split

In [16]:
x_train, x_test, y_train, y_test = train_test_split(df,df_T,random_state = 42)

## Train the data and view the model

In [37]:
lr = linear_model.LinearRegression()
lr.fit(x_train, y_train)
predict1 = lr.predict(x_test)

print('Linear Regression Model:\n')
print('Coefficients:', lr.coef_)
print('MSE =', mean_squared_error(y_test, predict1))
print('R^2:',lr.score(x_test, y_test))

Linear Regression Model:

Coefficients: [ -1.27824912e-01   2.95208977e-02   4.92643105e-02   2.77594439e+00
  -1.62801962e+01   4.36089596e+00  -9.19111559e-03  -1.40172019e+00
   2.57458956e-01  -9.94705777e-03  -9.24266403e-01   1.33164215e-02
  -5.18565634e-01]
MSE = 22.1316778943
R^2: 0.683955724318


## Try different regression model

In [38]:
rg = linear_model.Ridge()
rg.fit(x_train, y_train)
predict2 = rg.predict(x_test)

print('Ridge Regression Model:\n')
print('Coefficients:', rg.coef_)
print('MSE =', mean_squared_error(y_test, predict2))
print('R^2:',rg.score(x_test, y_test))

Ridge Regression Model:

Coefficients: [-0.12317515  0.03135993  0.01800043  2.54498006 -8.79329845  4.37248993
 -0.01533593 -1.2913009   0.24364675 -0.01081661 -0.83432505  0.01361388
 -0.53530193]
MSE = 22.5123297109
R^2: 0.678519949035


In [39]:
ls = linear_model.Lasso()
ls.fit(x_train, y_train)
predict3 = ls.predict(x_test)

print('Lasso Regression Model:\n')
print('Coefficients:', ls.coef_)
print('MSE =', mean_squared_error(y_test, predict3))
print('R^2:',ls.score(x_test, y_test))

Lasso Regression Model:

Coefficients: [-0.08446735  0.02647731 -0.          0.         -0.          1.5400203
  0.01347528 -0.58361292  0.20760129 -0.0112081  -0.7054379   0.01207364
 -0.75834494]
MSE = 24.4258144891
R^2: 0.651195047884


In [40]:
en = linear_model.ElasticNet()
en.fit(x_train, y_train)
predict4 = en.predict(x_test)

print('ElasticNet Regression Model:\n')
print('Coefficients:', en.coef_)
print('MSE =', mean_squared_error(y_test, predict4))
print('R^2:',en.score(x_test, y_test))

ElasticNet Regression Model:

Coefficients: [-0.10322494  0.03413852 -0.00667195  0.         -0.          1.142079
  0.01442598 -0.70972042  0.26480823 -0.0134646  -0.74460471  0.0121163
 -0.78320085]
MSE = 24.1368243834
R^2: 0.65532187772


## Simple conclusion
If we use the default setting in all linear model, the relationship of MSE is:  
Linear < Ridge < ElasticNet < Lasso

The relationship of R^2 is:  
Linear > Ridge > ElasticNet > Lasso

It looks like in default setting, linear regression model have the best performance of them.  
Next I will try to use more technique to find the better model.