### Linear Regression

#### Predict House Price Based on Boston Housing Dataset

This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston) and has been used extensively throughout the literature to benchmark algorithms. 

There are 14 attributes in each case of the dataset. They are:

- CRIM - per capita crime rate by town
- ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS - proportion of non-retail business acres per town.
- CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- NOX - nitric oxides concentration (parts per 10 million)
- RM - average number of rooms per dwelling
- AGE - proportion of owner-occupied units built prior to 1940
- DIS - weighted distances to five Boston employment centres
- RAD - index of accessibility to radial highways
- TAX - full-value property-tax rate per 10,000 dollars.
- PTRATIO - pupil-teacher ratio by town
- BLACK - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT - % lower status of the population
- MEDV - Median value of owner-occupied homes in 1000 dollars

### Importing Libraries

In [1]:
# Import useful libararies used for data management

import numpy as np
import pandas as pd

In [2]:
# load Boston Dataset
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
data = pd.read_csv('Boston.csv', index_col=0)

# display the first 10 records
data.head(10)

FileNotFoundError: [Errno 2] File Boston.csv does not exist: 'Boston.csv'

In [None]:
data.info()

**Now let's fit a simple linear model (OLS - for "ordinary least squares" method) with MEDV as the target variable and the others as the predictors:**

In [None]:
# use the first 13 attributes as independent varibles 
features = list(data.columns[0:13])

features

In [None]:
# use the names of attributes to split them into independent variables X and target variable y

X = data[features]
y = data['medv']

In [None]:
X

In [None]:
# Show the descriptive statistics of the training dataset (before normalization)
X.describe()

In [None]:
from sklearn import preprocessing
# Apply z-score normalization on all explanatory attributes

zscore_scaler = preprocessing.StandardScaler().fit(X)
X = pd.DataFrame(zscore_scaler.transform(X), columns = X.columns)


In [None]:
# Show the descriptive statistics of the normalized training dataset
X.describe()

#### Use Cross validation to evaluate the model

In [None]:
# import cross validation 
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict

In [None]:
# Import Linear Regression Model from sklearn
from sklearn.linear_model import LinearRegression

# Define model to be linear regression
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
lm = LinearRegression()


In [None]:
score_cv = cross_val_score(lm, X, y, scoring = 'neg_mean_squared_error', cv=10)

In [None]:
score_cv

In [None]:
-score_cv.mean()

In [None]:
pred_y = cross_val_predict(lm, X, y, cv=10)

In [None]:
pred_y

#### Fit the model and get the coefficents

In [None]:
# train model use all the training data
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.fit
lm.fit(X, y)

In [None]:
# show the intercept of the trained model (Theta_0)
lm.intercept_

In [None]:
lm.coef_

In [None]:
# show the coefficients of independent attributes
coeff_df = pd.DataFrame(lm.coef_, X.columns, columns=['Coefficient'])  
coeff_df 

### LASSO Regression

- **We will do Lasso regression next to see how it controls model complexity and eliminate not informative features.**

In [None]:
# Import Lasso Model from sklearn
from sklearn.linear_model import Lasso

In [None]:
# Define model to be Lasso, set alpha=0.1 (alpha is the regularization parameter)
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
lasso = Lasso(alpha = 0.1)

In [None]:
score_lasso = cross_val_score(lasso, X, y, scoring = 'neg_mean_squared_error', cv=10)

In [None]:
score_lasso

In [None]:
-score_lasso.mean()

In [None]:
# train model using whole dataset
lasso.fit(X, y)

In [None]:
# show the intercept of the trained model (Theta_0)
lasso.intercept_

In [None]:
# show the coefficients of independent attributes
coeff_df = pd.DataFrame(lasso.coef_, X.columns, columns=['Coefficient'])  
coeff_df 

In [None]:
coeff_df1 = pd.DataFrame(lasso.coef_, X.columns, columns=['Coefficient'])  
coeff_df1

**Note that both the coefficients of 'indus' and 'age' become zero.**

**Let's compare the magnitudes of coefficients under linear regression and Lasso regression.**

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.plot(lm.coef_,linestyle='none',marker='*',markersize=6,color='red',label='Linear') 

plt.plot(lasso.coef_,linestyle='none',marker='d',markersize=6,color='blue',label='Lasso; alpha = 0.1') 

# draw a horizontal line at 0.
plt.axhline(y=0, color='grey', linestyle='-')

plt.xlabel('Coefficient Index',fontsize=14)
plt.ylabel('Coefficient Magnitude',fontsize=14)
#plt.legend(fontsize=13,loc=10)
plt.legend(fontsize=14, loc='center left', bbox_to_anchor=(1, 0.5))
plt.xticks(np.arange(13), (features), fontsize=10)
plt.show()