# Regression predictive modelling on Boston House Prices (Linear/Lasso/Ridge Regression)
[Boston House Prices Dataset on Kaggle](https://www.kaggle.com/vikrishnan/boston-house-prices)

## 1. Import data for analysis

In [None]:
import os
import pandas as pd
import numpy as np

os.chdir('/kaggle/input')
os.getcwd()

In [None]:
df=pd.read_csv('boston-house-prices/housing.csv')
df.head() #all values are in the first column and header is missing

**X: Predictors** 
* CRIM: per capita crime rate by town
* ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
* INDUS: proportion of non-retail business acres per town
* CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
* NOX: nitric oxides concentration (parts per 10 million)
* RM: average number of rooms per dwelling
* AGE: proportion of owner-occupied units built prior to 1940
* DIS: weighted distances to ﬁve Boston employment centers
* RAD: index of accessibility to radial highways
* TAX: full-value property-tax rate per 10k
* PTRATIO: pupil-teacher ratio by town 12. 
* B: 1000(Bk−0.63)2 where Bk is the proportion of blacks by town 13. 
* LSTAT:%lower status of the population

**Y: Outcome** 
* MEDV: Median value of owner-occupied homes in $1000s

In [None]:
names=['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV'] 
df=pd.read_csv('boston-house-prices/housing.csv',delim_whitespace=True,names=names) 

#df.head()
#df.columns
#df.shape #506*14
df.info()  #no missing value 

## 2. Data Cleaning & Wrangling 

#### 2.1 Descriptive Analysis
**View decriptive statistics for all variables**

In [None]:
df.describe()

#### 2.2 Check relationship between predictors and outcome variable
* **1.scatter plot**
* According to the plots on the last row, we can observe moderate to strong relationship between each predictor and median house price, suggesting these predictors could explain the house prices to some extent. 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')
scatterplots=sns.PairGrid(df)
scatterplots.map_offdiag(plt.scatter) 
plt.show() 

* **2.correlation matrix / Heatmap**

In [None]:
#correlation matrix
#df.corr()

plt.figure(figsize=(25, 12))
sns.heatmap(df.corr(), vmin = -1, vmax = 1, center = 0, cmap = 'coolwarm', annot = True)
plt.show()

## 3.Build Regression Model from Scikit-learn
* Train the model: .fit()
* Predit of new data: .predit()

In [None]:
#split the data into predictors X and Y 
#df.info() #X:0-12; Y:13
X=df.iloc[:,:12]
y=df.iloc[:,13]

In [None]:
#Splitting to training and testing data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=1)

### 3.1 Linear regression
* Y=aX+b
* Y=target, X=features
* a,b=paremeters of model
* best line of fit: minimize the error function (SSE) --> best a,b

In [None]:
from sklearn.linear_model import LinearRegression

lr_all=LinearRegression()  
lr_all.fit(X_train, y_train) 

y_pred1=lr_all.predict(X_test)

In [None]:
# coefficient of intercept
lr_all.intercept_

In [None]:
#Converting the coefficient values to a dataframe
lr_all_coeffcients = pd.DataFrame([X_train.columns,lr_all.coef_]).T
lr_all_coeffcients = lr_all_coeffcients.rename(columns={0: 'Attribute', 1: 'Coefficients'}) #put into dataframe
lr_all_coeffcients #print out

**Model Evaluation**

In [None]:
#accuracy score 
lr_all.score(X_test, y_test)

* 𝑅^2 : It is a measure of the linear relationship between X and Y. It is interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variable.

* Adjusted 𝑅^2 :The adjusted R-squared compares the explanatory power of regression models that contain different numbers of predictors.

* MAE : It is the mean of the absolute value of the errors. It measures the difference between two continuous variables, here actual and predicted values of y. 

* MSE: The mean square error (MSE) is just like the MAE, but squares the difference before summing them all instead of using the absolute value. 

* RMSE: The mean square error (MSE) is just like the MAE, but squares the difference before summing them all instead of using the absolute value. 

In [None]:
# other evaluation metrics
from sklearn import metrics
print('R^2:',metrics.r2_score(y_test, y_pred1))
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_test, y_pred1))*(len(y_test)-1)/(len(y_test)-X_train.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_test, y_pred1))
print('MSE:',metrics.mean_squared_error(y_test, y_pred1))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test, y_pred1)))

## 4.Overfitting, Regularization 
* Default Performance Metrics: accuracy=correct prediction/ total # of prediction
* The loss fuction: OLS:minimize sum of squares of residuals
* :) the smaller the loss function, the better the model

* Regularization: Penalizing large coefficients

### 4.1 Ridge Regression
* Ridge regression is one of the simple techniques to reduce model complexity and prevent over-fitting which may result from linear regression
* The loss function is altered by adding a penalty equivalent to square of the magnitude of the coefficients 
* **One parameter: Alpha (also called 'lambda')**
* **higher the alpha value --> more restriction on the coeffs**
* **lower alpha --> more generalization**
* **Normal pratice: alpha>1** (e.g. 150;230)

In [None]:
from sklearn.linear_model import Ridge
ridge=Ridge(alpha=100)
ridge.fit(X_train, y_train)
y_pred2=ridge.predict(X_test)

In [None]:
ridge.score(X_test, y_test)

* ***compare Linear regression vs Ridge(alpha=0.1) vs Ridge(alpha=100)*** 

* 1. in terms of test score: Ridge regression with high alpha has lowest test score

In [None]:
from sklearn.linear_model import Ridge
rr1=Ridge(alpha=0.01)
rr1.fit(X_train,y_train)

rr2=Ridge(alpha=100)
rr2.fit(X_train,y_train)

print('Linear regression test score:',lr_all.score(X_test,y_test))
print('Ridge regression test score with low alpha(0.1):',rr1.score(X_test,y_test))
print('Ridge regression test score with high alpha(100):',rr2.score(X_test,y_test)) #high alpha对score的penalty很高

* 2. in terms of magnitude of coefficients: Rigde regression with high alpha penalizes the coefficients on CHAS, NOX, and RM a lot

In [None]:
import matplotlib.pyplot as plt
plt.plot(names[0:12],lr_all.coef_,alpha=0.4,linestyle='none',marker='o',markersize=7,color='green',label='Linear Regression')
plt.plot(names[0:12],rr1.coef_,alpha=0.4,linestyle='none',marker='*',markersize=7,color='red',label=r'Ridge;$\alpha=0.01$')
plt.plot(names[0:12],rr2.coef_,alpha=0.4,linestyle='none',marker='d',markersize=7,color='blue',label=r'Ridge;$\alpha=100$')
plt.xlabel('Coefficient Index',fontsize=16)
plt.ylabel('Coefficient Magnitude',fontsize=16)
plt.legend(fontsize=13,loc=4)
plt.show()

### 4.2 Lasso Regression
* Lasso regression is another simple technique to reduce model complexity and prevent over-fitting which result from lienar regression
* Lasso regression not only helps in **reducing over-fitting** but it can help us in **feature selection** 
* **Normal practice: alpha<1** (e.g. 0.1, 0.03) 

In [None]:
from sklearn.linear_model import Lasso
lasso=Lasso(alpha=0.8)
lasso.fit(X_train, y_train)
y_pred3=lasso.predict(X_test)

lasso.score(X_test, y_test)

#### :) Feature Selection 
* Removing the predictors with zero coefficients: **CHAS and NOX**

In [None]:
#print(lasso.coef_) 

#Converting the coefficient values to a dataframe
lasso_coeffcients = pd.DataFrame([X_train.columns,lasso.coef_]).T
lasso_coeffcients = lasso_coeffcients.rename(columns={0: 'Attribute', 1: 'Coefficients'}) #put into dataframe
lasso_coeffcients #print out

In [None]:
#Viewing by comparing linear and lasso regression coefficient plots 
import matplotlib.pyplot as plt
plt.plot(names[0:12],lasso.coef_,alpha=0.4,linestyle='none',marker='o',markersize=7,color='green',label='Lasso Regression')
plt.plot(names[0:12],lr_all.coef_,alpha=0.4,linestyle='none',marker='d',markersize=7,color='blue',label='Linear Regression')
plt.xlabel('Coefficient Index',fontsize=16)
plt.ylabel('Coefficient Magnitude',fontsize=16)
plt.legend(fontsize=13,loc=4)
plt.show()

### 4.3 Hyperparameter tunning 
* Ridge and Lasso regression: Choosing alpha
* Hyperparameters cannot be learned by fitting he model
* **Solution: GridSearch/RandomizedSearch**

In [None]:
#find best alpha for Ridge Regression
from sklearn.model_selection import GridSearchCV
param_grid={'alpha':np.arange(1,10,500)} #range from 1-500 with equal interval of 10 
ridge=Ridge() 
ridge_best_alpha=GridSearchCV(ridge, param_grid)
ridge_best_alpha.fit(X_train,y_train)

In [None]:
print("Best alpha for Ridge Regression:",ridge_best_alpha.best_params_)
print("Best score for Ridge Regression with best alpha:",ridge_best_alpha.best_score_)

In [None]:
#find best alpha for Lasso Regression
from sklearn.model_selection import GridSearchCV
param_grid={'alpha':np.arange(0,0.1,1)} #range from 0-1 with equal interval of 0.1 
lasso=Lasso() 
lasso_best_alpha=GridSearchCV(lasso, param_grid) 
lasso_best_alpha.fit(X_train,y_train)

In [None]:
print("Best alpha for Lasso Regression:",lasso_best_alpha.best_params_)
print("Best score for Lasso Regression with best alpha:",lasso_best_alpha.best_score_)

## 5.Preprocessing Data + Pipeline 
* 1. Handling missing value: dropna; fillna; Imputer
* 2. Normalizing(Centering and scaling): Features on larger scales can unduly in uence the model
* 3. pipeline：missing value+normalizaiton+fit model+predict+score

In [None]:
#Preprocessin data 

#1. handling with missing value (fill up by mean value)
from sklearn.impute import SimpleImputer 
import numpy as np
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')


#2. Normalizing raw data
from sklearn.preprocessing import StandardScaler

#3. Select a prediction model 
from sklearn.linear_model import LinearRegression

#4. Set up pipeline 
from sklearn.pipeline import Pipeline
steps=[('imputation',imputer),('scaler',StandardScaler()),('predict',LinearRegression())]
pipeline=Pipeline(steps) 

#5. Fit data into pipeline 
reg=pipeline.fit(X_train, y_train)
y_pred4=reg.predict(X_test)
reg.score(X_test,y_test)