# Boston house price prediction

The problem that we are going to solve here is that given a set of features that describe a house in Boston, our machine learning model must predict the house price. To train our machine learning model with boston housing dataset, we will try several regression algorithms.

### Dataset information

Boston House Prices Dataset has 506 rows with 14 attributes or features for homes from various suburbs in Boston.

```
Boston Housing Dataset Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's
```

In [None]:
# import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from ipywidgets import interact
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb

In [None]:
# read data
df = pd.read_csv('../input/boston-house-prices/housing.csv',delim_whitespace = True,header = None)
df.head()

In [None]:
# set column names
df.columns = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV']
df.head()

In [None]:
# shape of dataset
df.shape

In [None]:
#Statistical info
df.describe()

In [None]:
# datatype info
df.info()

In [None]:
#check for null values
df.isnull().sum()

In [None]:
# See rows with missing values
df[df.isnull().any(axis=1)]

In [None]:
# Columns distributions 
hist = df.hist(bins=40,figsize=(20,15))

In [None]:
# Feature scaling
scaler = StandardScaler()
sc_cols = ['CRIM','ZN','INDUS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV']
df[sc_cols] = scaler.fit_transform(df[sc_cols])

In [None]:
# Columns distributions after scaling
hist = df.hist(bins=40,figsize=(20,15))

In [None]:
# create box plots
fig, ax = plt.subplots(ncols=7, nrows=2, figsize=(20, 10))
index = 0
ax = ax.flatten()

for col, value in df.items():
    sns.boxplot(y=col, data=df, ax=ax[index])
    index += 1
plt.tight_layout(pad=0.5, w_pad=0.7, h_pad=5.0)

In [None]:
# Finding out the correlation between the features
corr = df.corr()
corr

In [None]:
# Plotting the heatmap of correlation between features
fig, ax = plt.subplots(figsize=(15 ,10))
sns.heatmap(corr,  annot=True, cmap='RdYlGn')

#### We can see that "RM" is highly correlated with  target variable. "LSTAT" is highly negatively correlated with output. There are also highly correlated features, which are "RAD" and "TAX". I'll drop one of them.

In [None]:
# Plotting correlations  in interactive way
def f(corr):
    if corr == 'MEDV and LSTAT':
        plt.scatter(df['MEDV'],df['LSTAT'],marker= '*', c= 'red')
        plt.xlabel('LSTAT')
        plt.ylabel('MEDV')
        plt.title('Correlation between MEDV and LSTAT')
        plt.show
    else:
        plt.scatter(df['MEDV'],df['RM'])
        plt.xlabel('RM')
        plt.ylabel('MEDV')
        plt.title('Correlation between MEDV and RM')
        plt.show     
interact(f, corr = ['MEDV and RM','MEDV and LSTAT'])

### Split data 

In [None]:
# Create dependent and independent features 
X = df.drop(columns = ['MEDV','RAD'], axis = 'column')
y = df['MEDV']

In [None]:
X.columns

### __Try different regression models__

### Linear Regression

In [None]:
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [None]:
LR = LinearRegression(normalize=True)
LR.fit(X_train,y_train)

In [None]:
LR_cv_score = cross_val_score(LR, X_train, y_train, cv = 5) 
LR_cv_score

In [None]:
LR.coef_

In [None]:
LR_predictions = LR.predict(X_test)

In [None]:
LR_RMSE = np.sqrt(mean_squared_error(y_test, LR_predictions))

In [None]:
Models_eval = np.array([])
Models_eval = np.append(Models_eval,[('Linear Regression',LR_RMSE,np.mean(LR_cv_score))])
Models_eval

In [None]:
# plotting Linear Regression coefficents
coef = pd.Series(LR.coef_, X.columns).sort_values()
coef.plot(kind='bar', title='Linear Regression Coefficents')

In [None]:
x_ax = range(len(y_test))
plt.scatter(x_ax, y_test, label="original")
plt.scatter(x_ax, LR_predictions, label="predicted")
plt.title("Original and predicted data")
plt.legend()
plt.show()

In [None]:
LR_eval = pd.DataFrame({'Model': 'Linear Regression','RMSE':[LR_RMSE],'CV Score':[np.mean(LR_cv_score)]})
print('Model evaluation')
LR_eval

### Decision Tree Regressor

In [None]:
DT = DecisionTreeRegressor(max_depth = 2)
DT.fit(X_train,y_train)

In [None]:
DT_params = {'max_depth': range(1,11)}
DT_grid_search = GridSearchCV(estimator = DT, param_grid = DT_params, cv=10,return_train_score = True )
DT_grid_search.fit(X_train, y_train)

In [None]:
DT_cv_results = pd.DataFrame(DT_grid_search.cv_results_)
DT_cv_results

In [None]:
DT_cv_results[['params','mean_train_score','mean_test_score']]

In [None]:
plt.plot(DT_grid_search.cv_results_['mean_test_score'],)
plt.plot(DT_grid_search.cv_results_['mean_train_score'])
plt.legend(['test score', 'train score'], loc='upper left')
plt.xlabel('depth')
plt.ylabel('Accuray')

In [None]:
DT = DT_grid_search.best_estimator_
DT

In [None]:
DT_cv_score = DT_grid_search.best_score_
DT_cv_score

In [None]:
DT.fit(X_train,y_train)

In [None]:
DT_predictions = DT.predict(X_test)

In [None]:
from sklearn import tree

fig = plt.figure(figsize=(40,40))
_ = tree.plot_tree(DT, feature_names= ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'TAX',
                                   'PTRATIO', 'B', 'LSTAT'],  
                   class_names=['MEDV'],
                   filled=True, node_ids = True)

In [None]:
DT_RMSE = np.sqrt(mean_squared_error(y_test, DT_predictions))

In [None]:
Models_eval = np.append(Models_eval,[('Decision Tree',DT_RMSE,np.mean(DT_cv_score))])
Models_eval

In [None]:
coef = pd.Series(DT.feature_importances_, X.columns).sort_values(ascending=False)
coef.plot(kind='bar', title='Feature Importance')

In [None]:
x_ax = range(len(y_test))
plt.scatter(x_ax, y_test, label="original")
plt.scatter(x_ax, DT_predictions, label="predicted")
plt.title("Original and predicted data")
plt.legend(loc='best')
plt.show()

In [None]:
DT_eval = pd.DataFrame({'Model': 'Decision Tree','RMSE':[DT_RMSE],'CV Score':[np.mean(DT_cv_score)]})
print('Model evaluation')
DT_eval

### Random Forest Regression 

In [None]:
RF = RandomForestRegressor(n_estimators = 30) 

In [None]:
RF_params = {'max_depth': range(2, 10), 'min_samples_split': [2, 4, 6, 8, 10],'n_estimators': range(1, 50) }
RF_grid_search = GridSearchCV(estimator = RF, param_grid = RF_params, cv=10,return_train_score = True )
RF_grid_search.fit(X_train, y_train)

In [None]:
RF_cv_results = pd.DataFrame(RF_grid_search.cv_results_)
RF_cv_results.columns

In [None]:
RF_cv_results[['params','param_n_estimators','mean_train_score','mean_test_score']]

In [None]:
plt.plot(RF_grid_search.cv_results_['mean_test_score'],)
plt.plot(RF_grid_search.cv_results_['mean_train_score'])
plt.legend(['test score', 'train score'], loc='upper left')
plt.xlabel('depth')
plt.ylabel('Accuray')

In [None]:
RF = RF_grid_search.best_estimator_
RF

In [None]:
RF_cv_score = RF_grid_search.best_score_
RF_cv_score

In [None]:
RF.fit(X_train,y_train)

In [None]:
RF_predictions = RF.predict(X_test)

In [None]:
RF_RMSE = np.sqrt(mean_squared_error(y_test, RF_predictions))

In [None]:
Models_eval = np.append(Models_eval,[('Random Forest',RF_RMSE,np.mean(RF_cv_score))])
Models_eval

In [None]:
RF_eval = pd.DataFrame({'Model': 'Decision Tree','RMSE':[RF_RMSE],'CV Score':[np.mean(RF_cv_score)]})
print('Model evaluation')
RF_eval

In [None]:
coef = pd.Series(RF.feature_importances_, X.columns).sort_values(ascending=False)
coef.plot(kind='bar', title='Feature Importance')

In [None]:
x_ax = range(len(y_test))
plt.scatter(x_ax, y_test, label="original")
plt.scatter(x_ax, RF_predictions, label="predicted")
plt.title("Original and predicted data")
plt.legend()
plt.show()

In [None]:
RF_eval = pd.DataFrame({'Model': 'Random Forest','RMSE':[RF_RMSE],'CV Score':[np.mean(RF_cv_score)]})
RF_eval

### XGB regression

In [None]:
XGB = xgb.XGBRegressor()

In [None]:
XGB_params = {'n_estimators':range(1, 50),'learning_rate':[0.1,0.07],'gamma':[0,0.03,0.1,0.3],'max_depth':[3,5]}

In [None]:
XGB_grid_search = GridSearchCV(estimator = XGB, param_grid = XGB_params, cv=10, return_train_score = True)
XGB_grid_search.fit(X_train, y_train)

In [None]:
XGB_cv_results = pd.DataFrame(RF_grid_search.cv_results_)
XGB_cv_results.columns

In [None]:
plt.plot(XGB_grid_search.cv_results_['mean_test_score'])
plt.plot(XGB_grid_search.cv_results_['mean_train_score'])
plt.legend(['test score', 'train score'], loc='upper left')
plt.xlabel('depth')
plt.ylabel('Accuray')

In [None]:
XGB = XGB_grid_search.best_estimator_
XGB

In [None]:
XGB_cv_score = XGB_grid_search.best_score_
XGB_cv_score

In [None]:
XGB.fit(X_train,y_train)

In [None]:
XGB_predictions = XGB.predict(X_test)

In [None]:
XGB_RMSE = np.sqrt(mean_squared_error(y_test, XGB_predictions))

In [None]:
Models_eval = np.append(Models_eval,[('XGB Regression',XGB_RMSE,np.mean(XGB_cv_score))])
Models_eval

In [None]:
coef = pd.Series(XGB.feature_importances_, X.columns).sort_values(ascending=False)
coef.plot(kind='bar', title='Feature Importance')

In [None]:
x_ax = range(len(y_test))
plt.scatter(x_ax, y_test, label="original")
plt.scatter(x_ax, XGB_predictions, label="predicted")
plt.title("Original and predicted data")
plt.legend()
plt.show()

In [None]:
XGB_eval = pd.DataFrame({'Model': 'XGB Regression','RMSE':[XGB_RMSE],'CV Score':[np.mean(XGB_cv_score)]})
print('Model evaluation')
XGB_eval

# Evaluation and comparision of all the models

In [None]:
Models_eval = Models_eval.reshape(4,3)
models_eval = pd.DataFrame(Models_eval,columns = ['Model','RMSE','CV Score'])
models_eval['RMSE'] = pd.to_numeric(models_eval['RMSE'])
models_eval['CV Score'] = pd.to_numeric(models_eval['CV Score'])
models_eval

In [None]:
f, axe = plt.subplots(1,1, figsize=(10,3))

models_eval.sort_values(by=['CV Score'], ascending=False, inplace=True)

sns.barplot(x='CV Score', y='Model', data = models_eval, ax = axe)
axe.set_xlabel('Cross-Validaton Score', size=16)
axe.set_ylabel('Model', size=16)
axe.set_xlim(0,1.0)
plt.show()

In [None]:
models_eval.sort_values(by=['RMSE'], ascending=False, inplace=True)

f, axe = plt.subplots(1,1, figsize=(10,3))
sns.barplot(x='Model', y='RMSE', data=models_eval, ax = axe)
axe.set_xlabel('Model', size=16)
axe.set_ylabel('RMSE', size=16)

plt.show()

So In this notebook, I have built four regression models using the Boston Housing Dataset. These are linear regression, decision tree regression, random forest regression and XGB regression. Afterward I have visualized calculated the performance measure of the models. Out of which XGB regression is the best suit for this dataset. 

### Please  <b><font color="green">UPVOTE </font></b> if you found this kernel useful !👍


### Feedback is greatly appreciated!