# Boston house prices

The aim of this notebook is to design machine learning methods to predict Boston house prices.  
We will use the data provided by kaggle : [Boston housing data](https://www.kaggle.com/c/boston-housing/data).

## Data

##### Loading the data

In [1]:
import pandas as pd

train = pd.read_csv("/Users/Paul-Noel/Desktop/Programming/boston-housing/data/train.csv", sep=",")
test = pd.read_csv("/Users/Paul-Noel/Desktop/Programming/boston-housing/data/train.csv", sep=",")

##### Overview

There are 13 features and one target, which is `medv`. The features are the following (descriptions given by Kaggle):

| crim | zn  | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat |
|------|-----|-------|------|-----|----|-----|-----|-----|-----|---------|-------|-------|
| per capita crime rate by town.|proportion of residential land zoned for lots over 25,000 sq.ft.|proportion of non-retail business acres per town.|Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).|nitrogen oxides concentration (parts per 10 million).|average number of rooms per dwelling.|proportion of owner-occupied units built prior to 1940.|weighted mean of distances to five Boston employment centres.|index of accessibility to radial highways.|full-value property-tax rate per \$10,000.|pupil-teacher ratio by town.|1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.|lower status of the population (percent).|

In [7]:
train.head(2)

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6


In [8]:
train = train.drop('ID', axis=1)

In [9]:
train.describe()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
count,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0
mean,3.360341,10.689189,11.293483,0.06006,0.557144,6.265619,68.226426,3.709934,9.633634,409.279279,18.448048,359.466096,12.515435,22.768769
std,7.352272,22.674762,6.998123,0.237956,0.114955,0.703952,28.133344,1.981123,8.742174,170.841988,2.151821,86.584567,7.067781,9.173468
min,0.00632,0.0,0.74,0.0,0.385,3.561,6.0,1.1296,1.0,188.0,12.6,3.5,1.73,5.0
25%,0.07896,0.0,5.13,0.0,0.453,5.884,45.4,2.1224,4.0,279.0,17.4,376.73,7.18,17.4
50%,0.26169,0.0,9.9,0.0,0.538,6.202,76.7,3.0923,5.0,330.0,19.0,392.05,10.97,21.6
75%,3.67822,12.5,18.1,0.0,0.631,6.595,93.8,5.1167,24.0,666.0,20.2,396.24,16.42,25.0
max,73.5341,100.0,27.74,1.0,0.871,8.725,100.0,10.7103,24.0,711.0,21.2,396.9,37.97,50.0


We check if there are NAs in some columns.

In [6]:
len(train) - train.count()

ID         0
crim       0
zn         0
indus      0
chas       0
nox        0
rm         0
age        0
dis        0
rad        0
tax        0
ptratio    0
black      0
lstat      0
medv       0
dtype: int64

##### Features selection

In [20]:
corr = train.corr().abs().loc['medv', :]
corr.sort_values(ascending=False)

medv       1.000000
lstat      0.738600
rm         0.689598
ptratio    0.481376
indus      0.473932
tax        0.448078
nox        0.413054
crim       0.407454
age        0.358888
rad        0.352251
zn         0.344842
black      0.336660
dis        0.249422
chas       0.204390
Name: medv, dtype: float64

First we will try model using only features that have a correlation coefficient with `medv` over 0.4. Then we will test regularized model that are going to perform the feature selection by themselves.

In [24]:
features = ['lstat', 'rm', 'ptratio', 'indus', 'tax', 'nox', 'crim', 'medv']

train_1 = train[features]

## Models

In [39]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV

In [53]:
y = train_1['medv']
X = train_1.drop('medv', axis=1)
X_train, X_cv, y_train, y_cv = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"TRAIN SHAPE : {X_train.shape} | CV SHAPE : {X_cv.shape}")

TRAIN SHAPE : (266, 7) | CV SHAPE : (67, 7)


### Linear regression

In [36]:
lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [37]:
y_train_pred = lr.predict(X_train)
y_cv_pred = lr.predict(X_cv)
rmse_train = mean_squared_error(y_train, y_train_pred)
rmse_cv = mean_squared_error(y_cv, y_cv_pred)

print(f"""TRAIN SET : rmse = {rmse_train} \n
CV SET : rmse = {rmse_cv}""")

TRAIN SET : rmse = 29.465442176803005 

CV SET : rmse = 23.783315141581987


### Regression Tree

In [58]:
params = {'max_depth': [3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30]}
RegTree = DecisionTreeRegressor()
clf = GridSearchCV(estimator=RegTree,
                  param_grid=params,
                  scoring='neg_mean_squared_error',
                  cv=5)
clf.fit(X_train, y_train)

print(clf.best_params_)

{'max_depth': 7}




In [59]:
TreeReg = DecisionTreeRegressor(max_depth=7)
TreeReg.fit(X_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=7, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [60]:
y_train_pred = TreeReg.predict(X_train)
y_cv_pred = TreeReg.predict(X_cv)
rmse_train = mean_squared_error(y_train, y_train_pred)
rmse_cv = mean_squared_error(y_cv, y_cv_pred)

print(f"""TRAIN SET : rmse = {rmse_train} \n
CV SET : rmse = {rmse_cv}""")

TRAIN SET : rmse = 2.714090917742065 

CV SET : rmse = 20.33276809793344
