# Boston house-prices prediction with decision tree

In [57]:
import matplotlib.pyplot as plt
import numpy as np
import sys
from sklearn.datasets import load_boston

sys.path.append("..")
from models.decision_tree import DecisionTreeClassifier, DecisionTreeRegressor
from models.linear_regression import LinearRegression
from utils.visualization import plot_decision_boundary

In [2]:
%matplotlib inline

np.random.seed(1)

## Load the dataset

The Boston house-prices prediction dataset housing values in suburbs of Boston, each instance is described with 14 attributes, 13 of them are use for prediction, the remaining one, MEDV(Median value of owner occupied homes in $1000's) is the target.

In [20]:
# Load an partition the data
(X, y) = load_boston(return_X_y=True)

perm = np.random.permutation(X.shape[0])
pivot = int(X.shape[0]*0.7)
x_train, y_train = X[:pivot, :], y[:pivot]
x_test, y_test = X[pivot:, :], y[pivot:]

## Define a baseline

To asses the quality of the model we may compare its perfromance against that of a simple baseline model, for instance Linear Regression.

In [59]:
# Create and fit baseline model
baseline = LinearRegression()
baseline.fit(x_train, y_train)

# Compute metrics on trainig and test set
rmse_train = np.sqrt(np.mean(np.square(baseline.predict(x_train)-y_train)))
y_hat = baseline.predict(x_test)
rmse = np.sqrt(np.mean(np.square(y_hat-y_test)))

print("RMSE on the train set: %.2f" % rmse_train)
print("RMSE on the validation set: %.2f" % rmse)

RMSE on the train set: 3.00
RMSE on the holdout set: 23.39


So now our benchmark to beat is a RMSE of 23.39, anything performing worse than that it's not worth it since the simple baseline is able to beat it.

To fit a decision tree we have to do some hyperparameter selection. We have to set a maximum depth and a minimum impurity. Let's first start with an educated guess to asses the model selection.

In [60]:
# Create and fit a tree
tree = DecisionTreeRegressor(max_depth=5, min_impurity=1)
tree.fit(X=x_train, y=y_train)
rmse_train = np.sqrt(np.mean(np.square(tree.predict(x_train)-y_train)))

y_hat = tree.predict(x_test)
rmse = np.sqrt(np.mean(np.square(y_hat-y_test)))

print("RMSE on the train set: %.2f" % rmse_train)
print("RMSE on the validation set: %.2f" % rmse)

RMSE on the train set: 2.36
RMSE on the holdout set: 6.93


## Improving the model

The new model is clearly able to outperform the baseline. But maybe we can improve it with hyperparameter search. Due to the small size of the dataset the size of the validation set might not be enough to discriminate with confidence between similar models performance. We may use cross-validation to overcome this problem.

### Cross-validation

In [67]:
def cross_validate(model, k):
    X_fold = X
    y_fold = y
    pivot = int(X_fold.shape[0]/k)
    cum_rmse = 0
    for _ in range(k):
        # Always take firt fold as test
        x_test, y_test = X_fold[:pivot, :], y_fold[:pivot]
        x_train, y_train = X_fold[pivot:, :], y_fold[pivot:]
        
        # fit the model
        model.fit(X=x_train, y=y_train)
        y_hat = model.predict(x_test)
        
        cum_rmse += np.sqrt(np.mean(np.square(y_hat-y_test)))
        
        X_fold = np.concatenate((x_train, x_test))
        y_fold = np.concatenate((y_train, y_test))
        
    return cum_rmse/k

In [68]:
cross_validate(tree, 5)

6.2953829714342975