# California housing dataset regression with decision trees 

In this notebook, we'll use [decision trees](http://scikit-learn.org/stable/modules/tree.html) and [ensembles of trees](http://scikit-learn.org/stable/modules/ensemble.html) to estimate median house values on Californian housing districts using scikit-learn and [XGBoost](https://xgboost.readthedocs.io/en/latest/).

First, the needed imports. 

In [None]:
%matplotlib inline

import numpy as np
from sklearn import datasets, __version__
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

import matplotlib.pyplot as plt
import seaborn as sns
import graphviz
sns.set()

## Data

In [None]:
chd = datasets.fetch_california_housing()

In [None]:
test_size = 5000

X_train_all, X_test_all, y_train, y_test = train_test_split(
    chd.data, chd.target, test_size=test_size, shuffle=True)

X_train_single = X_train_all[:,0].reshape(-1, 1)
X_test_single = X_test_all[:,0].reshape(-1, 1)
     
print()
print('California housing data: train:',len(X_train_all),'test:',len(X_test_all))
print()
print('X_train_all:', X_train_all.shape)
print('X_train_single:', X_train_single.shape)
print('y_train:', y_train.shape)
print()
print('X_test_all', X_test_all.shape)
print('X_test_single', X_test_single.shape)
print('y_test', y_test.shape)

In [None]:
X_train = X_train_single
X_test = X_test_single

#X_train = X_train_all
#X_test = X_test_all

## Decision tree

### Learning

In [None]:
%%time

max_depth = 3
dt_reg = DecisionTreeRegressor(max_depth=max_depth)
dt_reg.fit(X_train, y_train)

In [None]:
if X_train.shape[1] == 1:
    plt.figure(figsize=(10, 10))
    plt.scatter(X_train, y_train, s=5)
    reg_x = np.arange(np.min(X_train), np.max(X_train), 0.01).reshape(-1, 1)
    plt.plot(reg_x, dt_reg.predict(reg_x), lw=4, c=sns.color_palette()[1],
             label='decision tree')
    plt.legend(loc='best');

### Inference


In [None]:
%%time

predictions = dt_reg.predict(X_test)
print("Mean squared error: %.2f"
      % mean_squared_error(y_test, predictions))

## Random forest

### Learning


In [None]:
%%time

n_estimators = 10
max_depth = 3
rf_reg = RandomForestRegressor(n_estimators=n_estimators,
                               max_depth=max_depth)
rf_reg.fit(X_train, y_train)

In [None]:
if X_train.shape[1] == 1:
    plt.figure(figsize=(10, 10))
    plt.scatter(X_train, y_train, s=5)
    reg_x = np.arange(np.min(X_train), np.max(X_train), 0.01).reshape(-1, 1)
    plt.plot(reg_x, dt_reg.predict(reg_x), lw=4, c=sns.color_palette()[1],
             label='decision tree')
    plt.plot(reg_x, rf_reg.predict(reg_x), lw=4, c=sns.color_palette()[2],
             label='random forest')
    plt.legend(loc='best');

### Inference

In [None]:
%%time

predictions = rf_reg.predict(X_test)
print("Mean squared error: %.2f"
      % mean_squared_error(y_test, predictions))

## Gradient boosted trees (XGBoost)

### Learning

In [None]:
%%time

xgb_reg = XGBRegressor()
xgb_reg.fit(X_train, y_train)

In [None]:
if X_train.shape[1] == 1:
    plt.figure(figsize=(10, 10))
    plt.scatter(X_train, y_train, s=5)
    reg_x = np.arange(np.min(X_train), np.max(X_train), 0.01).reshape(-1, 1)
    plt.plot(reg_x, dt_reg.predict(reg_x), lw=4, c=sns.color_palette()[1],
             label='decision tree')
    plt.plot(reg_x, rf_reg.predict(reg_x), lw=4, c=sns.color_palette()[2],
             label='random forest')
    plt.plot(reg_x, xgb_reg.predict(reg_x), lw=4, c=sns.color_palette()[3],
             label='XGBoost')
    plt.legend(loc='best');

### Inference

In [None]:
%%time

predictions = xgb_reg.predict(X_test)
print("Mean squared error: %.2f"
      % mean_squared_error(y_test, predictions))

## Model tuning

Study the documentation of the different decision tree models used in this notebook ([decision trees](http://scikit-learn.org/stable/modules/tree.html), [tree ensembles](http://scikit-learn.org/stable/modules/ensemble.html), [XGBoost](https://xgboost.readthedocs.io/en/latest/)), and experiment with different hyperparameter values.  

Report the lowest mean squared error you manage to obtain.  Also mark down the parameters you used, so others can try to reproduce your results. 