# Exercise - RFs for regression

1. Use the **fetch_california_housing** data (remember to split your data into a train, validation, and test data). Using your training and validation data, optimize the parameters of your RF. How well does your optimized model perform on the test data?
   
**Note**: This dataset is **much** larger than what we have otherwise been using. This means you cannot try a million different things without the code running very slowly!

**See slides for more details!**

In [39]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn import ensemble
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.metrics import accuracy_score

from sklearn.datasets import clear_data_home
clear_data_home() #Clears the data cache

X, y = fetch_california_housing(return_X_y=True, as_frame=True)

# Use `train_test_split` to split your data into a train and a test set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Use `train_test_split` to split your train data into a train and a validation  set.
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

print(X_train.shape, X_val.shape, X_test.shape, y_train.shape, y_val.shape, y_test.shape)

(13209, 8) (3303, 8) (4128, 8) (13209,) (3303,) (4128,)


# Exercise 1

Use the **fetch_california_housing** data (remember to split your data into a train, validation, and test data). Using your training and validation data, optimize the parameters of your RF. How well does your optimized model perform on the test data?

Let us start by ensuring we can just run an RF without any optimization. Note how it is slower than a lot of what we have done so far!

In [40]:
rf_current = ensemble.RandomForestRegressor()
rf_current.fit(X_train, y_train)
y_val_hat = rf_current.predict(X_val)
mse = mean_squared_error(y_val, y_val_hat)

print(f'RF with default settings has validation MSE of {mse}.')

RF with default settings has validation MSE of 0.27398128108362224.


In [41]:
# Remember you can try other stuff than these specific parameters.
# Just here to get you started!

n_estimators_list = [5,10]  # Add more values to explore
min_samples_split_list = [5, 10]  # Add more values to explore
min_samples_leaf_list = [5, 10] 

results = []

for n_estimators in n_estimators_list:
    for min_samples_split in min_samples_split_list:
        for min_samples_leaf in min_samples_leaf_list:
            rf_current = ensemble.RandomForestRegressor(
                n_estimators=n_estimators,
                min_samples_split=min_samples_split,
                min_samples_leaf=min_samples_leaf,
                )
            rf_current.fit(X_train, y_train)
            y_val_hat = rf_current.predict(X_val)
            mse = mean_squared_error(y_val, y_val_hat)

            results.append([mse, n_estimators, min_samples_split, min_samples_leaf])

results = pd.DataFrame(results)
results.columns = ['MSE', 'n_estimators', 'min_samples_split', 'min_samples_leaf']
print(results)

        MSE  n_estimators  min_samples_split  min_samples_leaf
0  0.314479             5                  5                 5
1  0.323947             5                  5                10
2  0.320992             5                 10                 5
3  0.309901             5                 10                10
4  0.307990            10                  5                 5
5  0.309809            10                  5                10
6  0.286948            10                 10                 5
7  0.311502            10                 10                10


In [42]:
# Extract best parameters.
results[results['MSE'] == results['MSE'].max()]

Unnamed: 0,MSE,n_estimators,min_samples_split,min_samples_leaf
1,0.323947,5,5,10


In [43]:
print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

(13209, 8) (3303, 8) (13209,) (3303,)


In [44]:
# from sklearn.tree import DecisionTreeRegressor
# Initialize your final model
final_rf = ensemble.RandomForestRegressor(
    n_estimators=5,
    min_samples_split=5,
    min_samples_leaf=10,
    )

# Use both training and validation data to fit it (np.concatenate "stacks" the array like rbind in R)
final_rf.fit(np.concatenate([X_train, X_val]), np.concatenate([y_train, y_val]))

# Predict on test data
y_test_hat_optimized = final_rf.predict(X_test)

# Obtain and check mse on test data
rf_optimized = mean_squared_error(y_test, y_test_hat_optimized)
print(f'Optimized RF achieved MSE = {round(rf_optimized, 2)}. - lower values indicating better performance.')


Optimized RF achieved MSE = 0.31. - lower values indicating better performance.


