# Exercise - RFs for regression

1. Use the **fetch_california_housing** data (remember to split your data into a train, validation, and test data). Using your training and validation data, optimize the parameters of your RF. How well does your optimized model perform on the test data?
   
**Note**: This dataset is **much** larger than what we have otherwise been using. This means you cannot try a million different things without the code running very slowly!

**See slides for more details!**

In [1]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn import ensemble
import tqdm
import pandas as pd
import numpy as np

X, y = fetch_california_housing(return_X_y=True)

housing_data = fetch_california_housing()
print(housing_data.DESCR)

# Use `train_test_split` to split your data into a train and a test set.
X_train, X_test, y_train, y_test = X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=42)

# Use `train_test_split` to split your train data into a train and a validation  set.
X_train, X_val, y_train, y_val   = train_test_split(X_train,
                                                   y_train,
                                                   test_size=0.2,
                                                   random_state=42)

print(X_train.shape, X_val.shape, X_test.shape, y_train.shape, y_val.shape, y_test.shape)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

# Exercise 1

Use the **fetch_california_housing** data (remember to split your data into a train, validation, and test data). Using your training and validation data, optimize the parameters of your RF. How well does your optimized model perform on the test data?

Let us start by ensuring we can just run an RF without any optimization. Note how it is slower than a lot of what we have done so far!

In [2]:
rf_current = ensemble.RandomForestRegressor()
rf_current.fit(X_train, y_train)
y_val_hat = rf_current.predict(X_val)
mse = mean_squared_error(y_val, y_val_hat)

print(f'RF with default settings has validation MSE of {mse}.')

RF with default settings has validation MSE of 0.2755603043591978.


In [3]:
# Remember you can try other stuff than these specific parameters.
# Just here to get you started!
n_estimators_list = [20, 200, 500]
min_samples_split_list = [15,20] # input values seperated by ",".
min_samples_leaf_list = [5,10,15] # input values seperated by ",".

results = []

for n_estimators in tqdm.tqdm(n_estimators_list):
    for min_samples_split in min_samples_split_list:
        for min_samples_leaf in min_samples_leaf_list:
            rf_current = ensemble.RandomForestRegressor(
                n_estimators=n_estimators,
                min_samples_split=min_samples_split,
                min_samples_leaf=min_samples_leaf,
                )
            rf_current.fit(X_train, y_train)
            y_val_hat = rf_current.predict(X_val)
            mse = mean_squared_error(y_val, y_val_hat)

            results.append([mse, n_estimators, min_samples_split, min_samples_leaf])

results = pd.DataFrame(results)
results.columns = ['MSE', 'n_estimators', 'min_samples_split', 'min_samples_leaf']
print(results)

100%|██████████| 3/3 [07:54<00:00, 158.31s/it]

         MSE  n_estimators  min_samples_split  min_samples_leaf
0   0.290417            20                 15                 5
1   0.299068            20                 15                10
2   0.314495            20                 15                15
3   0.287619            20                 20                 5
4   0.297947            20                 20                10
5   0.303775            20                 20                15
6   0.283999           200                 15                 5
7   0.296333           200                 15                10
8   0.308104           200                 15                15
9   0.287921           200                 20                 5
10  0.294348           200                 20                10
11  0.306711           200                 20                15
12  0.283572           500                 15                 5
13  0.293775           500                 15                10
14  0.306329           500              




In [4]:
# Extract best parameters.

# Find the index of the row where MSE has the minimum value
min_index = results['MSE'].idxmin()

print("Index of the row with the minimum value in MSE:", min_index)

min_row = results.loc[min_index]
print("Row with the minimum value in MSE:")
print(min_row)


Index of the row with the minimum value in MSE: 12
Row with the minimum value in MSE:
MSE                    0.283572
n_estimators         500.000000
min_samples_split     15.000000
min_samples_leaf       5.000000
Name: 12, dtype: float64


In [6]:
# Initialize your final model
dt_optimized = ensemble.RandomForestRegressor(
                n_estimators=500,
                min_samples_split=15,
                min_samples_leaf=5,
                )
# Use both training and validation data to fit it (np.concatenate "stacks" the array like rbind in R)
dt_optimized.fit(np.concatenate([X_train, X_val]), np.concatenate([y_train, y_val]))

# Predict on test data
y_test_hat_optimized = dt_optimized.predict(X_test)

# Obtain and check mse on test data
mse_optimized = mean_squared_error(y_test, y_test_hat_optimized)
print(f'Optimized DT achieved MSE = {round(mse_optimized, 2)}.')

Optimized DT achieved MSE = 0.26.
