# Exercise - RFs for regression

1. Use the **fetch_california_housing** data (remember to split your data into a train, validation, and test data). Using your training and validation data, optimize the parameters of your RF. How well does your optimized model perform on the test data?
   
**Note**: This dataset is **much** larger than what we have otherwise been using. This means you cannot try a million different things without the code running very slowly!

**See slides for more details!**

In [1]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn import ensemble
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.metrics import accuracy_score

# from sklearn.datasets import clear_data_home
# clear_data_home() #Clears the data cache

housing_data = fetch_california_housing()
print(housing_data.DESCR)

X, y = fetch_california_housing(return_X_y=True, as_frame=True)

# Use `train_test_split` to split your data into a train and a test set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Use `train_test_split` to split your train data into a train and a validation  set.
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

print(X_train.shape, X_val.shape, X_test.shape, y_train.shape, y_val.shape, y_test.shape)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

# Exercise 1

Use the **fetch_california_housing** data (remember to split your data into a train, validation, and test data). Using your training and validation data, optimize the parameters of your RF. How well does your optimized model perform on the test data?

Let us start by ensuring we can just run an RF without any optimization. Note how it is slower than a lot of what we have done so far!

In [2]:
rf_current = ensemble.RandomForestRegressor()
rf_current.fit(X_train, y_train)
y_val_hat = rf_current.predict(X_val)
mse = mean_squared_error(y_val, y_val_hat)

print(f'RF with default settings has validation MSE of {mse}.')

RF with default settings has validation MSE of 0.27654700171374774.


In [3]:
# Remember you can try other stuff than these specific parameters.
# Just here to get you started!

n_estimators_list = [20,200, 500]  # Add more values to explore
min_samples_split_list = [15, 20]  # Add more values to explore
min_samples_leaf_list = [5, 10, 15] 

results = []

for n_estimators in n_estimators_list:
    for min_samples_split in min_samples_split_list:
        for min_samples_leaf in min_samples_leaf_list:
            rf_current = ensemble.RandomForestRegressor(
                n_estimators=n_estimators,
                min_samples_split=min_samples_split,
                min_samples_leaf=min_samples_leaf,
                )
            rf_current.fit(X_train, y_train)
            y_val_hat = rf_current.predict(X_val)
            mse = mean_squared_error(y_val, y_val_hat)

            results.append([mse, n_estimators, min_samples_split, min_samples_leaf])

results = pd.DataFrame(results)
results.columns = ['MSE', 'n_estimators', 'min_samples_split', 'min_samples_leaf']
print(results)

In [None]:
# Extract best parameters.
min_index=results['MSE'].idmin()
# min_index=results['MSE']

print('Index of the row with the minimum value in MSE', min_index)

# min_row =results.loc[min_index]
# print('Row with the minimum value in MSE',min_row)

results[results['MSE'] == results['MSE'].max()]

min_row = results.loc[min_index]
print(min_row)

In [None]:
print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

In [None]:
# from sklearn.tree import DecisionTreeRegressor
# Initialize your final model
final_rf = ensemble.RandomForestRegressor(
    n_estimators=5,
    min_samples_split=5,
    min_samples_leaf=10,
    )

# Use both training and validation data to fit it (np.concatenate "stacks" the array like rbind in R)
final_rf.fit(np.concatenate([X_train, X_val]), np.concatenate([y_train, y_val]))

# Predict on test data
y_test_hat_optimized = final_rf.predict(X_test)

# Obtain and check mse on test data
rf_optimized = mean_squared_error(y_test, y_test_hat_optimized)
print(f'Optimized RF achieved MSE = {round(rf_optimized*100, 1)}. - lower values indicating better performance.')
