![logo](https://github.com/donatellacea/DL_tutorials/blob/main/notebooks/figures/1128-191-max.png?raw=true)

# Modeling with Random Forests

In this Notebook we will show you how to train a Random Forest Regressor or Classifier. You will learn how to tune your Random Forest model to achieve the best performance.

--------

## Getting Started

### Setup Colab environment

If you installed the packages and requirments on your own machine, you can skip this section and start from the import section.
Otherwise you can follow and execute the tutorial on your browser. In order to start working on the notebook, click on the following button, this will open this page in the Colab environment and you will be able to execute the code on your own.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HelmholtzAI-Consultants-Munich/XAI-Tutorials/blob/main/data_and_models/Dataset-Penguins_Model-RandomForest.ipynb)



Now that you are visualizing the notebook in Colab, run the next cell to install the packages we will use.
There are few things you should follow in order to properly set the notebook up:

1. Warning: This notebook was not authored by Google. *Click* on 'Run anyway'.
2. When the installation commands are done, there might be "Restart runtime" button at the end of the output. Please, *click* it. 

In [1]:
# no additional installations needed

By running the next cell you are going to create a folder in your Google Drive. All the files for this tutorial will be uploaded to this folder. After the first execution you might receive some warning and notification, please follow these instructions:
1. Permit this notebook to access your Google Drive files? *Click* on 'Yes', and select your account.
2. Google Drive for desktop wants to access your Google Account. *Click* on 'Allow'.

At this point, a folder has been created and you can navigate it through the lefthand panel in Colab, you might also have received an email that informs you about the access on your Google Drive. 

In [2]:
# Create a folder in your Google Drive
# from google.colab import drive                                                                          
# drive.mount('/content/drive')

In [3]:
# %cd drive/MyDrive

In [4]:
# Don't run this cell if you already cloned the repo 
# !git clone https://github.com/HelmholtzAI-Consultants-Munich/XAI-Tutorials.git

In [5]:
# %cd XAI-Tutorials

### Imports

Let's start with importing all required Python packages.

In [6]:
# Load the required packages
import joblib
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import GridSearchCV, train_test_split

We fix the random seeds to ensure reproducible results, as we work with (pseudo) random numbers.

In [7]:
# assert reproducible random number generation
seed = 1
np.random.seed(seed)

--------

## Random Forest Models

*Note: Please visit our [Introduction to Random Forest Models](https://xai-tutorials.readthedocs.io/en/latest/_ml_basics/random_forest.html) to get more theoretical background information on the Random Forest algorithm.*

In the subsequent sections we will show you how to train a Random fforest model for regression and binary or multi-class classification. But before we start training our Random Forest model, try to answer the following questions:

<font color='green'>

#### Question 1: What are model hyperparameters and why do we have to tune them?

<font color='grey'>

#### Your Answer: 

While model parameters are learned during training — such as the slope and intercept in a linear regression — hyperparameters must be set by the data scientist before training. Hyperparameter tuning is essential for the overall performance of the machine learning model. The best hyperparameters are usually impossible to determine ahead of time, and tuning a model is where machine learning turns from a science into trial-and-error based engineering.

<font color='green'>

#### Question 2: How do we find the best hyperparameters?

<font color='grey'>

#### Your Answer: 

Hyperparameter tuning relies more on experimental results than theory, and thus the best method to determine the optimal settings is to try many different combinations evaluate the performance of each model. However, evaluating each model only on the training set can lead to one of the most fundamental problems in machine learning: overfitting. If we optimize the model for the training data, then our model will score very well on the training set, but will not be able to generalize to new data, such as in a test set. When a model performs highly on the training set but poorly on the test set, this is known as overfitting, or essentially creating a model that knows the training set very well but cannot be applied to new problems. An overfit model may look impressive on the training set, but will be useless in a real application. Therefore, the standard procedure for hyperparameter optimization accounts for overfitting through cross validation.

### Hyperparameters of Random Forest

The hyperparameters of the model are configured up-front and are provided by the caller of the model before the model is trained. They guide the learning process for a specific dataset and hence, they are very important for training a machine learning model. 

Some important hyperparameters for Random Forest models:

- `n_estimators` = number of trees in the model
- `criterion` = a function to measure the quality of the split
- `max_depth` = maximal depth of the tree (the longest path between the root node and the leaf node)
- `max_sample` = which fraction of the original dataset is given to each tree in the forest
- `max_features` = maximum number of features to consider when doing a split

The full list of hyperparemeters of the Random Forest models can be found in the scikit-learn documentation.

Now, that we learned about the hyperparameters of Random Forest and had a look at the choices we have for the Random Forest algorithm, it is time to choose the optimal hyperparameters for our model. We will objectively search through different values for Random Forest hyperparameters and choose the set of hyperparameters that results in the model with the best performance on a given validation set. To do this, we will define a search space as a grid of hyperparameter values and evaluate every position in the grid. This hyperparemter optimization technique is called **grid-search**. To evaluate the grid-search results, we can use n-fold cross validation. The n-fold cross validation strategy will split the training data into n folds and then train the model on n-1 folds and test it's performance on the nth fold, iterating through each fold as validation fold once. Hence, the reported score is the average score across n validation sets.

The grid-search technique searches through every combination of the hyperparameters you define. Hence, the run time can increase very fast and it should be something to take into account when training the model. For the sake of example, in this notebook we will define a rather small grids of hyperparameters.

--------

## Training a Random Forest Model for Binary Classification

We will now use the preprocessed Breast Cancer dataset (see [*Dataset-BreastCancer.ipynb*](../data_and_models/Dataset-BreastCancer.ipynb) for preprocessing steps) to train a Random Forest Classifier that can predict the breast cancer malignancy of pateints from 30 numeric features computed from a digitized image taken of breast mass. Therefore, let's first load the preprocessed dataset:

In [8]:
# Load the data
data = joblib.load(open('../data_and_models/data_breastcancer_preprocessed.joblib', 'rb'))
data.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,malignant
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,malignant
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,malignant
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,malignant
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,malignant


First, we will split the data into a **train and test set**, so the model does not use all the available information for training. That way, we can also check the performance on previously unseen data, mirroring the most probable practical use case.

In [9]:
# A Random Forest instance from sklearn requires a separate input of feature matrix and target values.
# Hence, we will first separate the target and feature columns.
X = data.loc[:, data.columns != 'target']
y = data.target

# split into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.80, random_state=seed)
X_train.reset_index(inplace=True, drop=True)
X_test.reset_index(inplace=True, drop=True)
y_train.reset_index(inplace=True, drop=True)
y_test.reset_index(inplace=True, drop=True)

print(f'Number of training samples: {len(X_train.index)} with {sum(y_train=="malignant")} malignant and {sum(y_train=="benign")} benign samples.')
print(f'Number of training samples: {len(X_test.index)} with {sum(y_test=="malignant")} malignant and {sum(y_test=="benign")} benign samples.')

Number of training samples: 455 with 170 malignant and 285 benign samples.
Number of training samples: 114 with 42 malignant and 72 benign samples.


In addition, we need to standardize our features. This is not not necessary for tree-based methods but required for other models. To avoid information leakage between train and test set through the standardization procedure, we fit the Standardizer on the training set and use it to transform train and test set.

In [10]:
scaler = StandardScaler()
scaler.set_output(transform="pandas")
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Next, we define the hyperparameter grid, we want to want to use for the grid-search and store them as a dictionary object. Feel free to change the grid based on your acquired knowledge and research on Random Forest hyperparameters! Just take care about the computation time for now. 


In [11]:
hyper_grid_classifier = {'n_estimators': [100, 1000], 
            'max_depth': [2, 5, 10], 
            'max_samples': [0.8],
            'criterion': ['gini', 'entropy'],
            'max_features': ['sqrt','log2']
}

Now we will start the training process. First, we define an instance of the `RandomForestClassifier`. Then, we run the `GridSearchCV` with the 5-fold cross validation using the grid we defined above. 

In [12]:
# Define a classifier. We set the oob_score = True, as OOB is a good approximation of the validation set score
classifier = RandomForestClassifier(oob_score=True, random_state=42, n_jobs=3)

# Define a grid search with 5-fold CV and fit 
gridsearch_classifier = GridSearchCV(classifier, hyper_grid_classifier, cv=5, verbose=1)
gridsearch_classifier.fit(X_train, y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


Then we can check how well the best model performed during cross-validation and which hyperparameters lead to the best reults.

In [13]:
# Check the results
print(f'The mean cross-validated score of the best model is {round(gridsearch_classifier.best_score_*100, 2)}% accuracy and the parameters of best prediction model are:')
print(gridsearch_classifier.best_params_)

The mean cross-validated score of the best model is 95.38% accuracy and the parameters of best prediction model are:
{'criterion': 'gini', 'max_depth': 10, 'max_features': 'sqrt', 'max_samples': 0.8, 'n_estimators': 1000}


The model with the best hyperparameters is saved as the _best_estimator__ in the GridSearchCV instance. Note, that the returned model is a Random Forest Classifier that was refit using the best found parameters on the whole training dataset. 

We can estimate the training, validation and test score, using the training, OOB and test set, respectively.

In [14]:
# Take the best estimator
rf = gridsearch_classifier.best_estimator_

# is the model performing reasonably on the training data?
print(f'Model Performance on training data: {round(rf.score(X_train, y_train)*100,2)} % subset accuracy.')

# is the model performing reasonably on the OOB data?
print(f'Model Performance on OOB data: {round(rf.oob_score_*100,2)} % subset accuracy.')

# is the model performing reasonably on the test data?
print(f'Model Performance on test data: {round(rf.score(X_test, y_test)*100,2)} % subset accuracy.')

Model Performance on training data: 100.0 % subset accuracy.
Model Performance on OOB data: 95.82 % subset accuracy.
Model Performance on test data: 95.61 % subset accuracy.


Great, now you trained your Random Forest model! And it generalized with a high accuracy of 95%!  

*Note: if your classes are strongly imbalanced, then it is NOT recommendable to use the simple accuracy as a performance score. If all classes of the imbalanced dataset are equally important, using the macro accuracy is recommended as it treats all classes equally.*

Let's now save the model in a ``joblib`` file, such that we can load the trained model into other notebooks later on.

In [15]:
# Save the model with joblib
data_and_model = [X_train, X_test, y_train, y_test, rf, scaler]
joblib.dump(data_and_model, open('./model_randomforest_breastcancer.joblib', 'wb'))

--------

## Training a Random Forest Model for Multiclass Classification

### Dataset: Wine

We will now use the preprocessed Wine dataset (see [*Dataset-Wine.ipynb*](../data_and_models/Dataset-Wine.ipynb) for preprocessing steps) to train a Random Forest Classifier that can predict the wine class from different chemical properties. Therefore, let's first load the preprocessed dataset:

In [16]:
# Load the data
data = joblib.load(open('../data_and_models/data_wine_preprocessed.joblib', 'rb'))
data.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


First, we will split the data into a **train and test set**, so the model does not use all the available information for training. That way, we can also check the performance on previously unseen data, mirroring the most probable practical use case.

In [17]:
# A Random Forest instance from sklearn requires a separate input of feature matrix and target values.
# Hence, we will first separate the target and feature columns.
X = data.loc[:, data.columns != 'target']
y = data.target

# split into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.80, random_state=seed)
X_train.reset_index(inplace=True, drop=True)
X_test.reset_index(inplace=True, drop=True)
y_train.reset_index(inplace=True, drop=True)
y_test.reset_index(inplace=True, drop=True)

print(f'Number of training samples: {len(X_train.index)} with {sum(y_train==0)} class 0, {sum(y_train==1)} class 1 and {sum(y_train==2)} class 2 samples.')
print(f'Number of training samples: {len(X_test.index)} with {sum(y_test==0)} class 0, {sum(y_test==1)} class 1 and {sum(y_test==2)} class 2 samples.')

Number of training samples: 142 with 45 class 0, 58 class 1 and 39 class 2 samples.
Number of training samples: 36 with 14 class 0, 13 class 1 and 9 class 2 samples.


In addition, we need to standardize our features. This is not not necessary for tree-based methods but required for other models. To avoid information leakage between train and test set through the standardization procedure, we fit the Standardizer on the training set and use it to transform train and test set.

In [18]:
scaler = StandardScaler()
scaler.set_output(transform="pandas")
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Next, we define the hyperparameter grid, we want to want to use for the grid-search and store them as a dictionary object. Feel free to change the grid based on your acquired knowledge and research on Random Forest hyperparameters! Just take care about the computation time for now. 


In [19]:
hyper_grid_classifier = {'n_estimators': [100, 1000], 
            'max_depth': [2, 5, 10], 
            'max_samples': [0.8],
            'criterion': ['gini', 'entropy'],
            'max_features': ['sqrt','log2']
}

Now we will start the training process. First, we define an instance of the `RandomForestClassifier`. Then, we run the `GridSearchCV` with the 5-fold cross validation using the grid we defined above. 

In [20]:
# Define a classifier. We set the oob_score = True, as OOB is a good approximation of the validation set score
classifier = RandomForestClassifier(oob_score=True, random_state=42, n_jobs=3)

# Define a grid search with 5-fold CV and fit 
gridsearch_classifier = GridSearchCV(classifier, hyper_grid_classifier, cv=5, verbose=1)
gridsearch_classifier.fit(X_train, y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


Then we can check how well the best model performed during cross-validation and which hyperparameters lead to the best reults.

In [21]:
# Check the results
print(f'The mean cross-validated score of the best model is {round(gridsearch_classifier.best_score_*100, 2)}% accuracy and the parameters of best prediction model are:')
print(gridsearch_classifier.best_params_)

The mean cross-validated score of the best model is 100.0% accuracy and the parameters of best prediction model are:
{'criterion': 'entropy', 'max_depth': 5, 'max_features': 'sqrt', 'max_samples': 0.8, 'n_estimators': 100}


The model with the best hyperparameters is saved as the _best_estimator__ in the GridSearchCV instance. Note, that the returned model is a Random Forest Classifier that was refit using the best found parameters on the whole training dataset. 

We can estimate the training, validation and test score, using the training, OOB and test set, respectively.

In [22]:
# Take the best estimator
rf = gridsearch_classifier.best_estimator_

# is the model performing reasonably on the training data?
print(f'Model Performance on training data: {round(rf.score(X_train, y_train)*100,2)} % subset accuracy.')

# is the model performing reasonably on the OOB data?
print(f'Model Performance on OOB data: {round(rf.oob_score_*100,2)} % subset accuracy.')

# is the model performing reasonably on the test data?
print(f'Model Performance on test data: {round(rf.score(X_test, y_test)*100,2)} % subset accuracy.')

Model Performance on training data: 100.0 % subset accuracy.
Model Performance on OOB data: 99.3 % subset accuracy.
Model Performance on test data: 97.22 % subset accuracy.


Great, now you trained your Random Forest model! And it generalized with a high accuracy of 97%!  

*Note: if your classes are strongly imbalanced, then it is not recommendable to use the simple accuracy as a performance score. If all classes of the imbalanced dataset are equally important, using the macro accuracy is recommended as it treats all classes equally.*

Let's now save the model in a ``joblib`` file, such that we can load the trained model into other notebooks later on.

In [23]:
# Save the model with joblib
data_and_model = [X_train, X_test, y_train, y_test, rf, scaler]
joblib.dump(data_and_model, open('./model_randomforest_wine.joblib', 'wb'))

### Dataset: Penguins

We will now use the preprocessed Penguins dataset (see [*Dataset-Penguins.ipynb*](../data_and_models/Dataset-Penguins.ipynb) for preprocessing steps) to train a Random Forest Classifier that can predict the species of Palmer penguins from the features *bill length*, *bill depth*, *flipper lentgh*, *body mass*, *year*, *island* and *sex*. Therefore, let's first load the preprocessed dataset:

In [24]:
# Load the data
data = joblib.load(open('../data_and_models/data_penguins_preprocessed.joblib', 'rb'))
data.head()

Unnamed: 0,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,year,island_Dream,island_Torgersen,sex_male
0,Adelie,39.1,18.7,181.0,3750.0,2007,0,1,1
1,Adelie,39.5,17.4,186.0,3800.0,2007,0,1,0
2,Adelie,40.3,18.0,195.0,3250.0,2007,0,1,0
4,Adelie,36.7,19.3,193.0,3450.0,2007,0,1,0
5,Adelie,39.3,20.6,190.0,3650.0,2007,0,1,1


First, we will split the data into a **train and test set**, so the model does not use all the available information for training. That way, we can also check the performance on previously unseen data, mirroring the most probable practical use case.

In [25]:
# A Random Forest instance from sklearn requires a separate input of feature matrix and target values.
# Hence, we will first separate the target and feature columns.
X = data.loc[:, data.columns != 'species']
y = data.species

# split into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.80, random_state=seed)
X_train.reset_index(inplace=True, drop=True)
X_test.reset_index(inplace=True, drop=True)
y_train.reset_index(inplace=True, drop=True)
y_test.reset_index(inplace=True, drop=True)

print(f'Number of training samples: {len(X_train.index)} with {sum(y_train=="Adelie")} Adelie, {sum(y_train=="Chinstrap")} Chinstrap and {sum(y_train=="Gentoo")} Gentoo samples.')
print(f'Number of training samples: {len(X_test.index)} with {sum(y_test=="Adelie")} Adelie, {sum(y_test=="Chinstrap")} Chinstrap and {sum(y_test=="Gentoo")} Gentoo samples.')

Number of training samples: 266 with 118 Adelie, 55 Chinstrap and 93 Gentoo samples.
Number of training samples: 67 with 28 Adelie, 13 Chinstrap and 26 Gentoo samples.


In addition, we need to standardize our features. This is not not necessary for tree-based methods but required for other models. To avoid information leakage between train and test set through the standardization procedure, we fit the Standardizer on the training set and use it to transform train and test set.

In [26]:
scaler = StandardScaler()
scaler.set_output(transform="pandas")
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Next, we define the hyperparameter grid, we want to want to use for the grid-search and store them as a dictionary object. Feel free to change the grid based on your acquired knowledge and research on Random Forest hyperparameters! Just take care about the computation time for now. 


In [27]:
hyper_grid_classifier = {'n_estimators': [100, 1000], 
            'max_depth': [2, 5, 10], 
            'max_samples': [0.8],
            'criterion': ['gini', 'entropy'],
            'max_features': ['sqrt','log2']
}

Now we will start the training process. First, we define an instance of the `RandomForestClassifier`. Then, we run the `GridSearchCV` with the 5-fold cross validation using the grid we defined above. 

In [28]:
# Define a classifier. We set the oob_score = True, as OOB is a good approximation of the validation set score
classifier = RandomForestClassifier(oob_score=True, random_state=42, n_jobs=3)

# Define a grid search with 5-fold CV and fit 
gridsearch_classifier = GridSearchCV(classifier, hyper_grid_classifier, cv=5, verbose=1)
gridsearch_classifier.fit(X_train, y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


Then we can check how well the best model performed during cross-validation and which hyperparameters lead to the best reults.

In [29]:
# Check the results
print(f'The mean cross-validated score of the best model is {round(gridsearch_classifier.best_score_*100, 2)}% accuracy and the parameters of best prediction model are:')
print(gridsearch_classifier.best_params_)

The mean cross-validated score of the best model is 98.88% accuracy and the parameters of best prediction model are:
{'criterion': 'gini', 'max_depth': 5, 'max_features': 'sqrt', 'max_samples': 0.8, 'n_estimators': 100}


The model with the best hyperparameters is saved as the _best_estimator__ in the GridSearchCV instance. Note, that the returned model is a Random Forest Classifier that was refit using the best found parameters on the whole training dataset. 

We can estimate the training, validation and test score, using the training, OOB and test set, respectively.

In [30]:
# Take the best estimator
rf = gridsearch_classifier.best_estimator_

# is the model performing reasonably on the training data?
print(f'Model Performance on training data: {round(rf.score(X_train, y_train)*100,2)} % subset accuracy.')

# is the model performing reasonably on the OOB data?
print(f'Model Performance on OOB data: {round(rf.oob_score_*100,2)} % subset accuracy.')

# is the model performing reasonably on the test data?
print(f'Model Performance on test data: {round(rf.score(X_test, y_test)*100,2)} % subset accuracy.')

Model Performance on training data: 99.62 % subset accuracy.
Model Performance on OOB data: 98.87 % subset accuracy.
Model Performance on test data: 95.52 % subset accuracy.


Great, now you trained your Random Forest model! And it generalized with a high accuracy of 95%!  

*Note: if your classes are strongly imbalanced, then it is not recommendable to use the simple accuracy as a performance score. If all classes of the imbalanced dataset are equally important, using the macro accuracy is recommended as it treats all classes equally.*

Let's now save the model in a ``joblib`` file, such that we can load the trained model into other notebooks later on.

In [31]:
# Save the model with joblib
data_and_model = [X_train, X_test, y_train, y_test, rf, scaler]
joblib.dump(data_and_model, open('./model_randomforest_penguins.joblib', 'wb'))

--------

## Training a Random Forest Model for Regression

We will now use the preprocessed California Housing dataset (see [*Dataset-Housing.ipynb*](../data_and_models/Dataset-Housing.ipynb) for preprocessing steps) to train a Random Forest Regressor that can predict the prices of housing blocks from the 8 descriptive features. Therefore, let's first load the preprocessed dataset:

In [32]:
# Load the data
data = joblib.load(open('../data_and_models/data_housing_preprocessed.joblib', 'rb'))

# for sake of runtime we only use the first 1000 samples
data = data.iloc[:1000]
data.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


First, we will split the data into a **train and test set**, so the model does not use all the available information for training. That way, we can also check the performance on previously unseen data, mirroring the most probable practical use case.

In [33]:
# A Random Forest instance from sklearn requires a separate input of feature matrix and target values.
# Hence, we will first separate the target and feature columns.
X = data.loc[:, data.columns != 'MedHouseVal']
y = data.MedHouseVal

# split into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.80, random_state=seed)
X_train.reset_index(inplace=True, drop=True)
X_test.reset_index(inplace=True, drop=True)
y_train.reset_index(inplace=True, drop=True)
y_test.reset_index(inplace=True, drop=True)

print(f'Number of training samples: {len(X_train.index)} samples.')
print(f'Number of training samples: {len(X_test.index)} samples.')

Number of training samples: 800 samples.
Number of training samples: 200 samples.


In addition, we need to standardize our features. This is not not necessary for tree-based methods but required for other models. To avoid information leakage between train and test set through the standardization procedure, we fit the Standardizer on the training set and use it to transform train and test set.

In [34]:
scaler = StandardScaler()
scaler.set_output(transform="pandas")
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Next, we define the hyperparameter grid, we want to want to use for the grid-search and store them as a dictionary object. Feel free to change the grid based on your acquired knowledge and research on Random Forest hyperparameters! Just take care about the computation time for now. 


In [35]:
hyper_grid_classifier = {'n_estimators': [100, 1000], 
            'max_depth': [2, 5], 
            'max_samples': [0.8],
            'criterion': ['squared_error', 'poisson'],
            'max_features': ['sqrt','log2'],
            'bootstrap': [True]
}

Now we will start the training process. First, we define an instance of the `RandomForestRegressor`. Then, we run the `GridSearchCV` with the 5-fold cross validation using the grid we defined above. 

In [36]:
# Define a classifier. We set the oob_score = True, as OOB is a good approximation of the validation set score
classifier = RandomForestRegressor(oob_score=True, random_state=42, n_jobs=3)

# Define a grid search with 5-fold CV and fit 
gridsearch_classifier = GridSearchCV(classifier, hyper_grid_classifier, cv=5, verbose=1)
gridsearch_classifier.fit(X_train, y_train)

Fitting 5 folds for each of 16 candidates, totalling 80 fits


Then we can check how well the best model performed during cross-validation and which hyperparameters lead to the best reults.

In [37]:
# Check the results
print(f'The mean cross-validated score of the best model is R^2 score of {round(gridsearch_classifier.best_score_, 2)} and the parameters of best prediction model are:')
print(gridsearch_classifier.best_params_)

The mean cross-validated score of the best model is R^2 score of 0.74 and the parameters of best prediction model are:
{'bootstrap': True, 'criterion': 'squared_error', 'max_depth': 5, 'max_features': 'log2', 'max_samples': 0.8, 'n_estimators': 1000}


The model with the best hyperparameters is saved as the _best_estimator__ in the GridSearchCV instance. Note, that the returned model is a Random Forest Classifier that was refit using the best found parameters on the whole training dataset. 

We can estimate the training, validation and test score, using the training, OOB and test set, respectively.

In [38]:
# Take the best estimator
rf = gridsearch_classifier.best_estimator_

# is the model performing reasonably on the training data?
print(f'Model Performance on training data: {round(rf.score(X_train, y_train),2)} R^2 score.')

# is the model performing reasonably on the OOB data?
print(f'Model Performance on OOB data: {round(rf.oob_score_,2)} R^2 score.')

# is the model performing reasonably on the test data?
print(f'Model Performance on test data: {round(rf.score(X_test, y_test),2)} R^2 score.')

Model Performance on training data: 0.84 R^2 score.
Model Performance on OOB data: 0.74 R^2 score.
Model Performance on test data: 0.72 R^2 score.


Great, now you trained your Random Forest model! And it generalized with a good R^2 of 0.74!  

*Note: The $R^2$ is the coefficient of determination and the closer this value is to 1, the better our model explains the data. A constant model that always predicts the average target value disregarding the input features would get an $R^2$ score of 0. However, $R^2$ score can also be negative because the model can be arbitrarily worse.*

Let's now save the model in a ``joblib`` file, such that we can load the trained model into other notebooks later on.

In [39]:
# Save the model with joblib
data_and_model = [X_train, X_test, y_train, y_test, rf, scaler]
joblib.dump(data_and_model, open('./model_randomforest_housing.joblib', 'wb'))