<a target="_blank" href="https://colab.research.google.com/github/JLDC/Data-Science-Fundamentals/blob/master/notebooks/109_random-forests.ipynb">
    <img src="https://i.ibb.co/2P3SLwK/colab.png"  style="padding-bottom:5px;" />Open this notebook in Google Colab
</a>

___

# Random Forests
___

In this notebook, we will look into **random forests** and use them to predict real estate prices from individual residential properties. We will use data from real estate transactions in Ames, Iowa, collected between 2006 and 2010.

This is a neat *real world* example. For instance, consider that you work as a data scientist consultant for a real estate agency. What you are going to learn below is typical of the work a data scientist might encounter in practice. With the exception that you would surely have spent a few days gathering and cleaning the data beforehand...

[A relatively raw version of the data can be found on Kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data). Don't hesitate to have a look so you can get a feel of how much time it takes until you can run your first model. For now, we will simply provide a clean version of the data to spare time.

If you want to read more on Random Forests, consult Chapter 8 of the book [Introduction to Statistical Learning](https://www.statlearning.com/) and [this blog](http://uc-r.github.io/2018/04/28/regression-trees/).

___
## Data pre-processing
We already provide clean data, which we obtain from the R package [`AmesHousing`](https://cran.r-project.org/web/packages/AmesHousing/index.html). For more information on the variable content see: https://github.com/topepo/AmesHousing or https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

In [None]:
import matplotlib.pyplot as plt # Plotting
import numpy as np # Numerical computing
import pandas as pd # Dataframes

# Define the path where the data is stored
DATA_PATH = "https://raw.githubusercontent.com/JLDC/Data-Science-Fundamentals/master/data"

In [None]:
# Read the data
ames = pd.read_csv(f"{DATA_PATH}/ames_housing.csv")
ames # Display the data

In [None]:
# Make sure that there are no missing values
ames.isnull().sum().max()

As mentioned above, we are interested in predicting the price of the residential properties, i.e., the `Sale_Price` column. Let's go ahead and create our features and targets.

In [None]:
# Features (notice the np.array(), this is to avoid a warning with random forests)
X = np.array(pd.get_dummies(ames.drop(columns=["Sale_Price"]), drop_first=True))
# Targets
y = ames["Sale_Price"]
X.shape # Display the shape of X

As the above cell shows, if we build dummies for every feature, we end up with a total of 306 feautres. Luckily for us, a high number of features is not necessarily a problem for random forests. It makes the computation a bit longer, but it should not impact the performance of our estimator much if we are careful to not overfit!

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Create our training and validation data
Xtrain, Xval, ytrain, yval = train_test_split(X, y, test_size=0.33, random_state=72)

___
## Fitting a random forest
As with many of the `scikit-learn` models, the [`RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor) comes with a lot of hyperparameters. We won't cover them in depth but when you are working with any model for your work or research, it is important that you understand what those hyperparameters do and which one you want to tune.

The random forest estimator is a kind of modified bootstrap aggregating (**bagging**) of decision trees. The important difference between random forests and bagging is that random forests use *de-correlated* trees to improve on bagging. Using de-correlated trees will typically lower the prediction variance.

So, how many trees are there in a forest? Well, this is one of the hyperparameters. The default value in `scikit-learn` is 100 but there's no reason that we shouldn't use something else. So we will start by comparing the performance of different forest sizes.

How to compare the performance? Cross-validation seems like an intuitive choice. However, due to the time it takes to cross-validate many forests on such a large dataset, we will use the **out-of-bag error**.

### Out-of-bag vs. validation error
The typical method of growing a random forest consists in using **bootstrap samples** to build the trees. This implies that, for each tree, some of the training data has not been used for the fit. In consequence, we can use this data as an approximation of how well our tree would perform out-of-sample. The error we compute using this leftover data is named the **out-of-bag error**.

So why would we use the out-of-bag error instead of the validation error? Well, we have the out-of-bag data *already* due to the way we are building each tree. Of course, we could use a validation set, or even cross-validation which would probably be better in terms of accuracy. However, this is much more costly to compute, whereas the out-of-bag error is much faster. In a sense, this is a trade-off between speed and accuracy. If your dataset is not too large and your computing power is enough, you might prefer running cross-validation. Let us start with the out-of-bag error to select our number of trees.

The scikit-learn package only implements the R² (recall the notebook `106_ols-train-test-cv.ipynb`) as the out-of-bag score. However, it also provides a vector of out-of-bag predictions, such that we can use this vector to compute our own metric. This makes things slightly more complicated and we will have to add our own computations of MSE on out-of-bag predictions, but it's nothing we can't deal with!

In [None]:
from sklearn.ensemble import RandomForestRegressor # Random Forest estimator
from sklearn.metrics import mean_absolute_error, mean_squared_error # Metrics

In [None]:
ntrees = [25, 50, 75, 100, 150, 200, 250, 300, 350] # Number of trees
# Instantiate lists to keep track of out-of-bag results
oob_mse = []
oob_mae = []

In [None]:
# Proceed with forest growing (this might take a while)
for trees in ntrees:
    # Instantiate the estimator (notice the cost-complexity pruning)
    # warm_start lets us continue growing the forest from the last instance; see https://www.kaggle.com/questions-and-answers/83501 for a more complete explanation.
    forest = RandomForestRegressor(n_estimators=trees, ccp_alpha=0.01, 
                                   oob_score=True, warm_start=True, 
                                   random_state=144)
    # Fit the model
    forest.fit(Xtrain, ytrain)
    # Store the out-of-bag MSE and MAE. Notice how it is on the TRAIN set
    oob_mse.append(mean_squared_error(ytrain, forest.oob_prediction_))
    oob_mae.append(mean_absolute_error(ytrain, forest.oob_prediction_))

In [None]:
# Time to visualize the results (plot the MSE)
fig, ax = plt.subplots(figsize=(12, 8))
# Compute best MSE
best = np.argmin(oob_mse)
# Plot out-of-bag error
ax.step(ntrees, oob_mse, where="post")
# Plot best
ax.scatter(ntrees[best], oob_mse[best], s=100, color="red", label="Best")
# Add axis labels, grid, legend
ax.set_xlabel("Number of trees")
ax.set_ylabel("Out-of-bag MSE")
ax.grid(True)
ax.legend()

As the above plot shows, the largest forest, i.e., the one with 350 trees has the best out-of-bag MSE.

___
#### 🤔 Pause and ponder
In this example, the best random forest (according to the oob score) is the one with the largest number of trees. If you were faced with this situation in your own work or research, what would you do? Would you pick this number of trees or would you try again with larger forests? Why?
___

As we have already discussed previously in this course. The MSE can be difficult to interpret, thus we also want to have a look at the mean absolute error of our random forest model.

In [None]:
# Print the best MAE and compare it to the mean price of our training set
print(f"The best out-of-bag MAE is {oob_mae[best]:>9.2f}")
print(f"The mean sales price is    {ytrain.mean():>9.2f}")

We have discussed the difference between out-of-bag and validation errors above. Let us now look at it in practice. We will do as we did above and try random forests of different sizes, but this time, we will use different numbers of trees and we will compare the out-of-bag error with the error on a validation set.

___
#### 🤔 Pause and ponder
Before we go ahead and run the code below, do you have an intuition as to how the results might look like? Try to think a bit about it and try to justify your intuition.
___

In [None]:
ntrees = range(50, 501, 50) # Number of trees
# Instantiate lists to keep track of out-of-bag results
oob_mse = []
oob_mae = []
# Instantiate lists to keep track of out-of-sample results
oos_mse = []
oos_mae = []

In [None]:
# Proceed with forest growing (this might take a while)
for trees in ntrees:
    # Instantiate the estimator (notice the cost-complexity pruning)
    # warm_start lets us continue growing the forest from the last instance
    forest = RandomForestRegressor(n_estimators=trees, ccp_alpha=0.01, 
                                   oob_score=True, warm_start=True, 
                                   random_state=144)
    # Fit the model
    forest.fit(Xtrain, ytrain)
    # Store the out-of-bag MSE and MAE. Notice how it is on the TRAIN set
    oob_mse.append(mean_squared_error(ytrain, forest.oob_prediction_))
    oob_mae.append(mean_absolute_error(ytrain, forest.oob_prediction_))
    # Compute the out-of-sample MSE and MAE. Notice how it is on the VALIDATION set
    pred = forest.predict(Xval) # Predict on validation set
    oos_mse.append(mean_squared_error(yval, pred))
    oos_mae.append(mean_absolute_error(yval, pred))

In [None]:
# Time to visualize the results (this time, we plot the MAE)
fig, ax = plt.subplots(figsize=(12, 8))
# Compute best MAE (out-of-bag)
oob_best = np.argmin(oob_mae)
# Compute best MAE (out-of-sample)
oos_best = np.argmin(oos_mae)
# Plot out-of-bag error
ax.step(ntrees, oob_mae, where="post", label="Out-of-bag")
# Plot out-of-sample error
ax.step(ntrees, oos_mae, where="post", label="Out-of-sample")
# Plot best out-of-bag
ax.scatter(ntrees[oob_best], oob_mae[oob_best], s=100, color="red", label="Best (OOB)")
# Plot best out-of-sample
ax.scatter(ntrees[oos_best], oos_mae[oos_best], s=100, color="purple", label="Best (OOS)")
# Add axis labels, grid, legend
ax.set_xlabel("Number of trees")
ax.set_ylabel("Out-of-bag MSE")
ax.grid(True)
ax.legend()

In [None]:
print("The best model (according to OOB selection) has :")
print(f" - A total number of {ntrees[oob_best]} trees")
print(f" - An out-of-bag mean absolute error of    {oob_mae[oob_best]:>10.2f}")
print(f" - An out-of-sample mean absolute error of {oos_mae[oob_best]:>10.2f}")

#### ➡️ ✏️ Task 1
What conclusions do you draw from the above plot? Was your intuition correct? Discuss with your classmates.

## Comparison with linear regression
___
Let us now observe how the random forest model performs compared to a linear regression. We have seen above that, if we had used the out-of-bag error to select our model, the test error would not have been the lowest (it was lowest for 300 trees in the forest). For 450 trees (best model according to the out-of-bag selection), the mean absolute error on the **test set** is **15'607.28**, while the **out-of-bag** MAE is at **16'699.76**.


#### ➡️ ✏️ Task 2
Can you try to guess how the linear regression will perform compared to the random forest? Think about its performance:  
+ on the training data, and
+ on the validation data.

Will it differ? Why? Discuss with your classmates before you run the code below.


In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
# Create the linear regression object
linreg = LinearRegression()
# Fit it to the training data
linreg.fit(Xtrain, ytrain)

In [None]:
# Compute the errors
lr_mae_train = mean_absolute_error(ytrain, linreg.predict(Xtrain))
lr_mae_val = mean_absolute_error(yval, linreg.predict(Xval))

In [None]:
# Print results
print("----- Training data MAE -----")
print(f"Linear regression MAE : {lr_mae_train:>10.2f}")
print(f"Random forest MAE     : {oob_mae[oob_best]:>10.2f}")
print() # Empty line
print("----- Validation data MAE -----")
print(f"Linear regression MAE : {lr_mae_val:>10.2f}")
print(f"Random forest MAE     : {oos_mae[oob_best]:>10.2f}")

## Hyperparameter Tuning
___
A **hyperparameter** is a parameter whose value is not learned during training. Instead it is used to control the learning process. We didn't discuss it in detail, but we have already seen a hyperparameter when looking at decision trees; this was the *cost-complexity pruning parameter* $\alpha$ (`ccp_alpha`). There are a few hyperparameters that can be tuned when using a random forest, typically, you want to look at [the official documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) to better understand which hyperparameters are available and how they are implemented in the package you use. Some of the hyperparameters of the random forest implemented in scikit-learn are the following:

+ `n_estimators`: The numbers of trees used to grow our forest, we have already played a bit with this hyperparameter above.
+ `max_features`: The number of features considered at each split. Be sure to look at the documentation, we can use many different inputs for this value, e.g., the number of features as an integer or the fraction of features as a float.
+ `min_samples_leaf`: The minimum number of samples required to be at a leaf node. A split will only be considered if there are at least `min_samples_leaf` in each of the resulting branches.
+ `ccp_alpha`: The [cost-complexity pruning parameter](https://scikit-learn.org/stable/modules/tree.html#minimal-cost-complexity-pruning).

Because hyperparameters are not derived during training, we have to think about how to select the best value of our hyperparameters. While there are [different strategies](https://scikit-learn.org/stable/modules/grid_search.html) to do so, **grid search** is the simplest method. If you consider what we did above with the size of the forest, we have already done hyperparameter tuning through grid search, i.e., we have selected the number of trees in the forest by trying out different values and choosing the best on the out-of-bag sample.

The problem of grid search is that it can take a long time to run, because we are trying many different models. For instance, when considering $10$ different values for `n_estimators`, we have to grow $10$ forests. However, if we now also want to try $5$ values of `max_features`, we need to try $10 \cdot 5=50$ models. Hence, as we want to tune more hyperparameters, and, as the grid search gets more granular, the number of models we need to estimate increases drastically.

⚠️ Below, we provide an example of grid search for hyperparameter tuning. Because this can take a lot of time, **do not run it in class!** ⚠️

Note that, even at home, this will take very long to run!

In [None]:
from sklearn.model_selection import GridSearchCV # Grid search function

In [None]:
if True: # Remove the 2 first lines to run the code, this is just a safeguard
    raise Exception("Remove these two lines first")
    
# Grid search on some hyperparameters
hyperparams = {
    "n_estimators": [100, 200, 300, 400],
    "max_features": [0.25, 0.5, 1.0],
    "ccp_alpha": [0, 0.001, 0.01],
    "min_samples_leaf": [0.1, 0.25, 1]
}
# Run grid search cross validation
forest = RandomForestRegressor()
grid_search_forest = GridSearchCV(forest, hyperparams, verbose=1)
grid_search_forest.fit(Xtrain, ytrain)