# Random Forest

# TO DO: Explain and justify why you selected the three algorithms and describe their respective advantages and drawbacks. + How well do the models perform? Evaluate and benchmark your models’ performance using suitable evaluation metrics. Which model would you select for deployment? + How could the selected model be improved further? Explain some of the improvement levers that you might focus on in a follow-up project.
– Evaluate your methodology and clearly state why you have opted for a specific approach in your analysis.
– Relate your findings to the real world and interpret them for non-technical audiences (e.g. What do the coefficients in your regression model mean?, What does the achieved error mean for your model?, etc.)
– Make sure to clearly state the implications (i.e. the ”so what?”) of your findings for managers/decision makers.

We chose the random forest regression for our prediction.

In [7]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error
import sys, os
sys.path.append(os.path.abspath(os.path.join("..")))
# from utils.evaluation import mean_average_percentage_error, root_mean_squared_error
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, mean_absolute_percentage_error

In [8]:
# Load the data
df_bike_trips_hourly = pd.read_parquet('../../data/bike_trips_hourly_FINAL.parquet')

### Define X and Y

In [9]:
X = df_bike_trips_hourly.drop(['starting_trips'], axis=1)
y = df_bike_trips_hourly['starting_trips']

### Train the model on the training set

In [10]:
# perform train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

We use a grid search to find the optimal combination of hyper-parameters

https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74 :

`max_features` = max number of features considered for splitting a node
`min_samples_leaf` = min number of data points allowed in a leaf node
`min_samples_split` = min number of data points placed in a node before the node is split
`max_depth` = max number of levels in each decision tree
`max_leaf_nodes` = max number of leaf nodes

n_estimator is not used




In [11]:
estimator = RandomForestRegressor(n_estimators=100, bootstrap=True, random_state=42)
param_grid = {
	'max_features': ['auto', 'sqrt', 'log2'],
	'min_samples_leaf': [1, 2, 4, 8],
	'min_samples_split': [2, 4, 8],
	'max_depth': [None, 5, 10, 50, 100],
	'max_leaf_nodes': [None, 10, 50, 100, 150],
}

In [12]:
model = GridSearchCV(
    estimator, param_grid, cv=3, scoring="neg_mean_squared_error", n_jobs=-1 , verbose=1
)
model.fit(X_train, y_train)

Fitting 3 folds for each of 900 candidates, totalling 2700 fits


GridSearchCV(cv=3, estimator=RandomForestRegressor(random_state=42), n_jobs=-1,
             param_grid={'max_depth': [None, 5, 10, 50, 100],
                         'max_features': ['auto', 'sqrt', 'log2'],
                         'max_leaf_nodes': [None, 10, 50, 100, 150],
                         'min_samples_leaf': [1, 2, 4, 8],
                         'min_samples_split': [2, 4, 8]},
             scoring='neg_mean_squared_error', verbose=1)

In [13]:
model.best_params_

{'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_samples_leaf': 1,
 'min_samples_split': 8}

#### TO DO : We can see that ... depends on new values 

In [14]:
best_model = model.best_estimator_

## Evaluate the model
It is generally not recommended to use the R^2 metric to evaluate the performance of a random forest model, because the R^2 metric is not well-suited for evaluating the performance of models that do not make predictions using a linear function. Instead, it is generally better to use error metrics that are more appropriate for non-linear models, such as mean squared error (MSE) or mean absolute error (MAE).

In [15]:
# evaluate the model
y_pred = best_model.predict(X_test)

print(f"MAE: {mean_absolute_error(y_test, y_pred):.2f}")
print(f"MSE: {mean_squared_error(y_test, y_pred):.2f}")
print(f"MAPE: {(mean_absolute_error(y_test, y_pred) / y_test.mean()) * 100:.2f}%")
print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False):.2f}")


MAE: 16.07
MSE: 677.48
MAPE: 21.12%
RMSE: 26.03
