# Random Forest

1. **Imports necessary libraries**: This includes pandas, numpy, sklearn's RandomizedSearchCV, RandomForestRegressor, and several metrics from sklearn.metrics, as well as sqrt from math.

2. **Loads data**: Reads training and testing data from CSV files, dropping the "date" column from the feature sets and reshaping the target variables into one-dimensional arrays.

3. **Initializes a RandomForestRegressor model**: This model will be used for the subsequent training.

4. **Defines parameter distributions for random search**: These parameters include `n_estimators`, `max_depth`, `min_samples_split`, and `min_samples_leaf`.

    - `n_estimators`: This parameter defines the number of trees in the forest. The values being considered in the random search are 100, 200, 300, 400, and 500. More trees generally improve the model's performance but also increase computational cost.

    - `max_depth`: This parameter specifies the maximum depth of the trees. The values being considered are None (which means nodes are expanded until all leaves are pure or until all leaves contain less than the minimum samples split), 25, 50, 75, and 100. A higher maximum depth can lead to a more complex model, which might result in overfitting.

    - `min_samples_split`: This parameter determines the minimum number of samples required to split an internal node. The values being considered are 1, 2, 4, 8, and 16. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.

    - `min_samples_leaf`: This parameter sets the minimum number of samples required to be at a leaf node. The values being considered are 1, 2, 4, 8, and 16. A smaller leaf makes the model more prone to capturing noise in train data.

5. **Performs a RandomizedSearchCV**: This is done to find the best parameters for the RandomForestRegressor model, using negative mean squared error as the scoring metric.

6. **Fits the random search model**: The model is trained using the training data.

7. **Evaluates the best model**: The best model from the random search is evaluated by making predictions on the test set.

8. **Calculates evaluation metrics**: These metrics include mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), and R-squared (R2) score.

9. **Prints the evaluation metrics**: The metrics, along with the best parameters found by the random search, are printed out.

In [1]:
# Import packages
import pandas as pd
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from math import sqrt

# Load the split data from csv
X_train = pd.read_csv("X_train.csv").drop(columns=["date"])
X_test = pd.read_csv("X_test.csv").drop(columns=["date"])
y_train = np.ravel(pd.read_csv("y_train.csv"))
y_test = np.ravel(pd.read_csv("y_test.csv"))

# Initialize the model
model = RandomForestRegressor()

# Define parameter distributions for random search
param_dist = {
    "n_estimators": [100, 200, 300, 400, 500],
    "max_depth": [None, 25, 50, 75, 100],
    "min_samples_split": [1, 2, 4, 8, 16],
    "min_samples_leaf": [1, 2, 4, 8, 16]
}

# Use RandomizedSearchCV to find the best parameters
random_search = RandomizedSearchCV(
    model,
    param_distributions=param_dist,
    n_iter=10,
    scoring="neg_mean_squared_error",
    cv=5,
    verbose=1,
    # n_jobs=-1,
    random_state=30,
)

# Fit the random search
random_search.fit(X_train, y_train)

# Evaluate the best model
best_model = random_search.best_estimator_

# Make predictions on the test set
predictions = best_model.predict(X_test)

# Calculate the evaluation metrics
mse = mean_squared_error(y_test, predictions)
rmse = sqrt(mse)
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

# Print the evaluation metrics
print(f"Model: {model.__class__.__name__}")
print(f"- Best Parameters: {random_search.best_params_}")
print(f"- MSE: {mse}")
print(f"- RMSE: {rmse}")
print(f"- MAE: {mae}")
print(f"- R2 Score: {r2}")

Fitting 5 folds for each of 10 candidates, totalling 50 fits


20 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\Arjun\Desktop\Subway Sales Forecasting\.venv\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\Arjun\Desktop\Subway Sales Forecasting\.venv\Lib\site-packages\sklearn\base.py", line 1467, in wrapper
    estimator._validate_params()
  File "c:\Users\Arjun\Desktop\Subway Sales Forecasting\.venv\Lib\site-packages\sklearn\base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "c:\Users\Arjun\Desktop\Subway Sales Forecasting\.venv\Lib\site

Model: RandomForestRegressor
- Best Parameters: {'n_estimators': 300, 'min_samples_split': 4, 'min_samples_leaf': 1, 'max_depth': None}
- MSE: 240.64202109364464
- RMSE: 15.512640687311901
- MAE: 5.26923728959828
- R2 Score: 0.3532845156705644
