# Random Forest - Hyperparameter Tuning & Training (All Features)

1. Define Random Forest model.
2. Define hyperparameter search space.
3. Use RandomizedSearchCV with Stratified K-Fold Cross-Validation on the resampled training data (all features).
4. Find the best hyperparameters based on Average Precision score.
5. Evaluate the best model on the validation set.
6. Perform final evaluation on the test set.
7. Save the best model.

In [1]:
import pandas as pd
import numpy as np
import joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn.metrics import classification_report, roc_auc_score, average_precision_score, ConfusionMatrixDisplay, PrecisionRecallDisplay, RocCurveDisplay
import matplotlib.pyplot as plt
from scipy.stats import randint

## 1. Load Data
Load the datasets: resampled training set, scaled validation set, scaled test set.

In [2]:
try:
    X_train = pd.read_csv('../../data/processed/transformed/X_train_transform_scaled_resampled.csv') # Contains all features
    y_train = pd.read_csv('../../data/processed/transformed/y_train_transform_scaled_resampled.csv')
    X_val = pd.read_csv('../../data/processed/transformed/X_val_transform_scaled.csv') # Contains all features
    y_val = pd.read_csv('../../data/processed/transformed/y_val_transform.csv')
    X_test = pd.read_csv('../../data/processed/transformed/X_test_transform_scaled.csv') # Contains all features
    y_test = pd.read_csv('../../data/processed/transformed/y_test_transform.csv')
    print("Data loaded successfully.")
    print("X_train shape:", X_train.shape)
    print("X_validate shape:", X_val.shape)
    print("X_test shape:", X_test.shape)
except FileNotFoundError as e:
    print(f"Error loading data: {e}")
    print("Please ensure the data files are present in the correct paths.")
    # Exit or handle error appropriately
    exit() # Simple exit for script-like execution in notebook

Data loaded successfully.
X_train shape: (4762, 19)
X_validate shape: (984, 19)
X_test shape: (984, 19)


## 2. Define Model and Hyperparameter Space

In [3]:
# Define the base model
rf = RandomForestClassifier(random_state=42, class_weight=None, n_jobs=-1) # class_weight=None as y_train is resampled

In [5]:
# Define the parameter distribution for Randomized Search
param_distributions = {
    'n_estimators': randint(100, 601),       # Number of trees (e.g., 100 to 600)
    'max_depth': [10, 20, 30, 40, None],      # Max depth of trees
    'min_samples_split': randint(2, 11),    # Min samples to split node (e.g., 2 to 10)
    'min_samples_leaf': randint(1, 5),      # Min samples per leaf node (e.g., 1 to 4)
    'max_features': ['sqrt', 'log2', 0.5, None] # Number of features to consider for split
    # 'criterion': ['gini', 'entropy']      # Split criterion
}

## 3. Setup Cross-Validation and Randomized Search

In [6]:
# Stratified K-Fold for cross-validation within the training set
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

In [7]:
# Randomized Search setup
# n_iter = number of parameter settings sampled. Increase for better search, decrease for speed.
# scoring = 'average_precision' is good for imbalanced problems (even if training data is balanced, validation/test are not)
random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_distributions,
    n_iter=50,        # Adjust number of iterations
    cv=cv_strategy,
    scoring='average_precision',
    n_jobs=-1,        # Use all cores
    random_state=42,
    verbose=1         # Show progress
)

## 4. Run Hyperparameter Search

In [None]:
print("Starting Randomized Search CV for Random Forest (All Features)...")
# Fit on the resampled training data
random_search.fit(X_train, y_train)
print("Search complete.")

Starting Randomized Search CV for Random Forest (All Features)...
Fitting 5 folds for each of 50 candidates, totalling 250 fits
