<center>

### COSC2753 - Machine Learning

# **Random Forrest**

<center>────────────────────────────</center>
&nbsp;

# I. Global Configuration

In [1]:
import sys
import importlib
import tabulate
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import sklearn
import statsmodels
import imblearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

# Reload modules
sys.path.append("../../")  # Root directory
modules_to_reload = [
    "scripts.styler",
    "scripts.neko",
    "scripts.utils",
    "scripts.outlier_detector",
]

# Reload modules if they have been modified
missing_modules = []

for module_name in modules_to_reload:
    if module_name in sys.modules:
        importlib.reload(sys.modules[module_name])
    else:
        missing_modules.append(module_name)

# Recache missing modules
if missing_modules:
    print(f"Modules {missing_modules} not found. \nRecaching...")

# Import user-defined scripts
from scripts.styler import Styler
from scripts.neko import Neko
from scripts.utils import Utils
from scripts.outlier_detector import OutlierDetector

# Initialize styler
styler = Styler()  # Text Styler

# Check package versions
styler.draw_box("Validating Package Versions...")

try:
    with open("../../requirements.txt", "r") as file:
        requirements = file.readlines()
except FileNotFoundError:
    print(f"File '../../requirements.txt' not found. Please check your directory!")

packages_to_check = [np, pd, sns, matplotlib, tabulate, sklearn, statsmodels, imblearn]

for package in packages_to_check:
    Utils.version_check(package, requirements=requirements)

styled_text = styler.style("\nDone validating packages\n", bold=True, italic=True)
print(styled_text)

# Initialize objects
styler.draw_box("Initializing Project...")
neko = Neko()  # Panda extension
bullet = ">>>"  # Bullet point
plt = matplotlib.pyplot  # Matplotlib

# Configuration
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.precision", 3)

styled_text = styler.style("Done initializing project...", bold=True, italic=True)
print(styled_text)

Modules ['scripts.styler', 'scripts.neko', 'scripts.utils', 'scripts.outlier_detector'] not found. 
Recaching...
┌──────────────────────────────────┐
│  Validating Package Versions...  │
└──────────────────────────────────┘
>>> numpy is up to date: 1.26.4
>>> pandas is up to date: 2.2.1
>>> seaborn is up to date: 0.13.2
>>> matplotlib is up to date: 3.8.3
>>> tabulate is up to date: 0.9.0
>>> sklearn is up to date: 1.4.1.post1
>>> statsmodels is up to date: 0.14.1
>>> imblearn is up to date: 0.12.2
[1m[3m
Done validating packages
[0m
┌───────────────────────────┐
│  Initializing Project...  │
└───────────────────────────┘

    /\_____/\
   /  x   o  \
  ( ==  ^  == )       Neko has arrived!
   )         (        An data visualizing extension for analyzing DataFrames.
  (           )       Art: https://www.asciiart.eu/animals/cats.
 ( (  )   (  ) )
(__(__)___(__)__)

[1m[3mDone initializing project...[0m


# II. Data Loading

In [2]:
try:
    # Load data
    df_train = pd.read_csv("../../data/processed/data_train_processed.csv")
    df_test = pd.read_csv("../../data/test/data_test.csv")

    styler.draw_box("Data Loaded Successfully")

except FileNotFoundError:
    print("Error: File not found. Please check the file path.")
except Exception as e:
    print("An error occurred:", e)

┌────────────────────────────┐
│  Data Loaded Successfully  │
└────────────────────────────┘


# III. Model Development

## 1. Model Training

For similar reasons to those underlying the decision tree selection, feature scaling and outlier handling will be **ignored** in the context of **tree-based methods**.

In [3]:
# Split the data into training and testing sets
X = df_train.drop(columns=["Status"], axis=1)
y = df_train["Status"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

### 1.1 Model Evaluation (First Attempt)

In [4]:
# Initialize the Random Forest classifier
rf_classifier = RandomForestClassifier()

neko.evaluate_model(rf_classifier, X_train, y_train, X_test, y_test)

Classification Report for Training Data:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     28473
           1       1.00      1.00      1.00     28536

    accuracy                           1.00     57009
   macro avg       1.00      1.00      1.00     57009
weighted avg       1.00      1.00      1.00     57009

Classification Report for Testing Data:
               precision    recall  f1-score   support

           0       0.83      0.91      0.87      7158
           1       0.90      0.81      0.85      7095

    accuracy                           0.86     14253
   macro avg       0.86      0.86      0.86     14253
weighted avg       0.86      0.86      0.86     14253



### 1.2 Feature Selection

In [5]:
# Traing the model (With reduced Features)
styler.draw_box("Training the model (With Reduced Features)")
model = RandomForestClassifier(random_state=42)

# Drop the specified columns from X_train and X_test
X_train_reduced = X_train.drop(columns=["AnyHealthcare", "MentHlth"])
X_test_reduced = X_test.drop(columns=["AnyHealthcare", "MentHlth"])

model.fit(X_train_reduced, y_train)

# Make predictions
neko.evaluate_model(model, X_train_reduced, y_train, X_test_reduced, y_test)

┌──────────────────────────────────────────────┐
│  Training the model (With Reduced Features)  │
└──────────────────────────────────────────────┘
Classification Report for Training Data:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     28473
           1       1.00      1.00      1.00     28536

    accuracy                           1.00     57009
   macro avg       1.00      1.00      1.00     57009
weighted avg       1.00      1.00      1.00     57009

Classification Report for Testing Data:
               precision    recall  f1-score   support

           0       0.83      0.90      0.86      7158
           1       0.88      0.81      0.84      7095

    accuracy                           0.85     14253
   macro avg       0.85      0.85      0.85     14253
weighted avg       0.85      0.85      0.85     14253



This analysis reveals results comparable to those achieved by a **Decision Tree model**. Excluding the two proposed columns for removal results in a decrease in the model's accuracy. Therefore, it is recommended to retain both columns.

Furthermore, the same hyperparameter selection process used for pruning the **Decision Tree** to address overfitting will also be applied here.


## 2. Hyperparameter Tuning

In [6]:
# Define the parameter grid to search
param_grid = {
    "n_estimators": [100, 150, 200, 225, 250, 300],  # Number of trees in the forest
    "max_depth": [None] + list(range(10, 110, 10)),  # Maximum depth of the tree
    "min_samples_split": np.arange(
        5, 20, 2
    ).tolist(),  # Minimum number of samples required to split an internal node
    "min_samples_leaf": np.arange(
        2, 6
    ).tolist(),  # Minimum number of samples required to be at a leaf node
    "criterion": ["gini", "entropy"],  # Function to measure the quality of a split
    "max_features": [
        None,
        "sqrt",
        "log2",
    ],  # Number of features to consider when looking for the best split
    "ccp_alpha": np.arange(
        0.0, 0.1, 0.01
    ),  # Complexity parameter used for Minimal Cost-Complexity Pruning
}

In [9]:
# Create a random forest classifier with 100 trees
clf = RandomForestClassifier(random_state=42)

# Instantiate GridSearchCV
search = RandomizedSearchCV(
    estimator=clf,
    param_distributions=param_grid,
    cv=5,
    n_jobs=5,
    scoring="f1_weighted",
    n_iter=1, # Number of iterations (explained in the conclusion)
)

# Perform the grid search
search.fit(X_train, y_train)

# Print the best parameters and the corresponding score
print("Best Parameters:", search.best_params_)
print("Best Score:", search.best_score_)

Best Parameters: {'n_estimators': 225, 'min_samples_split': 17, 'min_samples_leaf': 4, 'max_features': None, 'max_depth': 40, 'criterion': 'entropy', 'ccp_alpha': 0.0}
Best Score: 0.8712939008589056


### **Conclusion**

This model exhibits similar overall behavior to a decision tree. While tuning the model results in a slight accuracy improvement (approximately `1%`), there is a `6%` difference exists between the training and validation set accuracy. This difference is generally considered acceptable and not indicative of overfitting `[1]` `[2]`. It's important to note that hardware limitations necessitated a randomized search. However, research by S. Weiran `[3]` suggests a `96%` chance of finding the top `5%` parameters after `60` iterations, even with a randomized approach.

[1] [Free Code Camp - What is Overfitting in Machine Learning? ](https://www.freecodecamp.org/news/what-is-overfitting-machine-learning/#:~:text=The%20accuracy%20gap%20is%20a,what%20you%20should%20look%20for.)

[2] [Data Science - Ideal difference in the training accuracy and testing accuracy](https://datascience.stackexchange.com/questions/20256/ideal-difference-in-the-training-accuracy-and-testing-accuracy#:~:text=If%20test%20data%20starts%20decreasing%2C%20you%20have%20overfitting.&text=A%20difference%20of%205%25%20is%20fine.,and%20verify%20with%20mean%20accuracies.)

[3] [Medium - Hyper Parameter Tuning with Randomised Grid Search
](https://medium.com/m/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fhyper-parameter-tuning-with-randomised-grid-search-54f865d27926)