<center>

### COSC2753 - Machine Learning

# **Decision Tree**

<center>────────────────────────────</center>
&nbsp;

# I. Global Configuration

In [360]:
import sys
import importlib
import tabulate
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import sklearn
import statsmodels
import imblearn
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.experimental import enable_halving_search_cv  # noqa
from sklearn.model_selection import (
    RandomizedSearchCV,
    GridSearchCV,
    HalvingGridSearchCV,
)
from sklearn.exceptions import FitFailedWarning
import warnings
from sklearn.feature_selection import RFECV

# Reload modules
sys.path.append("../../")  # Root directory
modules_to_reload = [
    "scripts.styler",
    "scripts.neko",
    "scripts.utils",
]

# Reload modules if they have been modified
missing_modules = []

for module_name in modules_to_reload:
    if module_name in sys.modules:
        importlib.reload(sys.modules[module_name])
    else:
        missing_modules.append(module_name)

# Recache missing modules
if missing_modules:
    print(f"Modules {missing_modules} not found. \nRecaching...")

# Import user-defined scripts
from scripts.styler import Styler
from scripts.neko import Neko
from scripts.utils import Utils


# Initialize styler
styler = Styler()  # Text Styler

# Check package versions
styler.draw_box("Checking Package Versions...")

try:
    with open("../../requirements.txt", "r") as file:
        requirements = file.readlines()
except FileNotFoundError:
    print(f"File '../../requirements.txt' not found.")

packages_to_check = [np, pd, sns, matplotlib, tabulate, sklearn, statsmodels, imblearn]

for package in packages_to_check:
    Utils.version_check(package, requirements=requirements)

styled_text = styler.style(
    "\nDone checking packages version...\n", bold=True, italic=True
)
print(styled_text)

# Initialize objects
styler.draw_box("Initializing Project...")
neko = Neko()  # Panda extension
bullet = ">>>"  # Bullet point
plt = matplotlib.pyplot  # Matplotlib

# Configuration
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.precision", 3)

styled_text = styler.style("Done initializing project...", bold=True, italic=True)
print(styled_text)

┌────────────────────────────────┐
│  Checking Package Versions...  │
└────────────────────────────────┘
>>> numpy is up to date: 1.26.4
>>> pandas is up to date: 2.2.1
>>> seaborn is up to date: 0.13.2
>>> matplotlib is up to date: 3.8.3
>>> tabulate is up to date: 0.9.0
>>> sklearn is up to date: 1.4.1.post1
>>> statsmodels is up to date: 0.14.1
>>> imblearn is up to date: 0.12.2
[1m[3m
Done checking packages version...
[0m
┌───────────────────────────┐
│  Initializing Project...  │
└───────────────────────────┘

    /\_____/\
   /  x   o  \
  ( ==  ^  == )       Neko has arrived!
   )         (        An data visualizing extension for analyzing DataFrames.
  (           )       Art: https://www.asciiart.eu/animals/cats.
 ( (  )   (  ) )
(__(__)___(__)__)

[1m[3mDone initializing project...[0m


# II. Data Loading

In [361]:
try:
    # Load data
    df_train = pd.read_csv("../../data/processed/data_train_processed.csv")
    df_test = pd.read_csv("../../data/test/data_test.csv")

    styler.draw_box("Data Loaded Successfully")
except FileNotFoundError:
    print("Error: File not found. Please check the file path.")
except Exception as e:
    print("An error occurred:", e)

┌────────────────────────────┐
│  Data Loaded Successfully  │
└────────────────────────────┘


# III. Model Development

## 1. Model Training

Compared to techniques like *Logistic Regression*, a key advantage of tree-based methods is their inherent **insensitivity** to the scaling of input features. Unlike other methods, tree-based algorithms can directly utilize the data without requiring prior scaling to a specific range.

In [362]:
# Split the data into training and testing sets
X = df_train.drop(columns=["Status"], axis=1)
y = df_train["Status"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

## 2. Model Evaluation (First Attempt)

In [363]:
# Create and train Decision Tree classifier
dt = DecisionTreeClassifier()

neko.evaluate_model(dt, X_train, y_train, X_test, y_test)

Classification Report for Training Data:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     28473
           1       1.00      1.00      1.00     28536

    accuracy                           1.00     57009
   macro avg       1.00      1.00      1.00     57009
weighted avg       1.00      1.00      1.00     57009

Classification Report for Testing Data:
               precision    recall  f1-score   support

           0       0.82      0.81      0.82      7158
           1       0.81      0.83      0.82      7095

    accuracy                           0.82     14253
   macro avg       0.82      0.82      0.82     14253
weighted avg       0.82      0.82      0.82     14253



### 2.1 Feature Selection

In the previous analysis, the filtering method identified features **'MentHlth'** and **'AnyHealthcare'** as insignificant for predicting the target variable. Consequently, these features will be excluded from the dataset to improve model performance and focus on the most impactful factors.

In [364]:
# Traing the model (With reduced Features)
styler.draw_box("Training the model (With Reduced Features)")
model = DecisionTreeClassifier(random_state=42)

# Drop the specified columns from X_train and X_test
X_train_reduced = X_train.drop(columns=["AnyHealthcare", "MentHlth"])
X_test_reduced = X_test.drop(columns=["AnyHealthcare", "MentHlth"])

model.fit(X_train_reduced, y_train)

# Make predictions
neko.evaluate_model(model, X_train_reduced, y_train, X_test_reduced, y_test)

┌──────────────────────────────────────────────┐
│  Training the model (With Reduced Features)  │
└──────────────────────────────────────────────┘
Classification Report for Training Data:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     28473
           1       1.00      1.00      1.00     28536

    accuracy                           1.00     57009
   macro avg       1.00      1.00      1.00     57009
weighted avg       1.00      1.00      1.00     57009

Classification Report for Testing Data:
               precision    recall  f1-score   support

           0       0.81      0.81      0.81      7158
           1       0.81      0.81      0.81      7095

    accuracy                           0.81     14253
   macro avg       0.81      0.81      0.81     14253
weighted avg       0.81      0.81      0.81     14253



**Initial Model Performance**

The initial model achieved an accuracy of **82%** without any adjustments to its hyperparameters. This result is a **promising starting point** for further development. However, to maximize the model's effectiveness, additional optimization techniques will be explored.

**Feature Selection Analysis**

An initial exploration of feature selection was conducted by removing the features **'MentHlth'** and **'AnyHealthcare'** from the dataset. Unfortunately, this did not lead to any improvement in the model's performance. In fact, the accuracy even decreased slightly to **81%**. Consequently, these features will be retained in the dataset for further analysis.


### 2.2 Post-Pruning

To prevent overfitting, **post-pruning** will be applied to the decision tree. This technique selectively removes branches that contribute minimally to the model's performance on a validation set, even if they improve accuracy on the training set. Prioritizing regions where training and validation accuracy are similar focuses on generalizability and reduces the risk of overfitting.

Due to its effectiveness and reduced sensitivity to hyperparameter tuning compared to **pre-pruning**, post-pruning is the chosen strategy.

Given the large dataset size and the application of oversampling, **RandomizedSearchCV** will be employed for efficient hyperparameter optimization of the decision tree model.


In [365]:
# # Step 2: Compute the ccp_alphas value using cost_complexity_pruning_path()
# clf = DecisionTreeClassifier(random_state=42)

# # Get size of ccp_alpha valus
# path = clf.cost_complexity_pruning_path(X_train, y_train)
# ccp_alphas, impurities = path.ccp_alphas, path.impurities

# best_alpha = neko.post_pruning(
#     clf,
#     X_train,
#     y_train,
#     X_test,
#     y_test,
#     classifier="decision_tree",
#     n_iterations=300,
#     n_jobs=6,
# )

# # Train the model with the best alpha
# clf = DecisionTreeClassifier(random_state=42, ccp_alpha=best_alpha)
# neko.evaluate_model(clf, X_train, y_train, X_test, y_test)

In [366]:
# # Confusion matrix
# clf = DecisionTreeClassifier(random_state=42)

# # Fit the model
# clf.fit(X_train, y_train)

# # Confusion matrix for training set
# train_conf_matrix = confusion_matrix(y_train, clf.predict(X_train))

# # Confusion matrix for testing set
# test_conf_matrix = confusion_matrix(y_test, y_pred)

# # Plot the confusion matrices
# plt.figure(figsize=(20, 7))

# plt.subplot(1, 2, 1)
# sns.heatmap(train_conf_matrix, annot=True, fmt="d", cmap="Reds")
# plt.title("Confusion Matrix - Training Set")
# plt.xlabel("Predicted")
# plt.ylabel("Actual")

# plt.subplot(1, 2, 2)
# sns.heatmap(test_conf_matrix, annot=True, fmt="d", cmap="Blues")
# plt.title("Confusion Matrix - Testing Set")
# plt.xlabel("Predicted")
# plt.ylabel("Actual")

# plt.show()

### 2.2 Model Evaluation (After Feature Selection)

In [367]:
# neko.evaluate_model(model, X_train, y_train, X_test, y_test)

The model is highly overfit, with a training accuracy of **100%** and a validation accuracy of **89%**. To address this issue, hyperparameter tuning will be performed to optimize the model's performance.

### 2.2 Hyperparameter Tuning

In [392]:
# Warning Suppression
warnings.filterwarnings("ignore", category=FitFailedWarning)
warnings.filterwarnings("ignore", category=UserWarning)

# Hyperparameter tuning
param_grid = {
    "criterion": ["gini", "entropy"],
    "max_depth": np.arange(3, 40, 2).tolist(),
    "min_samples_split": np.arange(2, 20, 2).tolist(),
    "min_samples_leaf": np.arange(1, 20, 1).tolist(),
    "max_features": [None, "sqrt", "log2"],
    "ccp_alpha": np.arange(0.0, 0.1, 0.01),
}

In [394]:
# RandomForestClassifier
classifier = DecisionTreeClassifier(random_state=42)

# Create a RandomizedSearchCV object
search = HalvingGridSearchCV(
    estimator=classifier,
    param_grid=param_grid,
    scoring="f1_weighted",
    n_jobs=2,
    cv=10,
)

# Fit the model
search.fit(X_train, y_train)

# Print the best hyperparameters
print("Best hyperparameters:", search.best_params_)
print("Best accuracy:", search.best_score_)

Best hyperparameters: {'ccp_alpha': 0.0, 'criterion': 'gini', 'max_depth': 13, 'max_features': None, 'min_samples_leaf': 2, 'min_samples_split': 18}
Best accuracy: 0.8545458622595306
