<center>

### COSC2753 - Machine Learning

# **Decision Tree**

<center>────────────────────────────</center>
&nbsp;

# I. Global Configuration

In [396]:
import sys
import importlib
import tabulate
import pandas as pd
import numpy as np
import sklearn
import statsmodels
import imblearn
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.experimental import enable_halving_search_cv  # noqa
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.exceptions import FitFailedWarning
import warnings

# Reload modules
sys.path.append("../../")  # Root directory
modules_to_reload = [
    "scripts.styler",
    "scripts.neko",
    "scripts.utils",
]

# Reload modules if they have been modified
missing_modules = []

for module_name in modules_to_reload:
    if module_name in sys.modules:
        importlib.reload(sys.modules[module_name])
    else:
        missing_modules.append(module_name)

# Recache missing modules
if missing_modules:
    print(f"Modules {missing_modules} not found. \nRecaching...")

# Import user-defined scripts
from scripts.styler import Styler
from scripts.neko import Neko
from scripts.utils import Utils

# Initialize styler
styler = Styler()  # Text Styler

# Check package versions
styler.draw_box("Validating Package Versions...")

try:
    with open("../../requirements.txt", "r") as file:
        requirements = file.readlines()
except FileNotFoundError:
    print(f"File '../../requirements.txt' not found. Please check your directory!")

packages_to_check = [np, pd, tabulate, sklearn, statsmodels, imblearn]

for package in packages_to_check:
    Utils.version_check(package, requirements=requirements)

styled_text = styler.style("\nDone validating packages\n", bold=True, italic=True)
print(styled_text)

# Initialize objects
styler.draw_box("Initializing Project...")
neko = Neko()  # Panda extension
bullet = ">>>"  # Bullet point

# Configuration
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.precision", 3)

styled_text = styler.style("Done initializing project...", bold=True, italic=True)
print(styled_text)

┌──────────────────────────────────┐
│  Validating Package Versions...  │
└──────────────────────────────────┘
>>> numpy is up to date: 1.26.4
>>> pandas is up to date: 2.2.1
>>> tabulate is up to date: 0.9.0
>>> sklearn is up to date: 1.4.1.post1
>>> statsmodels is up to date: 0.14.1
>>> imblearn is up to date: 0.12.2
[1m[3m
Done validating packages
[0m
┌───────────────────────────┐
│  Initializing Project...  │
└───────────────────────────┘

    /\_____/\
   /  x   o  \
  ( ==  ^  == )       Neko has arrived!
   )         (        An data visualizing extension for analyzing DataFrames.
  (           )       Art: https://www.asciiart.eu/animals/cats.
 ( (  )   (  ) )
(__(__)___(__)__)

[1m[3mDone initializing project...[0m


# II. Data Loading

In [361]:
try:
    # Load data
    df_train = pd.read_csv("../../data/processed/data_train_processed.csv")

    styler.draw_box("Data Loaded Successfully")

except FileNotFoundError:
    print("Error: File not found. Please check the file path.")
except Exception as e:
    print("An error occurred:", e)

┌────────────────────────────┐
│  Data Loaded Successfully  │
└────────────────────────────┘


# III. Model Development

## 1. Model Training

Tree-based methods offer a significant advantage over techniques like Logistic Regression in their inherent insensitivity to the scaling of input features. Unlike other models, tree-based algorithms can be directly applied to the data without requiring preprocessing to standardize the scale of features.

In [362]:
# Split the data into training and testing sets
X = df_train.drop(columns=["Status"], axis=1)
y = df_train["Status"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

### 1.1 Model Evaluation (First Attempt)

In [363]:
# Create and train Decision Tree classifier
dt = DecisionTreeClassifier()

neko.evaluate_model(dt, X_train, y_train, X_test, y_test)

Classification Report for Training Data:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     28473
           1       1.00      1.00      1.00     28536

    accuracy                           1.00     57009
   macro avg       1.00      1.00      1.00     57009
weighted avg       1.00      1.00      1.00     57009

Classification Report for Testing Data:
               precision    recall  f1-score   support

           0       0.82      0.81      0.82      7158
           1       0.81      0.83      0.82      7095

    accuracy                           0.82     14253
   macro avg       0.82      0.82      0.82     14253
weighted avg       0.82      0.82      0.82     14253



### 1.2 Feature Selection

In the previous analysis, the filtering method identified features **'MentHlth'** and **'AnyHealthcare'** as insignificant for predicting the target variable. Consequently, these features will be excluded from the dataset to improve model performance and focus on the most impactful factors.

In [364]:
# Traing the model (With reduced Features)
styler.draw_box("Training the model (With Reduced Features)")
model = DecisionTreeClassifier(random_state=42)

# Drop the specified columns from X_train and X_test
X_train_reduced = X_train.drop(columns=["AnyHealthcare", "MentHlth"])
X_test_reduced = X_test.drop(columns=["AnyHealthcare", "MentHlth"])

model.fit(X_train_reduced, y_train)

# Make predictions
neko.evaluate_model(model, X_train_reduced, y_train, X_test_reduced, y_test)

┌──────────────────────────────────────────────┐
│  Training the model (With Reduced Features)  │
└──────────────────────────────────────────────┘
Classification Report for Training Data:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     28473
           1       1.00      1.00      1.00     28536

    accuracy                           1.00     57009
   macro avg       1.00      1.00      1.00     57009
weighted avg       1.00      1.00      1.00     57009

Classification Report for Testing Data:
               precision    recall  f1-score   support

           0       0.81      0.81      0.81      7158
           1       0.81      0.81      0.81      7095

    accuracy                           0.81     14253
   macro avg       0.81      0.81      0.81     14253
weighted avg       0.81      0.81      0.81     14253



**Initial Model Performance**

The initial model achieved an accuracy of 82% without hyperparameter adjustments. This result demonstrates a promising foundation for further development. However, to optimize the model's effectiveness, additional techniques will be investigated.

**Feature Selection Analysis**

An initial exploration of feature selection was conducted by removing the features `MentHlth` and `AnyHealthcare` from the dataset. Unfortunately, this did not lead to any improvement in the model's performance. In fact, the accuracy even decreased slightly to 81%. Consequently, these features will be retained in the dataset for further analysis.

**Addressing Overfitting**

The model currently exhibits overfitting, with a training accuracy of `100%` and a validation accuracy of `81%`. To address this, hyperparameter tuning will be employed to optimize performance. This includes adjusting parameters like **ccp_values** for post-pruning and **min_samples_leaf** for pre-pruning.


## 2. Hyperparameter Tuning

In [392]:
# Warning Suppression
warnings.filterwarnings("ignore", category=FitFailedWarning)
warnings.filterwarnings("ignore", category=UserWarning)

# Hyperparameter tuning
param_grid = {
    "criterion": ["gini", "entropy"],  # Splitting criterion
    "max_depth": np.arange(3, 40, 2).tolist(),  # Maximum depth of the tree
    "min_samples_split": np.arange(
        2, 20, 2
    ).tolist(),  # Minimum number of samples required to split an internal node
    "min_samples_leaf": np.arange(
        1, 20, 1
    ).tolist(),  # Minimum number of samples required to be at a leaf node
    "max_features": [
        None,
        "sqrt",
        "log2",
    ],  # Number of features to consider when looking for the best split
    "ccp_alpha": np.arange(
        0.0, 0.1, 0.01
    ),  # Complexity parameter used for Minimal Cost-Complexity Pruning
}

In [394]:
# RandomForestClassifier
classifier = DecisionTreeClassifier(random_state=42)

# Create a RandomizedSearchCV object
search = HalvingGridSearchCV(
    estimator=classifier,
    param_grid=param_grid,
    scoring="f1_weighted",
    n_jobs=2,
    cv=10,
)

# Fit the model
search.fit(X_train, y_train)

# Print the best hyperparameters
print("Best hyperparameters:", search.best_params_)
print("Best accuracy:", search.best_score_)

Best hyperparameters: {'ccp_alpha': 0.0, 'criterion': 'gini', 'max_depth': 13, 'max_features': None, 'min_samples_leaf': 2, 'min_samples_split': 18}
Best accuracy: 0.8545458622595306


## 3 Model Evaluation (Final Attempt)

In [400]:
styler.draw_box("Evaluation of the best model")

# Evaluate the best model
best_model = search.best_estimator_

neko.evaluate_model(best_model, X_train, y_train, X_test, y_test)

┌────────────────────────────────┐
│  Evaluation of the best model  │
└────────────────────────────────┘
Classification Report for Training Data:
               precision    recall  f1-score   support

           0       0.86      0.91      0.88     28473
           1       0.91      0.85      0.88     28536

    accuracy                           0.88     57009
   macro avg       0.88      0.88      0.88     57009
weighted avg       0.88      0.88      0.88     57009

Classification Report for Testing Data:
               precision    recall  f1-score   support

           0       0.83      0.89      0.86      7158
           1       0.88      0.82      0.85      7095

    accuracy                           0.86     14253
   macro avg       0.86      0.85      0.85     14253
weighted avg       0.86      0.86      0.85     14253



### **Conclusion**

The final model achieved an accuracy of `86%` on the validation set. This represents a significant improvement compared to the initial model's performance. Additionally, the model demonstrates good generalizability, with an accuracy of `88%` on the training set and `86%` on the validation set. This consistency suggests that the model is not overfitting the training data and can be effectively applied to new, unseen data.

<center><em><sub>─────── End Of Section ───────</sub></em></center>