<center>

### COSC2753 - Machine Learning

# **Decision Tree**

<center>────────────────────────────</center>
&nbsp;

# I. Global Configuration

In [1]:
import sys
import importlib
import tabulate
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import sklearn
import statsmodels
import imblearn
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.exceptions import FitFailedWarning
import warnings

# Reload modules
sys.path.append("../../")  # Root directory
modules_to_reload = [
    "scripts.styler",
    "scripts.neko",
    "scripts.utils",
]

# Reload modules if they have been modified
missing_modules = []

for module_name in modules_to_reload:
    if module_name in sys.modules:
        importlib.reload(sys.modules[module_name])
    else:
        missing_modules.append(module_name)

# Recache missing modules
if missing_modules:
    print(f"Modules {missing_modules} not found. \nRecaching...")

# Import user-defined scripts
from scripts.styler import Styler
from scripts.neko import Neko
from scripts.utils import Utils


# Initialize styler
styler = Styler()  # Text Styler

# Check package versions
styler.draw_box("Checking Package Versions...")

try:
    with open("../../requirements.txt", "r") as file:
        requirements = file.readlines()
except FileNotFoundError:
    print(f"File '../../requirements.txt' not found.")

packages_to_check = [np, pd, sns, matplotlib, tabulate, sklearn, statsmodels, imblearn]

for package in packages_to_check:
    Utils.version_check(package, requirements=requirements)

styled_text = styler.style(
    "\nDone checking packages version...\n", bold=True, italic=True
)
print(styled_text)

# Initialize objects
styler.draw_box("Initializing Project...")
neko = Neko()  # Panda extension
bullet = ">>>"  # Bullet point
plt = matplotlib.pyplot  # Matplotlib

# Configuration
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.precision", 3)

styled_text = styler.style("Done initializing project...", bold=True, italic=True)
print(styled_text)

Modules ['scripts.styler', 'scripts.neko', 'scripts.utils'] not found. 
Recaching...
┌────────────────────────────────┐
│  Checking Package Versions...  │
└────────────────────────────────┘
>>> numpy is up to date: 1.26.4
>>> pandas is up to date: 2.2.1
>>> seaborn is up to date: 0.13.2
>>> matplotlib is up to date: 3.8.3
>>> tabulate is up to date: 0.9.0
>>> sklearn is up to date: 1.4.1.post1
>>> statsmodels is up to date: 0.14.1
>>> imblearn is up to date: 0.12.2
[1m[3m
Done checking packages version...
[0m
┌───────────────────────────┐
│  Initializing Project...  │
└───────────────────────────┘

    /\_____/\
   /  x   o  \
  ( ==  ^  == )       Neko has arrived!
   )         (        An data visualizing extension for analyzing DataFrames.
  (           )       Art: https://www.asciiart.eu/animals/cats.
 ( (  )   (  ) )
(__(__)___(__)__)

[1m[3mDone initializing project...[0m


# II. Data Loading

In [2]:
try:
    # Load data
    df_train = pd.read_csv("../../data/processed/data_train_processed.csv")
    df_test = pd.read_csv("../../data/test/data_test.csv")

    styler.draw_box("Data Loaded Successfully")
except FileNotFoundError:
    print("Error: File not found. Please check the file path.")
except Exception as e:
    print("An error occurred:", e)

┌────────────────────────────┐
│  Data Loaded Successfully  │
└────────────────────────────┘


# III. Model Development

## 1. Model Training

Compared to techniques like *Logistic Regression*, a key advantage of tree-based methods is their inherent **insensitivity** to the scaling of input features. Unlike other methods, tree-based algorithms can directly utilize the data without requiring prior scaling to a specific range.

In [3]:
# Split the data into training and testing sets
X = df_train.drop(columns=["Status"], axis=1)
y = df_train["Status"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

## 2. Model Evaluation (First Attempt)

In [4]:
# Create and train Decision Tree classifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

# Calculate and print accuracy
accuracy = dt.score(X_test, y_test)
print("Decision Tree Accuracy:", accuracy)

# Print classification report
print("Classification Report:")
print(classification_report(y_test, dt.predict(X_test)))

Decision Tree Accuracy: 0.8893958888124234
Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.88      0.89     33403
           1       0.88      0.90      0.89     33439

    accuracy                           0.89     66842
   macro avg       0.89      0.89      0.89     66842
weighted avg       0.89      0.89      0.89     66842



### 2.1 Feature Selection

In the previous analysis, the filtering method identified features **'MentHlth'** and **'AnyHealthcare'** as insignificant for predicting the target variable. Consequently, these features will be excluded from the dataset to improve model performance and focus on the most impactful factors.

In [5]:
# Traing the model (With All Features)
styler.draw_box("Training the model (With All Features)")
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Classification report
report = classification_report(y_test, y_pred)
print(report)

# Traing the model (With reduced Features)
styler.draw_box("Training the model (With Reduced Features)")
model = DecisionTreeClassifier(random_state=42)

# Drop the specified columns from X_train and X_test
X_train_reduced = X_train.drop(columns=["AnyHealthcare", "MentHlth"])
X_test_reduced = X_test.drop(columns=["AnyHealthcare", "MentHlth"])

model.fit(X_train_reduced, y_train)

# Make predictions
y_pred = model.predict(X_test_reduced)

# Classification report
report = classification_report(y_test, y_pred)
print(report)

┌──────────────────────────────────────────┐
│  Training the model (With All Features)  │
└──────────────────────────────────────────┘
              precision    recall  f1-score   support

           0       0.90      0.88      0.89     33403
           1       0.88      0.90      0.89     33439

    accuracy                           0.89     66842
   macro avg       0.89      0.89      0.89     66842
weighted avg       0.89      0.89      0.89     66842

┌──────────────────────────────────────────────┐
│  Training the model (With Reduced Features)  │
└──────────────────────────────────────────────┘


The initial model achieved an accuracy of **89%** without hyperparameter tuning. This is a promising result; however, to ensure the model generalizes well to unseen data, techniques to address overfitting will be implemented.

**Overfitting Mitigation Strategies:**

1. **Complexity Regularization:** This approach will evaluate parameters that contribute to model complexity. Regularization techniques penalize overly complex models, encouraging them to learn generalizable patterns from the data instead of memorizing specific training examples.

2. **In-depth Analysis (if necessary):** If overfitting persists, a deeper investigation will be conducted. This may involve:
   - **Data Re-Balancing:** This step will explore alternative methods to address any imbalances within the training data.
   - **Feature Selection:** This process will identify and remove features that are irrelevant or redundant for the prediction task, potentially improving model generalizability.

### 2.2 Hyperparameter Tuning

In [None]:
# Warning Suppression
warnings.filterwarnings("ignore", category=FitFailedWarning)
warnings.filterwarnings("ignore", category=UserWarning)

# Hyperparameter tuning
param_grid = {
    "criterion": ["gini", "entropy"],
    "max_depth": np.arange(5, 31, 2).tolist(),
    "min_samples_split": np.arange(5, 21, 2).tolist(),
    "min_samples_leaf": np.arange(1, 11).tolist(),
    "max_leaf_nodes": np.arange(3, 51, 2).tolist(),
    "ccp_alpha": np.arange(0.0, 0.101, 0.01),
}

In [None]:
# RandomForestClassifier
classifier = DecisionTreeClassifier(random_state=42)

# Create a RandomizedSearchCV object
search = GridSearchCV(
    estimator=classifier,
    param_grid=param_grid,
    scoring="f1_weighted",
    n_jobs=1,
    cv=5,
)

# Fit the model
search.fit(X_train, y_train)

# Print the best hyperparameters
print("Best hyperparameters:", search.best_params_)
print("Best accuracy:", search.best_score_)

# Make predictions
y_pred = search.predict(X_test)

# Classification report
report = classification_report(y_test, y_pred)
print(report)