# Worksheet 8 — Decision Tree, Ensemble Methods, and Hyperparameter Tuning

## 1. Import Required Libraries

In this section, we import all necessary Python libraries for:
- Data handling
- Machine learning models
- Evaluation metrics
- Hyperparameter tuning
- Visualization

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time


from sklearn.datasets import load_iris, load_wine, load_diabetes
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import accuracy_score, f1_score, mean_squared_error

## 2. Helper Functions for Evaluation


The following helper functions are used to evaluate:
- **Classification models** using Accuracy and F1-score
- **Regression models** using MSE and RMSE

In [6]:
def report_classification_results(y_true, y_pred, label=None):
    acc = accuracy_score(y_true, y_pred)
    f1_macro = f1_score(y_true, y_pred, average='macro')
    print(f"{label} -> Accuracy: {acc:.4f}, F1 (macro): {f1_macro:.4f}")
    return acc, f1_macro


def report_regression_results(y_true, y_pred, label=None):
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    print(f"{label} -> RMSE: {rmse:.4f}, MSE: {mse:.4f}")
    return mse, rmse

## 3. Custom Decision Tree (Classification)


This is a **from-scratch implementation** of a Decision Tree classifier using:
- Entropy
- Information Gain


This implementation is **educational**, not optimized for performance.

In [7]:
class CustomDecisionTree:
    def __init__(self, max_depth=None, min_samples_split=2):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.tree = None

    def fit(self, X, y):
        self.tree = self._build_tree(np.array(X), np.array(y), depth=0)

    def _entropy(self, y):
        counts = np.bincount(y)
        probs = counts / counts.sum()
        probs = probs[probs > 0]
        return -np.sum(probs * np.log2(probs))

    def _information_gain(self, y, y_left, y_right):
        w_left = len(y_left) / len(y)
        w_right = len(y_right) / len(y)
        return self._entropy(y) - (w_left * self._entropy(y_left) + w_right * self._entropy(y_right))

    def _best_split(self, X, y):
        best_gain, split_idx, split_thr = -1, None, None
        for idx in range(X.shape[1]):
            thresholds = np.unique(X[:, idx])
            for thr in thresholds:
                left = y[X[:, idx] <= thr]
                right = y[X[:, idx] > thr]
                if len(left) == 0 or len(right) == 0:
                    continue
                gain = self._information_gain(y, left, right)
                if gain > best_gain:
                    best_gain, split_idx, split_thr = gain, idx, thr
        return split_idx, split_thr

    def _build_tree(self, X, y, depth):
        if len(np.unique(y)) == 1:
            return y[0]
        if self.max_depth is not None and depth >= self.max_depth:
            return np.bincount(y).argmax()

        idx, thr = self._best_split(X, y)
        if idx is None:
            return np.bincount(y).argmax()

        left_mask = X[:, idx] <= thr
        right_mask = X[:, idx] > thr
        return {
            'feature': idx,
            'threshold': thr,
            'left': self._build_tree(X[left_mask], y[left_mask], depth + 1),
            'right': self._build_tree(X[right_mask], y[right_mask], depth + 1)
        }

    def _predict(self, x, node):
        if not isinstance(node, dict):
            return node
        if x[node['feature']] <= node['threshold']:
            return self._predict(x, node['left'])
        return self._predict(x, node['right'])

    def predict(self, X):
        return np.array([self._predict(x, self.tree) for x in X])

## 4. Custom vs scikit-learn Decision Tree (Iris Dataset)

We compare:
- Custom Decision Tree
- `DecisionTreeClassifier` from scikit-learn

In [8]:
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42, stratify=iris.target
)

custom_tree = CustomDecisionTree(max_depth=3)
custom_tree.fit(X_train, y_train)
y_pred_custom = custom_tree.predict(X_test)
report_classification_results(y_test, y_pred_custom, "Custom Decision Tree (Iris)")

sk_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
sk_tree.fit(X_train, y_train)
y_pred_sklearn = sk_tree.predict(X_test)
report_classification_results(y_test, y_pred_sklearn, "Scikit-learn Decision Tree (Iris)")

Custom Decision Tree (Iris) -> Accuracy: 0.9667, F1 (macro): 0.9666
Scikit-learn Decision Tree (Iris) -> Accuracy: 0.9667, F1 (macro): 0.9666


(0.9666666666666667, 0.9665831244778612)

## 5. Wine Dataset — Decision Tree vs Random Forest

This experiment evaluates ensemble learning using Random Forest.

In [9]:
wine = load_wine()
X_tr, X_te, y_tr, y_te = train_test_split(
    wine.data, wine.target, test_size=0.2, random_state=42, stratify=wine.target
)

dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_tr, y_tr)
report_classification_results(y_te, dt.predict(X_te), "Decision Tree (Wine)")

rf = RandomForestClassifier(random_state=42)
rf.fit(X_tr, y_tr)
report_classification_results(y_te, rf.predict(X_te), "Random Forest (Wine)")

Decision Tree (Wine) -> Accuracy: 0.9444, F1 (macro): 0.9457
Random Forest (Wine) -> Accuracy: 1.0000, F1 (macro): 1.0000


(1.0, 1.0)

## 6. Hyperparameter Tuning — GridSearchCV (Random Forest Classifier)

Grid search is used to find the best hyperparameters.

In [10]:
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'max_features': ['sqrt', 0.5]
}

grid = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    scoring='f1_macro',
    cv=5,
    n_jobs=-1
)

grid.fit(X_tr, y_tr)
print("Best parameters:", grid.best_params_)

best_rf = grid.best_estimator_
report_classification_results(y_te, best_rf.predict(X_te), "Best Random Forest (Wine)")

Best parameters: {'max_depth': None, 'max_features': 'sqrt', 'n_estimators': 50}
Best Random Forest (Wine) -> Accuracy: 1.0000, F1 (macro): 1.0000


(1.0, 1.0)

## 7. Regression — Diabetes Dataset

We now switch to a **regression problem** and compare:
- Decision Tree Regressor
- Random Forest Regressor

In [11]:
diabetes = load_diabetes()
X_tr, X_te, y_tr, y_te = train_test_split(
    diabetes.data, diabetes.target, test_size=0.2, random_state=42
)

dt_reg = DecisionTreeRegressor(random_state=42)
dt_reg.fit(X_tr, y_tr)
report_regression_results(y_te, dt_reg.predict(X_te), "Decision Tree Regressor")

rf_reg = RandomForestRegressor(random_state=42)
rf_reg.fit(X_tr, y_tr)
report_regression_results(y_te, rf_reg.predict(X_te), "Random Forest Regressor")

Decision Tree Regressor -> RMSE: 70.5464, MSE: 4976.7978
Random Forest Regressor -> RMSE: 54.3324, MSE: 2952.0106


(2952.0105887640448, np.float64(54.332408273184846))

## 8. Hyperparameter Tuning — RandomizedSearchCV (Regression)

Randomized search is more efficient for large search spaces.

In [12]:
from scipy.stats import randint

param_dist = {
    'n_estimators': randint(50, 300),
    'max_depth': randint(2, 30),
    'min_samples_split': randint(2, 20)
}

rsearch = RandomizedSearchCV(
    RandomForestRegressor(random_state=42),
    param_distributions=param_dist,
    n_iter=20,
    cv=3,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    random_state=42
)

rsearch.fit(X_tr, y_tr)
print("Best parameters:", rsearch.best_params_)

best_rf_reg = rsearch.best_estimator_
report_regression_results(y_te, best_rf_reg.predict(X_te), "Best Random Forest Regressor")

Best parameters: {'max_depth': 21, 'min_samples_split': 4, 'n_estimators': 278}
Best Random Forest Regressor -> RMSE: 54.8719, MSE: 3010.9214


(3010.9214408853236, np.float64(54.87186383644466))

## 9. Conclusion

In this worksheet, we:
- Built a Decision Tree from scratch
- Compared it with scikit-learn implementations
- Demonstrated ensemble methods using Random Forests
- Applied GridSearchCV and RandomizedSearchCV for hyperparameter tuning