### Foundations of Decision Trees

A decision tree is a supervised machine learning algorithm used for classification and regression tasks. It models decisions in a tree-like structure, where:

* Each node represents a feature or attribute.
* Each branch represents a decision based on the feature.
* Each leaf node represents the final outcome (class label or numerical prediction).

#### Why Use Decision Trees?

Decision trees are popular because:

✅ They are interpretable (easy to understand and visualize). \
✅ They handle both categorical and numerical data. \
✅ They require little data preprocessing (no need for feature scaling). \
✅ They work well with nonlinear relationships. \
✅ They can be used for both classification and regression tasks. 

However, they have drawbacks:

❌ Prone to overfitting (if too deep). \
❌ Sensitive to small changes in data. \
❌ Biased towards dominant features (if not properly tuned).

#### Measuring Impurity in Decision Trees

A decision tree aims to create pure groups (nodes where all samples belong to the same class). To measure impurity, we use three common metrics:

1️⃣ Entropy (used in ID3 Algorithm) \
2️⃣ Gini Impurity (used in CART Algorithm) \
3️⃣ Mean Squared Error (MSE) (used in regression trees) 

Let’s break them down one by one.

#### Decision Tree Impurity Measures & Information Gain

#### 1️⃣ Entropy (Used in Classification Trees)

Entropy quantifies the **uncertainty** (randomness) in a dataset.
* A **pure** node (all samples belong to the same class) has **entropy = 0**.
* A **mixed** node has **higher entropy** (more disorder).

**Formula**:
$$H(S) = - \sum p_i \log_2 p_i$$

Where:
* $p_i$ = proportion of samples in class $i$
* Sum is over all possible classes

#### 📌 Example Calculation
If a dataset has:
* **4 apples** 🍎
* **6 oranges** 🍊

The entropy is:
$$H = -\left(\frac{4}{10} \log_2 \frac{4}{10} + \frac{6}{10} \log_2 \frac{6}{10}\right) \approx 0.97$$

🔵 **Lower entropy = purer node**  
🔴 **Higher entropy = more mixed node**


---


#### 2️⃣ Gini Impurity (Used in CART Algorithm)

Gini impurity is another measure of impurity. It represents the probability that a randomly chosen element is incorrectly classified if we label it randomly based on class proportions.

**Formula**:
$$\text{Gini}(S) = 1 - \sum p_i^2$$

### 📌 Example Calculation
For the same dataset:
$$\text{Gini} = 1 - \left(\left(\frac{4}{10}\right)^2 + \left(\frac{6}{10}\right)^2\right) = 1 - (0.16 + 0.36) = 0.48$$

🔹 **Lower Gini = purer node**  
🔹 **Higher Gini = more mixed node**

👉 **Gini vs. Entropy**:
* **Entropy** penalizes impurity more than Gini, so it is more sensitive to class proportions.
* **Gini** is faster to compute since it avoids logarithms.

---

#### 3️⃣ Mean Squared Error (Used in Regression Trees)

For regression trees, we don't classify data but predict **continuous values**. Instead of entropy/Gini, we minimize **Mean Squared Error (MSE)**:

$$\text{MSE} = \frac{1}{n} \sum (y_i - \hat{y})^2$$

Where:
* $y_i$ = actual value
* $\hat{y}$ = predicted value
* $n$ = number of samples

🔹 **Lower MSE = better split**  
🔹 **Higher MSE = poor split**

---

#### Information Gain

To decide which feature to split on, we use **Information Gain (IG)**. It measures how much impurity decreases when splitting a node.

**Formula (for entropy-based trees)**:
$$\text{IG} = H(\text{parent}) - \sum \left( \frac{N_{\text{child}}}{N_{\text{parent}}} \times H(\text{child}) \right)$$

#### 📌 Example Calculation
If we split our apple/orange dataset by color and get two subsets:
* 🍎 **(4 apples, 0 oranges) → Entropy = 0**
* 🍊 **(0 apples, 6 oranges) → Entropy = 0**

The **Information Gain** is:
$$\text{IG} = 0.97 - (0.4 \times 0 + 0.6 \times 0) = 0.97$$

✅ Since **entropy drops to 0**, this is a **perfect split**.

---

#### **Coding the Decision Tree from Scratch:**

In [2]:
import numpy as np
from collections import Counter

In [3]:
class Node:
    def __init__(self, feature=None, threshold=None, left=None, right=None, value=None):
        self.feature = feature        # Feature index to split on
        self.threshold = threshold    # Value to split at (for numerical features)
        self.left = left              # Left child node
        self.right = right            # Right child node
        self.value = value            # Class label (for leaf nodes)
    
    def is_leaf(self):
        """Check if the node is a leaf node (no children)."""
        return self.value is not None

In [4]:
class DecisionTree:
    def __init__(self, max_depth=10, min_samples_split=2):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.root = None

    def fit(self, X, y):
        """Builds the decision tree."""
        self.root = self._grow_tree(X, y, depth=0)

    def _grow_tree(self, X, y, depth):
        """Recursively splits the data and builds the tree."""
        num_samples, num_features = X.shape
        unique_classes = np.unique(y)

        # Stopping conditions
        if (depth >= self.max_depth or len(unique_classes) == 1 or num_samples < self.min_samples_split):
            leaf_value = self._most_common_label(y)
            return Node(value=leaf_value)

        # Find the best split
        best_feature, best_threshold = self._best_split(X, y, num_features)

        # Split the data
        left_indices = X[:, best_feature] <= best_threshold
        right_indices = ~left_indices
        left_child = self._grow_tree(X[left_indices], y[left_indices], depth + 1)
        right_child = self._grow_tree(X[right_indices], y[right_indices], depth + 1)

        return Node(feature=best_feature, threshold=best_threshold, left=left_child, right=right_child)

    def _best_split(self, X, y, num_features):
        """Finds the best feature and threshold for splitting the data."""
        best_gain = -1
        best_feature, best_threshold = None, None

        for feature in range(num_features):
            thresholds = np.unique(X[:, feature])  # Possible split points
            for threshold in thresholds:
                gain = self._information_gain(y, X[:, feature], threshold)

                if gain > best_gain:
                    best_gain = gain
                    best_feature = feature
                    best_threshold = threshold

        return best_feature, best_threshold

    def _information_gain(self, y, feature_column, threshold):
        """Computes the information gain from a split."""
        parent_gini = self._gini(y)
        left_indices = feature_column <= threshold
        right_indices = ~left_indices

        if len(y[left_indices]) == 0 or len(y[right_indices]) == 0:
            return 0  # No split occurs

        left_gini = self._gini(y[left_indices])
        right_gini = self._gini(y[right_indices])
        left_weight = len(y[left_indices]) / len(y)
        right_weight = len(y[right_indices]) / len(y)

        # Weighted average of child node impurities
        gini_gain = parent_gini - (left_weight * left_gini + right_weight * right_gini)
        return gini_gain

    def _gini(self, y):
        """Computes Gini impurity of a node."""
        class_counts = Counter(y)
        probabilities = np.array(list(class_counts.values())) / len(y)
        return 1 - np.sum(probabilities ** 2)

    def _most_common_label(self, y):
        """Returns the most common class label in a node."""
        return Counter(y).most_common(1)[0][0]

    def predict(self, X):
        """Predicts class labels for given samples."""
        return np.array([self._traverse_tree(x, self.root) for x in X])

    def _traverse_tree(self, x, node):
        """Traverses the tree to make a prediction."""
        if node.is_leaf():
            return node.value
        if x[node.feature] <= node.threshold:
            return self._traverse_tree(x, node.left)
        else:
            return self._traverse_tree(x, node.right)

In [5]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train our decision tree
tree = DecisionTree(max_depth=5)
tree.fit(X_train, y_train)

# Make predictions
y_pred = tree.predict(X_test)

# Check accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Custom Decision Tree Accuracy: {accuracy:.4f}")


Custom Decision Tree Accuracy: 1.0000
