let's explore **Topic 13: Decision Trees**. These are intuitive and powerful models used for both classification and regression tasks, known for their interpretability.


---

**1. Introduction: What are Decision Trees?**

* **Versatile Models:** Decision Trees are supervised learning algorithms that can predict a target variable by learning simple decision rules inferred from the data features. They can be used for:
    * **Classification tasks:** Predicting a discrete class label (e.g., "spam" or "not spam").
    * **Regression tasks:** Predicting a continuous numerical value (e.g., price of a house).
* **Structure (Conceptual Diagram):**
    Imagine an **upside-down tree** or a **flowchart**.
    * It starts with a **root node** at the top, representing the entire dataset.
    * This root node branches out into several **internal nodes** (or decision nodes). Each internal node represents a "test" or a "question" about a specific feature (e.g., "Is Petal Width < 0.8 cm?").
    * Each **branch** extending from an internal node represents an outcome or answer to that test (e.g., "Yes" or "No"; or for numerical features, a range like "< 0.8 cm" vs. ">= 0.8 cm").
    * The branches lead either to other internal nodes (further questions) or to **leaf nodes** (also called terminal nodes).
    * **Leaf nodes** represent the final outcome or decision.
        * In a **classification tree**, each leaf node corresponds to a class label (e.g., "Iris-setosa").
        * In a **regression tree**, each leaf node corresponds to a predicted continuous value (e.g., an average value like "$250,000").
* **Interpretability ("White Box" Model):** One of the biggest advantages of decision trees is their interpretability. The decision rules are explicit and easy to understand, unlike "black box" models like complex neural networks or SVMs with non-linear kernels.

---

**2. Tree Structure Terminology**

Let's formalize the parts of a decision tree:

* **Root Node:** The topmost node where the decision-making process begins. It contains all the training samples.
* **Internal Node (Decision Node):** A node that performs a test on a feature and splits the data into two or more subsets (child nodes) based on the outcome of the test.
* **Branch (Edge):** A link between nodes, representing the outcome of a test.
* **Leaf Node (Terminal Node):** A node that does not split further. It represents a final decision:
    * For classification: Contains a class label. The prediction for an instance reaching this leaf is that class label.
    * For regression: Contains a continuous value (typically the average of the target values of the training instances that reached this leaf).
* **Parent Node:** A node that is split into child nodes.
* **Child Node:** Nodes that are created as a result of a split from a parent node.
* **Depth of a Node:** The number of edges on the path from the root node to that node.
* **Depth of a Tree:** The depth of its deepest leaf node (length of the longest path from the root to a leaf).

**Text-based Diagram Example (Simple Classification Tree):**
Imagine we are classifying fruits based on color and size:

```
Is Color == Red?
|--- Yes: Is Size < 10cm?
|      |--- Yes: Leaf (Class: Cherry)
|      |--- No:  Leaf (Class: Apple)
|--- No: Is Color == Yellow?
       |--- Yes: Leaf (Class: Banana)
       |--- No:  Leaf (Class: Grape)  (assuming other colors lead here)
```

---

**3. How Decision Trees Work (Building the Tree - Recursive Partitioning)**

Decision trees are typically built using a **greedy, top-down, recursive partitioning** approach.

1.  **Start at the Root:** Begin with all training samples at the root node.
2.  **Find the Best Split:** For the current node, examine all possible features and all possible split points (thresholds for numerical features, or categories for categorical features). The goal is to find the feature and split point that results in the "best" split.
    * "Best" split means it divides the data into subsets (child nodes) that are as **"pure"** as possible with respect to the target variable. A pure node ideally contains samples from only one class (for classification) or samples with very similar target values (for regression).
3.  **Create Child Nodes:** Based on the best split, create child nodes and move the corresponding subsets of data into these child nodes.
4.  **Recurse:** Repeat steps 2 and 3 for each newly created child node.
5.  **Stopping Condition:** The recursion stops for a branch when one of the following conditions is met:
    * The node is perfectly pure (all samples belong to the same class or have identical target values).
    * A predefined stopping criterion is met (e.g., maximum tree depth reached, minimum number of samples in a node for splitting, etc.). These are hyperparameters used to control tree growth and prevent overfitting.
    * No split further improves the purity of the nodes significantly.

This algorithm is **greedy** because it makes the best local choice at each step (the split that looks best at the current node) without looking ahead to see if this choice will lead to a globally optimal tree.

---




**4. Splitting Criteria (Measuring Purity or Impurity)**

How does the algorithm decide which split is "best"? It uses a criterion to measure the purity (or impurity) of a node.

**a) For Classification Trees:**

The goal is to create child nodes where samples predominantly belong to a single class.

* **Gini Impurity:**
    * Measures the probability of misclassifying a randomly chosen element from the node if it were randomly labeled according to the distribution of labels in that node.
    * For a node $t$ with $K$ classes, and $p(C_k|t)$ being the proportion of samples of class $C_k$ in node $t$:
        $$Gini(t) = 1 - \sum_{k=1}^{K} [p(C_k|t)]^2$$
    * **Interpretation:**
        * $Gini(t) = 0$: Perfectly pure node (all samples belong to a single class).
        * $Gini(t) = 0.5$ (for binary classification): Maximally impure node (50/50 split of classes).
    * The algorithm selects the split that results in the **lowest weighted average Gini impurity** of the child nodes (i.e., maximizes the Gini gain or reduction in impurity).
    * **Conceptual Diagram for Purity:** Imagine a basket of fruits.
        * Basket A: 10 Apples, 0 Oranges (Pure, Gini = 0).
        * Basket B: 5 Apples, 5 Oranges (Impure, Gini = 0.5).
        A good split would separate an impure basket into purer ones.

* **Entropy / Information Gain:**
    * **Entropy** is a measure of uncertainty or disorder in a set of samples.
        $$Entropy(t) = -\sum_{k=1}^{K} p(C_k|t) \log_2(p(C_k|t))$$
        (Conventionally, $0 \log_2 0 = 0$).
    * **Interpretation:**
        * $Entropy(t) = 0$: Perfectly pure node.
        * $Entropy(t) = 1$ (for binary classification, $\log_2$): Maximally impure node.
    * **Information Gain (IG):** The reduction in entropy achieved by splitting the data on a particular feature. The algorithm chooses the split that **maximizes information gain**.
        $$IG(Parent, Split) = Entropy(Parent) - \sum_{j \in \text{children}} \frac{N_j}{N_{Parent}} Entropy(Child_j)$$
        where $N_j$ is the number of samples in child node $j$, and $N_{Parent}$ is the number of samples in the parent node.

* **Gini vs. Entropy:** Both are common and usually lead to similar trees. Gini impurity is often slightly faster to compute as it doesn't involve logarithmic calculations. Scikit-learn uses Gini by default for classification.

**b) For Regression Trees:**

The goal is to create child nodes where the target values of the samples are as similar as possible (i.e., have low variance).

* **Mean Squared Error (MSE):**
    * The algorithm tries to find splits that **minimize the MSE** within each resulting child node.
    * For a node $t$ with $N_t$ samples, and $\bar{y}_t$ being the average target value of samples in node $t$:
        $$MSE(t) = \frac{1}{N_t} \sum_{i \in t} (y_i - \bar{y}_t)^2$$
    * The prediction made at a leaf node is typically the average of the target values of the training instances in that leaf.
* **Mean Absolute Error (MAE):** Less common but can also be used. It's less sensitive to outliers than MSE.

Scikit-learn uses MSE by default for regression (referred to as "squared_error").

---
This covers the basic structure, how trees are built, and the splitting criteria. Next, we'll look at controlling tree growth (hyperparameters and pruning), visualization, pros/cons, and then the Scikit-learn implementation.