# Decision Trees

A **Decision Tree** is a versatile, non-parametric supervised learning algorithm used for both classification and regression tasks. Its model forms a tree-like structure, mimicking human decision-making processes.



### Core Components

* **Root Node:** The topmost node, representing the entire dataset. It gets split into two or more homogeneous sets.
* **Internal Nodes (Decision Nodes):** Nodes that represent a decision (or test) on a specific feature, splitting into further nodes.
* **Leaf Nodes (Terminal Nodes):** The final output nodes. They represent a class label (in classification) or a continuous value (in regression) and do not split further.
* **Branches (Sub-Trees):** Sections of the tree that are not the root or a leaf.

---

## How Does a Decision Tree Learn?

The algorithm learns by recursively splitting the data into purer subsets based on the features. At each node, it selects the feature and split point that best separates the data according to the target variable. This process continues until a stopping criterion is met (e.g., maximum depth is reached, or a node is 100% pure).

### Key Splitting Criteria
The "best" split is determined by a metric that quantifies the impurity or disorder of a node.

#### 1. Gini Impurity
Measures the probability of misclassifying a randomly chosen element from the node if it were randomly labeled according to the class distribution in the node.

* **Formula:** $Gini = 1 - \sum p_i^2$
    *(where $p_i$ is the proportion of class $i$ in the node)*
* **Range:** $0$ (perfectly pure) to $0.5$ (for a binary class with equal distribution).
* **Characteristics:** Computationally efficient as it doesn't involve logarithms.

#### 2. Entropy & Information Gain
**Entropy** measures the amount of uncertainty or disorder in a node.

* **Formula:** $Entropy = -\sum (p_i \times \log_2(p_i))$
* **Range:** $0$ (perfectly pure) to $1$ (maximally impure for binary classification).

**Information Gain (IG)** is the reduction in entropy after a dataset is split on an attribute. The split with the highest IG is chosen.

* **Formula:** $IG = Entropy(parent) - \text{Weighted Average} \times Entropy(children)$

---

### Gini vs. Entropy: Which to Use?

1.  **Gini Impurity** is slightly faster to compute and is the default in many libraries (like `scikit-learn`).
2.  **Entropy** may lead to more balanced trees as it tends to create slightly more granular splits.

> **Note:** In practice, the difference is often negligible; the resulting trees are usually very similar.

---
#### Where Are Decision Trees Used?
* **Finance:** Credit scoring, loan approval.
* **Healthcare:** Disease diagnosis, patient risk stratification.
* **E-commerce:** Customer segmentation, product recommendation engines.
* **Manufacturing:** Quality control and fault detection.

#### Practical Challenges & Solutions
**Handling Missing Values:**
* **Imputation:** Fill with mean, median, or mode.
* **Surrogate Splits:** (C4.5, CART) Use other features that mimic the original split to handle data points with missing values.

**Imbalanced Data:**
* Use `class_weight` parameter to assign higher weights to minority classes.
* Employ sampling techniques (SMOTE, undersampling) before training.

**Feature Types:**
* **Continuous:** Find an optimal threshold (e.g., Age $\le 30$).
* **Categorical:** Can be handled natively by some algorithms (C4.5) or require encoding (like Ordinal/Label Encoding) for others (CART).

---

### System Design Angle: Production Considerations

#### When to Use Decision Trees?
**Advantages:**
* **Highly interpretable and visualizable** (White Box model).
* Can handle both numerical and categorical data without scaling.
* Mirrors human decision-making, which is great for business logic.

**Disadvantages:**
* **Prone to Overfitting:** Requires careful tuning and pruning.
* **Unstable:** Small changes in data can lead to a completely different tree (high variance).
* **Biased:** Tend to favor features with more levels (e.g., continuous features).

#### Challenges in Production
* **Model Drift:** The tree's rules can become obsolete as data distributions change, requiring periodic retraining.
* **Scalability:** While fast to predict, training a very deep tree on large datasets can be memory-intensive.
* **Instability:** A single tree in production is risky. This is why ensemble methods (Random Forests, Gradient Boosting) are almost always preferred for production systems.




### 1. How do Decision Trees handle categorical vs. continuous features?

* **Continuous Features:** The algorithm sorts the feature values and evaluates all possible split points (e.g., $X > 5.2$). It chooses the threshold that minimizes impurity (maximizes information gain).
* **Categorical Features:**
    * **Binary trees (like CART):** It tries all possible binary partitions of the categories (e.g., Color in {Red, Blue} vs. Color in {Green}).
    * **Multi-way trees (like ID3):** It can create a branch for each category.
    * *Note:* In libraries like `scikit-learn`, categorical features often need to be numerically encoded (e.g., LabelEncoder) beforehand.

### 2. What is the role of entropy and information gain?

* **Entropy:** Quantifies the disorder within a node. A pure node has an entropy of $0$.
* **Information Gain:** Measures how much a split reduces this entropy. The algorithm greedily selects the feature and split point that results in the highest information gain at each step, effectively creating the most homogeneous child nodes.

### 3. Why might a Decision Tree overfit, and how can you prevent it?

**Why?** The tree can keep splitting until every leaf node is perfectly pure, effectively memorizing the training data, including noise and outliers.

**Prevention:**
1.  **Pruning:** Use cost-complexity pruning (`ccp_alpha`).
2.  **Limit Tree Size:** Set a `max_depth`, `min_samples_split`, or `min_samples_leaf`.
3.  **Use Ensemble Methods:** Train multiple trees (e.g., Random Forest) to average out the instability.
4.  **Use More Data:** Helps the model learn general patterns instead of noise.

### 4. Advantages and disadvantages of Gini Impurity vs. Entropy?

| Criterion | Advantages | Disadvantages |
| :--- | :--- | :--- |
| **Gini Impurity** | Faster to compute (no logarithms). | Tends to isolate the most frequent class in its own branch. |
| **Entropy** | More theoretically grounded. May create more balanced trees. | Slightly slower computation due to logarithms. |

> **Note:** In practice, they produce very similar results, and the choice is often a matter of preference.

### 5. How do Decision Trees handle missing values?

* **Imputation:** Before training, fill missing values with a statistic like the mean or mode.
* **Surrogate Splits (CART):** A powerful technique where the tree finds a "backup" feature that produces a split most similar to the primary split. If the primary feature is missing for a data point, the surrogate feature is used.
* **Ignore:** Some implementations simply skip data points with missing values in the feature being considered for a split.

### 6. Why are Decision Trees prone to instability?
They are **high-variance estimators**. Because of their hierarchical nature, a small change in the training data can lead to a completely different choice at the root node, which then propagates down and changes the entire structure of the tree. This instability is why they are rarely used alone in practice.

### 7. A Decision Tree overfits: performs well on training data but poorly on test data. What do you do?

1.  **Apply Pruning:** This is the most direct method. Use `ccp_alpha` for post-pruning.
2.  **Increase Regularization:** Increase `min_samples_leaf` or `min_samples_split`, or reduce `max_depth`.
3.  **Gather More Training Data.**
4.  **Perform Feature Selection:** Remove irrelevant features that may be adding noise.
5.  **Switch to an Ensemble Method:** Use a Random Forest or Gradient Boosting machine, which are built to overcome the limitations of a single tree.

### 8. How does pruning work, and what are the different types?

* **Pre-Pruning (Early Stopping):** Stops the tree from growing based on predefined conditions (e.g., `max_depth=5`). It's simple but can be short-sighted.
* **Post-Pruning (Cost-Complexity Pruning):**
    1.  Grow the tree to its full depth.
    2.  Systematically remove branches (sub-trees) from the fully grown tree.
    3.  Replace a sub-tree with a leaf node.
    4.  Evaluate the performance of the pruned tree on a validation set.
    5.  Select the pruned tree that maximizes validation performance. It uses a hyperparameter `alpha` to balance tree complexity and accuracy.

### 9. Do Decision Trees require feature scaling?
**No.** Since the splitting logic is based on ordering feature values and calculating impurity, the scale of the features (e.g., Age vs. Salary) does not influence the model. This is a significant advantage over distance-based models like SVMs or K-Nearest Neighbors.

### 10. How do Decision Trees determine feature importance?
Feature importance is calculated as the (normalized) total reduction of the impurity criterion (Gini/Entropy/MSE) brought by that feature.

* **Calculation:** For each feature, sum the impurity decrease for every node that splits on that feature, weighted by the fraction of samples it handles.
* **Limitations:**
    * **Bias towards high-cardinality features:** Continuous features or categorical features with many levels have more split options and can appear more important.
    * **Correlation:** If two features are highly correlated, the importance may be arbitrarily assigned to one, making the other seem less important.


# Decision Tree: Classifier vs. Regressor

The decision on whether to use a **Classifier** or a **Regressor** depends entirely on your **Target Variable** (the thing you are trying to predict).

### 1. Decision Tree Classifier
**Use this when:** Your target variable is **Categorical** (distinct classes or labels).

* **The Goal:** Split the data to separate different classes (e.g., Yes vs. No).
* **Prediction Output:** The **majority class** (mode) of the samples in the leaf node.
* **Evaluation Metrics:** Accuracy, Precision, Recall, F1-Score, AUC-ROC.
* **Splitting Criteria:** Gini Impurity, Information Gain (Entropy).

**Real-World Examples:**
* **Email Filter:** Is this email `Spam` or `Not Spam`?
* **Medical Diagnosis:** Does the patient have `Diabetes`, `Heart Disease`, or `No Disease`?
* **Loan Approval:** Should the loan be `Approved` or `Denied`?

### 2. Decision Tree Regressor
**Use this when:** Your target variable is **Continuous** (numerical values).



* **The Goal:** Split the data to group similar numerical values together.
* **Prediction Output:** The **average value** (mean) of the samples in the leaf node.
* **Evaluation Metrics:** MSE (Mean Squared Error), RMSE, MAE, R-Squared ($R^2$).
* **Splitting Criteria:** MSE (Mean Squared Error), Friedman MSE, MAE.

**Real-World Examples:**
* **House Pricing:** Predicting the exact **price** of a house (e.g., $350,000) based on square footage and location.
* **Sales Forecasting:** Predicting **how many units** of a product will sell next month.
* **Temperature Prediction:** Predicting the specific **temperature** (e.g., 24.5°C) for tomorrow.

---

### Quick Comparison Table

| Feature | Decision Tree **Classifier** | Decision Tree **Regressor** |
| :--- | :--- | :--- |
| **Target Type** | **Categories** (Discrete) | **Numbers** (Continuous) |
| **Prediction** | Majority Vote (Mode) | Average (Mean) |
| **Splitting Metric** | Gini Impurity / Entropy | Variance / Mean Squared Error (MSE) |
| **Example Output** | "Dog", "Cat", "Bird" | 10.5, 342.1, 0.95 |
| **Python Class** | `sklearn.tree.DecisionTreeClassifier` | `sklearn.tree.DecisionTreeRegressor` |

---

### Code Implementation Difference

Notice that the setup is almost identical, but you import a different class from `sklearn`.

```python
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

# -------------------------------------------------------
# SCENARIO 1: CLASSIFICATION (Predicting Colors)
# -------------------------------------------------------
# X features: [Weight, Texture]
# y target:   ['Apple', 'Orange', 'Apple'] (Categories)

clf = DecisionTreeClassifier(criterion='gini', max_depth=3)
# clf.fit(X, y)


# -------------------------------------------------------
# SCENARIO 2: REGRESSION (Predicting Price)
# -------------------------------------------------------
# X features: [Rooms, Location_Score]
# y target:   [200000, 150000, 300000] (Continuous Numbers)

reg = DecisionTreeRegressor(criterion='squared_error', max_depth=3)
# reg.fit(X, y)

In [None]:
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt


np.random.seed(42)
X = np.sort(5 * np.random.rand(80, 1), axis=0)
y = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train model
tree_reg = DecisionTreeRegressor(
    max_depth=3,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42
)
tree_reg.fit(X_train, y_train)

# Predictions
y_pred = tree_reg.predict(X_test)

# Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MSE: {mse:.4f}, R²: {r2:.4f}")

MSE: 0.0496, R²: 0.8916
