# Decision Tree

Great for:
* Retail: customer segmentation, churn prediction
* Finance: credit risk scoring, fraud detection
* E-commerce (Booking.com): content ranking, matching signals
* Healthcare: interpretable diagnosis rules



## 1. Definition
Decision Tree is a supervised learning algorithm that models decisions as a tree-like structure of binary rules. It recursively splits data into smaller subsets to predict a target class or value. It solves the problem of non-linear classification and regression by breaking complex data into simple, interpretable 'if-then-else' logic.


## 2. Core Idea
Decision Tree assumes that simple, hierarchical rules are sufficient to describe complex relationships. It partitions the feature space by asking a sequence of if/else questions (split) that try to make the labels in each region as 'pure' as possible. Each split aims to create child nodes that are more homogeneous in the target variable than the parent node.


## 3. Mechanism
Decision Trees optimize impurity reduction.
* Classification: minimize Gini impurity or Entropy 
* Regression: minimize variance (MSE)

At each node, the tree selects best feature and best threshold that maximizes the information gain (impurity decrease).
The process starts at the root with the full dataset, and for each feature, the algorithm evaluate all possible split threshold and compute impurity before and after the split. The process recursively repeats and stop when max depth reaches or impruity is smaller the defined threshold.

The final prediction is assigned based on: the most common class (classification) or mean target value (regression).


## 4. Mathematical Details and Training
The training process is a Greedy Algorithm (e.g. CART, classification and regression trees). It iterates through every feature to find the split that maximizes data purity.

1. Classification Criteria:
* Gini Impurity (used in CART): Measures the likelihood of incorrect classification of a random element, minimizing
Probability of misclassification if you randomly label a sample according to the class distribution in the node.  Gini impurity is an alternative to entropy for measuring the impurity of a dataset. It's computationally simpler than entropy. A Gini impurity of 0 indicates that all samples in the set belong to the same class.


$$ G = 1 - \sum_{i=1}^{C}(p_i)^2$$

* Entropy / Information Gain (used in ID3 / C4.5): Measures disorder. We want to maximize the gain (reduce in entropy).

> Entropy is a measure of the randomness or impurity in a set of data, and it quantifies the uncertainty in a collection of samples.

    - A set with a mix of different classes has high entropy (high impurity)
    - A set where all samples belong to the same class has low entropy (zero impurity)

In simple terms, the formula calculates the weighted average of the information content of each class. A higher value indicates a higher degree of unpredictability and disorder in the data.

$$H(S) = -\sum_{i=1}^{C}p_i\log_2(p_i)$$

$C$: number of unique class

$p_i$: The probability of samples that belong to the class i

$log2(p_i)$: This part measure the information content of a class. The base 2 is used becauase information is measured in bits.

$-\sum$: The negative sign is necessary because the logarithm of a probability (a number between 0 and 1) is always negative, and entropy is defined as a non-negative value.

<br>

Information Gain is the primary metric used to select the best feature to split on at each step of the tree-building process. It measures the *reduction in entropy achieved by splitting the data* on a particular feature. The feature with the highest information gain is chosen for the split.

**IG(S,A)**: This is the value we are trying to find. It represents the Information Gain achieved by splitting the dataset S on the feature A.
$$IG(S,A) = H(S) - \sum_{v \in \text{Values}{(A)}}\frac{[S_v]}{|S|}; H(S_v)$$


$S$: Original Set

$A$: The feature being considered

$Values(A)$: Possible values /thresholds for $A$

$S_v$: the subset of $S$ where feature $A$ has value $v$.


The model tries to choose splits that produce children with low impurity, i.e., clean class separation.

<br><br>
2. Regression:
* Variance Reduction: It minimizes the Mean Squared Error (MSE) within the split. The tree stops growing when it hits max depth, min samples per leaft, or when a split creates no purity gain.



## 5. Pros and Cons
* Pros
    * Interpretability: We can visualize the exact path taken to make a decision (good for banking / medical compliance).
    * No preprocessing: Does not require feature scaling or dummy variables.
    * Non-linear relationship
    * Fast predictions
    * Capture interactions automatically
* Cons:
    * Overfittingh: Trees tend to grow very deep and memorize noise, leading to high variance.
    * Instability: A small change in the data can result in a complete different tree structure.
    * Greedy splitting (locally optimal, not globally)
    * Orthogonal Boundaries: Decision trees only split parallel to the axes ($X > 5$). They struggle with diagonal relationships (e.g., $X > Y$).


## 6. Production Consideration
* Pruning: In production, you never let a tree grow fully. You use Pre-pruning (setting max_depth or min_samples_split) or Post-pruning (cutting back weak branches) to generalize better.
* Ensembling: A single decision tree is rarely used in high-performance production systems. They are almost always used as the building blocks for Random Forests (Bagging) or XGBoost/LightGBM (Boosting) to cure the instability and overfitting issues.
* Inference Speed: Extremely fast O(depth). making them suitable for low-latency applications.



## 7. Other variants
* Bagging: Random Forest -> Avoid Overfitting, parallelizable.
* Boosting: XGBoot -> Reduce bias.

In [1]:
import sys, os
root = os.path.abspath("..")
sys.path.append(root)


from src.decision_tree import DecisionTree
import pandas as pd

## Case 1: Content ranking and search for hotel booking system

### Business Problem
Ranking System (via Decision Trees) is a predictive model that sorts a list of properties (hotels, apartments) for a specific user query by predicting the probability of a "Conversion" (Booking) based on hierarchical feature interactions. A search for "Hotels in London" returns 4,000 results. The system must order them so the most relevant 10 appear on the first screen.

The existing heuristic ranking relied heavily on static signals (e.g., popularity, star rating) and didn’t capture personalization or property–user match quality. This resulted in:
* Lower than expected CTR on search result pages
* Weak alignment between user preferences and shown properties
* Difficulty explaining ranking decisions to PMs and Legal/Compliance


The specific goal was to increase the Conversion Rate (CVR) by at least 1% without increasing search latency beyond 50ms, ensuring we could better match property attributes to specific user contexts."


### Data Problem




## Dataset and Feature Engineering




## Case 2: Credit Risk Scoring