# Decision Tree - Theoretical Questions

### **1. What is a Decision Tree, and how does it work?**

A **Decision Tree** is a supervised machine learning algorithm which can be used for both classification and regression tasks.
It works by recursively splitting the dataset into branches based on feature values, forming a tree-like structure, which is a type of Greedy Algorithm as it keep on splitting the branches until the leaf node becomes completely pure (Homogeneous).

Each internal node represents a decision on a feature, each branch represents the outcome of a decision, and each leaf represents a class label or prediction result.

The goal is to create a model that predicts the target variable by learning simple decision rules inferred from data features.

### **2. What are impurity measures in Decision Trees?**

**Impurity measures** quantify the level of disorder or impurity in a dataset. These helps to determine the quality of a split in the dataset.

Common impurity measures include:
- **Gini Impurity**
- **Entropy (Information Gain)**

Lower impurity implies a better split that leads to more homogeneous subgroups.

### **3. What is the mathematical formula for Gini Impurity?**

The Gini Impurity is calculated using the formula:

$$	\text{Gini} = 1 - \sum_{i=1}^{n} p_i^2$$

Where:
- $n$ = number of classes
- $p_i$ = proportion of class $i$ elements in the node

For binary classification, this becomes:
$$	\text{Gini} = 2p(1-p)$$ 
where $p$ is the probability of one class.

### **4. What is the mathematical formula for Entropy?**

The formula for **Entropy** is:

$$	\text{H}(S) = - \sum_{i=1}^{n} p_i \cdot log_2(p_i)$$

Where:
- $H$ = Represents Entropy
- $n$ = number of classes
- $p_i$ = proportion of class $i$ in set $S$

Entropy reaches its maximum when the classes are perfectly balanced, and is zero when all data belongs to a single class.

### **5. What is Information Gain, and how is it used in Decision Trees?**

**Information Gain (IG)** measures the reduction in entropy or impurity after a dataset is split on a feature, and it also help us to decide which feature to choose for the root node to split further.

It is calculated as:
$$IG(S, A) = H(S) - \sum_{v \in \text{ val(A)}} \frac{|S_v|}{|S|} \cdot H(S_v)$$

Where $S$ is the dataset, $A$ is the feature, and $S_v$ is the subset of $S$ for which feature $A$ has value $v$.

The feature with the highest IG is chosen for splitting at each step.

### **6. What is the difference between Gini Impurity and Entropy?**

| Criterion        | Gini Impurity                     | Entropy                             |
|------------------|----------------------------------|--------------------------------------|
| Formula          | $1 - \sum p_i^2$                 | $- \sum p_i \cdot log_2 p_i$               |
| Interpretation   | Probability of misclassification| Amount of information (uncertainty) |
| Computationally  | Faster                          | Slower due to logarithms             |
| Usage            | CART algorithm                   | ID3, C4.5 algorithms                  |

### **7. What is the mathematical explanation behind Decision Trees?**

Decision Trees aim to partition the data space recursively by maximizing an objective function such as Information Gain or minimizing impurity.

At each node, the algorithm chooses the feature and threshold (for numeric features) or value (for categorical features) that best (decided upon information gain) splits the data.

Let $D$ be a dataset and $A$ a feature. We compute the split quality (e.g., Information Gain) and recursively apply the same process to the subsets until a stopping criterion is met (we ca specify the max_depth we want, or zero impurity i.e., until the leaf node becomes completely homogeneous or pure).

### **8. What is Pre-Pruning in Decision Trees?**

**Pre-Pruning** is a technique used to *cut* the tree growth **early** during its construction to prevent over-fitting.

Common pre-pruning strategies include:
- Limiting tree depth
- Mentioning a minimum number of samples at a node
- Setting a minimum information gain threshold

By restricting tree complexity upfront, pre-pruning balances bias and variance.

### **9. What is Post-Pruning in Decision Trees?**

**Post-Pruning** involves growing the full decision tree and *then trimming back* branches that have little statistical significance.

This is typically done by evaluating the tree’s performance on a validation dataset.

Common techniques include cost-complexity pruning (used in CART), reduced error pruning, and weakest link pruning.

### **10. What is the difference between Pre-Pruning and Post-Pruning?**

| Feature           | Pre-Pruning                                 | Post-Pruning                                |
|--------------------|----------------------------------------------|---------------------------------------------|
| Timing             | Stops tree growth early                      | Prunes after full tree is built             |
| Decision Basis     | Thresholds like depth, min samples, gain     | Validation set or cross-validation          |
| Complexity Control | Proactive                                   | Reactive                                     |
| Common Algorithms  | C4.5, ID3                                    | CART, Reduced Error Pruning                 |

### **11. What is a Decision Tree Regressor?**

A **Decision Tree Regressor** is used for predicting continuous numerical values.

Instead of class labels, the leaf nodes hold numeric values, and the tree splits the data by minimizing variance (in place of Information Gain) within each split.

The goal is to partition the feature space such that the average squared error (MSE) of the predictions is minimized.

### **12. What are the advantages and disadvantages of Decision Trees?**

**Advantages:**
- Easy to interpret and visualize
- No need for feature scaling
- Can handle both numerical (regressor) and categorical (classifier) data
- Works well with non-linear data

**Disadvantages:**
- Prone to overfitting
- Instability to small changes in data
- Biased with imbalanced datasets
- Less accurate compared to ensemble methods like Random Forest

### **13. How does a Decision Tree handle missing values?**

Decision Trees handle missing values using techniques such as:

- **Surrogate Splits**: Find alternative splits using other correlated features.
- **Imputation**: Replace missing values with mean/median/mode or predictive modeling.
- **Splitting on presence**: Create an additional category or branch for missing values.

Some implementations like CART and C4.5 directly support missing value handling.

### **14. How does a Decision Tree handle categorical features?**

Categorical features are handled by evaluating splits for each category or grouping them:

- For **binary splits**, the tree finds the best subset of categories for a decision boundary.
- For **multiway splits**, a branch is created for each category.
- Algorithms like CART use binary splits, while ID3 and C4.5 allow multiway splits.

### **15. What are some real-world applications of Decision Trees?**

- **Medical Diagnosis**: Predicting disease based on symptoms
- **Loan Approval**: Determining creditworthiness
- **Customer Segmentation**: Marketing and targeting strategies
- **Fraud Detection**: Identifying suspicious transactions
- **Risk Assessment**: In insurance and financial domains
- **Manufacturing**: Predictive maintenance and quality control