
# Decision Tree | Assignment Solution

# Theory Questions


### Question 1
**What is a Decision Tree, and how does it work in the context of classification?**



**Answer:**  
A Decision Tree is a flowchart-like model that splits the feature space into regions using a sequence of if–else rules. In classification, each internal node tests a feature; branches correspond to test outcomes; and leaves assign a class. The tree is learned by greedily choosing splits that best reduce class impurity (e.g., Gini or entropy) on the training data, and predictions for a new sample follow the path dictated by its feature values until a leaf is reached.



### Question 2
**Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?**



**Answer:**  
- **Gini impurity:** \( G = \sum_k p_k (1 - p_k) = 1 - \sum_k p_k^2 \), where \(p_k\) is the fraction of class *k* in a node. Lower Gini means purer nodes.  
- **Entropy:** \( H = -\sum_k p_k \log_2 p_k \). Lower entropy means more certainty.  

At each node, the algorithm evaluates candidate splits and chooses the one that **maximizes impurity reduction** (Information Gain for entropy or Gini decrease). This yields children nodes that are purer than the parent, improving class separability.



### Question 3
**What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.**



**Answer:**  
- **Pre-pruning (early stopping):** Limit the tree during growth using constraints like `max_depth`, `min_samples_split`, or `min_impurity_decrease`.  
  - *Advantage:* Faster training and lower variance out-of-the-box (prevents very deep, overfit trees).  
- **Post-pruning (cost-complexity pruning):** First grow a large tree, then prune back using a penalty on complexity (e.g., `ccp_alpha`) based on validation/CV.  
  - *Advantage:* Often yields a better bias–variance trade-off because the algorithm can explore rich splits before pruning subtrees that don’t generalize.



### Question 4
**What is Information Gain in Decision Trees, and why is it important for choosing the best split?**



**Answer:**  
**Information Gain (IG)** is the reduction in impurity from a parent node to its children after a split. Using entropy,  
\[ IG = H(\text{parent}) - \sum_i \frac{N_i}{N} H(\text{child}_i) \]  
Higher IG indicates a split that creates purer child nodes. Trees pick the split with **maximum IG** (or maximum Gini decrease), which leads to more informative partitions and better predictive performance.



### Question 5
**What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?**



**Answer:**  
- **Applications:** credit risk scoring, churn prediction, medical diagnosis triage, fraud detection, lead qualification, and simple rule-based recommenders.  
- **Advantages:** easy to interpret/visualize; handles mixed data types; little feature scaling needed; captures non-linear relations and interactions.  
- **Limitations:** prone to overfitting without pruning; decision boundaries are axis-aligned; small data changes can alter the tree (instability); typically lower accuracy than ensembles like Random Forests/Gradient Boosting on complex tasks.


# Practical Questions

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, mean_squared_error

RANDOM_STATE = 42

def describe_feature_importances(model, feature_names):
    if hasattr(model, "feature_importances_"):
        return pd.Series(model.feature_importances_, index=feature_names).sort_values(ascending=False)
    else:
        return pd.Series(dtype=float)



### Q6
**Write a Python program to:**
- Load the Iris Dataset  
- Train a Decision Tree Classifier using the Gini criterion  
- Print the model’s accuracy and feature importances


In [2]:

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Load data
iris = load_iris(as_frame=True)
X = iris.data
y = iris.target
feature_names = X.columns.tolist()

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=RANDOM_STATE, stratify=y)

# Model with Gini
clf_gini = DecisionTreeClassifier(criterion="gini", random_state=RANDOM_STATE)
clf_gini.fit(X_train, y_train)

# Evaluate
y_pred = clf_gini.predict(X_test)
acc = accuracy_score(y_test, y_pred)

print(f"Accuracy (Gini): {acc:.4f}")
fi = describe_feature_importances(clf_gini, feature_names)
print("\nFeature Importances (descending):\n", fi.to_string())


Accuracy (Gini): 0.8947

Feature Importances (descending):
 petal length (cm)    0.919887
petal width (cm)     0.046629
sepal width (cm)     0.020091
sepal length (cm)    0.013394



### Q7
**Write a Python program to:**
- Load the Iris Dataset  
- Train a Decision Tree Classifier with `max_depth=3` and compare its accuracy to a fully-grown tree.


In [3]:

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Load data
iris = load_iris(as_frame=True)
X = iris.data
y = iris.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=RANDOM_STATE, stratify=y)

# Shallow tree
clf_depth3 = DecisionTreeClassifier(max_depth=3, random_state=RANDOM_STATE)
clf_depth3.fit(X_train, y_train)
acc_depth3 = accuracy_score(y_test, clf_depth3.predict(X_test))

# Fully-grown tree (no max_depth)
clf_full = DecisionTreeClassifier(random_state=RANDOM_STATE)
clf_full.fit(X_train, y_train)
acc_full = accuracy_score(y_test, clf_full.predict(X_test))

print(f"Accuracy (max_depth=3): {acc_depth3:.4f}")
print(f"Accuracy (fully-grown): {acc_full:.4f}")


Accuracy (max_depth=3): 0.8947
Accuracy (fully-grown): 0.8947



### Q8
**Write a Python program to:**
- Load the California Housing dataset from `sklearn`  
- Train a Decision Tree Regressor  
- Print the Mean Squared Error (MSE) and feature importances


In [4]:

from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor

# Load data
cal = fetch_california_housing(as_frame=True)
X = cal.data
y = cal.target  # Median House Value (in 100k USD)
feature_names = X.columns.tolist()

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=RANDOM_STATE)

# Model
regr = DecisionTreeRegressor(random_state=RANDOM_STATE)
regr.fit(X_train, y_train)

# Evaluate
pred = regr.predict(X_test)
mse = mean_squared_error(y_test, pred)
print(f"Test MSE: {mse:.4f}")

fi = describe_feature_importances(regr, feature_names)
print("\nFeature Importances (descending):\n", fi.to_string())


Test MSE: 0.5285

Feature Importances (descending):
 MedInc        0.526241
AveOccup      0.134914
Latitude      0.088012
Longitude     0.086799
HouseAge      0.050926
AveRooms      0.048155
Population    0.036914
AveBedrms     0.028039



### Q9
**Write a Python program to:**
- Load the Iris Dataset  
- Tune the Decision Tree’s `max_depth` and `min_samples_split` using `GridSearchCV`  
- Print the best parameters and the resulting model accuracy


In [5]:

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Load data
iris = load_iris(as_frame=True)
X = iris.data
y = iris.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=RANDOM_STATE, stratify=y)

# Grid search
param_grid = {
    "max_depth": [2, 3, 4, 5, None],
    "min_samples_split": [2, 3, 4, 5, 10]
}
grid = GridSearchCV(
    DecisionTreeClassifier(random_state=RANDOM_STATE),
    param_grid=param_grid,
    cv=5,
    n_jobs=-1
)
grid.fit(X_train, y_train)

best_model = grid.best_estimator_
test_acc = accuracy_score(y_test, best_model.predict(X_test))

print("Best Params:", grid.best_params_)
print(f"Test Accuracy of Best Model: {test_acc:.4f}")


Best Params: {'max_depth': 3, 'min_samples_split': 2}
Test Accuracy of Best Model: 0.8947



### Question 10
**Healthcare use case:** You have a large dataset with mixed data types and some missing values. Explain the step-by-step process to:
- Handle missing values
- Encode categorical features
- Train a Decision Tree model
- Tune its hyperparameters
- Evaluate its performance  
And describe the business value.



**Answer:**  
1. **Data audit & split:** Profile missingness by column and class; create a stratified train/validation/test split before heavy preprocessing to avoid leakage.  
2. **Missing values:**  
   - Numerical: impute with median (robust to outliers).  
   - Categorical: impute with a dedicated category like `"Missing"` (preserves signal of missingness).  
   - Optionally add *missing-indicator* features for high-impact columns.  
3. **Encoding:** Use **ordinal/one-hot encoding** for categorical variables. Decision Trees don’t require scaling; one-hot is safe and preserves order-agnostic semantics.  
4. **Model training:** Train a `DecisionTreeClassifier` with a fixed `random_state`. Start simple (gini or entropy), use class-weighting if classes are imbalanced.  
5. **Hyperparameter tuning:** Use cross-validation to search over `max_depth`, `min_samples_split`, `min_samples_leaf`, and `ccp_alpha` (cost-complexity pruning). Optimize a metric aligned to the business goal (e.g., recall or balanced accuracy if missing a disease is costly).  
6. **Evaluation:** Report accuracy plus **precision/recall/F1**, ROC–AUC, confusion matrix, and calibration (Brier score, reliability plot). Perform threshold analysis to balance false negatives vs. false positives.  
7. **Interpretability & monitoring:** Inspect feature importances, path explanations, and partial dependency/ICE for key drivers; set up drift and performance monitoring post-deployment.  

**Business value:** Earlier and more accurate triage reduces missed diagnoses (lower false negatives), optimizes resource allocation (flag high-risk patients for further testing), and supports clinicians with transparent, explainable rules—improving patient outcomes and operational efficiency.


# ------------------------------- Thank You -----------------------------