<a href="https://colab.research.google.com/github/0xs1d/pwskills/blob/main/decision_tree_assignment_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Decision Tree — Assignment Solutions

**Assignment Code:** DA-AG-012


**Date:** 2025-10-09


## Question 1: Decision Tree in Classification

A Decision Tree is a supervised learning algorithm that splits data into branches based on feature values.
Each node represents a decision rule, and each leaf represents a class label.

---

## Question 2: Gini Impurity and Entropy

- **Gini Impurity:** Measures misclassification probability.
- **Entropy:** Measures disorder in data.

Both guide optimal splits by minimizing impurity.

---

## Question 3: Pre-Pruning vs Post-Pruning

- Pre-Pruning: Limits tree growth early (e.g., max_depth).
- Post-Pruning: Trims fully grown tree.

---

## Question 4: Information Gain

Measures reduction in entropy after a split; used to select best features.

---

## Question 5: Applications and Limitations

Applications include healthcare, finance, and fraud detection.
Limitations include overfitting and instability.


## Question 6: Iris Dataset — Decision Tree (Gini)

In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = DecisionTreeClassifier(criterion='gini')
model.fit(X_train, y_train)

print("Accuracy:", accuracy_score(y_test, model.predict(X_test)))
print("Feature Importances:", model.feature_importances_)


Accuracy: 1.0
Feature Importances: [0.         0.01911002 0.55727376 0.42361622]


## Question 7: Depth Comparison

In [2]:
model_depth3 = DecisionTreeClassifier(max_depth=3, random_state=42)
model_depth3.fit(X_train, y_train)

model_full = DecisionTreeClassifier(random_state=42)
model_full.fit(X_train, y_train)

print("Accuracy (depth=3):", accuracy_score(y_test, model_depth3.predict(X_test)))
print("Accuracy (full):", accuracy_score(y_test, model_full.predict(X_test)))


Accuracy (depth=3): 1.0
Accuracy (full): 1.0


## Question 8: Boston Housing Regression

In [6]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

import pandas as pd
import numpy as np

data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.3, random_state=42)

reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))
print("Feature Importances:", reg.feature_importances_)

MSE: 11.588026315789474
Feature Importances: [5.84654523e-02 9.88919249e-04 9.87244881e-03 2.97334284e-04
 7.05056208e-03 5.75807411e-01 7.17019866e-03 1.09624049e-01
 1.64635669e-03 2.18111251e-03 2.50428658e-02 1.18729904e-02
 1.89980299e-01]


## Question 9: GridSearchCV Tuning

In [9]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor # Import DecisionTreeRegressor

param_grid = {
    'max_depth': [2, 3, 4, 5],
    'min_samples_split': [2, 4, 6]
}

grid = GridSearchCV(DecisionTreeRegressor(random_state=42), # Use DecisionTreeRegressor
                    param_grid,
                    cv=5,
                    scoring='neg_mean_squared_error') # Change scoring for regression

grid.fit(X_train, y_train)

print("Best Params:", grid.best_params_)
print("Best MSE:", -grid.best_score_) # Convert back to positive MSE for display

Best Params: {'max_depth': 5, 'min_samples_split': 2}
Best MSE: 22.448019032252112


## Question 10: Healthcare Use Case

Steps:
1. Handle missing values via imputation.
2. Encode categorical variables.
3. Train Decision Tree model.
4. Tune hyperparameters.
5. Evaluate using accuracy, precision, recall.

**Business Value:** Improves diagnosis accuracy and decision-making efficiency.
