
# Decision Tree Assignment Solutions

---

## Question 1: What is a Decision Tree?
A Decision Tree is a supervised learning algorithm used for classification and regression.
In classification, it splits data based on feature values. Internal nodes represent decision rules,
branches represent outcomes, and leaf nodes represent class labels.

---

## Question 2: Gini Impurity and Entropy

Gini Impurity:
Gini = 1 − Σ(p²)
Measures the probability of incorrect classification.

Entropy:
Entropy = −Σ(p log₂ p)
Measures randomness in data.

The split with lowest impurity is selected.

---

## Question 3: Pre-Pruning vs Post-Pruning

Pre-Pruning:
Stops tree growth early (e.g., max_depth).
Advantage: Reduces overfitting and computation.

Post-Pruning:
Grow full tree then remove weak branches.
Advantage: Better generalization.

---

## Question 4: Information Gain

Information Gain measures entropy reduction after a split.
The feature with highest information gain is selected.

---

## Question 5: Applications, Advantages & Limitations

Applications:
- Medical diagnosis
- Fraud detection
- Loan approval

Advantages:
- Easy to interpret
- Minimal preprocessing

Limitations:
- Can overfit
- Sensitive to small data changes

---


## Question 6

In [None]:

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Feature Importances:", model.feature_importances_)


## Question 7

In [None]:

full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
full_acc = accuracy_score(y_test, full_tree.predict(X_test))

pruned_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
pruned_tree.fit(X_train, y_train)
pruned_acc = accuracy_score(y_test, pruned_tree.predict(X_test))

print("Full Tree Accuracy:", full_acc)
print("Max Depth=3 Accuracy:", pruned_acc)


## Question 8

In [None]:

from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

boston = load_boston()
X_boston = boston.data
y_boston = boston.target

X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(
    X_boston, y_boston, test_size=0.2, random_state=42
)

reg_model = DecisionTreeRegressor(random_state=42)
reg_model.fit(X_train_b, y_train_b)

y_pred_b = reg_model.predict(X_test_b)

print("Mean Squared Error:", mean_squared_error(y_test_b, y_pred_b))
print("Feature Importances:", reg_model.feature_importances_)


## Question 9

In [None]:

from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [None, 2, 3, 4, 5],
    'min_samples_split': [2, 5, 10]
}

grid = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid,
    cv=5
)

grid.fit(X_train, y_train)

best_model = grid.best_estimator_
best_accuracy = accuracy_score(y_test, best_model.predict(X_test))

print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", best_accuracy)



## Question 10

Step 1: Handle Missing Values
- Numerical: Mean/Median imputation
- Categorical: Mode imputation

Step 2: Encode Categorical Features
- One-Hot Encoding
- Label Encoding (if ordinal)

Step 3: Train Model
- Split data
- Fit DecisionTreeClassifier

Step 4: Tune Hyperparameters
- Use GridSearchCV
- Optimize max_depth, min_samples_split

Step 5: Evaluate Model
- Accuracy
- Precision, Recall, F1-score
- ROC-AUC

Business Value:
- Early disease detection
- Better patient risk stratification
- Cost reduction
- Improved healthcare decisions
