<a href="https://colab.research.google.com/github/JMandal02/Data-Science_pwskills/blob/main/Assignment__Decision_Tree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# **Assignment -- Decision Tree**

# **Q1: What is a Decision Tree, and how does it work in the context of classification?**

A **Decision Tree** is a supervised machine learning model that predicts an output by learning **if-else decision rules** from the data.

- It splits the dataset based on **feature values** to create branches.
- Each **internal node** represents a decision (e.g., “Petal length ≤ 2.5?”)
- Each **leaf node** represents a final prediction (e.g., Iris Setosa)

### **In Classification:**
The goal is to **separate different classes** step-by-step by choosing the **best splitting condition** at each node.

Decision Tree tries to make **each split as pure as possible** — meaning most records in a node belong to one class only.

---

# **Q2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact splits in a Decision Tree?**

These are used to measure **how mixed the classes are** in a node.

### **Gini Impurity (used in CART / sklearn default)**
Gini = 1 − Σ pₙ²  
- 0 = pure node (only one class)  
- Higher = more mixed  

### **Entropy (used in ID3 / Information Gain)**
Entropy = − Σ pₙ log₂(pₙ)  
- 0 = pure  
- High value = high disorder  

### **Impact on splits:**
- Decision Tree chooses the **split that reduces impurity the most**
- Both give similar results, but **Gini is slightly faster**
- Entropy is more **mathematically precise**

---

# **Q3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of each.**

|Type | Description | Advantage |
|------|-------------|-----------|
| **Pre-Pruning** | Stop growing the tree early using constraints like `max_depth`, `min_samples_split` | **Prevents overfitting + Faster training** |
| **Post-Pruning** | First grow full tree, then remove unnecessary branches later | **Better accuracy (removes only harmful branches)** |

---

# **Q4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?**

**Information Gain = Reduction in Entropy after a split**  
IG = Entropy(parent) − Weighted Entropy(children)

- It measures **how much uncertainty is removed** after a split.
- The **split with the highest Information Gain** is selected.
- Ensures the tree splits on the **most informative feature first**

---

# **Q5: What are some common real-world applications of Decision Trees? What are their advantages and limitations?**

### **Real-world Applications:**
- Medical diagnosis  
- Credit loan approval  
- Fraud detection  
- Customer churn prediction  
- Spam filtering  

### **Advantages:**
- **Easy to understand & interpret**
- **No need for feature scaling**
- Works with **numerical + categorical data**

### **Limitations:**
- **Can overfit easily** if not pruned
- **Unstable** — small data change → different tree
- **Less accurate than Random Forest / XGBoost**

---






# **Question 6: Write a Python program to:**

### **● Load the Iris Dataset**

### **● Train a Decision Tree Classifier using the Gini criterion**

### **● Print the model’s accuracy and feature importances**

In [1]:
# Colab-ready cell for Q6
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

# Load data
iris = load_iris(as_frame=True)
X = iris.data
y = iris.target
feature_names = iris.feature_names
class_names = iris.target_names

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

# Train Decision Tree with Gini criterion (default)
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Predict and evaluate
y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)

# Feature importances
importances = clf.feature_importances_
feat_imp = pd.Series(importances, index=feature_names).sort_values(ascending=False)

print(f"Accuracy (test): {acc:.4f}")
print("\nFeature importances:")
print(feat_imp)


Accuracy (test): 0.8947

Feature importances:
petal length (cm)    0.919887
petal width (cm)     0.046629
sepal width (cm)     0.020091
sepal length (cm)    0.013394
dtype: float64


# **Question 7: Write a Python program to:**

### **● Load the Iris Dataset**

### **● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.**

In [2]:
# Colab-ready cell for Q7
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris(as_frame=True)
X = iris.data; y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

# Fully grown tree (no depth limit)
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
acc_full = accuracy_score(y_test, y_pred_full)

# Shallow tree with max_depth=3
clf_shallow = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_shallow.fit(X_train, y_train)
y_pred_shallow = clf_shallow.predict(X_test)
acc_shallow = accuracy_score(y_test, y_pred_shallow)

print(f"Fully grown tree accuracy: {acc_full:.4f}")
print(f"max_depth=3 tree accuracy:  {acc_shallow:.4f}")

# Optional: print depth and number of leaves for comparison
print("\nFully grown tree depth:", clf_full.get_depth(), " leaves:", clf_full.get_n_leaves())
print("max_depth=3 tree depth:", clf_shallow.get_depth(), " leaves:", clf_shallow.get_n_leaves())


Fully grown tree accuracy: 0.8947
max_depth=3 tree accuracy:  0.8947

Fully grown tree depth: 6  leaves: 9
max_depth=3 tree depth: 3  leaves: 5


# **Question 8: Write a Python program to:**

### **● Load the Boston Housing Dataset**

### **● Train a Decision Tree Regressor**

### **● Print the Mean Squared Error (MSE) and feature importances**


In [3]:
# Colab-ready cell for Q8
# Note: load_boston() has been deprecated/removed in recent sklearn.
# We'll fetch the Boston dataset from OpenML (data_id=531). If fetch_openml is not allowed,
# an alternative is fetch_california_housing() or loading a CSV from a known source.
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import pandas as pd

# Fetch Boston from OpenML (ID 531)
boston = fetch_openml(data_id=531, as_frame=True)  # ID 531 corresponds to "boston" on OpenML
X = boston.data
y = boston.target.astype(float)  # target might be a string; convert to float
feature_names = X.columns.tolist()

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Train Decision Tree Regressor (default settings)
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

# Predict & evaluate
y_pred = reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

# Feature importances
importances = pd.Series(reg.feature_importances_, index=feature_names).sort_values(ascending=False)

print(f"Mean Squared Error (test): {mse:.4f}")
print("\nFeature importances:")
print(importances)


Mean Squared Error (test): 16.6884

Feature importances:
RM         0.587170
LSTAT      0.210344
DIS        0.073912
CRIM       0.066320
AGE        0.014041
INDUS      0.011459
B          0.011301
PTRATIO    0.009587
NOX        0.007032
TAX        0.005635
ZN         0.001315
CHAS       0.001127
RAD        0.000758
dtype: float64


# **Question 9: Write a Python program to:**

### **● Load the Iris Dataset**

### **● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV**

### **● Print the best parameters and the resulting model accuracy**


In [4]:
# Colab-ready cell for Q9
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris(as_frame=True)
X = iris.data; y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

param_grid = {
    'max_depth': [None, 1, 2, 3, 4, 5],
    'min_samples_split': [2, 4, 6, 8, 10]
}

dt = DecisionTreeClassifier(random_state=42)
grid = GridSearchCV(dt, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train)

best_params = grid.best_params_
best_score_cv = grid.best_score_

# Evaluate best estimator on test set
best_est = grid.best_estimator_
test_pred = best_est.predict(X_test)
test_acc = accuracy_score(y_test, test_pred)

print("Best parameters (GridSearchCV):", best_params)
print(f"Best CV accuracy (train-folds): {best_score_cv:.4f}")
print(f"Test accuracy with best params:     {test_acc:.4f}")


Best parameters (GridSearchCV): {'max_depth': None, 'min_samples_split': 4}
Best CV accuracy (train-folds): 0.9375
Test accuracy with best params:     0.8947


### **Q10: Imagine you’re working as a data scientist for a healthcare company that**
### **wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.**
### Explain the step-by-step process you would follow to:

### **● Handle the missing values**

### **● Encode the categorical features**

### **● Train a Decision Tree model**

### **● Tune its hyperparameters**

### **● Evaluate its performance**

### **And describe what business value this model could provide in the real-world setting.**

### **Answer**

### **Step 1: Handle Missing Values**
- Numerical: **Median Imputation**
- Categorical: **“Missing” category or most frequent value**
- Optionally add **“was_missing” flag** for medical reasoning

### **Step 2: Encode Categorical Features**
- Use **One-Hot Encoding** (safe + interpretable)
- If too many categories → use **Target Encoding**

### **Step 3: Train Decision Tree Model**
- Use sklearn `Pipeline` + `ColumnTransformer`
- Control complexity with `max_depth`, `min_samples_split`

### **Step 4: Hyperparameter Tuning**
- Use `GridSearchCV` or `RandomizedSearchCV`
- Important params: `max_depth`, `min_samples_leaf`, `ccp_alpha`

### **Step 5: Evaluate Performance**
- Metrics: **Accuracy, Precision, Recall, F1, ROC-AUC**
- **Healthcare Priority = Minimize False Negatives**
- Show **Confusion Matrix** to medical experts

---