

## **Q1. What is a Decision Tree, and how does it work in the context of classification?**

A **Decision Tree** is a supervised machine learning model used for **classification and regression**. It works by repeatedly splitting the data into subsets based on the *best* feature that separates the target classes.

### **How it Works (Classification)**

1. The root node receives the entire dataset.
2. The algorithm selects the feature that best separates the classes (using Gini or Entropy).
3. Based on the selected feature, the dataset is split into branches.
4. Each branch continues splitting until:

   * All samples in a node belong to one class, or
   * A stopping criterion (max depth, min samples) is reached.
5. Leaf nodes assign a class label.

### **Why It Works**

A decision tree models human-like decision-making using simple yes/no questions. It learns rules such as:

> If petal length < 2.5 → Iris-setosa
> Else if petal width < 1.8 → Iris-versicolor
> Else → Iris-virginica

### **Advantages**

* Easy to understand and visualize
* Handles numerical and categorical data
* Requires little data preprocessing

### **Limitations**

* Prone to overfitting
* High variance

---




## **Q2. Explain Gini Impurity and Entropy. How do they impact splits in a Decision Tree?**

### **1) Gini Impurity**

Measures how often a randomly chosen sample would be incorrectly classified.

[
Gini = 1 - \sum p_i^2
]

* Range: 0 (pure) to 0.5 (impure in binary class)
* Faster to compute than entropy

### **2) Entropy**

Measures the impurity using information theory.

[
Entropy = -\sum p_i \log_2(p_i)
]

* Range: 0 (pure) to 1 (high impurity)

### **Impact on Splits**

Decision Trees choose **splits that reduce impurity the most**.

* Using **Gini**, the tree prefers splits that isolate the most frequent class.
* Using **Entropy**, the tree prefers splits that maximize “information gain.”

In practice:

* Gini = default in scikit-learn
* Both give similar splits

---



## **Q3. Difference between Pre-Pruning and Post-Pruning. Give one advantage each.**

### **Pre-Pruning (Early Stopping)**

Stops the tree **before** it grows too deep.

Methods:

* max_depth
* min_samples_split
* min_samples_leaf

**Advantage:**
✔ Prevents overfitting early and reduces training time.

---

### **Post-Pruning (Cost Complexity Pruning)**

Grow a full tree first → then prune useless branches.

Methods:

* Reduced error pruning
* Cost complexity pruning (ccp_alpha)

**Advantage:**
✔ Produces simpler, more generalizable models.

---



## **Q4. What is Information Gain and why is it important?**

**Information Gain (IG)** measures how much a feature reduces impurity.

[
IG = Entropy(parent) - \sum \frac{N_i}{N} Entropy(child_i)
]

### **Importance**

* Higher IG = better feature for splitting
* Helps select the **most informative feature**
* Ensures the model learns meaningful patterns

Decision Trees repeatedly choose the split with **maximum IG**, ensuring fast and accurate classification.

---



## **Q5. Real-World Applications of Decision Trees + Advantages & Limitations**

### **Applications**

* Medical diagnosis
* Customer churn prediction
* Credit risk scoring
* Fraud detection
* Loan approval systems
* Product recommendation
* Agriculture (crop disease prediction)

### **Advantages**

* Interpretable (explainable AI)
* Handles both numerical & categorical data
* No need for feature scaling

### **Limitations**

* Overfitting
* High variance
* Unstable with small dataset changes
* Biased towards features with many levels

---

# **Programming Questions**

---



## **Q6. Python Program – Train Decision Tree (Gini) on Iris Dataset**



---


In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Predictions
y_pred = clf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Feature Importances:", clf.feature_importances_)


Accuracy: 1.0
Feature Importances: [0.         0.01667014 0.90614339 0.07718647]


##**Q7.Write a Python program to:**
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.
(Include your Python code and output in the code box below.)

In [2]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# full tree
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
full_pred = full_tree.predict(X_test)

# depth 3 tree
pruned_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
pruned_tree.fit(X_train, y_train)
pruned_pred = pruned_tree.predict(X_test)

print("Full Tree Accuracy:", accuracy_score(y_test, full_pred))
print("Depth=3 Accuracy:", accuracy_score(y_test, pruned_pred))


Full Tree Accuracy: 1.0
Depth=3 Accuracy: 1.0


##**Q.8.Write a Python program to:**
● Load the Boston Housing Dataset
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances

In [4]:
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load data
housing = fetch_california_housing()
X = housing.data
y = housing.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)

print("MSE:", mean_squared_error(y_test, y_pred))
print("Feature Importances:", reg.feature_importances_)

MSE: 0.495235205629094
Feature Importances: [0.52850909 0.05188354 0.05297497 0.02866046 0.03051568 0.13083768
 0.09371656 0.08290203]


##Q9.Write a Python program to:**
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy

In [6]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV

# Load Iris data specifically for this question
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

# Train-test split for Iris data
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(X_iris, y_iris, test_size=0.2, random_state=42)

params = {
    'max_depth': [2, 3, 4, None],
    'min_samples_split': [2, 3, 4, 5]
}

grid = GridSearchCV(DecisionTreeClassifier(random_state=42),
                    param_grid=params,
                    cv=5,
                    scoring='accuracy')

grid.fit(X_train_iris, y_train_iris)

print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", grid.best_score_)


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Best Accuracy: 0.9416666666666668




## **Q10. Healthcare Disease Prediction – End-to-End Workflow**

### **Step 1: Handle Missing Values**

* Numerical → impute using mean/median
* Categorical → impute using most frequent value
* Optional: advanced imputation (KNN Imputer)

### **Step 2: Encode Categorical Features**

* Label Encoding for binary categories
* One-Hot Encoding for multi-category features

### **Step 3: Train Decision Tree Model**

* Define X and y
* Split dataset
* Fit DecisionTreeClassifier
* Evaluate using accuracy, F1-score, confusion matrix

### **Step 4: Hyperparameter Tuning**

Parameters to tune:

* max_depth
* min_samples_split
* min_samples_leaf
* criterion (gini/entropy)

Use GridSearchCV or RandomizedSearchCV.

### **Step 5: Model Evaluation**

* Classification accuracy
* ROC-AUC
* Precision-Recall
* Confusion matrix
* Feature importance visualisation

### **Business Value**

* Helps doctors flag high-risk patients early
* Improves diagnosis speed
* Reduces manual workload
* Supports evidence-based treatment
* Saves cost by early detection
* Enhances patient care quality

---
