Assignment Code: DA-AG-012
# Decision Tree | Assignment


**Question 1:  What is a Decision Tree, and how does it work in the context of classification?**

**Ans-**
Decision Tree in Classification

Definition

 - A Decision Tree is a supervised machine learning algorithm used for classification and regression tasks.
In classification, it predicts the class label of an input by learning decision rules inferred from the data features.

How It Works (Step by Step in Classification)

  Root Node Creation

 - The algorithm starts with the entire dataset at the root.

 Splitting Criteria

 - At each node, it chooses the best feature to split on.

 - This is based on impurity measures like:

 - Gini Index (CART)

 - Entropy / Information Gain (ID3, C4.5)

 - Goal → create groups that are as pure as possible (mostly one class).

Recursive Partitioning

 - The dataset is split recursively into child nodes.

 - This continues until a stopping condition is met:

 - Max depth reached

 - Minimum samples per node

 - Node is pure (all samples belong to one class)

Leaf Nodes (Prediction)

 - Each leaf node corresponds to a class label.

 - For a new input, the model follows the rules from the root to a leaf → prediction.

 Example

       Imagine predicting if a patient has a disease (Yes/No) using features like:

       Age

       Blood Pressure

       Cholesterol

The tree may look like:

In [None]:
          [Age > 50?]
            /     \
         Yes       No
        /            \
 [BP > 140?]      Disease = No
   /      \
Yes        No
|          |
Disease=Yes Disease=No


**Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?**



Ans. Gini Impurity & Entropy in
Decision Trees
1. Gini Impurity

  - Definition: Probability that a randomly chosen sample from the node would be misclassified if it were labeled according to the class distribution in that node.

         Gini\=1−i\=1∑C​pi2​

2. Entropy (Information Gain)

- Definition: Measures the uncertainty (disorder) in a node.

      “Entropy\=−i\=1∑C​pi​⋅log2​(pi​)”

- Information Gain (IG) is used for splitting:

      “IG\=Entropy(parent)−k∑​nnk​​⋅Entropy(childk​)”


3. Impact on Splits

 - At each node, the algorithm checks all possible splits across features.

 - It chooses the split that produces the highest reduction in impurity:

 - CART algorithm (Classification and Regression Trees) → uses Gini.

 - ID3/C4.5 algorithms → use Entropy (Information Gain).

 - Both measures usually lead to similar trees, but:

 - Gini is computationally faster (no log).

 - Entropy can be more sensitive when class probabilities are skewed.

**Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.**

Ans-  Pre-Pruning vs Post-Pruning

| Aspect           | Pre-Pruning                          | Post-Pruning                              |
| ---------------- | ------------------------------------ | ----------------------------------------- |
| **When applied** | During tree growth                   | After full tree is built                  |
| **Control**      | Stops growth early                   | Cuts back later                           |
| **Risk**         | Might underfit (if pruned too early) | Safer (tree has full info before pruning) |
| **Advantage**    | Faster, efficient                    | More accurate, interpretable              |

 Practical Advantage( Pre- Pruning):



  - Saves training time and computational resources, especially on very large datasets.

Practical Advantage(Post-Pruning):

  - Produces a simpler, more interpretable tree without sacrificing much accuracy.



**Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?**


 **Ans-**
 Information Gain in Decision Trees

Definition

 - Information Gain (IG) measures how much uncertainty (entropy) is reduced after splitting a dataset on a feature.

 - In other words: it tells us how useful a feature is for classifying the data.

Why It’s Important

 - Decision Trees work by choosing the best split at each step.

 - Information Gain ensures we pick the feature that:

 - Maximizes purity of child nodes.

 - Reduces uncertainty about the target variable.

 - Without IG (or similar measures like Gini), the tree might split on irrelevant features.

**Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?
Dataset Info:**
    
    ● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
    provided CSV).
    ● Boston Housing Dataset for regression tasks
    (sklearn.datasets.load_boston() or provided CSV).

Ans-   
**Real-World Applications of Decision Trees**

1. Common Applications

Classification

 - Healthcare → Predict whether a patient has a disease (Yes/No).

 - Finance → Fraud detection in credit card transactions.

 - Customer Analytics → Predict whether a customer will churn or not.

 - Education → Classify students as "pass/fail" based on grades.

Dataset Example: Iris (Classification)

 - Features: Petal length, petal width, sepal length, sepal width.

 - Task: Classify flowers into Setosa, Versicolor, Virginica.

 - A Decision Tree learns rules like:

arduino

       if petal_length < 2.5 → Setosa  
       else if petal_width < 1.75 → Versicolor  
       else → Virginica

Regression

 - Real Estate → Predict housing prices.

 - Economics → Predict GDP growth or demand forecasting.

 - Healthcare → Predict length of hospital stay.

 - Agriculture → Estimate crop yield.

Dataset Example: Boston Housing (Regression)

 - Features: Rooms per house, crime    rate, proximity to jobs, etc.

 - Task: Predict median house value.

 - A Decision Tree regressor splits data into regions and assigns the average house price of each region as prediction.

2. Advantages of Decision Trees

 - Interpretability – Easy to visualize (flowchart style) and explain to non-technical people.

 - Handles Mixed Data – Works with both categorical & numerical features.

 - No Feature Scaling Needed – Unlike SVM or Logistic Regression.

 - Captures Nonlinear Relationships – Flexible decision boundaries.

 - Versatile – Can be used for both classification (Iris) and regression (Boston Housing).

3. Limitations of Decision Trees

 - Overfitting – Trees can grow too deep and memorize training data (needs pruning or ensembles).

 - Instability – Small changes in data can lead to very different trees.

 - Bias Toward Features with Many Categories – Categorical variables with many levels can dominate.

 - Lower Predictive Accuracy – A single tree is weaker compared to ensemble methods like Random Forests or Gradient Boosted Trees.

Question 6:   Write a Python program to:

    ● Load the Iris Dataset
    ● Train a Decision Tree Classifier using the Gini criterion
    ● Print the model’s accuracy and feature importances
    (Include your Python code and output in the code box below.)

In [2]:
#Ans-
# Decision Tree Classifier on Iris Dataset

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# 2. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 3. Train a Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion="gini", random_state=42)
clf.fit(X_train, y_train)

# 4. Make predictions
y_pred = clf.predict(X_test)

# 5. Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)

# 6. Get feature importances
feature_importances = pd.Series(clf.feature_importances_, index=feature_names)

# Print results
print("Decision Tree Classifier (Gini Criterion)")
print("Accuracy on test set: {:.2f}%".format(accuracy * 100))
print("\nFeature Importances:")
print(feature_importances.sort_values(ascending=False))


Decision Tree Classifier (Gini Criterion)
Accuracy on test set: 93.33%

Feature Importances:
petal length (cm)    0.558568
petal width (cm)     0.406015
sepal width (cm)     0.029167
sepal length (cm)    0.006250
dtype: float64


**Question 7:  Write a Python program to:**

    ● Load the Iris Dataset
    ● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
    a fully-grown tree.
    (Include your Python code and output in the code box below.)

In [3]:
#Ans-
# Compare pruned vs fully-grown Decision Tree on Iris Dataset

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 3. Train a pruned Decision Tree (max_depth = 3)
clf_pruned = DecisionTreeClassifier(criterion="gini", max_depth=3, random_state=42)
clf_pruned.fit(X_train, y_train)
y_pred_pruned = clf_pruned.predict(X_test)
accuracy_pruned = accuracy_score(y_test, y_pred_pruned)

# 4. Train a fully-grown Decision Tree
clf_full = DecisionTreeClassifier(criterion="gini", random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# 5. Print results
print("Decision Tree Classifier Comparison on Iris Dataset")
print("Pruned Tree (max_depth=3) Accuracy: {:.2f}%".format(accuracy_pruned * 100))
print("Fully-grown Tree Accuracy: {:.2f}%".format(accuracy_full * 100))


Decision Tree Classifier Comparison on Iris Dataset
Pruned Tree (max_depth=3) Accuracy: 96.67%
Fully-grown Tree Accuracy: 93.33%


**Question 8: Write a Python program to:**

    ● Load the California Housing dataset from sklearn
    ● Train a Decision Tree Regressor
    ● Print the Mean Squared Error (MSE) and feature importances
    (Include your Python code and output in the code box below.)

In [4]:
# Ans-
# Decision Tree Regressor on California Housing Dataset

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import pandas as pd

# 1. Load the California Housing dataset
housing = fetch_california_housing()
X = housing.data
y = housing.target
feature_names = housing.feature_names

# 2. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train a Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# 4. Make predictions
y_pred = regressor.predict(X_test)

# 5. Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)

# 6. Get feature importances
feature_importances = pd.Series(regressor.feature_importances_, index=feature_names)

# Print results
print("Decision Tree Regressor on California Housing Dataset")
print("Mean Squared Error (MSE): {:.4f}".format(mse))
print("\nFeature Importances:")
print(feature_importances.sort_values(ascending=False))


Decision Tree Regressor on California Housing Dataset
Mean Squared Error (MSE): 0.4952

Feature Importances:
MedInc        0.528509
AveOccup      0.130838
Latitude      0.093717
Longitude     0.082902
AveRooms      0.052975
HouseAge      0.051884
Population    0.030516
AveBedrms     0.028660
dtype: float64


**Question 9: Write a Python program to:**

    ● Load the Iris Dataset
    ● Tune the Decision Tree’s max_depth and min_samples_split using
    GridSearchCV
    ● Print the best parameters and the resulting model accuracy
    (Include your Python code and output in the code box below.)

In [5]:
#Ans-
# Hyperparameter tuning with GridSearchCV on Iris Dataset

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 3. Define parameter grid for tuning
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5, 10]
}

# 4. Create GridSearchCV with Decision Tree Classifier
grid_search = GridSearchCV(
    estimator=DecisionTreeClassifier(random_state=42),
    param_grid=param_grid,
    cv=5, scoring='accuracy'
)

# 5. Fit the grid search to training data
grid_search.fit(X_train, y_train)

# 6. Get the best parameters
best_params = grid_search.best_params_

# 7. Evaluate best model on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Decision Tree Classifier with GridSearchCV (Iris Dataset)")
print("Best Parameters:", best_params)
print("Accuracy on test set: {:.2f}%".format(accuracy * 100))


Decision Tree Classifier with GridSearchCV (Iris Dataset)
Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Accuracy on test set: 93.33%


**Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.**

Explain the step-by-step process you would follow to:
    
    ● Handle the missing values
    ● Encode the categorical features
    ● Train a Decision Tree model
    ● Tune its hyperparameters
    ● Ev aluate its performance
    And describe what business value this model could provide in the real-world
    setting.

**Ans**-  Step-by-Step Process

1. Handle the Missing Values

    Explore Missingness

    - Check % of missing values per column.

   -  Identify whether missingness is MCAR (Missing Completely at Random), MAR (Missing at Random), or MNAR (Missing Not at Random).

 Strategies:

  - Numerical features → Use mean/median imputation (median is robust to outliers).

   -  Categorical features → Use mode imputation or add a new category like "Unknown".

   - Advanced methods → Use KNN imputation or multivariate imputation (MICE) if dataset is large.

2. Encode the Categorical Features

   -  Ordinal categorical (e.g., disease stage: low, medium, high) → Use Ordinal Encoding.

  -  Nominal categorical (e.g., gender, blood group) → Use One-Hot Encoding.

   -  If dataset has high-cardinality features (like ZIP codes, hospital IDs):

  - Use Target Encoding or Frequency Encoding.

3. Train a Decision Tree Model

  - Data Split → Train/Test (e.g., 80/20) or use Stratified K-Fold Cross Validation (because class imbalance is common in healthcare).

  - Scaling is not required for Decision Trees.

  - Train initial Decision Tree using scikit-learn:

In [None]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)


4. Tune Hyperparameters

 - Use GridSearchCV or RandomizedSearchCV to optimize:

 - max_depth: to control tree growth.

 - min_samples_split, min_samples_leaf: to avoid overfitting.

 - criterion: "gini" vs "entropy".

 - max_features: number of features to consider for best split.

Example:

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}

grid = GridSearchCV(DecisionTreeClassifier(random_state=42),
                    param_grid, cv=5, scoring='roc_auc')
grid.fit(X_train, y_train)

best_model = grid.best_estimator_


5. Evaluate Model Performance

 - Since healthcare is high-risk (false negatives can be critical):

 - Use metrics beyond accuracy:

 - Confusion Matrix

 - Precision, Recall, F1-score

 - ROC-AUC Score

 - PR Curve (important when classes are imbalanced).

Example:

In [None]:
from sklearn.metrics import classification_report, roc_auc_score
y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, best_model.predict_proba(X_test)[:,1]))
