Q1] What is a Decision Tree, and how does it work in the context of
classification?
Ans] A Decision Tree is a supervised machine learning algorithm used for both classification and regression, but it is most commonly used for classification tasks. It works by splitting the dataset into branches based on feature values, eventually leading to a decision (a class label).  

Start at the Root Node  
The algorithm considers the entire dataset and tries to find the feature that best separates the classes (e.g., "Age", "Income", etc.).  
Select the Best Feature to Split  
Uses a measure such as:  
Gini Impurity  
Entropy / Information Gain  

Gain Ratio  
These measures help determine how well a feature divides the data into pure groups (groups where most samples belong to the same class).  

Split the Data Based on the Feature  
If the condition is:  
Age > 30?  
The dataset splits into two groups: Age > 30 and Age ≤ 30.  
Repeat the Process  
Each subgroup is then split again based on another feature, forming new branches.  

Stop Splitting When:  
All values belong to one class (pure node), or  
Maximum tree depth is reached, or  
Not enough samples to split further.  
Output the Class Label at the Leaf Node  
When traversal reaches a leaf, the leaf stores the most common class in that subset.  

Q2] Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?  
Ans]In a decision tree, before splitting the data at each node, the algorithm must decide which feature and threshold will best separate the classes.  
To do this, it measures how “impure” or “mixed” a node is.  
A node is pure if it contains only one class.  
A node is impure if it contains a mixture of classes.  
Two common impurity measures are:  
Gini Impurity  
Entropy (Information Gain)  

Gini Impurity:  
Gini measures the probability of incorrect classification if we randomly assign a label based on class proportions.  
Lower Gini → purer node.  

Entropy:  
Entropy measures uncertainty or disorder in the data.  
Entropy is 0 when the node is pure.  
Entropy is maximum when classes are equally mixed.  

How They Impact the Splits (Points)  

Used to evaluate split quality:  
At each decision point, the tree calculates either Gini Impurity or Entropy to determine how pure or mixed a node is.  

Lower impurity = better split:  
The algorithm always prefers splits that result in child nodes with lower impurity (more pure classes).  

Gini focuses on misclassification reduction:  
Gini Impurity tries to reduce the chance of incorrectly classifying a randomly chosen sample.  

Entropy focuses on information gain:  
Entropy selects splits that maximize information gain, meaning it aims to reduce uncertainty in the data.  

Both try to separate classes effectively:  
The chosen split should group similar class samples together, increasing purity in nodes.  

Entropy tends to produce more balanced splits:  
It considers all class proportions carefully, sometimes leading to more evenly distributed branches.  

Gini tends to isolate the dominant class faster:  
It is slightly more sensitive to majority class frequency, so it may create simpler and faster splits.  

The algorithm tests multiple split points:  
For each feature, the tree tries possible thresholds and calculates impurity values.  

The best split minimizes weighted impurity:  
The final chosen split is the one with the lowest weighted impurity across child nodes.  

Impurity measures continue at each level:  
This process is repeated recursively until stopping conditions are met (like max depth or pure nodes).  

Q3] What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.  
Ans] Pre-Pruning (Early Stopping)  
Pruning is done during the tree-building process.  
The tree stops splitting when further splits do not significantly improve performance.  
Controlled using parameters like max_depth, min_samples_split, min_samples_leaf, etc.  
Helps avoid overfitting early by limiting tree growth.  
Faster and more efficient because the tree is smaller from the start.  
Risk: It may stop too early, causing underfitting.  
Practical Advantage:  
✅ Saves time and computational resources by preventing unnecessary growth.  

Post-Pruning (Cost Complexity Pruning)  
Pruning is done after the tree has been fully grown.  
The tree is first allowed to overfit, then unnecessary branches are removed.  
Uses evaluation metrics (e.g., cross-validation) to prune weak splits.  
Helps reduce overfitting and improves generalization.  
Typically more accurate but takes longer to compute.  
Requires additional validation data to decide which branches to remove.  

Practical Advantage:  
✅ Produces a simpler model with better accuracy on unseen data.  

Q4] What is Information Gain in Decision Trees, and why is it important for
choosing the best split?  
Ans] Information Gain is a measure used in Decision Trees to determine which feature provides the best split at each node.  
It tells us how much uncertainty (impurity) is reduced after splitting the data using a particular feature.  
It is calculated using Entropy.  
Information Gain=Entropy (Parent Node)−Weighted Entropy (Child Nodes)  
So, higher Information Gain means the split makes the classes more separated (purer).   
Why Is Information Gain Important for Choosing the Best Split?  

Helps Identify the Most Useful Feature:  
It selects the feature that best separates the data based on class labels.  

Reduces Uncertainty:  
A split with high Information Gain reduces disorder (entropy) and creates purer child nodes.  

Improves Model Accuracy:  
By choosing splits that maximize Information Gain, the tree learns better decision boundaries.  

Prevents Random or Unhelpful Splits:  
Without Information Gain, the tree might split on features that do not help classification.  

Builds the Tree Efficiently:  
Highest Information Gain = most informative split → faster and more effective tree building.  

Q5] What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?  
Ans] Real-World Applications of Decision Trees  

Medical Diagnosis  
Used to predict diseases based on symptoms, test results, age, etc.  
Example: Classifying patients as “High Risk” or “Low Risk.”  

Credit Risk Assessment (Banking & Finance)  
Used to decide whether to approve or deny loan applications.  
Example: Predicting if a customer will default based on income, credit score, etc.  

Customer Churn Prediction (Marketing)  
Helps identify customers likely to leave a service.  
Used for targeted retention campaigns.  

Fraud Detection  
Detects unusual transaction patterns that may indicate fraud.  

Recommendation Systems  
Identifies customer preferences for product/service recommendations.  

Manufacturing Quality Control  
Classifies products as “Defective” or “Non-defective” based on features.  

Human Resource Decision Making  
Used in employee performance evaluation and promotion eligibility.   

Advantages of Decision Trees  

Easy to Understand and Interpret  
Decision trees resemble human thinking and can be visualized.  

No Need for Feature Scaling  
Works well with both categorical and numerical data without normalization.  

Handles Nonlinear Relationships  
Can model complex decision boundaries.  

Can Handle Missing Values  
Some implementations allow splitting based on available data only.  

Fast Prediction  
Once built, the tree makes decisions quickly.  

Limitations of Decision Trees  

High Risk of Overfitting  
Trees may grow too large and fit training data too closely.  

Unstable  
Small changes in data can lead to a completely different tree.  

Biased Toward Features with Many Levels  
Categorical features with many categories can dominate splits.  

Not Always the Most Accurate Alone  
Often improved using Ensembles like Random Forest or Gradient Boosting.  

Q6] Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data        
y = iris.target     


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

clf = DecisionTreeClassifier(criterion='gini', random_state=42)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

print("\nFeature Importances:")
for name, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{name}: {importance:.4f}")


Model Accuracy: 1.0

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


Q7] Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.


In [2]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data          # Features
y = iris.target        # Labels

# 2. Split into training and test sets (80% Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train a fully-grown Decision Tree
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
full_pred = full_tree.predict(X_test)
full_accuracy = accuracy_score(y_test, full_pred)

# 4. Train a Decision Tree with max_depth = 3
pruned_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
pruned_tree.fit(X_train, y_train)
pruned_pred = pruned_tree.predict(X_test)
pruned_accuracy = accuracy_score(y_test, pruned_pred)

# 5. Print the accuracies
print("Accuracy of Fully-Grown Tree: ", full_accuracy)
print("Accuracy of Tree with max_depth=3: ", pruned_accuracy)

# Optional: Show which model performed better
if pruned_accuracy > full_accuracy:
    print("\nPruned tree generalizes better (less overfitting).")
elif pruned_accuracy < full_accuracy:
    print("\nFully-grown tree performed better, but may overfit.")
else:
    print("\nBoth models performed equally.")


Accuracy of Fully-Grown Tree:  1.0
Accuracy of Tree with max_depth=3:  1.0

Both models performed equally.


Q8] Write a Python program to:
● Load the Boston Housing Dataset
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances

In [3]:
# Import necessary libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# 1. Load the California Housing Dataset (replacement for Boston Housing)
housing = fetch_california_housing()
X = housing.data               # Features
y = housing.target             # Target (median house value)

# 2. Split into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train the Decision Tree Regressor
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)

# 4. Predict on the test set
y_pred = model.predict(X_test)

# 5. Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)

# 6. Print Feature Importances
print("\nFeature Importances:")
for feature, importance in zip(housing.feature_names, model.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Mean Squared Error (MSE): 0.495235205629094

Feature Importances:
MedInc: 0.5285
HouseAge: 0.0519
AveRooms: 0.0530
AveBedrms: 0.0287
Population: 0.0305
AveOccup: 0.1308
Latitude: 0.0937
Longitude: 0.0829


Q9] Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy

In [5]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split into training and testing data (80% Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Define a Decision Tree Classifier
model = DecisionTreeClassifier(random_state=42)

# 4. Define parameter grid for tuning
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5, 10]
}

# 5. Perform Grid Search with 5-fold cross-validation
grid_search = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy'
)

grid_search.fit(X_train, y_train)

# 6. Print the best parameters
print("Best Parameters:", grid_search.best_params_)

# 7. Evaluate the best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Model Accuracy with Best Parameters:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Model Accuracy with Best Parameters: 1.0


Q10] Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.


A. Handle missing values
Audit missingness: compute % missing per feature and check patterns (MCAR / MAR / MNAR).
Keep missingness flags: for important features add feature_missing = isnull indicator — missingness can be predictive.
Simple imputation baseline:
Numeric: median (robust to outliers) or mean if symmetric.
Categorical: most frequent or "MISSING" label.
Advanced imputation (if needed):
IterativeImputer (model-based) or KNN imputer when relationships exist.
Use caution: imputation leakage — fit imputer only on training fold.
Document assumptions: why you imputed and what values mean clinically.

B. Encode categorical features
Categorize by cardinality:
Low cardinality (<= ~10 unique): One-Hot Encoding.
High cardinality: Target encoding / leave-one-out / frequency encoding to avoid huge sparse matrices. Decision Trees tolerate ordinal numeric encoding — but beware of introducing spurious order.
Fit encoders only on training data (use ColumnTransformer / Pipeline to avoid leakage).
Preserve rare categories: map rare levels to __OTHER__.

C. Train the Decision Tree model
Start with a pipeline: preprocessing (imputer, encoder, scaler if needed for other models) → DecisionTreeClassifier.
Class imbalance: if positive cases are rare, use class_weight='balanced' or resampling (SMOTE) inside cross-validation.
Baseline: train default tree, record metrics.

D. Tune hyperparameters
Use GridSearchCV or RandomizedSearchCV with StratifiedKFold.
Tune: max_depth, min_samples_split, min_samples_leaf, criterion (gini/entropy), max_features, class_weight.
Consider cost-sensitive tuning: use custom scoring that weights false negatives higher if missing disease is costly.
Use nested CV if you need an unbiased estimate of generalization performance.

E. Evaluate performance
Primary metrics (healthcare context):
Recall / Sensitivity (catch sick patients)
Precision (avoid too many false alarms)
ROC-AUC and PR-AUC (PR especially if class imbalance)
Confusion matrix at operational threshold; consider threshold tuning using ROC/PR or business cost function.
Calibration: check predicted probabilities (reliability diagrams, calibration curve); calibrate with CalibratedClassifierCV if needed.
Explainability: feature importances, SHAP values, decision paths for sample patients.
Robustness checks: performance across subgroups (age, gender, hospital), temporal validation (train on earlier period, test on later).
Statistical significance / CI for metrics via bootstrapping.

In [8]:
# Requires: scikit-learn >=0.24, category_encoders (optional)
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (accuracy_score, recall_score, precision_score,
                             roc_auc_score, confusion_matrix, classification_report)

# --- Assume `df` is your DataFrame, 'target' is 0/1 where 1 = disease present
# df = pd.read_csv('patient_data.csv')

# Example placeholders (replace with your data)
# df = ...
# target_col = 'disease'
# X = df.drop(columns=[target_col])
# y = df[target_col]

# For demonstration, using iris as surrogate (replace with your data)
from sklearn.datasets import load_iris
iris = load_iris(as_frame=True)
df = iris.frame
# Make it a binary problem by mapping classes 0 vs (1 or 2)
df['target'] = (df['target'] != 0).astype(int)
target_col = 'target'
X = df.drop(columns=[target_col])
y = df[target_col]

# Identify column types (example logic)
numeric_features = X.select_dtypes(include=['number']).columns.tolist()
categorical_features = X.select_dtypes(include=['object','category']).columns.tolist()

# Preprocessing pipelines
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    # Decision Trees don't need scaling, so we omit scaler
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Full pipeline
pipe = Pipeline(steps=[
    ('pre', preprocessor),
    ('clf', DecisionTreeClassifier(random_state=42))
])

# Parameter grid for tuning
param_grid = {
    'clf__criterion': ['gini', 'entropy'],
    'clf__max_depth': [3, 5, 7, None],
    'clf__min_samples_split': [2, 5, 10],
    'clf__min_samples_leaf': [1, 2, 4],
    'clf__class_weight': [None, 'balanced']
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid = GridSearchCV(pipe, param_grid, cv=cv, scoring='recall', n_jobs=-1, verbose=1)
grid.fit(X, y)

print("Best params:", grid.best_params_)
best_model = grid.best_estimator_

# Evaluate on held-out test set (do a train_test_split first in real workflow)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,
                                                    test_size=0.2, random_state=42)
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:,1] if hasattr(best_model.named_steps['clf'], "predict_proba") else None

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Recall (Sensitivity):", recall_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
if y_proba is not None:
    print("ROC AUC:", roc_auc_score(y_test, y_proba))
print("\nConfusion matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification report:\n", classification_report(y_test, y_pred))

# Feature importances (map back to original feature names)
# After preprocessing, need to reconstruct column names for OneHot encoder
ohe_cols = []
if categorical_features:
    ohe = best_model.named_steps['pre'].named_transformers_['cat'].named_steps['onehot']
    ohe_cols = list(ohe.get_feature_names_out(categorical_features))
feature_names = numeric_features + ohe_cols
importances = best_model.named_steps['clf'].feature_importances_
feat_imp = pd.Series(importances, index=feature_names).sort_values(ascending=False)
print("\nTop feature importances:\n", feat_imp.head(10))


Fitting 5 folds for each of 144 candidates, totalling 720 fits
Best params: {'clf__class_weight': None, 'clf__criterion': 'gini', 'clf__max_depth': 3, 'clf__min_samples_leaf': 1, 'clf__min_samples_split': 2}
Accuracy: 1.0
Recall (Sensitivity): 1.0
Precision: 1.0
ROC AUC: 1.0

Confusion matrix:
 [[10  0]
 [ 0 20]]

Classification report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00        20

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30


Top feature importances:
 petal length (cm)    1.0
sepal length (cm)    0.0
sepal width (cm)     0.0
petal width (cm)     0.0
dtype: float64


3) Evaluation checklist (operational concerns)  
Choose metric aligned with business cost (e.g., missing a diseased patient might cost more than false alarm).  
Threshold selection: pick a probability threshold based on cost matrix, not necessarily 0.5.  
Subgroup fairness: measure performance across demographics; mitigate biased performance.  
Calibration: if probabilities are used for triage, ensure good calibration.  
Temporal/External validation: test on later periods and different hospitals.  
Explainability: provide per-prediction explanations (SHAP, simplest decision path).  
Monitoring: set up drift detection (data distribution, performance drop).  

4) Deployment, privacy & ethics notes  
Data privacy: follow HIPAA/GDPR rules; de-identify PHI; control access to model outputs.  
Human-in-the-loop: use model as triage/decision support, not as sole final decision maker.  
Documentation: keep model card that lists intended use, limitations, and performance.  
Regulatory: in some jurisdictions medical decision tools require approval; check compliance.  

5) Business value (short & concrete)  
Early detection → earlier interventions, better patient outcomes.  
Efficient triage → prioritize high-risk patients for further tests or specialist review.  
Cost reduction → avoid unnecessary tests for low-risk patients, allocate resources efficiently.  
Operational planning → predict caseloads and resource demand (beds, staff).  
Quality improvement → discover predictive features that inform new clinical guidelines.  