Question 1: What is a Decision Tree, and how does it work in the context of classification?

  - Definition:
A Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks.
In the context of classification, it works by splitting the dataset into smaller subsets based on feature values, forming a tree-like structure of decisions that leads to class labels.

     How it works (Step by Step for Classification):

1.Root Node Selection

- The process starts at the root node (the top of the tree).

- The algorithm chooses the best feature to split the data using criteria such as:

- Gini Impurity

- Entropy / Information Gain

- Chi-Square

2. Splitting

- The chosen feature splits the dataset into branches (subgroups).

- Each branch represents a decision based on the feature value.

3.Recursive Partitioning

- For each branch, the algorithm again selects the best feature to further split the data.

- This process continues recursively until:

- A stopping criterion is met (e.g., max depth, min samples per leaf).

- Or all samples in a node belong to the same class (pure node).

4. Leaf Nodes (Prediction)

- The final nodes (leaves) contain the class label prediction.

- For classification:

- If a leaf has 80% "Yes" and 20% "No" samples, the model predicts "Yes" for new data falling into that leaf.

- Example (Binary Classification):

-  Suppose we want to classify whether a person will buy a computer.

     Root Node: "Age"

    If Age < 30 → Check "Income"

    If Age 30–50 → Predict "Yes"

     If Age > 50 → Check "Student"

This process continues until we reach class labels ("Yes" or "No").

Advantages:

-  Easy to understand and interpret (like a flowchart).

-  Handles both numerical and categorical data.

-  Requires little data preprocessing (no scaling needed).

Limitations:

- Can easily overfit if not pruned or limited.

-  Sensitive to small changes in data.

-  Bias toward features with many levels.

Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

- 1. Pre-Pruning (Early Stopping)

  Definition: Pre-pruning stops the tree from growing too deep while it is being built.

   It applies constraints during tree construction to avoid overly complex trees.

  Common pre-pruning techniques:

  Limit maximum depth (max_depth).

  Minimum samples required to split a node (min_samples_split).

  Minimum samples per leaf (min_samples_leaf).

  Maximum number of leaf nodes (max_leaf_nodes).

  Stop splitting if impurity improvement is below a threshold (min_impurity_decrease).

- Practical Advantage:
 Efficiency: Saves time and memory by preventing unnecessary tree growth.
Example: Useful when dealing with very large datasets where training speed matters.

2. Post-Pruning (Pruning After Full Growth)

-  Definition: Post-pruning allows the tree to grow fully and then removes branches that do not improve model performance significantly.

   The idea is: start complex → simplify.

   Common post-pruning techniques:

   Reduced Error Pruning: Replace a subtree with a leaf if it doesn’t worsen accuracy on validation data.

   Cost Complexity Pruning (used in CART): Prune nodes that provide the least benefit relative to their complexity.
Practical Advantage:
Better Generalization: By pruning after full growth, we can analyze the tree structure and remove branches that cause overfitting.
Example: Useful when interpretability and accuracy on unseen data are more important than speed.

Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or provided CSV). ● Boston Housing Dataset for regression tasks (sklearn.datasets.load_boston() or provided CSV)

 - 1. Real-World Applications

     (a) Classification

   Iris Dataset (Classic Example in ML):

     Task: Classify iris flowers into species (Setosa, Versicolor, Virginica) based on features like petal length, sepal width, etc.

Dataset: sklearn.datasets.load_iris()

  Use Case: Demonstrates how Decision Trees can handle multi-class classification.

  Other Real-World Classification Examples:

  Medical Diagnosis: Classifying patients as “disease” or “no disease”.

  Fraud Detection: Detecting fraudulent vs. legitimate transactions.

Customer Churn Prediction: Predicting if a customer will leave a service.

(b) Regression

Boston Housing Dataset (Classic Example in Regression):

-  Task: Predict median house prices based on features like crime rate, number of rooms, distance to employment centers, etc.

-  Dataset: sklearn.datasets.load_boston() (deprecated, but still classic) or modern alternatives like California Housing dataset.

-  Use Case: Demonstrates how Decision Trees can predict continuous outcomes.

-  Other Real-World Regression Examples:

-  Predicting sales based on advertisement spending.

-  Forecasting demand for energy or products.

-  Estimating credit risk score.

2. Advantages of Decision Trees

-  Easy to Interpret & Visualize

-  Looks like a flowchart → business/non-technical stakeholders can understand.

-  No Need for Feature Scaling

-  Unlike SVM or Logistic Regression, no normalization/standardization required.

-  Handles Both Categorical & Numerical Data

-  Works with mixed data types.

-  Captures Non-linear Relationships

-  Can model complex decision boundaries.

- Fast Inference

-  Once trained, predictions are quick (just follow tree branches).

3. Limitations of Decision Trees

-  Overfitting

-  Trees can grow too deep → memorizing noise in training data.

-   (Needs pruning or ensemble methods like Random Forest).

Unstable

-  Small changes in data can produce very different trees.

-  Biased Toward Features with More Levels

-  Features with many unique values (like IDs) can dominate splits.

-  Not Great at Extrapolation in Regression

-  If values outside training range appear, tree can’t predict well.

- Less Accurate Alone

   Often outperformed by ensemble models (Random Forest, XGBoost, LightGBM).

   

Question 6: Write a Python program to:

- Load the Iris Dataset

- Train a Decision Tree Classifier using the Gini criterion

-  Print the model’s accuracy and feature importances


In [1]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# 2. Split into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 3. Train Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion="gini", random_state=42)
clf.fit(X_train, y_train)

# 4. Evaluate the model
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# 5. Print results
print("Decision Tree Classifier (Gini Criterion)")
print(f"Accuracy on test set: {accuracy:.2f}")

# Feature importances
print("\nFeature Importances:")
for feature_name, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature_name}: {importance:.3f}")


Decision Tree Classifier (Gini Criterion)
Accuracy on test set: 0.93

Feature Importances:
sepal length (cm): 0.006
sepal width (cm): 0.029
petal length (cm): 0.559
petal width (cm): 0.406


Question 7: Write a Python program to: ● Load the Iris Dataset ● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

In [2]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# 2. Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 3. Train a fully-grown Decision Tree
full_tree = DecisionTreeClassifier(random_state=42)  # no depth limit
full_tree.fit(X_train, y_train)
y_pred_full = full_tree.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# 4. Train a Decision Tree with max_depth = 3 (pre-pruned)
shallow_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
shallow_tree.fit(X_train, y_train)
y_pred_shallow = shallow_tree.predict(X_test)
accuracy_shallow = accuracy_score(y_test, y_pred_shallow)

# 5. Print Results
print("Decision Tree Accuracy Comparison")
print(f"Fully-Grown Tree Accuracy : {accuracy_full:.2f}")
print(f"Max Depth = 3 Tree Accuracy: {accuracy_shallow:.2f}")


Decision Tree Accuracy Comparison
Fully-Grown Tree Accuracy : 0.93
Max Depth = 3 Tree Accuracy: 0.97


Question 8: Write a Python program to: ● Load the California Housing dataset from sklearn ● Train a Decision Tree Regressor ● Print the Mean Squared Error (MSE) and feature importances

In [3]:
# Import necessary libraries
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# 1. Load the California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# 2. Split into training and test sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# 4. Predictions and Evaluation
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

# 5. Print results
print("Decision Tree Regressor on California Housing")
print(f"Mean Squared Error (MSE): {mse:.2f}")

print("\nFeature Importances:")
for feature_name, importance in zip(housing.feature_names, regressor.feature_importances_):
    print(f"{feature_name}: {importance:.3f}")


Decision Tree Regressor on California Housing
Mean Squared Error (MSE): 0.50

Feature Importances:
MedInc: 0.529
HouseAge: 0.052
AveRooms: 0.053
AveBedrms: 0.029
Population: 0.031
AveOccup: 0.131
Latitude: 0.094
Longitude: 0.083


Question 9: Write a Python program to: ● Load the Iris Dataset ● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV ● Print the best parameters and the resulting model accuracy

In [4]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# 2. Split into training and test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 3. Define the Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# 4. Define the hyperparameter grid
param_grid = {
    "max_depth": [2, 3, 4, 5, None],
    "min_samples_split": [2, 3, 4, 5, 10]
}

# 5. GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(
    estimator=dt,
    param_grid=param_grid,
    cv=5,                # 5-fold cross-validation
    scoring="accuracy",
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

# 6. Get best parameters and evaluate on test set
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# 7. Print results
print("Best Parameters:", best_params)
print(f"Accuracy of best model on test set: {accuracy:.2f}")


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Accuracy of best model on test set: 0.93


Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:

- Handle the missing values
-  Encode the categorical features
-  Train a Decision Tree model
-  Tune its hyperparameters
-    Evaluate its performanceAnd describe what business value this model could provide in the real-world setting.

tep 1: Handle Missing Values

Why: Medical datasets often have missing entries (e.g., missing lab results, unrecorded symptoms). Decision Trees can handle some missing values, but preprocessing improves reliability.

Approach:

1 Numerical features → Impute with:
- Mean/median (if data is normally distributed/skewed).
- More advanced: KNN imputation or regression imputation.

2 Categorical features → Impute with:

- Mode (most frequent value).

- Or create a special category like "Unknown"

In [5]:
from sklearn.impute import SimpleImputer

num_imputer = SimpleImputer(strategy="median")
cat_imputer = SimpleImputer(strategy="most_frequent")

Step 2: Encode Categorical Features

- Decision Trees don’t need feature scaling, but they do need numerical inputs.

- Encoding approaches:

- One-Hot Encoding: For nominal categories (e.g., blood type A/B/O).

- Ordinal Encoding: For ordered categories (e.g., disease stage I/II/III).

In [6]:
from sklearn.preprocessing import OneHotEncoder

Step 3: Train a Decision Tree Model

Split dataset into train/test (e.g., 80/20).

Train using DecisionTreeClassifier (criterion = "gini" or "entropy").

In [7]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

Step 4: Hyperparameter Tuning

Prevents overfitting. Use GridSearchCV to find the best parameters:

max_depth → controls tree depth.

min_samples_split / min_samples_leaf → control minimum samples required for splits.

criterion → "gini" or "entropy"

In [8]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "max_depth": [3, 5, 10, None],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4],
    "criterion": ["gini", "entropy"]
}
grid = GridSearchCV(clf, param_grid, cv=5, scoring="accuracy", n_jobs=-1)
grid.fit(X_train, y_train)
best_model = grid.best_estimator_


Step 5: Model Evaluation

Metrics to use (Healthcare context):

Accuracy: Good, but not enough (especially with imbalanced data).

Precision & Recall: Important — recall ensures we don’t miss actual patients with disease.

F1-score: Balance between precision & recall.

ROC-AUC: Measures model’s ability to distinguish disease vs. no disease.

In [11]:
from sklearn.metrics import classification_report, roc_auc_score

y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, best_model.predict_proba(X_test)[:,1]))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       0.90      0.90      0.90        10
           2       0.90      0.90      0.90        10

    accuracy                           0.93        30
   macro avg       0.93      0.93      0.93        30
weighted avg       0.93      0.93      0.93        30



ValueError: multi_class must be in ('ovo', 'ovr')

Step 6: Business Value in Real-World Setting

Early Disease Detection: Helps doctors identify patients at risk → early treatment → improved outcomes.

Resource Allocation: Hospitals can prioritize high-risk patients (e.g., ICU beds, specialist referrals).

Cost Savings: Reduces unnecessary tests for low-risk patients while focusing on critical cases.

Decision Support: Provides interpretable rules (why a patient is classified as “at risk”), which doctors trust more than black-box models.