Question

1. What is a Decision Tree, and how does it work in the context of
  classification?
 -    answers:- A Decision Tree is a flowchart-like tree structure where:

-  Each internal node represents a feature (attribute) test

-  Each branch represents the outcome of that test

-  Each leaf node represents a class label (decision taken after computing all  features)

-    How It Works (Classification Context)
Start at the root node (the full dataset).

-  Partition the data into subsets based on the selected feature.

-  Repeat this process recursively for each child node (subtree) until one of the stopping conditions is met:

-  All data points in a node belong to the same class

-  Maximum depth is reached

-  Minimum number of samples per leaf is reached

-  Splitting Criteria (How to Choose the Best Feature)?
Common metrics for classification:

-  Gini Impurity: Measures how often a randomly chosen element would be incorrectly labeled.

-  Entropy / Information Gain: Measures the reduction in uncertainty after a split.

-  Example (Binary Classification)
Let’s say you want to classify whether a person buys a product based on age and income.

        -  Age	Income	Buys
          <30	High	No
          30–40	Medium	Yes
         >40	Low	Yes

-  The tree might look like this:


            [Age?]
           /  |   \
        <30 30-40 >40
        No   Yes   Yes

-   No need for feature scaling

-  Can handle both numerical and categorical data

-  Prone to overfitting (especially deep trees)

-  Small changes in data can result in very different trees (unstable)

-  Biased towards features with more levels/categories

-  Medical diagnosis (e.g., disease classification)

-  Customer segmentation

-  Credit scoring


 Question

 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.
  How do they impact the splits in a Decision Tree?
-  answers:- In decision tree algorithms like CART and ID3, we use impurity measures to decide which attribute to split on at each node. Two commonly used measures of impurity are:

-  1. Gini Impurity
-  Definition:
Gini impurity measures the probability of incorrectly classifying a randomly chosen element if it was randomly labeled according to the distribution of labels in the dataset.

-  Formula:

-  𝐺
𝑖
𝑛
𝑖
(
𝐷
)
=
1
−
∑
𝑖
=
1
𝐶
𝑝
𝑖
2
Gini(D)=1−
i=1
∑
C
​
 p
i
2
​

Where:

𝐷
D is the dataset

𝐶
C is the number of classes

𝑝
𝑖
p
i
​
  is the proportion of class
𝑖
i in
𝐷
D

-  Interpretation:

-  Range: 0 (pure) to just under 1 (impure)

-  Lower Gini = better split (more homogeneous node)

-  2. Entropy (Information Gain)
-  Definition:
-  Entropy measures the amount of disorder or uncertainty in the dataset. It is used in the ID3 algorithm.

-  Formula:

-  𝐸
𝑛
𝑡
𝑟
𝑜
𝑝
𝑦
(
𝐷
)
=
−
∑
𝑖
=
1
𝐶
𝑝
𝑖
log
⁡
2
(
𝑝
𝑖
)
Entropy(D)=−
i=1
∑
C
​
 p
i
​
 log
2
​
 (p
i
​
 )
Where:

𝑝
𝑖
p
i
​
  is the probability of class
𝑖
i

-  Interpretation:

-  Entropy = 0 when the node is pure (only one class)

- Entropy is highest when all classes are equally represented

-  How They Impact Splits in a Decision Tree
When building a decision tree:

- At each node, we evaluate all possible splits of the data.

-  For each split, we calculate the impurity (Gini or Entropy) of the resulting subsets.

-  We aim to reduce impurity — i.e., find the split that maximizes purity in the child nodes.

-  The splitting criterion:
Gini: Choose the split that gives the lowest weighted Gini impurity.

-  Entropy (Information Gain): Choose the split that gives the highest information gain, i.e., maximum reduction in entropy.

-   Example (Binary Classification):
Class Distribution	Gini Impurity	Entropy
[50%, 50%]	0.5	1.0
[90%, 10%]	0.18	0.47
[100%, 0%]	0.0 (pure)	0.0 (pure)

-   Gini vs. Entropy
Criteria	Gini Impurity	Entropy
Algorithm Used	CART	ID3, C4.5
Computational Cost	Faster (no logarithms)	Slightly slower
Splitting Behavior	Tends to isolate the most frequent class	More balanced splits


Question

 3: What is the difference between Pre-Pruning and Post-Pruning in Decision
  Trees? Give one practical advantage of using each.

  -  answers:- Decision Trees can easily overfit the training data if allowed to grow without constraints. To prevent this, pruning techniques are used — either before or after the tree is built.

-   Pre-Pruning (Early Stopping)
Definition:
Pre-pruning stops the tree from growing once a certain condition is met during the training process.

-  Common Pre-Pruning Criteria:

-  Maximum depth of the tree

-  Minimum number of samples required to split a node

-  Minimum information gain or reduction in impurity (Gini/Entropy)

-  Practical Advantage:
Faster training time — because it avoids creating unnecessary branches early, it reduces computation and saves resources.

-  Post-Pruning (Cost Complexity Pruning or Reduced Error Pruning)
Definition:
Post-pruning allows the tree to grow fully and then removes branches that do not provide significant improvement on a validation set.

-  How it works:

-  Fully grow the tree

-  Evaluate subtrees using a validation set

-  Prune the branches that reduce accuracy or have minimal gain

-  Practical Advantage:
Better generalization — the tree is allowed to learn all patterns and then simplified, often leading to higher accuracy on unseen data.

-  Summary Table:
Aspect	Pre-Pruning	Post-Pruning
When applied	During tree construction	After full tree is built
Basis	Heuristics (e.g. depth, samples)	Validation performance
Risk	Might underfit	Less risk of underfitting
Speed	Faster	Slower
Advantage	Faster training	Better generalization


 Question

  4: What is Information Gain in Decision Trees, and why is it important for
  choosing the best split?
-  answera:-
-  Information Gain (IG) is a metric used in Decision Trees to measure the reduction in uncertainty (impurity) about the target variable after a dataset is split on a specific feature.

-  It is based on the concept of Entropy, which measures the amount of disorder or impurity in a dataset.

-  The formula for Information Gain is:

-  Information Gain
=
Entropy (parent)
−
∑
𝑖
=
1
𝑘
(
𝑛
𝑖
𝑛
×
Entropy (child
𝑖
)
)
Information Gain=Entropy (parent)−
i=1
∑
k
​
 (
n
n
i
​

​
-   ×Entropy (child
i
​
 ))
Where:

-  Entropy(parent): Entropy before the split.

-  Entropy(child₁, child₂, ..., childₖ): Entropies of the resulting subsets.

-  nᵢ: Number of instances in child i.

-  n: Total number of instances in the parent node.

-  Information Gain is used to select the best feature to split the data at each step while building the Decision Tree. The feature with the highest Information Gain is chosen because:

-  It gives the most reduction in impurity.

-  It results in purer child nodes, leading to a more accurate model.

-  It helps the tree converge faster by reducing uncertainty more effectively.

-  Example:
  Suppose you’re building a decision tree to classify whether to play tennis based on weather conditions. If splitting by “Outlook” gives the highest Information Gain compared to “Humidity” or “Wind”, then the tree will choose “Outlook” as the first split.  

Question

5: What are some common real-world applications of Decision Trees, and
  what are their main advantages and limitations?
-  answers:- Real-World Applications of Decision Trees:
Medical Diagnosis

-  Used to classify diseases based on patient symptoms, test results, and history.

-  Example: Diagnosing diabetes or heart disease based on input features like age, blood pressure, glucose levels, etc.

-  Customer Relationship Management (CRM)

-  Predicting customer churn, segmenting customers, or recommending products.

-  Credit Scoring and Risk Analysis

-  Banks use decision trees to decide whether to approve loans based on financial history, income, credit score, etc.

-  Fraud Detection

-  Identifying fraudulent transactions by analyzing patterns in transaction data.

- Marketing and Sales

-  Targeting potential customers or selecting marketing strategies based on demographic and behavioral data.

-  Manufacturing & Quality Control

-  Predicting equipment failure or defects in a production process.

-  Agriculture

-  Classifying types of crops or predicting yield based on environmental factors like rainfall, temperature, and soil type.

-  Main Advantages of Decision Trees:
Easy to Understand and Interpret

-  Tree structures are visual and mimic human decision-making.

-  No Need for Feature Scaling or Normalization

-  Can handle raw data without preprocessing like normalization or standardization.

-  Can Handle Both Numerical and Categorical Data

-  Flexibly deals with different types of variables.

-  Non-Parametric

-  Makes no assumptions about data distribution.

-  Feature Selection is Built-in

-  Automatically selects the most important features based on splitting criteria.

-   Limitations of Decision Trees:
Overfitting

-  Trees can become too complex and memorize the training data instead of generalizing well (especially without pruning).

-  Instability

-  Small changes in the data can lead to very different tree structures.

-  Biased Towards Features with More Levels

-  Features with more unique values can dominate splits.

-  Less Accurate Alone

-  Compared to ensemble methods like Random Forest or Gradient Boosted Trees, a single decision tree may have lower predictive power.

-  Harder to Model Complex Relationships

-  Not as good at modeling interactions between variables unless the tree grows very deep.

-   Summary Table:
-   Aspect	Advantage or Limitation	Comment
-  Interpretability	✅ Advantage	Easy to explain and visualize
-  Data Preprocessing	✅ Advantage	No need for normalization or scaling
-  Overfitting Risk	❌ Limitation	Needs pruning or ensemble methods
-  Accuracy	❌ Limitation	Less accurate than boosted/ensemble models
-  Stability	❌ Limitation	Sensitive to small data changes

*Dataset Info:
● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).
● Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).

-  1. Iris Dataset (Classification)
Purpose: Used for classification tasks — predicting the species of an iris flower based on features.

-   Features:
sepal length (cm)

sepal width (cm)

petal length (cm)

petal width (cm)

-  Target:
0: Setosa

1: Versicolor

2: Virginica

-  Load using sklearn:
from sklearn.datasets import load_iris
import pandas as pd

        iris = load_iris()
        X = pd.DataFrame(iris.data, columns=iris.feature_names)
        y = pd.Series(iris.target, name='species')

 2. Boston Housing Dataset (Regression)
Purpose: Used for regression tasks — predicting median house prices in Boston suburbs.

-   Note: load_boston() is deprecated due to ethical concerns with the dataset. If you're using an older version of scikit-learn, you can still load it as shown. For newer versions, use the CSV file or the California Housing dataset instead.

-   Features:
      13 numeric/categorical features (e.g., crime rate, number of rooms, property tax rate, etc.)

-   Target:
Median value of owner-occupied homes in $1000s.

-   Load using sklearn (if available):
from sklearn.datasets import load_boston
import pandas as pd

boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = pd.Series(boston.target, name='MEDV')

-   Alternative: Use CSV
If load_boston() is not available, use the CSV (if provided):
import pandas as pd

-           df = pd.read_csv('boston_housing.csv')
            X = df.drop('MEDV', axis=1)
            y = df['MEDV']


Question


 6. Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances

-  answers :- Here is a complete Python program to:

Load the Iris Dataset
Train a Decision Tree Classifier using the Gini criterion
Print the model’s accuracy and feature importances

    # Import necessary libraries
    from sklearn.datasets import load_iris
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score

    # Load the Iris dataset
    iris = load_iris()
    X = iris.data  # Features
    y = iris.target  # Labels

    # Split the data into training and testing sets (80% train, 20% test)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Create and train the Decision Tree Classifier using Gini criterion
    clf = DecisionTreeClassifier(criterion='gini', random_state=42)
    clf.fit(X_train, y_train)

    # Predict on test set
    y_pred = clf.predict(X_test)

    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print("Model Accuracy:", accuracy)

    # Print feature importances
    print("\nFeature Importances:")
    for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")
    Model Accuracy: 1.0
    Feature Importances:
    sepal length (cm): 0.0000
    sepal width (cm): 0.0000
    petal length (cm): 0.4444
    petal width (cm): 0.5556


Question


7:  Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.
-  answers:-
-  Here's a complete Python program to:

Load the Iris dataset

Train two Decision Tree classifiers:

One with max_depth=3

Another fully grown (no max_depth limit)

Compare their accuracies
    # Import necessary libraries
    from sklearn.datasets import load_iris
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score

    # Load the Iris dataset
    iris = load_iris()
    X = iris.data
    y = iris.target

    # Split the data into training and testing sets (80% train, 20% test)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    #  Train Decision Tree Classifier with max_depth = 3
    clf_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
    clf_limited.fit(X_train, y_train)
    y_pred_limited = clf_limited.predict(X_test)
    accuracy_limited = accuracy_score(y_test, y_pred_limited)

    # Train fully-grown Decision Tree (no max_depth limit)
    clf_full = DecisionTreeClassifier(random_state=42)
    clf_full.fit(X_train, y_train)
    y_pred_full = clf_full.predict(X_test)
    accuracy_full = accuracy_score(y_test, y_pred_full)

    # Output the results
    print("Accuracy with max_depth=3:", accuracy_limited)
    print("Accuracy with fully-grown tree:", accuracy_full)

    Accuracy with max_depth=3: 1.0
    Accuracy with fully-grown tree: 1.0

Question

 8: Write a Python program to:
● Load the Boston Housing Dataset
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances
-  answers:- Here's a Python program that:


-  Note: load_boston was removed in scikit-learn v1.2 due to ethical concerns. You can still load it using a backup method or via sklearn.datasets.fetch_openml. Below is a version that works with current versions using OpenML:

        from sklearn.datasets import fetch_openml
        from sklearn.tree import DecisionTreeRegressor
        from sklearn.metrics import mean_squared_error
        from sklearn.model_selection import train_test_split
        import pandas as pd

        # Load the Boston Housing dataset from OpenML
         boston = fetch_openml(name='boston', version=1, as_frame=True)
         X = boston.data
        y = boston.target

       # Split into training and testing sets (80% train, 20% test)
      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

       # Train Decision Tree Regressor
      model = DecisionTreeRegressor(random_state=42)
       model.fit(X_train, y_train)

      # Predict on test set
       y_pred = model.predict(X_test)

      # Calculate Mean Squared Error (MSE)
      mse = mean_squared_error(y_test, y_pred)
      print(f"Mean Squared Error (MSE): {mse:.2f}")

       # Print feature importances
      importances = pd.Series(model.feature_importances_, index=X.columns)
      print("\nFeature Importances:")
      print(importances.sort_values(ascending=False))

      Mean Squared Error (MSE): 19.75

      Feature Importances:
      LSTAT    0.620

      RM       0.275

     DIS      0.033

     CRIM     0.024

     NOX      0.017

     ...


Question
9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy

-  answers:- Here's a complete Python program to:

        Print the best parameters and model accuracy

        from sklearn.datasets import load_iris
        from sklearn.tree import DecisionTreeClassifier
        from sklearn.model_selection import GridSearchCV, train_test_split
        from sklearn.metrics import accuracy_score

        # Load the Iris dataset
        iris = load_iris()
         y = iris.target

        # Split the dataset into training and testing sets (80-20 split)
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

        # Create a Decision Tree Classifier
        dtree = DecisionTreeClassifier(random_state=42)

        # Define the parameter grid to tune
        param_grid = {
        'max_depth': [2, 3, 4, 5, 6],
        'min_samples_split': [2, 3, 4, 5]
        }

         # Apply GridSearchCV
        grid_search = GridSearchCV(dtree, param_grid, cv=5, scoring='accuracy')
        grid_search.fit(X_train, y_train)

        # Get the best estimator
        best_model = grid_search.best_estimator_

        # Predict on test data
        y_pred = best_model.predict(X_test)

        # Print the best parameters and accuracy
        print("Best Parameters:", grid_search.best_params_)
        print("Test Accuracy: {:.2f}%".format(accuracy_score(y_test, y_pred) * 100))

        Best Parameters: {'max_depth': 3, 'min_samples_split': 2}
        Test Accuracy: 100.00%

Question

 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

-  answers:- Step 1: Handle the Missing Values
Explore the data:

Use .info() and .isnull().sum() to check for missing values.

Impute missing values:

Numerical columns: Use mean or median imputation depending on skewness.

    from sklearn.impute import SimpleImputer
    num_imputer = SimpleImputer(strategy='median')
    Categorical columns: Use mode (most frequent) imputation.

-  cat_imputer = SimpleImputer(strategy='most_frequent')
You can also use more advanced methods like KNN imputation for better accuracy.

-  Step 2: Encode the Categorical Features
Identify categorical columns.

    Use df.select_dtypes(include='object') or df.dtypes.

Encoding:

For Decision Trees, Label Encoding is often sufficient, as trees are not sensitive to one-hot encoded scales.


    from sklearn.preprocessing import LabelEncoder
    le = LabelEncoder()
    df['column'] = le.fit_transform(df['column'])
    If the tree splits based on feature importance, OneHotEncoding can also be used:

    from sklearn.preprocessing import OneHotEncoder

-  Step 3: Train a Decision Tree Model
Split the dataset:

    from sklearn.model_selection import train_test_split
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
     Train the model:

    from sklearn.tree import DecisionTreeClassifier
    dt = DecisionTreeClassifier(random_state=42)
    dt.fit(X_train, y_train)

-  Step 4: Tune Hyperparameters
Use GridSearchCV or RandomizedSearchCV to tune:

      max_depth, min_samples_split, min_samples_leaf, criterion (gini/entropy).


      param_grid = {
     'max_depth': [3, 5, 10, None],
     'min_samples_split': [2, 5, 10],
     'min_samples_leaf': [1, 2, 4],
     'criterion': ['gini', 'entropy']
      }

     grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42),
                            param_grid, cv=5, scoring='accuracy')
     grid_search.fit(X_train, y_train)

    best_model = grid_search.best_estimator_

-  Step 5: Evaluate Performance
Metrics to use:

Accuracy

Precision, Recall, F1-score

Confusion Matrix

ROC AUC Score


    from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

    y_pred = best_model.predict(X_test)
    print(confusion_matrix(y_test, y_pred))
    print(classification_report(y_test, y_pred))
    print("ROC AUC:", roc_auc_score(y_test, best_model.predict_proba(X_test)[:, 1]))
    Business Value in Real-World Healthcare Setting
    This model can provide the following key benefits:

   Early Detection & Diagnosis: Helps doctors prioritize high-risk patients  for further tests, enabling faster intervention.

Resource Optimization: Reduces unnecessary tests for low-risk patients, optimizing time and cost.

Decision Support: Assists physicians in decision-making by identifying hidden patterns in patient data.

Personalized Care: Enables better treatment planning based on predicted risk.

Regulatory Reporting: Can be used to generate risk stratification for health insurance and compliance.

Patient Outreach: Helps in targeted communication campaigns (e.g., invite high-risk groups for checkups).







