Question 1: What is a Decision Tree, and how does it work in the context of
classification?
- A Decision Tree is a model that uses a tree-like structure of decisions and their possible consequences.
- It splits data into branches based on feature values until it reaches a decision or prediction
- A Decision Tree is a flowchart-like model that makes decisions by splitting data based on feature values, aiming to create pure groups of classes. It’s intuitive, easy to interpret, and powerful for classification problems.

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?
- Both Gini Impurity and Entropy are impurity measures used to decide how “good” a split is in a Decision Tree.

- How They Impact Splitting
   - It tests all possible features and thresholds
   - For each split, it calculates the weighted impurity (using Gini or Entropy) of the child nodes.
   - The algorithm chooses the split that minimizes impurity (or maximizes Information Gain).
- formula :-  
    - Information Gain=Entropy(parent)−k∑​n/nk​​×Entropy(childk​)

Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.
- Pruning is the process of reducing the size of a Decision Tree by removing unnecessary branches or splits that add little predictive power.
- It helps the model generalize better on unseen data.
- 1. Pre-Pruning :-
    - Pre-pruning stops the tree from growing too deep during its construction.
    - That means the algorithm stops splitting a node before it becomes      perfectly pure.
- 2. Post-Pruning :-
    - Post-pruning first allows the tree to grow fully, and then cuts back branches that do not improve model performance on validation data.

Question 4: What is Information Gain in Decision Trees, and why is it important for
choosing the best split?
- Information Gain (IG) measures how much uncertainty (entropy) in the dataset is reduced after splitting the data based on a particular feature.
- Information Gain tells us how much a feature improves our prediction power by reducing uncertainty. The higher the gain, the better the split for the Decision Tree.
- Importance of Information Gain

   - Guides the splitting process :-
      The algorithm calculates Information Gain for each feature and selects the one with the highest gain for splitting.

    - Improves purity  :-
        Higher Information Gain → Child nodes are purer → Better classification.

    - Reduces uncertainty :-
         Helps the model learn which features are most informative.

    - Foundation for Entropy-based algorithms :-
          Used in ID3 and C4.5 Decision Tree algorithms.

Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?
- A Finance & Banking
   - Credit Risk Assessment:
      - Predict whether a loan applicant will default or not.
      - Example: “If income < ₹30,000 and past defaults > 2 → likely to default.”

   - Fraud Detection:
      - Detect unusual transaction patterns.
- B Healthcare
     - Disease Diagnosis :-
         Classify patients based on symptoms or test results (e.g., “Does this X-ray indicate pneumonia?”)

     - Treatment Recommendations :-
           Suggest treatment plans based on patient characteristics.
- C Education
    - Student Performance Prediction -:
             Predict if a student will pass/fail based on attendance, marks, etc.
    - Dropout Analysis:
             Identify students at risk of dropping out.
- D E-commerce & Marketing
    - E-commerce & Marketing
              Identify likely buyers vs. non-buyers based on behavior.

    - Campaign Targeting:
               Decide which customers to send marketing emails to.
- Advantages of Decision Trees :-
  -  Easy to understand and interpret
  -  No need for feature scaling
  - Handles numerical and categorical data
  - Requires little data preparation
  - Captures nonlinear relationships
  - Provides feature importance
  - Handles missing values
  - Fast prediction
- Disadvantages :
  - Overfitting
  - High variance
  - Biased toward features with many categories
  - Less accurate alone
  - Sensitive to noisy data

Question 6: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances


In [1]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1️Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2️Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3 Create the Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# 4️ Train the model
clf.fit(X_train, y_train)

# 5️ Make predictions
y_pred = clf.predict(X_test)

# 6️Evaluate model accuracy
accuracy = accuracy_score(y_test, y_pred)

# 7️Print results
print("Decision Tree Classifier using Gini Criterion")
print("-------------------------------------------------")
print(f"Accuracy: {accuracy * 100:.2f}%")

# 8️ Print feature importances
print("\nFeature Importances:")
for feature_name, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature_name}: {importance:.4f}")



Decision Tree Classifier using Gini Criterion
-------------------------------------------------
Accuracy: 100.00%

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876


Question 7: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.

In [2]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1️⃣ Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2️⃣ Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3️⃣ Train a fully-grown Decision Tree
full_tree = DecisionTreeClassifier(criterion='gini', random_state=42)
full_tree.fit(X_train, y_train)
y_pred_full = full_tree.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# 4️⃣ Train a Decision Tree with max_depth=3 (pre-pruned)
limited_tree = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
limited_tree.fit(X_train, y_train)
y_pred_limited = limited_tree.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

# 5️⃣ Print accuracy comparison
print("Decision Tree Accuracy Comparison")
print("-----------------------------------")
print(f"Fully Grown Tree Accuracy: {accuracy_full * 100:.2f}%")
print(f"Tree with max_depth=3 Accuracy: {accuracy_limited * 100:.2f}%")

# 6️⃣ Optional: show feature importance for the pruned tree
print("\nFeature Importances (max_depth=3 Tree):")
for feature_name, importance in zip(iris.feature_names, limited_tree.feature_importances_):
    print(f"{feature_name}: {importance:.4f}")


Decision Tree Accuracy Comparison
-----------------------------------
Fully Grown Tree Accuracy: 100.00%
Tree with max_depth=3 Accuracy: 100.00%

Feature Importances (max_depth=3 Tree):
sepal length (cm): 0.0000
sepal width (cm): 0.0000
petal length (cm): 0.9251
petal width (cm): 0.0749


Question 8: Write a Python program to:
● Load the Boston Housing Dataset
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances


In [3]:
# Import necessary libraries
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# 1️⃣ Load dataset
data = fetch_california_housing()
X = data.data
y = data.target

# 2️⃣ Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3️⃣ Train Decision Tree Regressor
regressor = DecisionTreeRegressor(criterion='squared_error', random_state=42)
regressor.fit(X_train, y_train)

# 4️⃣ Predict on test data
y_pred = regressor.predict(X_test)

# 5️⃣ Evaluate performance using Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)

# 6️⃣ Display results
print("Decision Tree Regressor Results (California Housing Dataset)")
print("-------------------------------------------------------------")
print(f"Mean Squared Error (MSE): {mse:.4f}")

# 7️⃣ Print feature importances
print("\nFeature Importances:")
for name, importance in zip(data.feature_names, regressor.feature_importances_):
    print(f"{name}: {importance:.4f}")


Decision Tree Regressor Results (California Housing Dataset)
-------------------------------------------------------------
Mean Squared Error (MSE): 0.5280

Feature Importances:
MedInc: 0.5235
HouseAge: 0.0521
AveRooms: 0.0494
AveBedrms: 0.0250
Population: 0.0322
AveOccup: 0.1390
Latitude: 0.0900
Longitude: 0.0888


Question 9 : Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy

In [7]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# 1️⃣ Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2️⃣ Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3️⃣ Define the Decision Tree model
clf = DecisionTreeClassifier(random_state=42)

# 4️⃣ Define the hyperparameter grid to search
param_grid = {
    'max_depth': [1, 2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5, 6]
}

# 5️⃣ Perform Grid Search with 5-fold cross-validation
grid_search = GridSearchCV(
    estimator=clf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy'
)
grid_search.fit(X_train, y_train)

# 6️⃣ Get the best parameters and retrain the model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# 7️⃣ Make predictions on the test set
y_pred = best_model.predict(X_test)

# 8️⃣ Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)

# 9️⃣ Print results
print("GridSearchCV Results for Decision Tree Classifier")
print("--------------------------------------------------")
print(f"Best Parameters: {best_params}")
print(f"Test Set Accuracy: {accuracy * 100:.2f}%")


GridSearchCV Results for Decision Tree Classifier
--------------------------------------------------
Best Parameters: {'max_depth': 4, 'min_samples_split': 6}
Test Set Accuracy: 100.00%


Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

In [10]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# 1️⃣ Load Iris dataset as a DataFrame
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

# 2️⃣ Introduce some missing values for demonstration
np.random.seed(42)
for col in df.columns[:-1]:  # exclude target
    df.loc[df.sample(frac=0.1).index, col] = np.nan

# 3️⃣ Handle missing values
numerical_cols = df.select_dtypes(include=['float64', 'int64']).columns.tolist()
numerical_cols.remove('target')  # exclude target
num_imputer = SimpleImputer(strategy='median')
df[numerical_cols] = num_imputer.fit_transform(df[numerical_cols])

# 4️⃣ Split features and target
X = df.drop('target', axis=1)
y = df['target']

# 5️⃣ Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 6️⃣ Train Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# 7️⃣ Evaluate the model
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# 8️⃣ Hyperparameter tuning with GridSearchCV
param_grid = {
    'max_depth': [2, 3, 4, None],
    'min_samples_split': [2, 3, 4, 5]
}
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
print("Best Parameters:", grid_search.best_params_)
print("Test Accuracy after tuning:", accuracy_score(y_test, best_model.predict(X_test)))


Accuracy: 0.9111111111111111

Confusion Matrix:
 [[18  0  1]
 [ 1 11  1]
 [ 1  0 12]]

Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.95      0.92        19
           1       1.00      0.85      0.92        13
           2       0.86      0.92      0.89        13

    accuracy                           0.91        45
   macro avg       0.92      0.91      0.91        45
weighted avg       0.92      0.91      0.91        45

Best Parameters: {'max_depth': 3, 'min_samples_split': 2}
Test Accuracy after tuning: 0.9111111111111111


- Business Value in Real-World Healthcare
  - Early Disease Detection:
  - Helps doctors identify high-risk patients sooner.
  - Resource Optimization:
  - Prioritize medical testing and treatment for likely patients.
  - Improved Patient Outcomes:
  - Timely intervention reduces complications and hospital costs.
  - Decision Support for Clinicians:
  - Provides interpretable rules (Decision Trees are explainable) to aid diagnosis.
  - Population Health Management:
  - Identify trends and high-risk groups across demographics.