Question 1: What is a Decision Tree, and how does it work in the context of classification?



*   **Decision Tree:-**
    A Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks, but in the context of classification, it predicts categorical outcomes by learning simple decision rules based on the features in the data.


*    It Works for Classification:

a. The process starts at the root node, where the data is partitioned based on the value of the most informative feature (using criteria like Gini impurity or information gain).

b. At each node, the feature and threshold that best separates the classes are chosen, and the node branches to different sub-nodes or leaves depending on the feature value.

c. This recursive splitting continues (using the divide-and-conquer strategy) until the data in each subset is as homogenous as possible regarding the target class—or until some stopping rules apply (such as maximum tree depth or minimum leaf size).

d. The tree assigns a class label (or probability of classes) to new instances by routing them from the root down to a leaf, following the rules learned at each decision node.

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.How do they impact the splits in a Decision Tree?



*  **Gini Impurity :-**

 a. Gini Impurity measures the probability of incorrectly classifying a randomly chosen sample from the node if that sample were randomly labeled based on the class distribution.

b. For a node with k classes and class probabilities p1,p2,...,pk , the Gini Impurity is given by:

                      Gini=1−∑i=1tok(pi)2

c.  A Gini value of 0 indicates all samples in the node belong to a single class (pure node), while a higher Gini value indicates more class mixing (maximum value for binary class is 0.5).


*  **Entropy:-**

a. Entropy, in the context of Decision Trees, measures the level of impurity or disorder at a node. It represents the average amount of information (or surprise) required to classify a randomly picked instance in the node.

b. For class probabilities p1,p2,..,pk , entropy is given by:

               Entropy=−∑i=1tokpilog⁡2pi


c. A node is pure (no disorder) when entropy is 0. Higher entropy indicates greater class mixing and increased uncertainty.

*   Impact on Decision Tree Splits:

a. Both Gini Impurity and Entropy are used to select the best feature for splitting by minimizing impurity (or maximizing the reduction in impurity).

b.For each candidate split, the algorithm computes the weighted average impurity (using Gini or Entropy) for the child nodes, and selects the split that results in the lowest weighted impurity.

c.While both measures tend to produce similar trees, Gini Impurity is often preferred for efficiency since it does not require logarithmic computations.

d.The choice of impurity measure can slightly affect tree structure when splits are equally valid according to both criteria, but in practice the overall classification performance is usually similar.

Question 3: What is the difference between Pre-Pruning and Post-Pruning in DecisionTrees? Give one practical advantage of using each.



*   **Pre-Pruning (Early Stopping):-**

a.  Pre-pruning halts the growth of the decision tree during the training phase, stopping further splitting when a certain condition is met—such as maximum depth, minimum samples per leaf, or lack of further accuracy improvement.

b. Advantage: A practical advantage is improved training efficiency and lower risk of overfitting on large datasets, since unnecessary branches are never created.
*   **Post-Pruning (Reduced/Error Pruning):-**

a. Post-pruning allows the tree to grow fully before examining the branches; non-significant nodes/branches are pruned back based on metrics like cross-validation accuracy, cost-complexity pruning, or error reduction.

b. Advantage: A key benefit is enhanced -generalization—post-pruning often leads to higher accuracy on unseen data, especially with small or noisy datasets, because it rigorously evaluates which branches contribute predictive value.

Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

a.**Information Gain (IG)** measures the reduction in entropy (uncertainty) of the target variable after splitting the data on a given feature.

b. Formula:

IG(D,A)=H(D)−H(D∣A)

where:

H(D) is the entropy of the original dataset D.

H(D∣A) is the weighted average entropy of the dataset after splitting by feature A.

c. The feature with the largest information gain is chosen for the split, as it produces the purest (most homogeneous) child nodes.

Importance for Splits:-

 a .Information Gain identifies which feature provides the most "explanatory power" for classifying the samples at each node.

b. By selecting the feature with the highest IG, the tree maximally reduces class impurity at each split, contributing to quick and effective separation of classes and improving overall classification performance.

c.This mechanism is crucial for constructing efficient, accurate trees that avoid unnecessary complexity and overfitting.

Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

**Common Real-World Applications**

a.Credit Scoring & Risk Assessment: Used by banks and financial institutions to assess loan eligibility, evaluate credit risk, and segment borrowers based on attributes like credit history and income.

b.Medical Diagnosis: Helps doctors diagnose diseases by analyzing patient symptoms, medical test results, and risk factors to guide treatment decisions.

c.Fraud Detection: Flags suspicious transactions in finance by identifying deviations from typical user behavior or patterns.

d.Customer Segmentation & Marketing: Segments customers by demographics or purchasing behavior to optimize targeted marketing campaigns and retention strategies.

e.Manufacturing Quality Control: Predicts product defects and guides process improvements to reduce waste and maintain standards.

f.E-commerce & Retail: Powers product recommendation engines, enables dynamic pricing, and helps manage inventory and promotions based on user data and purchase trends.

g.Agriculture & Environmental Science: Predicts crop yields, optimizes planting schedules, and manages resources for precision farming.

Main Advantages:-

a.**Interpretability:** Decision Trees deliver outputs that are easy to visualize and interpret, enabling transparent decision-making and regulatory compliance.

b.**Minimal Data Preparation:** Less sensitive to missing values and outliers, and often does not require complex feature scaling or transformation.

c.**Handles Mixed Data Types:** Can process both categorical and numerical data without extra encoding.

d. **Suitability for Nonlinear Relationships:** Can capture complex, nonlinear patterns in the data.

Principal Limitations:-

a.**Instability:** Small changes in data can lead to significant changes in the tree structure, making Decision Trees less stable than some other algorithms.

b.**Overfitting:** Deep trees are prone to overfitting, especially when not properly pruned or regularized.

c.**Lower Predictive Performance (Standalone):** Often outperformed by ensemble methods (e.g., Random Forests, Gradient Boosting) in terms of raw accuracy, especially on complex datasets.

d.**Bias Toward Features with Many Categories:** Trees can be biased towards variables with numerous levels, sometimes leading to less meaningful splits.

Question 6: Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier using the Gini criterion

● Print the model’s accuracy and feature importances.










In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and test sets (75% train, 25% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Initialize Decision Tree Classifier with Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# Train the model
clf.fit(X_train, y_train)

# Predict on test data
y_pred = clf.predict(X_test)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy on test data: {accuracy:.4f}")

# Print feature importances
feature_importances = clf.feature_importances_
for feature_name, importance in zip(iris.feature_names, feature_importances):
    print(f"{feature_name}: {importance:.4f}")


Accuracy on test data: 1.0000
sepal length (cm): 0.0000
sepal width (cm): 0.0179
petal length (cm): 0.8997
petal width (cm): 0.0824


Question 7: Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.

In [2]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data into training and test sets (75% training, 25% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Train Decision Tree with max_depth=3
clf_max_depth = DecisionTreeClassifier(max_depth=3, criterion='gini', random_state=42)
clf_max_depth.fit(X_train, y_train)
y_pred_max_depth = clf_max_depth.predict(X_test)
accuracy_max_depth = accuracy_score(y_test, y_pred_max_depth)

# Train fully grown Decision Tree (no max_depth limit)
clf_full = DecisionTreeClassifier(criterion='gini', random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Print accuracies for comparison
print(f"Accuracy with max_depth=3: {accuracy_max_depth:.4f}")
print(f"Accuracy with fully grown tree: {accuracy_full:.4f}")


Accuracy with max_depth=3: 1.0000
Accuracy with fully grown tree: 1.0000


Question 8: Write a Python program to:

● Load the California Housing dataset from sklearn

● Train a Decision Tree Regressor

● Print the Mean Squared Error (MSE) and feature importances.

In [3]:
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
housing = fetch_california_housing()
X = housing.data
y = housing.target

# Split dataset into training and test sets (75% train, 25% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Initialize Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)

# Train the regressor
regressor.fit(X_train, y_train)

# Predict on test set
y_pred = regressor.predict(X_test)

# Calculate and print Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE) on test data: {mse:.4f}")

# Print feature importances
for feature_name, importance in zip(housing.feature_names, regressor.feature_importances_):
    print(f"{feature_name}: {importance:.4f}")


Mean Squared Error (MSE) on test data: 0.5285
MedInc: 0.5262
HouseAge: 0.0509
AveRooms: 0.0482
AveBedrms: 0.0280
Population: 0.0369
AveOccup: 0.1349
Latitude: 0.0880
Longitude: 0.0868


Question 9: Write a Python program to:

● Load the Iris Dataset

● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV

● Print the best parameters and the resulting model accuracy.

In [4]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)

# Define parameter grid for tuning max_depth and min_samples_split
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 5, 10]
}

# Set up GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')

# Fit GridSearchCV on training data to find the best parameters
grid_search.fit(X_train, y_train)

# Print the best parameters found
print("Best Parameters:", grid_search.best_params_)

# Use the best estimator to predict on the test data
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Calculate and print test accuracy
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy of Best Model: {test_accuracy:.4f}")


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Test Accuracy of Best Model: 1.0000


Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:

● Handle the missing values

● Encode the categorical features

● Train a Decision Tree model

● Tune its hyperparameters

● Evaluate its performance

And describe what business value this model could provide in the real-world setting.

1. Handle Missing Values
Identify missing data patterns using data exploration techniques.

For numerical features, impute missing values using median or mean imputation, as these are robust to outliers.

For categorical features, impute with the most frequent category (mode) or create a special category like "Unknown."

Alternatively, use more advanced imputation like K-Nearest Neighbors or iterative imputation if the dataset and time allow.

2. Encode Categorical Features
Use One-Hot Encoding for nominal categorical variables to avoid ordinal assumptions.

For high-cardinality categorical features, consider target encoding or frequency encoding carefully, ensuring no leakage in cross-validation.

Ensure the encoding is consistent across training and test data (fit encoder on train, transform both sets).

3. Train a Decision Tree Model
Split the dataset into training and validation sets.

Use a Decision Tree Classifier (e.g., from scikit-learn) since it naturally handles mixed data and categorical encodings well.

Train the model on the training set using the Gini impurity or entropy criterion.

4. Tune Hyperparameters
Tune key hyperparameters such as max_depth, min_samples_split, and min_samples_leaf.

Use GridSearchCV or RandomizedSearchCV with cross-validation to reliably find the best hyperparameters.

Monitor performance metrics relevant to healthcare, such as accuracy, precision, recall, and F1-score, during tuning.

5. Evaluate Model Performance
Evaluate final model on a hold-out test set using metrics aligned with clinical goals (e.g., high recall to avoid missing disease cases).

Consider the confusion matrix to understand types of errors.

Use AUC-ROC if probabilistic outputs are required.

Business Value in Real-World Healthcare:-

a.Early and accurate disease prediction can enable timely clinical interventions, improving patient outcomes.

b.The model can assist healthcare providers by flagging high-risk patients for further testing and monitoring, optimizing resource allocation.

c.Automating parts of diagnostic evaluation reduces human workload and potential for error.

d.Understanding feature importances enhances transparency for clinicians, aiding trust and adoption.

e.Ultimately, this predictive modeling supports personalized and proactive patient care, potentially reducing healthcare costs by preventing disease progression.

