**Question 1** - What is a Decision Tree, and how does it work in the context of classification?

**Answer** - A Decision Tree is a flowchart-like model used in machine learning to make decisions or predictions. In the context of classification, it helps sort data into categories (like "spam" or "not spam") by asking a series of yes/no or multiple-choice questions about the input features.

It starts at a root node and splits the data based on the feature that best separates the classes. Each internal node asks a question, and each branch represents an answer. The process continues until it reaches a leaf node, which gives the final classification.

**Question 2** - Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?

**Answer** - Gini Impurity and Entropy are two commonly used measures in decision trees to decide how "pure" or "impure" a node is — that is, how mixed the classes are at that point.

 1. Gini Impurity:

Measures the probability of misclassifying a randomly chosen element.

Formula:

                           **Gini=1−∑(pi​)2**

where
𝑝
𝑖
p
i
	​

 is the probability of class
𝑖
i.

Range: 0 (pure) to ~0.5 (most impure with two classes).

Faster to compute than entropy.

 2. Entropy:

Measures the level of disorder or uncertainty in the node.

Formula:

                         **Entropy=−∑pi​log2​(pi​)**

Range: 0 (pure) to 1 (for a 50/50 split in binary classification).

Based on information theory.

-  How They Impact Splits in a Decision Tree:

At each node, the tree tries different features and thresholds and chooses the split that reduces impurity the most (i.e., has the biggest information gain for entropy, or Gini decrease).

This helps the tree grow in a way that groups similar items together more effectively.


**Question 3**- What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.

Answer - **Pre-Pruning**

**Definition**: Pre-pruning (also called early stopping) is the process of stopping the growth of the decision tree before it becomes too complex, based on certain conditions (like maximum depth, minimum samples per leaf, or minimum information gain).

**Post-Pruning**

**Definition**: Post-pruning is the process of growing the full decision tree first and then removing or trimming the less significant branches after the tree is built, to reduce overfitting.

**Practical Advantage**:

Pre-Pruning: It helps reduce computation time and memory usage because the tree is never allowed to grow too large.

Post-Pruning: It often results in a more accurate and generalized model, as pruning removes branches that do not contribute much to prediction accuracy.

**Question 4** - What is Information Gain in Decision Trees, and why is it important for
choosing the best split?

Answer - **Information Gain in Decision Trees:**

**Definition**:
Information Gain is a measure used in decision trees to determine how well a particular feature separates the data into target classes. It calculates the reduction in entropy (or uncertainty) about the target variable after splitting the dataset based on a specific attribute.

In simple terms, it tells us how much “information” about the target class we gain by splitting the data on a given feature.

 **Importance for Choosing the Best Split**:

Information Gain is important because it helps the decision tree choose the most informative attribute at each node.

A higher Information Gain means the feature reduces more uncertainty and creates purer subsets.

The decision tree algorithm selects the feature with the highest Information Gain as the best split, leading to a more accurate and efficient tree.

**Question 5** - What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

**Answer** - **Real-World Applications of Decision Trees**

Decision trees are used in many practical situations because they are simple to understand and easy to apply. For example, in medical diagnosis, they help doctors predict diseases or suggest treatments based on patient data. In finance and banking, they assist with credit scoring, loan approvals, and detecting fraudulent activities. Companies also use decision trees in marketing to segment customers, predict their purchasing behavior, and plan targeted campaigns. In manufacturing, they can identify causes of defects or predict machine failures for better maintenance planning. Even in education, decision trees are useful for predicting student performance or identifying those at risk of dropping out.

 **Main Advantages of Decision Trees**

One major advantage is that decision trees are easy to understand and interpret because their results can be visualized like a simple flowchart. They also don’t require data scaling or complex preprocessing, which makes them easier to use. Another benefit is that they can handle both categorical and numerical data without any modifications. Additionally, decision trees help in identifying the most important features in a dataset, which is valuable for analysis and feature selection.

**Main Limitations of Decision Trees**

Despite their usefulness, decision trees have some limitations. They are prone to overfitting, meaning they can become too complex and learn noise from the training data. They can also be unstable, as small changes in the dataset might result in a completely different tree structure. Another issue is that decision trees can be biased towards features with many levels. Lastly, a single decision tree often performs less accurately compared to ensemble methods like Random Forests or Gradient Boosted Trees.

**Question 6** -  Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier using the Gini criterion

● Print the model’s accuracy and feature importances

In [1]:
#Answer - # Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Train a Decision Tree Classifier using the Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# 4. Predict on the test set
y_pred = clf.predict(X_test)

# 5. Calculate and print the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# 6. Print feature importances
print("Feature Importances:")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Model Accuracy: 1.0
Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


Explanation:

We load the Iris dataset using load_iris().

We split it into training and testing data.

We train a decision tree classifier using the Gini impurity criterion.

Finally, we print the accuracy of the model and the feature importances, which tell us how much each feature contributed to the decision-making process.

**Question 7** -  Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.


In [2]:
#Answer - # Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train a Decision Tree Classifier with max_depth = 3
clf_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_limited.fit(X_train, y_train)
y_pred_limited = clf_limited.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

# 4. Train a fully-grown Decision Tree Classifier
clf_full = DecisionTreeClassifier(random_state=42)  # no depth limit
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# 5. Print and compare accuracies
print("Accuracy with max_depth=3:", accuracy_limited)
print("Accuracy with fully-grown tree:", accuracy_full)


Accuracy with max_depth=3: 1.0
Accuracy with fully-grown tree: 1.0


Explanation:

We first load the Iris dataset and split it into training and testing data.

Then we train one decision tree with a limited depth (max_depth=3) and another without depth limitation (fully-grown).

Finally, we compare their accuracies on the test set.

**Question 8** -  Write a Python program to:

● Load the California Housing dataset from sklearn

● Train a Decision Tree Regressor

● Print the Mean Squared Error (MSE) and feature importances


In [3]:
#Answer - # Import necessary libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# 1. Load the California Housing dataset
housing = fetch_california_housing()
X = housing.data
y = housing.target

# 2. Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train a Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# 4. Make predictions on the test set
y_pred = regressor.predict(X_test)

# 5. Calculate and print the Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)

# 6. Print feature importances
print("\nFeature Importances:")
for feature, importance in zip(housing.feature_names, regressor.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Mean Squared Error (MSE): 0.495235205629094

Feature Importances:
MedInc: 0.5285
HouseAge: 0.0519
AveRooms: 0.0530
AveBedrms: 0.0287
Population: 0.0305
AveOccup: 0.1308
Latitude: 0.0937
Longitude: 0.0829


Explanation:

The dataset is loaded with fetch_california_housing().

We split the data into training and testing sets.

A Decision Tree Regressor is trained to predict housing prices.

The Mean Squared Error (MSE) evaluates how close the predictions are to actual values (lower = better).

Feature importances show which features contribute most to the prediction — here, median income (MedInc) is the most important.

**Question 9** -  Write a Python program to:
● Load the Iris Dataset

● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV

● Print the best parameters and the resulting model accuracy


In [4]:
#Answer - # Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Set up the Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)

# 4. Define the parameter grid for tuning
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5, 10]
}

# 5. Perform GridSearchCV
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# 6. Get the best parameters and the best model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# 7. Predict on the test set and calculate accuracy
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# 8. Print the results
print("Best Parameters:", best_params)
print("Model Accuracy with Best Parameters:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Model Accuracy with Best Parameters: 1.0


Explanation:

We load the Iris dataset and split it into training and test sets.

A parameter grid is defined for max_depth and min_samples_split.

GridSearchCV tries all combinations using 5-fold cross-validation to find the best parameters.

Finally, we evaluate the tuned model on the test set and print the best hyperparameters and accuracy.

This approach helps improve model performance while preventing overfitting by finding the most suitable hyperparameters.

**Question 10** -  Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

**Answer** - 1) Handle missing values — step by step

Understand the missingness

Check patterns: is data MCAR (missing completely at random), MAR (missing at random), or MNAR (not at random)? Use missingness maps and cross-tabulations with other features/target.

If missingness correlates with the target or a feature, treat missingness as informative (don’t blindly drop it).

Decide what to drop vs impute

Drop columns only if they’re > say 60–90% missing and not recoverable or not useful.

Drop rows only if very few and no pattern of informativeness.

Imputation strategies (choose per column type and context)

Numerical: median (robust), mean (if symmetric), or model-based (KNNImputer, IterativeImputer) if relationships exist.

Categorical: treat missing as its own category ("__MISSING__") or impute with the most frequent value if missingness is non-informative.

If missingness is informative: add boolean indicator features (e.g., age_missing = 1) before imputation.

Prevent data leakage

Fit imputers only on training folds or training set within cross-validation — use Pipeline/ColumnTransformer.

Advanced: for complex missing patterns use multiple imputation (e.g., IterativeImputer) or domain-specific rules (e.g., lab tests absent because not ordered).

2) Encode categorical features — step by step

Identify variable type: ordinal vs nominal vs high-cardinality.

Choose encoding:

Nominal with few levels: One-hot encoding (OneHotEncoder(handle_unknown='ignore')) is safe for trees (trees handle sparse, high-dim features okay).

Ordinal: OrdinalEncoder with domain order.

High-cardinality: frequency encoding, target/mean encoding with out-of-fold (CV) scheme to avoid leakage, or hashing/binary encoding.

Avoid leakage: if using target encoding, implement it inside CV or use libraries that support out-of-fold target encoding.

Keep missing category if meaningful (see missingness step).

Note: Decision trees do not require scaling.

3) Train the Decision Tree model — step by step

Baseline model: build a simple DecisionTreeClassifier(criterion='gini' or 'entropy', random_state=...) to get a baseline.

Pipeline: compose preprocessing (imputers + encoders) with the classifier in a single Pipeline so transforms are applied consistently inside CV.

Class imbalance: if disease prevalence is low,

use class_weight='balanced' or provide sample weights, or

try resampling (SMOTE/undersampling) inside a pipeline with careful CV.

Feature engineering: create clinically useful features (e.g., BMI from weight/height, lab ratios, time since last visit). Keep clinicians involved.

4) Tune hyperparameters — step by step

Which hyperparameters matter: max_depth, min_samples_split, min_samples_leaf, max_features, max_leaf_nodes, ccp_alpha (cost-complexity pruning), criterion.

Search strategy:

Start with RandomizedSearchCV to explore wide ranges, then refine with GridSearchCV around good regions.

Use StratifiedKFold (maintain target distribution) for CV.

Scoring: choose scoring aligned with business goals:

For disease detection often prioritize recall (sensitivity) (minimize false negatives) or ROC AUC / PR AUC for imbalanced targets.

Use multiple metrics (precision, recall, F1, AUC) for a fuller view.

Nested CV: consider nested CV for unbiased generalization estimate if you must pick a model and estimate performance from the same dataset.

Calibration: decision trees can produce poorly calibrated probabilities — use CalibratedClassifierCV (Platt/isotonic) if probabilities are used for decision thresholds.

5) Evaluate performance — step by step

Train/validation/test split:

Reserve a held-out test set (stratified) that is untouched until final evaluation. If data is time-ordered, use time-based split.

Primary metrics:

Confusion matrix (TP, FP, FN, TN), Precision, Recall (sensitivity), Specificity, F1.

ROC AUC and Precision-Recall AUC (PR AUC is often more informative for rare disease).

If business has explicit costs, compute expected cost / utility using a cost matrix.

Threshold selection:

Choose probability threshold based on maximizing business utility (trade-off between recall and precision), not always 0.5.

Explainability & trust:

Use tree visualization (plot_tree) and feature importances.

Use SHAP or LIME for local explanations and to generate rules clinicians can interpret.

Robustness & fairness:

Test model across subgroups (age, gender, ethnicity) for disparate performance.

External validation:

Validate on data from a different hospital/time period if available.

Monitoring after deployment:

Monitor drift (feature distribution, model performance), label distribution, and collect feedback for retraining.

6) Deployment, governance, and safety notes

Ensure compliance with health data regulations (HIPAA/GDPR), logging and encryption.

Involve clinicians to review model rules and flagged high-risk cases.

Maintain a feedback loop: capture outcomes to continuously retrain.

Build human-in-the-loop workflow for critical decisions — model flags cases, clinicians verify.

7) Business value (why this model matters)

Early detection / triage: identify patients likely to have disease so they can receive timely diagnostic testing or treatment.

Resource optimization: prioritize limited diagnostic resources (imaging, specialist referral) to highest-risk patients.

Cost savings: reduce expensive late-stage treatments through earlier intervention.

Improved outcomes: quicker intervention can improve morbidity/mortality metrics.

Operational: automate routine screening, reduce clinician workload for low-risk patients.

Explainability: decision trees yield human-readable decision rules that help clinician acceptance and auditability.