# **Question 1: What is a Decision Tree, and how does it work in the context of classification?**

A Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks. In the context of classification, it works by breaking down a dataset into smaller subsets based on feature values, following a tree-like structure.

**Structure:**

* Each internal node represents a decision based on a feature (e.g., "Is petal length > 2.5 cm?").

* Each branch represents the outcome of that decision.

* Each leaf node represents a class label (final prediction).

**Working:**

The algorithm starts at the root node and recursively splits the dataset based on the feature that provides the most information (measured by impurity measures like Gini or Entropy). This continues until the tree reaches pure subsets (all samples in a node belong to one class) or a stopping condition (like maximum depth).

**Example:**

In the Iris dataset, a decision tree might first split on petal length to separate Setosa from the other species, then further split on petal width to distinguish Versicolor and Virginica.

**Key point:** A Decision Tree classifies data by asking a sequence of "yes/no" questions about features until it reaches a conclusion.

# **Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?**

Gini Impurity and Entropy in Decision Trees

When building a Decision Tree, the algorithm must decide which feature and threshold to split on at each step. To do this, it uses impurity measures to evaluate how “pure” a node is (i.e., how mixed the classes are). Two commonly used impurity measures are Gini Impurity and Entropy.

**Gini Impurity**

**Definition:**

Gini impurity measures the probability that a randomly chosen sample would be misclassified if it were randomly labeled according to the distribution of labels in the node.

**Range:**

0 → node is pure (only one class).

Maximum (0.5 for binary classification) → highly impure, classes evenly mixed.

**Entropy**

**Definition:**

Entropy measures the amount of uncertainty or disorder in the dataset. It comes from information theory.

**Range:**

0 → node is pure (all samples belong to one class).

Maximum (1 for binary classification when classes are 50/50) → maximum disorder.

**Impact on Splits**

* The Decision Tree algorithm tries to reduce impurity at each split.

* It calculates the Information Gain (reduction in impurity) for each possible split and chooses the feature/threshold with the highest gain.

* Using Gini or Entropy usually leads to similar trees, but:

* Gini is faster computationally and tends to create slightly purer nodes.

* Entropy is more information-theoretic and considers uncertainty in greater detail.

**In short:**

* Both Gini and Entropy measure how mixed the classes are in a node.

* Decision Trees split data based on the feature that produces the largest reduction in impurity, ensuring more homogeneous child nodes and better classification performance.

# **Question** 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

Pre-Pruning vs Post-Pruning in Decision Trees

Decision Trees have a tendency to overfit the training data if allowed to grow without restrictions. To address this, pruning techniques are used to control tree growth and improve generalization.

**Pre-Pruning (Early Stopping)**

**Definition:**

Pre-pruning stops the growth of the tree early, before it becomes overly complex. The algorithm imposes constraints such as:

Maximum tree depth (max_depth)

Minimum number of samples required to split a node (min_samples_split)

Minimum samples per leaf (min_samples_leaf)

Maximum number of leaf nodes

**Advantage:**

It reduces computation time and prevents the tree from becoming too deep, thus lowering the risk of overfitting.

**Post-Pruning (Reduced Error Pruning)**

**Definition:**

Post-pruning allows the tree to grow to its full size first, then removes branches that do not contribute significantly to accuracy. This is done by evaluating performance on a validation set and pruning branches that don’t improve generalization.

**Advantage:**

It produces a simpler and more generalizable model, improving accuracy on unseen data by eliminating unnecessary complexity.

**Summary:**

Pre-Pruning: Stops tree growth early → faster training.

Post-Pruning: Trims the fully grown tree → better generalization.

# **Question** 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

**Information Gain in Decision Trees**

When building a Decision Tree, the algorithm must decide which feature to split on at each step. This decision is guided by Information Gain (IG).

**Definition**

Information Gain (IG) measures the reduction in impurity (uncertainty) after a dataset is split on a feature.

It is based on Entropy, which quantifies the amount of disorder or randomness in a dataset.

**Why it is Important**

Information Gain helps the Decision Tree choose the best split at each node.

A feature with high IG means splitting on it makes the resulting subsets more “pure” (closer to containing only one class).

This leads to shorter, more efficient trees and better classification performance.

Without IG (or similar measures like Gini), the algorithm would not know which feature provides the most useful separation of classes.

**Example**

Suppose we are classifying whether a student passes or fails based on “Study Hours.”

* Before splitting: Dataset has 50% Pass, 50% Fail → high entropy.

* After splitting:

  * Group 1 (Study Hours > 5): 90% Pass, 10% Fail → lower entropy.

  * Group 2 (Study Hours ≤ 5): 20% Pass, 80% Fail → lower entropy.

The split reduces uncertainty, so “Study Hours” has high Information Gain.

**In short:**

Information Gain tells us how much a feature improves classification by reducing uncertainty. The Decision Tree always chooses the feature with the highest IG at each step, making it a key factor in building accurate and efficient models.

# **Question** 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

**Real-World Applications of Decision Trees**

Decision Trees are widely used in different industries because they are easy to interpret, handle both numerical and categorical data, and require little preprocessing. Some common applications include:

1. **Healthcare**

Predicting whether a patient has a disease based on symptoms, lab results, and medical history.

Assisting doctors with diagnostic decision-making.

2. **Finance & Banking**

Credit scoring (deciding whether to approve a loan).

Fraud detection in transactions.

3. **Marketing & Customer Analytics**

Predicting customer churn (whether a customer will leave).

Recommending products based on past purchases.

Segmenting customers for targeted marketing campaigns.

4. **Operations & Manufacturing**

Predictive maintenance of machines.

Quality control (classifying defective vs. non-defective products).

5. **Education**

Predicting student performance based on attendance, assignments, and test scores.

**Advantages of Decision Trees**

* Easy to interpret and visualize → Trees resemble human decision-making.

* Handles both numerical and categorical data without much preprocessing.

* Non-parametric → No assumption about data distribution.

* Can capture nonlinear relationships between features and target.

**Limitations of Decision Trees**

* Prone to overfitting if grown too deep.

* Unstable → Small changes in data can lead to very different trees.

* Biased towards features with more categories (especially in categorical data).

* Less accurate alone compared to ensemble methods like Random Forest or Gradient Boosted Trees.

**In summary:**
Decision Trees are powerful and interpretable tools used in healthcare, finance, marketing, and many other fields. Their main strengths are simplicity and interpretability, while their weaknesses lie in overfitting and instability.

**Dataset Info:**

● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).

● Boston Housing Dataset for regression tasks (sklearn.datasets.load_boston() or provided CSV).

# **Question 6: Write a Python program to:**

● Load the Iris Dataset

● Train a Decision Tree Classifier using the Gini criterion

● Print the model’s accuracy and feature importances.

In [None]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split (70% training, 30% testing)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Decision Tree with Gini criterion
clf = DecisionTreeClassifier(criterion="gini", random_state=42)
clf.fit(X_train, y_train)

# Predictions
y_pred = clf.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# Feature Importances
print("Feature Importances:", clf.feature_importances_)

Model Accuracy: 1.0
Feature Importances: [0.         0.01911002 0.89326355 0.08762643]


# **Question 7: Write a Python program to:**

● Load the Iris Dataset

● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

In [None]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Fully-grown Decision Tree
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
full_acc = accuracy_score(y_test, y_pred_full)

# Decision Tree with max_depth=3
clf_depth3 = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_depth3.fit(X_train, y_train)
y_pred_depth3 = clf_depth3.predict(X_test)
depth3_acc = accuracy_score(y_test, y_pred_depth3)

# Print results
print("Full Tree Accuracy:", full_acc)
print("Max Depth=3 Accuracy:", depth3_acc)

Full Tree Accuracy: 1.0
Max Depth=3 Accuracy: 1.0


# **Question 8: Write a Python program to:**

● Load the California Housing dataset from sklearn

● Train a Decision Tree Regressor

● Print the Mean Squared Error (MSE) and feature importances

In [None]:
# Import libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Train-test split (70% training, 30% testing)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Predictions
y_pred = regressor.predict(X_test)

# Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# Feature Importances
print("Feature Importances:", regressor.feature_importances_)

Mean Squared Error: 0.5280096503174904
Feature Importances: [0.52345628 0.05213495 0.04941775 0.02497426 0.03220553 0.13901245
 0.08999238 0.08880639]


# **Question 9: Write a Python program to:**

● Load the Iris Dataset

● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV

● Print the best parameters and the resulting model accuracy

In [None]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Define parameter grid for tuning
param_grid = {
    "max_depth": [2, 3, 4, 5, None],
    "min_samples_split": [2, 4, 6, 8, 10]
}

# Initialize GridSearchCV
grid = GridSearchCV(
    estimator=DecisionTreeClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,              # 5-fold cross-validation
    scoring="accuracy"
)

# Fit model
grid.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid.best_params_)

# Evaluate on test data
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Test Accuracy:", accuracy)

Best Parameters: {'max_depth': 4, 'min_samples_split': 6}
Test Accuracy: 1.0


# **Question 10: Imagine you’re working as a data scientist for a  Healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.**

Explain the step-by-step process you would follow to:

● Handle the missing values

● Encode the categorical features

● Train a Decision Tree model

● Tune its hyperparameters

● Evaluate its performance

And describe what business value this model could provide in the real-world setting

1. **Handling Missing Values**

Missing data can bias predictions or reduce model accuracy. Steps include:

*Identify missing values:* Check each column for missing data using methods like .isnull().sum().

**Decide on imputation strategy:**

*Numerical features:* Impute missing values with mean, median, or a model-based approach (e.g., KNN imputer) depending on distribution.

*Categorical features:* Impute with the mode (most frequent value) or create a special category “Unknown”.

*Optional:* If a column has >50% missing values, consider dropping it as it may not provide useful information.

2. **Encoding Categorical Features**

Decision Trees in most libraries handle numerical data, so categorical features must be encoded:

*Label Encoding:* Assign numeric values to each category (useful if categories are ordinal).

*One-Hot Encoding:* Create binary columns for each category (preferred for nominal variables to avoid implying order).

*Avoid high cardinality:* For categorical features with many unique values, consider grouping rare categories into “Other” to reduce dimensionality.

3. **Train a Decision Tree Model**

Decision Trees are robust to non-linear relationships and don’t require feature scaling.

*Split data:* Divide into training and testing sets (e.g., 80/20 split).

*Initialize model:* Use DecisionTreeClassifier() from scikit-learn.

*Fit the model:* Train on the preprocessed training dataset.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

4. **Hyperparameter Tuning**

To avoid overfitting and improve performance:

**Key parameters to tune:**

*max_depth:* Maximum depth of the tree.

*min_samples_split:* Minimum samples required to split a node.

*min_samples_leaf:* Minimum samples required at a leaf node.

*criterion:* “gini” or “entropy”.

Use GridSearchCV or RandomizedSearchCV for systematic search:

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}

grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42),
                           param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
print(grid_search.best_params_)

5. **Evaluate Performance**

For a healthcare prediction model, evaluation metrics should include:

*Accuracy:* General correctness, but not enough for imbalanced data.

*Precision & Recall:* Critical if false negatives (missing disease) are costly.

*F1-Score:* Balance between precision and recall.

*ROC-AUC:* Measures discrimination ability across thresholds.

In [None]:
from sklearn.metrics import classification_report, roc_auc_score

y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, best_model.predict_proba(X_test)[:,1]))

* Optionally, visualize feature importance to understand key predictors  of disease.

6. **Business Value**

*Early disease detection:* Helps doctors identify high-risk patients quickly.

*Resource optimization:* Prioritizes patients for further testing or treatment, reducing unnecessary costs.

*Personalized healthcare:* Enables tailored treatment plans based on patient risk profiles.

*Data-driven decision making:* Supports hospital management and policy decisions using predictive insights.

**Summary:**

This workflow ensures clean, well-prepared data, optimizes model performance, and provides actionable insights in a healthcare context. Decision Trees offer interpretability, making it easier for medical professionals to trust predictions.