1.What is a Decision Tree, and how does it work in the context of
classification?

Ans- A tree-like model used to make decisions by splitting data into branches based on feature values.

**How It Works**
Root Node: Starts with the entire dataset.

Splitting: Asks a "Yes/No" question about a specific feature (e.g., "Is Age > 30?").

Branches: Based on the answer, the data moves down different paths.

Leaf Nodes: The final points where a class label is assigned (e.g., "Will Buy" or "Won't Buy").

**The Goal**
The tree uses metrics like Gini Impurity or Information Gain to find the questions that best separate the classes into "pure" groups.

2. Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?

Ans- **Gini Impurity**
Measures the probability of a random element being incorrectly classified if it were labeled according to the distribution in the node.

Formula: $G = 1 - \sum (p_i)^2$

Range: 0 (Pure) to 0.5 (Balanced/Max Impurity).

Characteristic: Favors the largest class and is computationally faster because it doesn't use logarithms.

**Entropy** Measures the "disorder" or unpredictability of the data.

Formula: $H = -\sum p_i \log_2(p_i)$

Range: 0 (Pure) to 1.0 (Balanced/Max Impurity).

Characteristic: Used to calculate Information Gain. It is more computationally expensive than Gini due to the log calculation.

**Impact on Splits**

**Selection**: The tree tests every possible split across all features.

Comparison: It calculates the impurity before and after the split.

Optimization: It chooses the split that results in the greatest reduction in impurity (the "purest" child nodes).8In short: Lower impurity equals a better split.

3.  What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each

Ans- **Pre-Pruning (Early Stopping)**

Mechanism: Stops the tree-building process before it becomes too complex. It uses parameters like max_depth, min_samples_split, or min_samples_leaf.


Advantage: Efficiency. It saves significant time and memory by preventing the tree from growing unnecessary branches in the first place.

**Post-Pruning (Cost Complexity Pruning)**

Mechanism: Allows the tree to grow to its full size (where it likely overfits) and then removes branches that provide little predictive power.

Advantage: Better Performance. It allows the model to capture complex relationships that might be missed by early stopping, only removing them if they prove to be "noise" during the pruning phase.

4. What is Information Gain in Decision Trees, and why is it important for
choosing the best split?

Ans- The reduction in Entropy (disorder) achieved by partitioning a dataset based on a specific attribute.

**How it works**

Initial Entropy: Measure the uncertainty of the parent node.

Weighted Entropy: Calculate the average entropy of the resulting child nodes after a split.

Subtraction: $\text{Information Gain} = \text{Entropy(Parent)} - \text{Weighted Entropy(Children)}$.

**Importance for Choosing Splits**

Optimization: It acts as the selection criterion. The algorithm calculates Information Gain for every possible split and chooses the one with the highest value.

Purity: High Information Gain ensures that the resulting child nodes are as "pure" as possible (containing mostly one class), making the classification more accurate.

5. What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

Ans- **Real-World Applications**

Finance: Credit scoring and loan default prediction.

Healthcare: Diagnosing diseases based on patient symptoms.

Marketing: Predicting customer churn or response to a campaign.

**Advantages**

Interpretability: They are easy to visualize and explain to non-technical users.

No Preprocessing: Requires little data scaling or normalization.

Feature Importance: Automatically identifies which variables are most significant.

**Limitations**

Overfitting: Trees can become overly complex and fail to generalize to new data.

Instability: Small changes in the data can result in a completely different tree structure.

Bias: They can be biased toward features with more levels or categories.

6. Write a Python program to:

Load the Iris Dataset

Train a Decision Tree Classifier using the Gini criterion

Print the model’s accuracy and feature importances

In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model using Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Predictions and Accuracy
y_pred = clf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")

# Feature Importances
for name, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{name}: {importance:.4f}")

Accuracy: 1.00
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876


7. Write a Python program to:

Load the Iris Dataset

Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.

In [2]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 1. Pruned Tree (max_depth=3)
pruned_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
pruned_tree.fit(X_train, y_train)
acc_pruned = accuracy_score(y_test, pruned_tree.predict(X_test))

# 2. Fully-Grown Tree (No depth limit)
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
acc_full = accuracy_score(y_test, full_tree.predict(X_test))

print(f"Accuracy (max_depth=3): {acc_pruned:.4f}")
print(f"Accuracy (Fully Grown): {acc_full:.4f}")

Accuracy (max_depth=3): 1.0000
Accuracy (Fully Grown): 1.0000


**Comparison Summary**

Fully-Grown Tree: High risk of overfitting, capturing noise as rules.

Pruned Tree (max_depth=3): Generally more robust, simpler to interpret, and often generalizes better to new data despite potentially lower training accuracy.

8. Write a Python program to:

Load the Boston Housing Dataset

Train a Decision Tree Regressor

Print the Mean Squared Error (MSE) and feature importances

In [5]:
from sklearn.datasets import make_regression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# 1. Generate local synthetic data (100 samples, 5 features)
X, y = make_regression(n_samples=100, n_features=5, noise=0.1, random_state=42)
feature_names = [f"Feature {i}" for i in range(5)]

# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Train Decision Tree Regressor
regressor = DecisionTreeRegressor(max_depth=3, random_state=42)
regressor.fit(X_train, y_train)

# 4. Predict and calculate MSE
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print(f"Mean Squared Error: {mse:.4f}")
print("\nFeature Importances:")
for name, importance in zip(feature_names, regressor.feature_importances_):
    print(f"{name}: {importance:.4f}")

Mean Squared Error: 7166.5039

Feature Importances:
Feature 0: 0.2276
Feature 1: 0.6522
Feature 2: 0.0000
Feature 3: 0.1203
Feature 4: 0.0000


9. Write a Python program to:

Load the Iris Dataset

Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV

Print the best parameters and the resulting model accuracy

In [6]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define parameters to tune
param_grid = {
    'max_depth': [2, 3, 4, 5, 6],
    'min_samples_split': [2, 5, 10, 20]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Best parameters and accuracy
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Accuracy with Best Model: {accuracy_score(y_test, y_pred):.4f}")

Best Parameters: {'max_depth': 4, 'min_samples_split': 10}
Accuracy with Best Model: 1.0000


10.  Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.

Explain the step-by-step process you would follow to:

Handle the missing values

Encode the categorical features

Train a Decision Tree model

Tune its hyperparameters

Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

Ans- 1. **Handle Missing Values**

Identify: Check if data is missing randomly or systematically.

Imputation: For numerical values (e.g., blood pressure), use Median Imputation to avoid outlier bias. For categorical values (e.g., smoking status), use Mode Imputation or create a new "Missing" category.

2. **Encode Categorical Features**

Binary/Ordinal: Use Label Encoding for features with logical order (e.g., Stage 1, 2, 3).

Nominal: Use One-Hot Encoding for features without order (e.g., Blood Type) to ensure the model doesn't assume a false mathematical relationship between categories.

3. **Train a Decision Tree Model**

Split: Divide data into training (80%) and testing (20%) sets.

Fit: Initialize DecisionTreeClassifier and fit it to the training data. Decision trees are excellent for mixed data types as they don't require feature scaling (like normalization).

4. **Tune Hyperparameters**

Grid Search: Use GridSearchCV to test combinations of max_depth (to prevent overfitting) and min_samples_leaf (to ensure nodes are statistically significant).

5. **Evaluate Performance**

Recall (Sensitivity): Crucial in healthcare. We must minimize False Negatives (missing a sick patient).

F1-Score: To balance Precision and Recall.

ROC-AUC: To evaluate how well the model distinguishes between healthy and diseased patients.

**Business Value**

Early Intervention: Identifies high-risk patients sooner, improving recovery rates and saving lives.

Resource Allocation: Helps hospitals prioritize urgent cases and manage staff more efficiently.

Cost Reduction: Prevents expensive late-stage treatments through proactive screening.