# Decision Tree:
1. What is a Decision Tree, and how does it work in the context of classification?

Ans:- A Decision Tree is a supervised learning model that predicts a target by learning simple decision rules from data, organized in a tree‑like structure of nodes and branches.

What a decision tree is:

a. It consists of:

(i). Root node: represents the full dataset and the first splitting feature.

(ii). Internal nodes: decision tests on feature values (for example, “Age > 30?”).

(iii). Branches: outcomes of those tests (Yes/No or categorical values).

(iv). Leaf nodes: final class labels or predictions.

b. It is widely used for classification (categorical outputs) and also for regression (continuous outputs).

How it works for classification:

a. Starting at the root, the algorithm recursively splits the data based on feature values to create subsets that are as “pure” as possible (mostly containing one class).

b. At each node it chooses the best feature and threshold using a impurity criterion such as:

(i). Gini impurity (used in CART),

(ii). Entropy / information gain (used in ID3/C4.5).

c. This process continues until a stopping rule is met (all samples in a node share the same class, depth limit reached, or not enough samples). The node then becomes a leaf with a class label (often the majority class in that node).


Making a classification prediction:

For a new example:

a. Start at the root, apply the node's test (e.g., “Income > 50k?”).

b. Follow the corresponding branch (Yes/No or specific category).

c. Repeat at each internal node until reaching a leaf.

d. The class assigned to that leaf is the predicted class for that example.

2. Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?

Ans:- Gini impurity and entropy are node impurity measures used by decision trees to decide how to split the data. Both quantify how “mixed” the class labels are at a node: lower value = purer node = better split candidate.

Gini impurity:

a. For a node with class probabilities p1,p2,…,pK, Gini impurity is,

Gini = 1 - Σ(pᵢ²)

b. Interpretation: approximate probability that a randomly chosen sample in the node would be misclassified if you randomly assigned labels according to the node's class distribution.

c. Properties:

(i). G=0 when all samples belong to a single class (perfectly pure).

(ii). Larger when classes are more evenly mixed (maximum for balanced classes, e.g., 0.5 for a 50-50 binary node).

Entropy:

a. For the same node, entropy (from information theory) is,

H = -Σ p(x) log₂(p(x))

b. Interpretation: expected number of bits needed to encode the class label; higher entropy = more disorder or uncertainty.

c. Properties:

(i). H=0 when the node is pure (all in one class).

(ii). Maximum when all classes are equally likely (most uncertain / mixed).

How they affect splits in a decision tree:

A. During training, a decision tree evaluates many possible splits at a node:

a. For each candidate split, it computes the weighted impurity of the child nodes (using either Gini or entropy).

b. It then chooses the split that maximizes impurity reduction (equivalently, minimizes the weighted impurity of children):

(i). For Gini: sometimes called Gini gain.

(ii). For entropy: called information gain.

B. Impact:

a. Both criteria prefer splits that create child nodes where one class dominates (high purity).

b. In practice they often pick very similar splits; Gini is slightly more sensitive to the majority class and is a bit faster to compute (no logarithms).

c. Using either measure biases the tree toward features and thresholds that most reduce class mixing early in the tree, which usually improves classification accuracy and makes decision boundaries sharper.

3. What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.

Ans:- Pre-pruning and post-pruning are two ways of simplifying a decision tree to reduce overfitting, but they act at different stages of tree construction and have different practical benefits.

Pre-pruning (early stopping):

a. Concept: The tree is stopped from growing while it is being built if a split fails some heuristic test, such as “information gain below a threshold,” “node has fewer than min_samples_split,” or “depth reached max_depth.”

b. Goal: Prevent the tree from becoming too deep or complex in the first place, trading some bias for lower variance and faster training.

Practical advantage: Pre-pruning is computationally efficient—it avoids building unnecessary branches, which is valuable on large datasets or when models must be trained frequently or in real time.

Post-pruning (pruning after full growth):

a. Concept: First grow a large (often overfitted) tree with little or no stopping, then prune it back by cutting subtrees whose removal does not hurt—or even improves—performance on validation data or according to a cost-complexity criterion.

b. Methods: Examples include cost-complexity (minimal cost-complexity) pruning, reduced-error pruning, and other subtree-replacement strategies.

Practical advantage: Post-pruning often yields a better-performing and more robust tree, because it makes pruning decisions with knowledge of the entire grown tree and its behavior on validation data, rather than relying only on local heuristics while growing.

Impact on trees:

Pre-pruning: stops growth based on local criteria → smaller, faster-to-train trees, but risk of underfitting if stopped too early.

Post-pruning: lets tree overfit, then trims back using validation or complexity penalties → more computation, but usually better generalization and cleaner structure.

4. What is Information Gain in Decision Trees, and why is it important for
choosing the best split?

Ans:- Information Gain measures how much a split reduces uncertainty (entropy) about the class labels at a node, and the tree chooses the feature/threshold with the highest Information Gain as the best split.

Definition:

a. Entropy at a node with class probabilities p1,…,pK:

H(parent) = -∑ pi.log2.pi

b. After splitting on attribute a into child nodes j, each with proportion
(|Tj|/|T|) of samples and entropy H(Tj), the Information Gain of that split is:

IG(T,a) = H(T)-∑|Tj|/|T|*H(TJ)

This is the reduction in entropy from the parent node to the children.

Why it matters for choosing the best split:

a. Selects purer children: A high Information Gain means the child nodes have much lower entropy, i.e., they are more homogeneous in class labels, so classification there is easier and more accurate.

b. Feature ranking: At each node, the algorithm computes Information Gain for all candidate features (and thresholds for numeric features) and chooses the split with the largest gain, building the tree top-down.

c. Greedy but effective: This greedy choice of maximum Information Gain at each step tends to create short, informative paths from root to leaf, improving both accuracy and interpretability of the resulting tree.

5. What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

Ans:- Decision Trees are widely used because they combine reasonable predictive power with strong interpretability, but they also have well-known weaknesses like overfitting and instability.

Common real-world applications:

a. Credit scoring and loan approval (banking):

Assess whether to approve a loan based on features like credit score, income, employment stability, and past defaults.

b. Medical diagnosis and treatment support (healthcare):

Classify patients into disease/no-disease groups (e.g., diabetes, heart disease) using lab values and symptoms, or select treatment paths.

c. Customer churn and marketing segmentation:

Predict which customers are likely to churn or respond to an offer, using demographic and behavior features, then design targeted campaigns.

d. Fraud detection (finance, insurance, e-commerce):

Flag suspicious transactions or claims based on patterns in historical fraud vs. non-fraud records.

e. Manufacturing and quality control:

Identify process conditions that lead to defects or failures, so that operators can adjust parameters and improve yield.

f. perations and strategy / decision support:

Evaluate business or project decisions (e.g., expansion options, pricing, logistics) by mapping scenarios, costs, and outcomes in tree form.

Main advantages:

a. Highly interpretable:

Trees read like nested if-else rules and can be visualized, making them easy for non-experts and stakeholders to understand and audit.

b. Handle different data types and non-linear patterns:

Work with both numerical and categorical features and naturally model non-linear decision boundaries via hierarchical splits.

c. Require little preprocessing:

No need for feature scaling; trees can automatically pick informative features and ignore many irrelevant ones.

d. Versatile:

Support both classification and regression, and serve as base learners in powerful ensembles like Random Forests and Gradient Boosted Trees.

Main limitations:

a. Overfitting and high variance:

Deep trees can memorize training data and perform poorly on unseen data; small changes in data can lead to very different tree structures.

b. Instability and sensitivity to noise:

Because each split decision is greedy, noise or outliers can heavily influence structure and predictions.

c. Bias toward features with many categories & imbalanced classes:

Standard splitting criteria may favor attributes with many distinct values and can be biased toward majority classes if data are imbalanced.

d. Often outperformed by ensembles or other models:

Single trees are usually less accurate than ensemble methods (Random Forest, XGBoost) or complex models like neural networks, especially for regression or high-dimensional data.

In [10]:
# 6 Write a Python program to:
# Load the Iris Dataset
# Train a Decision Tree Classifier using the Gini criterion
# Print the model’s accuracy and feature importances

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")


from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()                    # features in iris.data, labels in iris.target
X, y = iris.data, iris.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Train a Decision Tree Classifier using the Gini criterion
clf = DecisionTreeClassifier(criterion="gini", random_state=42)
clf.fit(X_train, y_train)

# Compute accuracy on the test set
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")

# Print feature importances
print("Feature importances:")
for name, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{name}: {importance:.3f}")


Accuracy: 0.933
Feature importances:
sepal length (cm): 0.006
sepal width (cm): 0.029
petal length (cm): 0.559
petal width (cm): 0.406


In [11]:
# 7 Write a Python program to:
# Load the Iris Dataset
# Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Decision Tree with max_depth = 3 (pre-pruned tree)
tree_depth3 = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_depth3.fit(X_train, y_train)
y_pred_3 = tree_depth3.predict(X_test)
acc_3 = accuracy_score(y_test, y_pred_3)

# Fully-grown Decision Tree (no depth limit)
tree_full = DecisionTreeClassifier(random_state=42)   # max_depth=None by default
tree_full.fit(X_train, y_train)
y_pred_full = tree_full.predict(X_test)
acc_full = accuracy_score(y_test, y_pred_full)

print(f"Accuracy with max_depth=3: {acc_3:.3f}")
print(f"Accuracy with full tree : {acc_full:.3f}")


Accuracy with max_depth=3: 0.967
Accuracy with full tree : 0.933


In [15]:
# 8. Write a Python program to:
# Load the California Housing dataset from sklearn
# Train a Decision Tree Regressor
# Print the Mean Squared Error (MSE) and feature importances

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load California housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree Regressor
dt = DecisionTreeRegressor(random_state=42)
dt.fit(X_train, y_train)

# Predict and calculate MSE
y_pred = dt.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

# Print results
print("Mean Squared Error (MSE):", mse)
print("Feature importances:", dt.feature_importances_)



In [13]:
#9. Write a Python program to:
# Load the Iris Dataset
# Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
# Print the best parameters and the resulting model accuracy

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Set up the Decision Tree and parameter grid
dt = DecisionTreeClassifier(criterion="gini", random_state=42)

param_grid = {"max_depth": [2, 3, 4, 5, None], "min_samples_split": [2, 5, 10]}

# GridSearchCV to tune hyperparameters (5-fold CV)
grid = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, scoring="accuracy")
grid.fit(X_train, y_train)

# Evaluate best model on the test set
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)

print("Best parameters:", grid.best_params_)
print(f"Cross‑val best accuracy: {grid.best_score_:.3f}")
print(f"Test accuracy with best params: {test_acc:.3f}")


Best parameters: {'max_depth': 4, 'min_samples_split': 2}
Cross‑val best accuracy: 0.942
Test accuracy with best params: 0.933


10.  Imagine you're working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

Ans:- Concise step-by-step process we could describe as.

A. Handle the missing values:

a. Explore missingness:

Check percentage of missing values per column and whether it’s MCAR/MAR/MNAR (completely at random, at random, not at random).

b. Impute appropriately:

(i). For numerical features: use median (robust to outliers) or mean; consider more advanced imputation (KNN/iterative) if many values are missing.

(ii). For categorical features: impute with most frequent category or add a special “Unknown” category.

c. Use a pipeline so imputation is part of the training process and applied consistently to train and test sets.

B. Encode the categorical features:

a. Because Decision Trees can naturally handle ordinal and integer-coded categories, but many libraries expect numbers, encode as:

(i). One-hot encoding for nominal variables (e.g., gender, blood type).

(ii). Ordinal encoding only when there is a genuine order (e.g., disease stage I < II < III).

b. Again, put encoders in a ColumnTransformer / pipeline so the transformations are learned only on training data and applied to validation/test data.

C. Train a Decision Tree model:

a. Split data into train/validation/test (or train/test with cross-validation).

b. Create a DecisionTreeClassifier with a reasonable starting configuration (e.g., criterion="gini", limited max_depth to avoid severe overfitting).

c. Fit the model on the processed training data.

D. Tune its hyperparameters:

a. Use GridSearchCV or RandomizedSearchCV over parameters such as:

(i). max_depth (tree depth)

(ii). min_samples_split, min_samples_leaf (minimum samples to form splits / leaves)

(iii). max_features (number of features considered per split)

b. Optimize for an appropriate metric (e.g., ROC-AUC, F1-score) using cross-validation to get a robust estimate.

c. Select the best model and refit it on the full training set.

E. Evaluate its performance:

a. On the held-out test set, compute:

(i). Confusion matrix, accuracy

(ii). Precision, recall, F1-score (especially important if the disease is rare)

(iii). ROC curve and AUC to understand discrimination across thresholds.

F. Business value in a real-world healthcare setting:

(i). Earlier, data-driven triage: High-risk patients can be flagged for additional tests or specialist review, potentially catching disease earlier and improving outcomes.

(ii). Resource optimization: Hospitals can prioritize expensive diagnostics for patients with the highest predicted risk, reducing unnecessary tests and costs.

(iii). Clinical decision support: Because Decision Trees are interpretable (if not too deep), clinicians can see which factors led to a high-risk prediction, supporting transparent, explainable decisions.

(iv). Population health management: Aggregated predictions help identify high-risk cohorts, guiding preventive programs and targeted interventions (screening campaigns, lifestyle counseling).
