Decision Tree

###1.What is a Decision Tree, and how does it work in the context of classification?


A Decision Tree is a supervised machine learning algorithm used for both classification and regression, but it is most commonly applied in classification problems.

It works like a flowchart, where data is split into branches based on certain conditions until a decision (prediction) is made.


 Structure of a Decision Tree

	1.	Root Node – The starting point; represents the entire dataset and the first split based on the most important feature.
	2.	Internal Nodes – Each node represents a feature test (e.g., “Age > 30?”).
	3.	Branches – The outcome of the test (e.g., “Yes” or “No”).
	4.	Leaf Nodes – The final decision (class label in classification).


 How It Works in Classification

	1.	Feature Selection (Splitting Criteria)
The algorithm selects the feature that best separates the data into different classes. Common criteria include:

	•	Gini Index (used in CART)
	•	Entropy / Information Gain (used in ID3, C4.5)

Example: In a dataset of students, if we want to classify whether they “Pass” or “Fail”, the feature “Hours Studied” might be the most important factor for the first split.

	2.	Recursive Splitting
After the root node split, each branch creates a smaller subset of the dataset. The process is repeated recursively until:

	•	All records in a node belong to the same class, or
	•	The maximum tree depth / stopping criteria is reached.


 3.	Prediction

To classify a new instance, the model traverses the tree from the root node to a leaf by following the decision rules.




 Example (Binary Classification)

Dataset: Predict if a person will buy a product. Features: Age, Income.

	•	Root Node: “Age > 30?”
	•	If Yes, check “Income > 50k?”
	•	If Yes → Buy
	•	If No → Don’t Buy
	•	If No → Don’t Buy

⸻

Advantages

	•	Easy to interpret and visualize.
	•	Handles both numerical and categorical data.
	•	Requires little data preprocessing.

 Disadvantages

	•	Can easily overfit (too many splits).
	•	Sensitive to noisy data.
	•	Greedy algorithm (locally optimal, not always globally optimal).

###2. Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

1️. Impurity in Decision Trees

In a Decision Tree, the goal of each split is to make the child nodes as pure as possible.

	•	A pure node contains samples from only one class.
	•	An impure node contains a mix of classes.

To measure impurity, we use Gini Impurity and Entropy.



2️.Gini Impurity

	•	Definition:
Probability that a randomly chosen sample from a node would be misclassified if labeled randomly according to the class distribution in that node.

	•	Formula:


Gini = 1 - \sum_{i=1}^{C} p_i^2

Where:

	•	C = number of classes
	•	p_i = proportion of samples of class i in the node
	•	Range:
	•	0 → node is completely pure (all samples same class)
	•	Maximum → node is maximally impure (evenly mixed classes)
	•	Example:
Node has 70% Class A and 30% Class B:
Gini = 1 - (0.7^2 + 0.3^2) = 0.42



3️. Entropy (Information Gain)

	•	Definition:

Measures the uncertainty or disorder in a node.

	•	Lower entropy → node is more pure
	•	Higher entropy → node is more mixed
	•	Formula:

Entropy = - \sum_{i=1}^{C} p_i \cdot \log_2(p_i)

	•	Range:
	•	0 → node is pure
	•	1 (binary case) → maximum disorder
	•	Example: Node has 70% Class A, 30% Class B:
    Entropy = -(0.7 \cdot \log_2 0.7 + 0.3 \cdot \log_2 0.3) \approx 0.881



4️.Impact on Decision Tree Splits

	•	Decision Tree chooses the best feature to split based on reducing impurity.

How each measure works:

	1.	Gini Impurity:
	•	Choose the feature that minimizes the weighted Gini of child nodes after split.
	2.	Entropy:
	•	Choose the feature that maximizes Information Gain = reduction in entropy.

Goal: After each split, nodes are more “pure” → easier to classify new samples.

###3.: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each

Decision trees are prone to overfitting, especially when grown to full depth. Pruning techniques help improve generalization by simplifying the tree. They fall into two broad categories:

Pre-Pruning (Early Stopping)

What it is: Stop growing the tree before it perfectly fits the training data. You set criteria to halt splitting when further splits are unlikely to yield significant gains.

Common criteria:

Maximum depth (limit how deep the tree can grow)

Minimum samples per leaf (stop splitting if a node has too few samples)

Minimum impurity decrease (require a minimum gain to justify a split)

Validation-based stopping (stop when performance on a hold-out set stops improving)

How it works in practice:

As you build the tree top-down, you check the stopping criteria at each potential split and refrain from adding new nodes if the criterion isn’t met.

Practical advantage

Faster and simpler models with built-in regularization: Pre-pruning prevents overfitting by design, often resulting in smaller trees that train quickly and generalize better without needing a separate pruning phase.

Post-Pruning (Cost-Complexity Pruning, Full-prune)
What it is: Grow the full tree (often to the point of overfitting) and then prune back branches that do not provide enough predictive power.

Common approaches:

Cost-Complexity Pruning (also known as weakest link pruning): balance tree accuracy against tree size using a complexity parameter α.

Reduced-error pruning: remove subtrees that do not improve validation set performance.

How it works in practice:
 After fully growing the tree, evaluate the impact of removing each subtree and iteratively prune the branches that yield the best improvement (or least degradation) on a separate validation set.

Practical advantage

Potentially better performance with optimal simplification: Post-pruning can produce a more accurate model by tailoring the final size of the tree to the data, especially when the initial fully grown tree captured noise. It often achieves a better bias-variance trade-off than a single fixed-depth pre-pruned tree.

Quick comparison

Growth phase:

Pre-pruning: Stop early during growth.

Post-pruning: Grow to full depth, then prune.

Control signal:

Pre-pruning: Simple, resource-efficient criteria applied during building.

Post-pruning: Uses validation metrics after full growth to decide what to prune.

Risk:

Pre-pruning: Risk of underfitting if stopping criteria are too aggressive.

Post-pruning: Risk of overfitting during growth, but can yield better final generalization after pruning.

Typical use cases:

Pre-pruning: When you want fast training and a compact model (e.g., in resource-constrained environments).

Post-pruning: When you suspect there is structure in the data that a fully grown tree can reveal but needs simplification to generalize well.

###4.: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

1.What is Information Gain?

Information Gain (IG) measures how much uncertainty or disorder is reduced by splitting a node using a particular feature.

	•	It is based on Entropy, which quantifies the impurity of a node.
	•	When we split a node, we want child nodes to be more “pure” (less mixed classes) than the parent node.
	•	Information Gain = Reduction in Entropy after the split.

Formula:

IG(S, A) = Entropy(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} \cdot Entropy(S_v)

Where:

	•	S = dataset at the current node
	•	A = feature we are splitting on
	•	S_v = subset of S for which feature A has value v
	•	|S_v|/|S| = proportion of samples in that subset



2. Step-by-step Explanation


  . Compute Entropy of the parent node → measures the current disorder.

	.	Split the dataset based on a feature → create child nodes.

	.	Compute Entropy of each child node → measure disorder after split.

	.	Weighted average of child entropies → combine them.

	.	Information Gain = Parent Entropy - Weighted Child Entropy

 A higher Information Gain means the feature splits the data better → child nodes are more pure.



3️. Why is Information Gain Important?

	•	Decision Trees need to decide which feature to split on at each node.
	•	Information Gain is the metric used in algorithms like ID3 and C4.5 to choose the best feature.
	•	Feature with highest Information Gain → chosen for the split.
	•	This ensures that each split reduces uncertainty the most, making the tree more accurate and efficient.



4️.Example (Simple)


Suppose you want to predict if someone will buy a product.

	•	Feature “Age” splits the dataset into:
	•	Young → mostly “No” (Entropy low)
	•	Old → mostly “Yes” (Entropy low)
	•	Parent node entropy was high (mixed “Yes”/“No”)
	•	Information Gain = Parent Entropy – Weighted Child Entropy → high
	•	So, “Age” is a good feature to split on.

###5.What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?


1️.Common Real-World Applications of Decision Trees


Decision Trees are versatile and widely used in many domains, especially for classification and regression tasks.

 Some examples:


a) Healthcare

	•	Diagnosing diseases based on patient symptoms and test results.

	•	Example: Predicting whether a patient has diabetes or heart disease.

b) Finance

	•	Credit scoring and loan approval.
	•	Example: Predicting if a person is a good credit risk based on income, age, debt, etc.
	•	Fraud detection in banking transactions.

c) Marketing

	•	Customer segmentation and targeting.
	•	Example: Predicting which customers are likely to respond to a promotion.

d) E-commerce

	•	Recommendation systems.
	•	Example: Predicting whether a user will buy a product based on browsing history.

e) Manufacturing

	•	Quality control and defect detection.
	•	Example: Predicting defective products using machine sensor data.

f) Operations & HR

	•	Employee attrition prediction.
	•	Example: Predicting whether an employee is likely to leave the company.



2️. Main Advantages of Decision Trees

	1.	Easy to understand and interpret
	•	Trees can be visualized as flowcharts; non-technical people can understand them.
	2.	Handles both numerical and categorical data
	3.	No need for feature scaling
	•	Unlike algorithms like SVM or KNN, Decision Trees don’t require normalization.
	4.	Can capture non-linear relationships
	•	Splits allow complex decision boundaries.
	5.	Can handle missing values (some implementations)
	6.	Fast predictions
	•	Traversing a tree is computationally simple.



3️. Main Limitations of Decision Trees

	1.	Overfitting
	•	Trees can become too deep and memorize training data, reducing generalization.
	2.	Sensitive to small data changes
	•	A small change in the data can lead to a completely different tree.
	3.	Biased with imbalanced data
	•	Features with many categories may dominate splits.
	4.	Greedy algorithm
	•	Locally optimal splits may not lead to globally optimal trees.
	5.	Less accurate than ensemble methods
	•	Single Decision Trees are often outperformed by Random Forests or Gradient Boosting.

###6. Write a Python program to:
● Load the Iris Dataset

● Train a Decision Tree Classifier using the Gini criterion

● Print the model’s accuracy and feature importances

In [10]:
 # Import Libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree Classifier using Gini
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy*100:.2f}%")

# Print Feature Importances
print("Feature Importances:")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")

Model Accuracy: 100.00%
Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


###7.: Write a Python program to:
● Load the Iris Dataset

● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.


In [11]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree with max_depth=3
clf_depth3 = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_depth3.fit(X_train, y_train)
y_pred_depth3 = clf_depth3.predict(X_test)
accuracy_depth3 = accuracy_score(y_test, y_pred_depth3)

# Train fully-grown Decision Tree (no max_depth)
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Print accuracies
print(f"Accuracy (max_depth=3): {accuracy_depth3*100:.2f}%")
print(f"Accuracy (fully-grown tree): {accuracy_full*100:.2f}%")

Accuracy (max_depth=3): 100.00%
Accuracy (fully-grown tree): 100.00%


###8.Write a Python program to:
● Load the Boston Housing Dataset

● Train a Decision Tree Regressor

● Print the Mean Squared Error (MSE) and feature importances

In [12]:
#Import libraries
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load Boston Housing Dataset
boston = fetch_openml(name='boston', version=1, as_frame=True)
X = boston.data
y = boston.target

# Split dataset into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Train Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Make predictions
y_pred = regressor.predict(X_test)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

# Print Feature Importances
print("Feature Importances:")
for feature, importance in zip(X.columns, regressor.feature_importances_):
    print(f"{feature}: {importance:.4f}")

Mean Squared Error: 10.42
Feature Importances:
CRIM: 0.0513
ZN: 0.0034
INDUS: 0.0058
CHAS: 0.0000
NOX: 0.0271
RM: 0.6003
AGE: 0.0136
DIS: 0.0707
RAD: 0.0019
TAX: 0.0125
PTRATIO: 0.0110
B: 0.0090
LSTAT: 0.1933


###9.Write a Python program to:
● Load the Iris Dataset

● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV

● Print the best parameters and the resulting model accuracy

In [13]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# Define the hyperparameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 5, 10, 15]
}

# GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Best parameters
best_params = grid_search.best_params_
print(f"Best Parameters: {best_params}")

# Evaluate the model with best parameters on test data
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the best model: {accuracy*100:.2f}%")

Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Accuracy of the best model: 100.00%


###10.Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.

Explain the step-by-step process you would follow to:

● Handle the missing values

● Encode the categorical features

● Train a Decision Tree model

● Tune its hyperparameters

● Evaluate its performance

And describe what business value this model could provide in the real-world
setting

1. Handling Missing Values

	•	Importance: Missing data can introduce bias, reduce accuracy, or prevent the model from learning correctly.

	•	Steps:
	1.	Identify missing values: Check which features contain null values or placeholders like NA.
	2.	Analyze patterns: Determine if missingness is random (MCAR), dependent on other variables (MAR), or non-random (MNAR).
	3.	Impute missing values:
	•	Numerical features: Use median or mean to replace missing values. Median is preferred if the data has outliers.

	•	Categorical features: Replace missing values with the mode (most frequent category) or a special label like "Unknown" to indicate missingness.

	4.	Drop features if necessary: If a feature has too many missing values (e.g., >50%), it may be safer to drop it to avoid noise.

	•	Business Impact: Proper handling of missing values ensures that the predictions are accurate and trustworthy, which is critical in healthcare.



2. Encoding Categorical Features

	•	Why: Machine learning algorithms, including Decision Trees in libraries like scikit-learn, require numerical inputs.

	•	Techniques:

	1.	Label Encoding: Assigns a unique integer to each category. Suitable for ordinal variables (e.g., low, medium, high risk).

	2.	One-Hot Encoding: Creates a binary column for each category. Best for nominal variables (e.g., gender, blood type) without any natural order.

	•	Combining features: After encoding categorical features, they are merged with numerical features to form the final dataset.



3. Training a Decision Tree Model

•	Why Decision Tree:

	  •	Intuitive and interpretable, which is crucial in healthcare for explaining predictions.
	  •	Can handle both numerical and categorical features.
	  •	Captures non-linear relationships between features and target disease outcome.

	•	Steps:

	1.	Split data into training and testing sets to evaluate model performance on unseen data.
	2.	Initialize a DecisionTreeClassifier.
	3.	Fit the model on the training data to learn decision rules based on features.

	•	Outcome: The tree splits the data at each node based on features that best reduce impurity (using Gini or Entropy).



4. Hyperparameter Tuning

	•	Why: Default trees often overfit, especially on complex datasets. Tuning parameters ensures better generalization.

•	Key hyperparameters:
	•	max_depth: Maximum depth of the tree to control complexity.

	•	min_samples_split: Minimum samples required to split a node.

	•	min_samples_leaf: Minimum samples in a leaf node.

	•	criterion: “gini” or “entropy” to measure impurity.

	•	Technique: Use GridSearchCV to systematically search for the combination of hyperparameters that maximizes cross-validated accuracy.

	•	Business Impact: Tuning ensures that the model avoids overfitting, producing reliable predictions for real patients.



5. Model Evaluation

•	Metrics:
	•	Accuracy: Overall correctness of predictions.

	•	Precision: How many predicted positives are actually positive (important to avoid unnecessary treatment).
	•	Recall (Sensitivity): How many actual positives were correctly identified (critical in healthcare to avoid missed diagnoses).
	•	F1-score: Balance between precision and recall.
	•	ROC-AUC: Measures discriminative ability of the model.
•	Importance in healthcare: Prioritizing high recall ensures disease cases are not missed, even if it means some false positives.



6. Business Value

	•	Early Detection: Identify high-risk patients before severe symptoms appear.

	•	Resource Optimization: Prioritize patients for expensive tests or specialist consultations.

	•	Cost Reduction: Reduce unnecessary tests for low-risk patients.

	•	Improved Patient Outcomes: Early and accurate diagnosis leads to better treatment plans and outcomes.
  
	•	Data-Driven Decision Making: Hospitals can allocate staff, equipment, and budget more effectively using predictive insights.