## ***Decision Tree***

---



**Question 1:** What is a Decision Tree, and how does it work in the context of
classification?
**Answer:**

A Decision Tree is a supervised machine learning algorithm that splits data into branches based on feature values, forming a tree-like structure.

At each node, it chooses the best feature to split the data.

Internal nodes represent features, branches represent decisions, and leaves represent outcomes (class labels).

In classification, the algorithm predicts the class label by following the path from root to leaf.

**Question 2:** Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?
**Answer:**


Gini Impurity: Measures how often a randomly chosen element would be misclassified.
Formula:
G
=
1
−
∑
p
i
2
G=1−∑p
i
2
​


Entropy: Measures disorder or uncertainty in the dataset.
Formula:
H
=
−
∑
p
i
log
⁡
2
(
p
i
)
H=−∑p
i
​
 log
2
​
 (p
i
​
 )

Impact:

Lower Gini/Entropy means purer nodes.

The Decision Tree splits features to minimize impurity, making each node as pure as possible

**Question 3:** What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.

**Answer:**


**Pre-Pruning:** Stop tree growth early using conditions (e.g., max_depth, min_samples_split).
Advantage: Prevents overfitting and reduces training time.

**Post-Pruning:** Build full tree, then cut back unnecessary branches.
Advantage: Allows more flexibility and often yields better generalization.

**Question 4:** What is Information Gain in Decision Trees, and why is it important for
choosing the best split?

**Answer:**

Information Gain (IG): Reduction in impurity (Entropy or Gini) after splitting.

Formula:

I
G
=
H
(
p
a
r
e
n
t
)
−
∑
n
i
n
H
(
c
h
i
l
d
i
)
IG=H(parent)−∑
n
n
i
​

​
 H(child
i
​
 )
Importance: Helps select the best feature for splitting, ensuring meaningful decision boundaries.

**Question 5:** What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?
**Answer:**

Applications:

Medical diagnosis

Credit risk analysis

Fraud detection

Marketing/customer segmentation

Advantages: Easy to interpret, handles both categorical & numerical data, non-parametric.

Limitations: Can overfit easily, unstable to small changes, less accurate compared to ensemble methods

In [2]:
#question 5
# Question 8

from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree Regressor
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

# Predictions
y_pred = reg.predict(X_test)

# Output results
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("Feature Importances:", reg.feature_importances_)



Mean Squared Error: 0.495235205629094
Feature Importances: [0.52850909 0.05188354 0.05297497 0.02866046 0.03051568 0.13083768
 0.09371656 0.08290203]


In [3]:
#06Write a Python program to:
#● Load the Iris Dataset
#● Train a Decision Tree Classifier using the Gini criterion
#● Print the model’s accuracy and feature importances
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Predictions
y_pred = clf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Feature Importances:", clf.feature_importances_)


Accuracy: 1.0
Feature Importances: [0.         0.01911002 0.89326355 0.08762643]


In [7]:
#Question 7: Write a Python program to:
# Load the Iris Dataset
#Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
 #fully-grown tree.
 # Fully-grown tree
# Question 7

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris Dataset
iris = load_iris()
X,y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fully-grown tree
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
acc_full = accuracy_score(y_test, y_pred_full)

# Limited depth tree
clf_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_limited.fit(X_train, y_train)
y_pred_limited = clf_limited.predict(X_test)
acc_limited = accuracy_score(y_test, y_pred_limited)

print("Accuracy of fully-grown tree:", acc_full)
print("Accuracy of depth-limited tree (max_depth=3):", acc_limited)



Accuracy of fully-grown tree: 1.0
Accuracy of depth-limited tree (max_depth=3): 1.0


In [8]:
#Question 8: Write a Python program to:
#● Load the California Housing dataset from sklearn
#● Train a Decision Tree Regressor
#● Print the Mean Squared Error (MSE) and feature importances
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train regressor
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

# Predictions
y_pred = reg.predict(X_test)

print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("Feature Importances:", reg.feature_importances_)


Mean Squared Error: 0.495235205629094
Feature Importances: [0.52850909 0.05188354 0.05297497 0.02866046 0.03051568 0.13083768
 0.09371656 0.08290203]


In [10]:
#Question 9: Write a Python program to:
#● Load the Iris Dataset
#● Tune the Decision Tree’s max_depth and min_samples_split using
#GridSearchCV
#● Print the best parameters and the resulting model accuracy
# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target  # discrete class labels (0, 1, 2)

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# Set parameter grid for tuning
param_grid = {
    'max_depth': [2, 3, 4, 5, 6],
    'min_samples_split': [2, 3, 4, 5, 6]
}

# Apply GridSearchCV
grid = GridSearchCV(dt, param_grid, cv=5)
grid.fit(X_train, y_train)

# Print best parameters
print("Best Parameters:", grid.best_params_)

# Make predictions using the best model
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)



Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Model Accuracy: 1.0


#Question 10: Imagine you’re working as a data scientist for a healthcare company that
#wants to predict whether a patient has a certain disease. You have a large dataset with
#mixed data types and some missing values.
#Explain the step-by-step process you would follow to:
#● Handle the missing values
#● Encode the categorical features
#● Train a Decision Tree model
#● Tune its hyperparameters
#● Evaluate its performance
#And describe what business value this model could provide in the real-world
#setting.

**Answer:**
**Handle Missing Values:**

Use mean/median for numerical features.

Use mode or most frequent category for categorical features.

**Advanced:** Use imputation methods (KNN, MICE).

**Encode Categorical Features:**

One-Hot Encoding for nominal categories.

Label Encoding for ordinal categories.

Train Decision Tree Model:

Split data into train/test.

Fit DecisionTreeClassifier.

**Tune Hyperparameters:**

Use GridSearchCV/RandomizedSearchCV for max_depth, min_samples_split, criterion.

**Evaluate Performance:**

Use Accuracy, Precision, Recall, F1-score, ROC-AUC.

**Business Value:**

Helps doctors quickly identify high-risk patients.

Reduces healthcare costs by prioritizing urgent cases.

Improves patient outcomes through early disease prediction.