# Assignment Code: DA-AG-012
# Decision Tree | Assignment



Question 1:  What is a Decision Tree, and how does it work in the context of
classification?

Ans> A decision tree is a supervised learning algorithm used for classification and regression tasks. In classification, it is used to predict the class or category of a given input based on its features.

1. Selecting the Best Split: The algorithm begins by selecting the best attribute to split the data at the root node. This selection is typically based on criteria like Information Gain, Gini Index, or Gain Ratio, which measure how well an attribute separates the classes.
2. Splitting the Data: Once the best attribute is selected, the data is split into subsets based on the attribute’s values. For numerical data, this might involve dividing the data into ranges, while for categorical data, the split is based on distinct categories.
3. Repeating the Process: The algorithm repeats the splitting process recursively for each subset, creating new internal nodes and branches. The process continues until one of the stopping criteria is met, such as when all data points in a node belong to the same class or when further splitting does not significantly improve the classification.
4. Assigning Class Labels: Once the splitting process is complete, the leaf nodes are assigned class labels based on the majority class in each subset. These labels represent the final decisions or classifications made by the tree.
5. Pruning the Tree: To improve generalization and prevent overfitting, the tree is pruned by removing nodes that do not contribute significantly to classification accuracy. Pruning can be performed using techniques such as Reduced Error Pruning or Cost-Complexity Pruning.

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

**Gini Impurity**: Gini Impurity is a measure of the likelihood of an incorrect classification of a new instance of a random variable if that instance was randomly labeled according to the distribution of labels in the dataset. It quantifies the degree of impurity or disorder in a dataset.

**Entropy**: Entropy, rooted in information theory, measures the amount of uncertainty or randomness in a dataset. In the context of decision trees, it quantifies the homogeneity of a dataset.

Impact on tree: Choice affects which splits are selected, the sensitivity to minority classes, the tree shape, and subtle differences in final predictions, but in most practical scenarios, both metrics perform comparably.

Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

Ans>
**Pre-pruning**:
1. Pre-Pruning, also known as early stopping, involves halting the construction of a decision tree before it fully fits the training data.
2. Risk of overfitting is lower if stopping criteria are well tuned.
3. Prevents large trees proactively
4. practical advantage: re-pruning is often used when computational resources are limited or datasets are large.

**Post-pruning**:
1. ost-Pruning, also called backward pruning, allows the tree to grow completely and then removes branches that contribute little to predictive performance.
2. Risk of overfitting is controlled by pruning irrelevant branches.
3. Reduces complexity after seeing full growth
4. practical use: when prediction accuracy is critical and sufficient data is available for validation.

Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

Ans> Information Gain (IG) is a measure used in decision trees to quantify the effectiveness of a feature in splitting the dataset into classes. It calculates the reduction in entropy (uncertainty) of the target variable (class labels) when a particular feature is known.
nformation Gain helps us understand how much a particular feature contributes to making accurate predictions in a decision tree. Features with higher Information Gain are considered more informative and are preferred for splitting the dataset, as they lead to nodes with more homogenous classes.

Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

Ans> Common Real-World Applications
1. Finance and Banking: They help in credit scoring, loan approval predictions, fraud detection, and risk assessment, as they can handle both numerical and categorical data.
2. Marketing and Sales: Businesses use Decision Trees to segment customers, predict churn, recommend products, and optimize marketing strategies based on customer behavior.
3. Operations and Manufacturing: Used for quality control, predictive maintenance, and process optimization by analyzing machinery sensor data and production metrics.

Advantages:
* Interpretability: The tree structure is easy to visualize and understand, making results transparent and explainable.
* Non-parametric: Decision Trees do not assume any probability distributions, making them flexible for diverse datasets.
* Handles mixed data types: Can manage numerical, categorical, and missing data without extensive preprocessing.

Limitations:
* Overfitting: Decision Trees can easily overfit especially with deep trees and noisy data, reducing generalization to new datasets.
* Lack of smooth predictions: In regression, predictions are piecewise constant rather than smooth, which may be inappropriate for certain continuous outputs.

In [None]:
# Question 6:   Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier using the Gini criterion
# ● Print the model’s accuracy and feature importances

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Decision Tree Classifier using the Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Calculate and print the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Print feature importances
print("Feature Importances:")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")

Model Accuracy: 1.00
Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


In [None]:
# Question 7:  Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
# a fully-grown tree

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Decision Tree Classifier with max_depth=3
clf_limited_depth = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_limited_depth.fit(X_train, y_train)

# Make predictions on the test set
y_pred_limited_depth = clf_limited_depth.predict(X_test)

# Calculate and print the model's accuracy for the limited depth tree
accuracy_limited_depth = accuracy_score(y_test, y_pred_limited_depth)
print(f"Model Accuracy (max_depth=3): {accuracy_limited_depth:.2f}")

# Train a fully-grown Decision Tree Classifier (default behavior)
clf_full_depth = DecisionTreeClassifier(random_state=42)
clf_full_depth.fit(X_train, y_train) # This should be y_train, not y_test

# Make predictions on the test set
y_pred_full_depth = clf_full_depth.predict(X_test)

# Calculate and print the model's accuracy for the fully-grown tree
accuracy_full_depth = accuracy_score(y_test, y_pred_full_depth)
print(f"Model Accuracy (fully-grown): {accuracy_full_depth:.2f}")

Model Accuracy (max_depth=3): 1.00
Model Accuracy (fully-grown): 1.00


In [1]:
# Question 8: Write a Python program to:
# ● Load the California Housing dataset from sklearn
# ● Train a Decision Tree Regressor
# ● Print the Mean Squared Error (MSE) and feature importances

import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset
housing = fetch_california_housing()
X = housing.data
y = housing.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Make predictions on the test set
y_pred = regressor.predict(X_test)

# Calculate and print the Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.2f}")

# Print feature importances
print("Feature Importances:")
for feature, importance in zip(housing.feature_names, regressor.feature_importances_):
    print(f"{feature}: {importance:.4f}")

Mean Squared Error (MSE): 0.50
Feature Importances:
MedInc: 0.5285
HouseAge: 0.0519
AveRooms: 0.0530
AveBedrms: 0.0287
Population: 0.0305
AveOccup: 0.1308
Latitude: 0.0937
Longitude: 0.0829


In [2]:
# Question 9: Write a Python program to:
# ● Load the Iris Dataset
# ● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
# ● Print the best parameters and the resulting model accuracy

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the parameter grid to tune
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 5, 10]
}

# Create a Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)

# Perform GridSearchCV to find the best parameters
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Print the best parameters found
print(f"Best parameters: {grid_search.best_params_}")

# Get the best model
best_clf = grid_search.best_estimator_

# Make predictions on the test set using the best model
y_pred = best_clf.predict(X_test)

# Calculate and print the accuracy of the best model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy with best parameters: {accuracy:.2f}")

Best parameters: {'max_depth': 4, 'min_samples_split': 2}
Model Accuracy with best parameters: 1.00


Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.


Ans>
1. Handling Missing Values
In healthcare datasets, missing values are common due to incomplete records, patient non-responses, or data entry errors. Handling them is crucial for model accuracy.
Steps:
* Data Exploration: Identify missing values by column and calculate the proportion of missing entries. Visualizations like heatmaps or summary tables can help detect patterns.
* Assess Mechanism of Missingness: Classify missing data as MCAR (Missing Completely at Random), MAR (Missing at Random), or MNAR (Missing Not at Random). This informs the imputation strategy.
* Imputation Strategy:
Numerical Features: Fill missing values using mean, median, or K-nearest neighbors (KNN) imputation, depending on distribution skewness.
Categorical Features: Use mode imputation or create a separate 'missing' category.
2. Encoding Categorical Features
Decision trees can handle categorical data natively in some implementations, but many libraries require numeric encoding.
Nominal Features: One-hot encoding is standard to avoid implicit ordering. Target encoding can be used cautiously if data leakage is prevented.
Integration: Combine encoded features with numerical data, ensuring feature scaling is not required for decision trees.
4. Train a Decision Tree Model:
Decision trees recursively partition the feature space based on splitting criteria like Gini impurity or entropy.

* Split Dataset: Use train-test split (e.g., 80%-20%) or cross-validation frameworks.
* Model Initialization: Choose parameters like criterion, max_depth, min_samples_split, and min_samples_leaf for initial trials.
* Training: Fit the tree on the training data using the features and the target variable (disease presence: yes/no).

5. Hyperparameter Tuning
To improve generalization and prevent overfitting:
Key Hyperparameters:
* max_depth: Restricts tree depth.
* min_samples_split and min_samples_leaf: Avoid splits with very few records.
* max_features: Controls the number of features considered at each split.
* criterion: 'gini' or 'entropy' for classification quality.

5. Evaluating Model Performance
Steps:
* Confusion Matrix: Calculate TP, FP, FN, TN to derive precision, recall, and F1-score.
* ROC Curve / AUC: Assess the model’s discriminative ability.
* Cross-Validation Scores: Ensure stability across folds.
* Calibration: Verify predicted probabilities align with actual disease occurrence if probability outputs are relevant.


Business Value in Real-World Healthcare
* Early Disease Detection: Predictive insights allow proactive intervention, potentially reducing hospitalizations and improving patient outcomes.
Resource Allocation: Prioritize high-risk patients for further tests or monitoring.
* Cost Savings: Reduce unnecessary tests for low-risk patients and optimize operational efficiency.
* Personalized Care: Inform treatment plans based on model predictions, enabling targeted preventative measures.
* Compliance & Reporting: Support evidence-based recommendations for healthcare policies and insurance purposes.