# Decision Tree | Assignment

Question 1:  What is a Decision Tree, and how does it work in the context of
classification?

Answer:

*   A Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks that splits data into subsets based on feature values, forming a tree-like structure of decisions which ultimately lead to predicted outcomes or values.

*   How Decision Trees Work in Classification

    * The process begins at the root node, representing the entire dataset.

    * At each decision (internal) node, data is split according to rules based on feature values (e.g., "Age > 30?", "Income > $50,000?").

    * This splitting is usually determined by measures like Gini impurity or information gain, which help find the most discriminative feature for the split.

    * Branches connect nodes and represent possible values or outcomes for each decision.

    * The process continues recursively, forming further internal nodes and splits, until a stopping criterion is reached (such as all observations in a node are of the same class, or a maximum tree depth is met).

    * Leaf nodes represent the final decision or predicted class for the given subset of data.

    * For classification, the outcome in each leaf is generally a categorical value, like "spam" or "not spam".

*   Example:

    For classifying whether a customer will buy a product, a decision tree may ask questions at each node:

    "Income > $50,000?" → If yes, go to the next question; if no, predict "No Purchase".

    "Age > 30?" → If yes, go to next question; if no, predict "No Purchase".

    "Previous Purchases > 0?" → If yes, predict "Purchase"; if no, predict "No Purchase".

    Each path from the root to a leaf defines a decision rule mapping input features to a class label.

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?

Answer:

*   Gini Impurity and Entropy are both impurity measures used in Decision Trees to evaluate how well a feature splits the data at each node, guiding the tree-building process to create the most homogeneous branches possible.

*   Impurity Measures:

    Gini Impurity and Entropy
    
    To decide how to split nodes, Decision Trees use impurity measures that quantify how mixed the classes are within a node. Two common measures:

*   Gini Impurity:
    
    Measures the chance of misclassifying a randomly chosen sample from the node if it were labeled according to class distribution.

    Formula:

    Gini = 1 - sum(p_j^2 for each class j)
    
    Ranges 0 (pure node with one class) to max (most mixed).

    Splits aim to minimize Gini impurity.

*   Entropy:
      
    Measures the disorder or uncertainty in the node.

    Formula:

    Entropy = - sum(p_j * log2(p_j) for each class j)

    0 means pure node, max when classes are evenly distributed.

    The algorithm chooses splits that reduce entropy the most (maximize information gain).

*   Impact on Decision Tree Splits:

    * At each node, possible splits are evaluated by calculating the weighted average impurity (Gini or Entropy) of child nodes.

    * The best split is the one that reduces impurity the most (lowest weighted average impurity).

    * Gini is generally faster to compute and works well for binary splits.

    * Entropy is theoretically grounded in information theory and may perform better for multi-class problems.

    * Both generally lead to similar tree structures and accuracy.

Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.

Answer:

*   Pre-Pruning and Post-Pruning are two techniques used to prevent overfitting in Decision Trees by controlling their size and complexity.

*   Pre-Pruning (Early Stopping):

    * Pre-Pruning stops the tree growth early during the training process before it becomes too complex.

    * It uses criteria like maximum tree depth, minimum samples per leaf, or minimum information gain to halt further splitting.

    * This method avoids building deep branches that have little benefit, thus preventing overfitting from the start.

    * Practical Advantage: It is computationally efficient because it stops unnecessary splits early, saving time and resources during training.

*   Post-Pruning (Pruning after Full Growth):

    * Post-Pruning allows the tree to grow fully and then systematically removes or trims branches that do not add significant predictive power.

    * Techniques include cost-complexity pruning, reduced error pruning, and pruning based on impurity thresholds.

    * It refines the large tree by replacing some subtrees with leaf nodes in a bottom-up manner.

    * Practical Advantage: It often results in better generalization and more accurate models because pruning decisions are based on actual performance of the fully grown tree.

Question 4: What is Information Gain in Decision Trees, and why is it important for
choosing the best split?


Answer:

*   Information Gain in Decision Trees is a metric that measures the effectiveness of a feature in splitting the dataset into classes. It quantifies how much knowing a feature reduces the uncertainty (entropy) of the target variable.

    * In simple terms, Information Gain is calculated as:

      Information Gain = Entropy(before split) - Weighted Entropy(after split)

    * Entropy measures how mixed or impure a dataset is; it is high when classes are evenly distributed and low when mostly one class dominates.

    * By splitting the data using a particular feature, the aim is to decrease entropy and make child nodes more homogeneous.

*   Information Gain is important because:

    * It helps decide the best feature and threshold to split the data at each node.

    * The feature with the highest Information Gain is chosen for the split, leading to more pure nodes and better classification accuracy.

    * It guides the tree-building process toward efficient and meaningful splits.

Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

Answer:

Here are some common real-world applications of Decision Trees along with their main advantages and limitations in those contexts:

1. Loan Approval in Banking:

    * Application:
    
      Banks use Decision Trees to decide loan approvals based on credit score, income, employment status, and loan history.

    * Advantage:
    
      Provides clear, interpretable decisions that can be explained to customers and regulators.

    * Limitation:
      
      Can be biased if training data is imbalanced or does not account for all risk factors.

2. Medical Diagnosis:

    * Application:
    
      Predicts whether a patient has a disease like diabetes using clinical data such as glucose level, BMI, and blood pressure.

    * Advantage:
    
      Helps in early diagnosis and treatment with interpretable decision rules.

    * Limitation:
    
      May overfit if too deep, leading to inaccurate predictions on unseen patient data.

3. Customer Churn Prediction:

    * Application:
    
      Predicts if a customer is likely to leave using behavioral data and purchase history.

    * Advantage:
    
      Enables proactive retention strategies by identifying at-risk customers clearly.

    * Limitation:
    
      Can be sensitive to noisy data and may require frequent retraining to stay current.

4. Fraud Detection:

    * Application:
    
      Detects fraudulent transactions by analyzing patterns in transaction data.

    * Advantage:
    
      Provides transparency in alerts, aiding investigation teams.

    * Limitation:
    
      May generate false positives or miss evolving fraud patterns without updates.

5. Quality Control in Manufacturing:

    * Application:
    
      Predicts defective products based on production variables.

    * Advantage:
    
      Helps maintain quality standards and reduce waste with understandable rules.

    * Limitation:
    
      Limited by the quality and granularity of sensor/production data available.

Question 6:   Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model's accuracy and feature importances

Answer:

In [1]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data          # Features
y = iris.target        # Target labels

# Split into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create Decision Tree Classifier with Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# Train the model
clf.fit(X_train, y_train)

# Predict on test data
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")

# Print feature importances
print("Feature Importances:")
for name, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{name}: {importance:.4f}")


Model Accuracy: 1.0000
Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876


Question 7:  Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.

Answer:

In [4]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree with max_depth=3
clf_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_limited.fit(X_train, y_train)
y_pred_limited = clf_limited.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

# Train fully grown Decision Tree (max_depth=None)
clf_full = DecisionTreeClassifier(max_depth=None, random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Print accuracy comparison
print(f"Accuracy with max_depth=3: {accuracy_limited:.4f}")
print(f"Accuracy with full depth: {accuracy_full:.4f}")

Accuracy with max_depth=3: 1.0000
Accuracy with full depth: 1.0000


Question 8: Write a Python program to:

● Load the Boston Housing Dataset

● Train a Decision Tree Regressor

● Print the Mean Squared Error (MSE) and feature importances


Answer:

In [6]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load Boston Housing dataset from URL (CSV format)
url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(url, sep=r"\s+", skiprows=22, header=None)


# Process the raw data into features and target
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

# Split dataset into training and testing sets (70%-30%)
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.3, random_state=42)

# Initialize and train Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Predict on the test set and compute MSE
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.4f}")

# Feature names as per Boston dataset description
feature_names = [
    'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE',
    'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'
]

# Print feature importances
print("Feature Importances:")
for name, importance in zip(feature_names, regressor.feature_importances_):
    print(f"{name}: {importance:.4f}")


Mean Squared Error: 11.5880
Feature Importances:
CRIM: 0.0585
ZN: 0.0010
INDUS: 0.0099
CHAS: 0.0003
NOX: 0.0071
RM: 0.5758
AGE: 0.0072
DIS: 0.1096
RAD: 0.0016
TAX: 0.0022
PTRATIO: 0.0250
B: 0.0119
LSTAT: 0.1900


Question 9: Write a Python program to:

● Load the Iris Dataset

● Tune the Decision Tree's max_depth and min_samples_split using
GridSearchCV

● Print the best parameters and the resulting model accuracy

Answer:

* Below is the python program that loads the Iris dataset, uses GridSearchCV to tune the Decision Tree Classifier's max_depth and min_samples_split parameters, and prints the best parameters and resulting model accuracy:

In [7]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define Decision Tree classifier
dtree = DecisionTreeClassifier(random_state=42)

# Define parameter grid for tuning
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 5, 10]
}

# Set up GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(dtree, param_grid, cv=5)

# Fit GridSearch to the training data
grid_search.fit(X_train, y_train)

# Best parameters found
best_params = grid_search.best_params_
print(f"Best Parameters: {best_params}")

# Predict on test set using best estimator
y_pred = grid_search.best_estimator_.predict(X_test)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy with tuned parameters: {accuracy:.4f}")

Best Parameters: {'max_depth': 4, 'min_samples_split': 10}
Model Accuracy with tuned parameters: 1.0000


Question 10: Imagine you're working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:

● Handle the missing values

● Encode the categorical features

● Train a Decision Tree model

● Tune its hyperparameters

● Evaluate its performance
And describe what business value this model could provide in the real-world setting.


Answer:


*   To build a disease prediction Decision Tree model with a healthcare dataset containing mixed data types and missing values, follow this step-by-step process:

1. Handle Missing Values:

    Identify missing data patterns and proportions per feature.

    Using appropriate imputation:

    * For numerical features, fill missing values with mean, median, or use advanced methods like KNN imputation.

    * For categorical features, fill with the mode or use a special 'missing' category.

    Handle missingness carefully to avoid bias or data leakage.

2. Encode Categorical Features:

    Convert categorical variables to numeric representations compatible with Decision Trees.

    Use:

    * Label Encoding for ordinal categories.

    * One-Hot Encoding for nominal categories.

    Ensure encoding is consistent across training and test datasets.

3. Train a Decision Tree Model:

    Split data into training and test sets.

    Initialize a DecisionTreeClassifier and train on the processed training data.

    Use criteria like Gini impurity or entropy.

    Handle class imbalance if present by adjusting class weights or resampling.

4. Tune Hyperparameters:

    Use GridSearchCV or RandomizedSearchCV to tune key hyperparameters such as:

    * max_depth (tree depth limit)

    * min_samples_split (minimum samples to split a node)

    * min_samples_leaf (minimum samples at a leaf node)

    * criterion (gini or entropy)

    Perform cross-validation during tuning to avoid overfitting.

5. Evaluate Model Performance:

    Measure accuracy, precision, recall, F1-score, and ROC-AUC on the test set.

    Use confusion matrix and feature importance for interpretability.

    Optionally, perform calibration or threshold tuning for better clinical decision-making.

6. This process add a business value as follow:

    This process enables early and accurate disease detection, improving patient outcomes through timely intervention.

    Supports clinical decision-making by providing interpretable rules.

    Helps allocate medical resources efficiently by identifying high-risk patients.

    Improves operational efficiency by automating risk stratification and reducing manual workload.

    Facilitates personalized medicine and targeted treatments by identifying important predictors.

    This process balances data quality, model performance, and interpretability essential for impactful healthcare predictive modeling.