# Decision Tree | Assignment


# Question 1: What is a Decision Tree, and how does it work in the context of classification?

A Decision Tree is a supervised machine learning algorithm used for classification (categorical outcomes) and regression (continuous outcomes). It works by splitting data into subsets based on the values of input features, forming a tree-like structure of decisions.

1.** Structure of a Decision Tree**

A decision tree consists of:

Root Node: Represents the entire dataset and the first decision point (feature to split on).

Internal Nodes: Represent features used to split the data further.

Branches: Represent the outcomes of the split.

Leaf Nodes (Terminal Nodes): Represent the final classification or decision.

----

2. **How it works for Classification **

The main goal is to classify data into categories by asking a series of yes/no (or multi-way) questions about features.
  
Step-by-step process:

1. Select the Best Feature to Split

   • Use a splitting criterion like:
          
          • Gini Impurity
          • Entropy / Information Gain
          • Chi-square

Choose the feature that best separates the classes.

2. Split the Data

• Divide the dataset into subsets based on the selected feature’s values.

3. Repeat Recursively

• For each subset, repeat the process:

     • Select the best feature.
     • Split the subset.


Continue until one of the stopping criteria is met:

• All data points in a node belong to the same class.

• Maximum tree depth is reached.

• Minimum number of samples in a node is reached.


4. Assign Class Labels

 Once a leaf node is reached, assign the most common class of that node to all data points in it.


**3. Example**

Suppose you want to classify if someone will play tennis based on weather:


| Outlook  | Temperature | Humidity | Wind | Play Tennis |
| -------- | ----------- | -------- | ---- | ----------- |
| Sunny    | Hot         | High     | Weak | No          |
| Overcast | Hot         | High     | Weak | Yes         |
| Rain     | Mild        | High     | Weak | Yes         |


The tree might first split on "Outlook":

• If Sunny → split further by Humidity

• If Rain → split by Wind

• If Overcast → Yes (leaf node)


**4. Advantages**

• Easy to understand and interpret.

• Can handle both numerical and categorical data.

• Requires little data preprocessing.


**5. Disadvantages**

• Can overfit easily on training data.

• Unstable to small variations in data.

• Sometimes biased toward features with more levels.


# Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

**Decision Trees & Impurity**

When building a Decision Tree, the goal at each split is to separate the data into “purer” groups — meaning each node should ideally contain samples from only one class.

To measure this impurity (or disorder) of a node, we commonly use:

1. Gini Impurity

2. Entropy (Information Gain)


-----

1. Gini Impurity

• Formula:

**Gini Impurity Formula:**

$$
Gini = 1 - \sum_{i=1}^{C} p_i^2
$$


Meaning:
Gini measures how often you would misclassify a sample if you randomly labeled it according to the class distribution in the node.

Range:


0 → Pure (all samples belong to one class)

Maximum ≈  0.5 (for 2 classes with 50-50 distribution)


Example: If a node has 70% Class A and 30% Class B

**Example: Gini Impurity Calculation**

$$
Gini = 1 - (0.7^2 + 0.3^2) = 1 - (0.49 + 0.09) = 0.42
$$

--------

**2. Entropy (Information Gain)**

• Formula:

**Entropy Formula:**

$$
Entropy = - \sum_{i=1}^{C} p_i \log_2(p_i)
$$


**Meaning:**

Entropy measures the amount of uncertainty or “disorder” in the node.
The more evenly distributed the classes, the higher the entropy.

**Range:**

0 → Pure (all samples same class)

1 → Maximum impurity (for 2 classes split 50-50)

**Example:**

Same node: 70% Class A, 30% Class B:

**Example: Entropy Calculation**

$$
Entropy = -(0.7 \log_{2} 0.7 + 0.3 \log_{2} 0.3) \approx 0.88
$$


**How They Impact Splits in Decision Trees**

Both Gini and Entropy are used to choose the best split:

 • The algorithm checks all possible splits.

• For each split, it calculates the impurity of the child nodes.

• It then picks the split that reduces impurity the most (highest information gain for entropy, or highest Gini decrease for Gini).


**Differences in practice:**

• Gini is computationally simpler (no log function) → slightly faster.

• Entropy tends to give more balanced splits (more sensitive to class distribution).

• In practice, both usually give very similar trees.

# Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

**Pre-Pruning (Early Stopping)**

• **Definition:** The tree growth is stopped early, before it becomes too complex.

• **How:** Uses constraints like maximum depth, minimum samples per split, or minimum information gain.

• **Goal:** Prevents overfitting by not letting the tree grow unnecessarily deep.

**Practical Advantage**: Saves training time and computation, especially useful for large datasets where building a very deep tree would be slow and expensive.

--------

**Post-Pruning (Pruning after Full Growth)**

• **Definition:** The tree is first allowed to grow fully, and then unnecessary branches are cut back.

• **How:** Uses techniques like cost-complexity pruning (CART) or reduced error pruning to remove branches that don’t improve accuracy.

• **Goal:** Simplifies the model while keeping performance high.


**Practical Advantage:** Produces a more accurate and generalizable model, since pruning decisions are made after seeing the complete tree.



# Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

**Information Gain (IG)**

**Definition:** Information Gain measures how much "uncertainty" (impurity) in the dataset is reduced after splitting on a feature.

Formula:

$$
IG(S, A) = Entropy(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} \times Entropy(S_v)
$$

**Why It’s Important**

• A Decision Tree works by splitting data into purer subsets (where classes are more separated).

• Information Gain tells us which feature is the “best question” to ask at each step.

• Higher IG = bigger reduction in impurity = better split.

# Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

**Real-World Applications of Decision Trees**


 • **Medical diagnosis** – Predicting if a patient has a disease based on symptoms and test results.

 • **Credit scoring –** Classifying loan applicants as “low risk” or “high risk based on financial history.

• **Fraud detection –** Identifying suspicious transactions.

• **Customer churn prediction** – Predicting if a customer will stop using service.

• **Product recommendation** – Suggesting products based on user behavior and preferences.

• **Regression tasks **– Predicting house prices, sales forecasting, etc.

---------

**Main advantages:**


• **Easy to understand & interpret**   – Produces clear, human-readable rules.

• **Handles both numerical and categorical data**  – Works with mixed data types without complex preprocessing.

• **No need for feature scaling** – Normalization or standardization is not required.

• **Captures non-linear relationships** – Can model complex decision boundaries.

• **Feature importance** – Identifies which variables are most influential.

-------

**Main limitations:**


• **Overfitting** – Fully grown trees can memorize the training data and perform poorly on unseen data.

• **High variance** – Small changes in data can produce very different trees.


• **Bias toward features with many levels**  – Features with more unique values can dominate splits.

• **Less accurate than ensembles** – Often outperformed by Random Forests or Gradient Boosting on complex problems.


In [1]:
# Question 6: Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier using the Gini criterion
# ● Print the model’s accuracy and feature importances


from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# Split into training and test sets (stratified to maintain class proportions)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Train Decision Tree Classifier with Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Predict on test set
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Feature importances
importances = clf.feature_importances_

# Output
print(f"Accuracy: {accuracy:.4f}")
print("Feature importances:")
for name, importance in zip(feature_names, importances):
    print(f"  {name}: {importance:.4f}")

Accuracy: 0.9333
Feature importances:
  sepal length (cm): 0.0000
  sepal width (cm): 0.0286
  petal length (cm): 0.5412
  petal width (cm): 0.4303


In [3]:
# 7. Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into train/test sets (stratified)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Fully-grown tree (no depth limit)
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
acc_full = accuracy_score(y_test, clf_full.predict(X_test))

# Tree with max_depth=3
clf_md3 = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_md3.fit(X_train, y_train)
acc_md3 = accuracy_score(y_test, clf_md3.predict(X_test))

# Output
print(f"Fully-grown tree accuracy: {acc_full:.4f}")
print(f"Max depth=3 tree accuracy: {acc_md3:.4f}")



Fully-grown tree accuracy: 0.9333
Max depth=3 tree accuracy: 0.9778


In [4]:
# 8. Write a Python program to:
# ● Load the California Housing dataset from sklearn
# ● Train a Decision Tree Regressor
# ● Print the Mean Squared Error (MSE) and feature importances

from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Load California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target
feature_names = housing.feature_names

# Split dataset into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Predict on test set
y_pred = regressor.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)

# Print results
print(f"Mean Squared Error on test data: {mse:.4f}\n")

print("Feature Importances:")
for name, importance in zip(feature_names, regressor.feature_importances_):
    print(f" - {name}: {importance:.4f}")


Mean Squared Error on test data: 0.4952

Feature Importances:
 - MedInc: 0.5285
 - HouseAge: 0.0519
 - AveRooms: 0.0530
 - AveBedrms: 0.0287
 - Population: 0.0305
 - AveOccup: 0.1308
 - Latitude: 0.0937
 - Longitude: 0.0829


In [5]:
# 9. Write a Python program to:
# ● Load the Iris Dataset
# ● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
# ● Print the best parameters and the resulting model accuracy

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train and test sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the Decision Tree classifier
dt = DecisionTreeClassifier(random_state=42)

# Set up the grid of parameters to search
param_grid = {
    'max_depth': [2, 3, 4, 5, 6, None],
    'min_samples_split': [2, 5, 10]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(dt, param_grid, cv=5, n_jobs=-1)

# Fit GridSearch to the training data
grid_search.fit(X_train, y_train)

# Best parameters found
best_params = grid_search.best_params_

# Evaluate the best estimator on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
# Print the results
print(f"Best parameters: {best_params}")
print(f"Model accuracy on test set: {accuracy:.4f}")

Best parameters: {'max_depth': 4, 'min_samples_split': 2}
Model accuracy on test set: 1.0000


# 10. Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values. Explain the step-by-step process you would follow to:

● Handle the missing values

● Encode the categorical features

● Train a Decision Tree model

● Tune its hyperparameters

● Evaluate its performance And describe what business value this model could provide in the real-world setting.


---
**bold text**

Answer:

**1. Handle Missing Values**

**Understand the missingness:**

First, analyze how data is missing. Is it random, or does it follow a pattern? This affects how you handle it.

• **Imputation:**

For **numerical features**, you could fill missing values with mean, median, or use more advanced methods like K-Nearest Neighbors imputation.

For **categorical features**, you might fill missing with the mode (most frequent value) or create a special category like "Unknown".

If missing values are too prevalent or critical, consider dropping those features or samples carefully.

**Why it matters:** Models can’t handle missing data directly, so cleaning this up ensures your model sees complete, reliable inputs.


-----


**2. Encode Categorical Features**

• **Identify categorical variables:** This could be patient gender, blood type, or any non-numeric info.

• **Encoding methods:**

• For **nominal categories** without order (e.g., blood type), use One-Hot Encoding.

• For **ordinal categories** (e.g., disease severity: mild, moderate, severe), use Label Encoding or map them to meaningful numeric scales.

• **Why it matters:** Machine learning models, including Decision Trees, require numeric input, so encoding transforms your data into a digestible form.

------

**3. Train a Decision Tree Model**

• **Split the data**: Use an 80-20 or 70-30 split between training and test sets, or use cross-validation to ensure your results generalize.

• **Initialize the model**: Start with a default Decision Tree classifier.

• **Train on processed data**: Fit the model on your training data.

• **Why Decision Trees**: They handle mixed data types well, are interpretable (important in healthcare), and can capture nonlinear patterns

--------

**4. Tune Hyperparameters**

• **Key hyperparameters to tune:**
    
   • max_depth: controls tree complexity, balancing underfitting and overfitting.

   • min_samples_split and min_samples_leaf: control how many samples needed to split or be a leaf node, affecting generalization.

   • max_features: number of features to consider at each split.

• **Use GridSearchCV or RandomizedSearchCV**: Explore combinations systematically with cross-validation to find the sweet spot.

• **Why tuning matters**: Proper tuning prevents overfitting or underfitting, improving the model’s predictive power on unseen data.


-------------

**5. Evaluate Performance**

**Metrics:**

For classification, consider Accuracy, Precision, Recall, F1-score, and ROC-AUC.

In healthcare, Recall (sensitivity) is often critical — you want to catch as many patients with the disease as possible, even at the cost of some false positives.

**Validation**: Use a separate test set or cross-validation to ensure your model performs reliably.

**Interpretability**: Use feature importance and decision tree visualization to explain model decisions to clinicians and stakeholders.


-----



**Business Value of This Model**

**Early detection**: Predicting disease early means patients can receive timely treatment, improving outcomes and reducing healthcare costs.

**Resource optimization**: Helps healthcare providers prioritize high-risk patients for screening or intervention, making better use of limited resources.

**Personalized care**: Tailors monitoring and care plans based on individual risk, improving patient satisfaction and effectiveness.

**Data-driven decisions**: Provides actionable insights backed by data, enabling the company to develop better products, policies, or outreach programs.

**Trust and transparency**: Decision Trees’ interpretability supports building trust with clinicians, regulators, and patients, crucial in healthcare.
