# Decision Tree Assignment Theory Questions

# Question 1:  What is a Decision Tree, and how does it work in the context of classification ?

# Answer:
A **Decision Tree** is a supervised machine learning algorithm that is used for both **classification** and **regression** tasks. In the context of **classification**, a decision tree is used to assign labels to input data by learning decision rules from the features of the data.



### **How It Works (Classification Context)**

1. **Structure**:

   * The tree is made up of **nodes**:

     * **Root Node**: The first decision point based on the most significant feature.
     * **Internal Nodes**: Each represents a decision based on a feature.
     * **Leaf Nodes**: Final output or class labels (e.g., "spam" or "not spam").

2. **Splitting**:

   * At each node, the algorithm chooses a feature and a threshold (for numerical data) or category (for categorical data) to split the data.
   * The goal is to **maximize the purity** of the resulting groups (i.e., make each group mostly belong to a single class).

3. **Feature Selection Criteria**:

   * Common metrics to decide splits:

     * **Gini Impurity**
     * **Entropy / Information Gain**
     * **Gain Ratio**
     * These measure how well a split separates the classes.

4. **Prediction**:

   * To classify a new instance, the model starts at the root and follows the path based on feature values until it reaches a leaf node, which gives the predicted class.



### **Example**

Suppose we are classifying emails as "Spam" or "Not Spam" based on features like:

* Contains the word "free"
* Has more than 100 words
* Includes a link

The decision tree might look like this:

```
              [Contains "free"?]
                  /       \
              Yes           No
            /                 \
     [Has link?]            Not Spam
       /    \
     Yes     No
   Spam   Not Spam
```



###  **Advantages**

* Easy to understand and interpret
* Requires little data preprocessing
* Handles both numerical and categorical data

### **Disadvantages**

* Prone to overfitting
* Can become complex if not pruned
* Unstable to small data changes


# Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree ?

#Answer:
In a **Decision Tree**, the goal at each step is to **split the dataset** in a way that best separates the classes. To do this, the tree uses **impurity measures** to evaluate how "mixed" or "pure" a set of labels is.

Two of the most common impurity measures are:



## 1. **Gini Impurity**

### Definition:

Gini Impurity measures the probability that a randomly chosen sample from the dataset would be **incorrectly classified** if it was randomly labeled according to the class distribution in the dataset.

### Formula:

For a dataset with classes $C_1, C_2, ..., C_k$:

$$
\text{Gini}(D) = 1 - \sum_{i=1}^{k} p_i^2
$$

Where:

* $p_i$ is the probability (frequency) of class $i$ in dataset $D$

### Interpretation:

* Gini = 0 → Pure node (only one class)
* Higher Gini = More mixed classes



## 2. **Entropy (Information Gain)**

###  Definition:

Entropy measures the level of **disorder** or **uncertainty** in a dataset. The more mixed the class labels are, the higher the entropy.

###  Formula:

$$
\text{Entropy}(D) = - \sum_{i=1}^{k} p_i \log_2(p_i)
$$

Where:

* $p_i$ is the probability of class $i$ in the dataset $D$

### Interpretation:

* Entropy = 0 → Pure node
* Maximum entropy → Classes are evenly mixed



### **How They Impact Splits in a Decision Tree**

###  Goal:

The decision tree algorithm chooses the split that **minimizes impurity** (Gini or Entropy) in the child nodes after the split.

### Process:

1. For each possible feature and threshold:

   * Calculate the impurity of the left and right child nodes after the split.
   * Compute the **weighted average impurity** of these child nodes.
2. Choose the split that **reduces impurity the most**.

This reduction is:

* **Gini Gain** = Parent Gini - Weighted Gini of children
* **Information Gain** = Parent Entropy - Weighted Entropy of children



## Example

Suppose you have a dataset:

| Class |
| ----- |
| Yes   |
| Yes   |
| No    |
| No    |

* Gini = $1 - (0.5^2 + 0.5^2) = 0.5$
* Entropy = $- (0.5 \log_2 0.5 + 0.5 \log_2 0.5) = 1$

If a split results in:

* One child node with all "Yes" (pure → Gini = 0, Entropy = 0)
* One with mixed classes

The algorithm will favor this split because it **reduces impurity**.



##  Gini vs Entropy: Which is Better?

| Aspect            | Gini Impurity            | Entropy (Information Gain)                |
| ----------------- | ------------------------ | ----------------------------------------- |
| Speed             | Slightly faster (no log) | Slightly slower (uses log)                |
| Splitting results | Often similar            | Often similar                             |
| Tendency          | Gini is more “greedy”    | Entropy is more sensitive to distribution |

Both work well in practice, and many libraries like **scikit-learn** use Gini by default.


#Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

#Answer:

##  **What is Pruning in Decision Trees?**

**Pruning** is a technique used to reduce the size of a decision tree by removing sections that provide little or no predictive power. The goal is to prevent **overfitting** and improve the model's generalization on unseen data.



##  **Types of Pruning**

###  **1. Pre-Pruning (Early Stopping)**

**Definition:**
Pre-pruning stops the tree **growth early**, before it becomes too complex.

**How it works:**
It uses conditions to **stop splitting** a node during the tree-building process. Examples of stopping criteria:

* Maximum depth reached
* Minimum number of samples in a node
* Minimum information gain (or impurity reduction) from a split

** Practical Advantage:**
 **Faster training** and reduced **computational cost**, especially on large datasets.



###  **2. Post-Pruning (Reduced Error Pruning / Cost Complexity Pruning)**

**Definition:**
Post-pruning allows the tree to grow fully and then **removes unnecessary branches** after it's built.

**How it works:**
After building the full tree:

* Evaluate the performance of each subtree on a **validation set**
* Prune branches that do not improve (or hurt) validation accuracy
* This simplifies the tree without much loss in performance

** Practical Advantage:**
 **Improves generalization** by removing overfitted branches after seeing the full data structure.


#Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

#Answer:
##  **What is Information Gain?**

**Information Gain (IG)** is a metric used in decision trees to **measure the effectiveness of an attribute (feature) in classifying the data**.

It quantifies the **reduction in entropy** (i.e., disorder or impurity) achieved by splitting the dataset based on a specific feature.

##  **Formula:**

$$
\text{Information Gain} = \text{Entropy(parent)} - \sum_{i=1}^{k} \frac{|D_i|}{|D|} \cdot \text{Entropy}(D_i)
$$

Where:

* $D$: Parent dataset
* $D_i$: Subsets after splitting by a feature
* $|D_i| / |D|$: Proportion of instances in subset $i$
* Entropy is calculated as:

  $$
  \text{Entropy}(D) = - \sum_{j=1}^{c} p_j \log_2(p_j)
  $$

  with $p_j$ being the probability of class $j$


## **Why Is Information Gain Important?**

In decision tree construction, **at each node**, the algorithm must decide **which feature to split on**.

* **Information Gain tells us which feature gives the most reduction in uncertainty** about the target class.
* The feature with the **highest Information Gain** is chosen as the **best split** at that node.



## **Example:**

Suppose you're classifying emails as "Spam" or "Not Spam" using the feature **"Contains the word 'free'"**.

1. **Before the split** (parent node):

   * 50% spam, 50% not spam → Entropy = 1 (maximum uncertainty)

2. **After the split**:

   * Group 1 ("free" = yes): 90% spam, 10% not spam → Low entropy
   * Group 2 ("free" = no): 20% spam, 80% not spam → Low entropy

3. **Information Gain**:

   * High, because the split greatly reduces uncertainty

 So, the decision tree **selects this feature** to split the node.



##  **Key Points**

| Concept      | Description                                                        |
| ------------ | ------------------------------------------------------------------ |
| What it does | Measures how much "information" a feature gives us about the class |
| Used for     | Selecting the **best feature** to split a node                     |
| Goal         | Maximize Information Gain = Maximize purity after split            |
| Result       | Smaller, more accurate trees that generalize better                |




#Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

#Answer:

##  **Real-World Applications of Decision Trees**

Decision Trees are widely used across industries due to their simplicity and interpretability. Some common applications include:

### 🔹 1. **Medical Diagnosis**

* **Use case**: Classifying diseases based on symptoms, test results, and patient history.
* **Example**: Predicting whether a tumor is benign or malignant based on features like size, shape, etc.

### 🔹 2. **Credit Scoring & Risk Assessment**

* **Use case**: Deciding whether to approve a loan or credit card application.
* **Example**: Using features like income, debt, credit history, and employment status.

### 🔹 3. **Fraud Detection**

* **Use case**: Identifying fraudulent transactions.
* **Example**: A decision tree might flag a transaction as suspicious if it happens in a different country or involves an unusually large amount.

### 🔹 4. **Customer Churn Prediction**

* **Use case**: Predicting if a customer is likely to leave a service (telecom, SaaS, etc.).
* **Example**: Based on usage patterns, customer support calls, billing info.

### 🔹 5. **Marketing and Targeted Advertising**

* **Use case**: Segmenting customers and targeting offers based on behavior.
* **Example**: Deciding which promotion to send to a user based on age, spending history, and activity.

### 🔹 6. **Manufacturing and Quality Control**

* **Use case**: Detecting defective products based on sensor data or measurements.



## **Advantages of Decision Trees**

| Advantage                    | Explanation                                                            |
| ---------------------------- | ---------------------------------------------------------------------- |
| ✔️ Easy to understand        | Tree-like structure is intuitive, even for non-technical stakeholders  |
| ✔️ Requires little data prep | Handles both numerical and categorical data; no need for normalization |
| ✔️ Non-linear relationships  | Can capture complex decision boundaries without transformations        |
| ✔️ Fast inference            | Once built, predictions are fast and efficient                         |
| ✔️ Works with missing values | Can be modified to handle missing data                                 |



##  **Limitations of Decision Trees**

| Limitation                                  | Explanation                                                                 |
| ------------------------------------------- | --------------------------------------------------------------------------- |
| ❌ **Overfitting**                           | Trees can grow too complex and memorize training data                       |
| ❌ **Unstable**                              | Small changes in data can lead to very different tree structures            |
| ❌ **Bias toward features with more levels** | Can favor features with many unique values                                  |
| ❌ **Less accurate than ensemble methods**   | Alone, trees may underperform compared to models like Random Forests        |
| ❌ **Hard to model smooth functions**        | Decision trees split data step-wise and struggle with gradual relationships |



##  **Mitigation Tips**

* Use **pruning** (pre or post) to prevent overfitting.
* Use **ensemble methods** (like Random Forest or Gradient Boosting) to improve accuracy and stability.



Let me know if you'd like examples or diagrams for any of these use cases!


# Decision Tree Assignment Practical Questions

In [1]:
# Question 6:   Write a Python program to: ● Load the Iris Dataset ● Train a Decision Tree Classifier using the Gini criterion ● Print the model’s accuracy and feature importances.

# Answer:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# 1. Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names

# 2. Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train a Decision Tree Classifier using the Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# 4. Make predictions
y_pred = clf.predict(X_test)

# 5. Print the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# 6. Print feature importances
importances = clf.feature_importances_
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

print("\nFeature Importances:")
print(feature_importance_df.to_string(index=False))


Model Accuracy: 1.00

Feature Importances:
          Feature  Importance
petal length (cm)    0.906143
 petal width (cm)    0.077186
 sepal width (cm)    0.016670
sepal length (cm)    0.000000


In [2]:
# Question 7:  Write a Python program to: ● Load the Iris Dataset ● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

# Answer:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train a Decision Tree Classifier with max_depth=3
clf_limited = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
clf_limited.fit(X_train, y_train)
y_pred_limited = clf_limited.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

# 4. Train a fully-grown Decision Tree (no max_depth limit)
clf_full = DecisionTreeClassifier(criterion='gini', random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# 5. Print the accuracies for comparison
print(f"Accuracy with max_depth=3: {accuracy_limited:.2f}")
print(f"Accuracy with fully-grown tree: {accuracy_full:.2f}")


Accuracy with max_depth=3: 1.00
Accuracy with fully-grown tree: 1.00


In [3]:
# Question 8: Write a Python program to: ● Load the California Housing dataset from sklearn ● Train a Decision Tree Regressor ● Print the Mean Squared Error (MSE) and feature importances

# Answer:
# Import necessary libraries
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import pandas as pd

# 1. Load the California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target
feature_names = data.feature_names

# 2. Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train a Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# 4. Predict on the test set
y_pred = regressor.predict(X_test)

# 5. Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.4f}")

# 6. Print feature importances
importances = regressor.feature_importances_
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

print("\nFeature Importances:")
print(importance_df.to_string(index=False))


Mean Squared Error (MSE): 0.4952

Feature Importances:
   Feature  Importance
    MedInc    0.528509
  AveOccup    0.130838
  Latitude    0.093717
 Longitude    0.082902
  AveRooms    0.052975
  HouseAge    0.051884
Population    0.030516
 AveBedrms    0.028660


In [4]:
# Question 9: Write a Python program to: ● Load the Iris Dataset ● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV ● Print the best parameters and the resulting model accuracy

# Answer:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Define the parameter grid for tuning
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 4, 6, 8, 10]
}

# 4. Initialize the Decision Tree Classifier
clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# 5. Set up GridSearchCV
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5, scoring='accuracy')

# 6. Fit GridSearchCV
grid_search.fit(X_train, y_train)

# 7. Predict on the test set using the best estimator
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# 8. Print best parameters and test accuracy
print("Best Parameters Found:")
print(grid_search.best_params_)

accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy with Best Parameters: {accuracy:.2f}")


Best Parameters Found:
{'max_depth': 4, 'min_samples_split': 2}

Model Accuracy with Best Parameters: 1.00


# Question 10: Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values. Explain the step-by-step process you would follow to: ● Handle the missing values ● Encode the categorical features ● Train a Decision Tree model ● Tune its hyperparameters ● Evaluate its performance And describe what business value this model could provide in the real-world setting.

# Answer:

## **Step 1: Handle Missing Values**

* **Identify missing data**: Explore the dataset to understand which features have missing values and how much is missing.
* **Decide on imputation strategy** based on feature type and missingness:

  * For **numerical features**: Use strategies like mean, median, or more advanced imputation methods (e.g., K-Nearest Neighbors Imputer).
  * For **categorical features**: Impute with the most frequent category or create a separate category like “Unknown.”
* Optionally, **flag missing values** with new binary features indicating “missingness” if missingness itself is informative.


## **Step 2: Encode Categorical Features**

* Use **encoding techniques** suitable for Decision Trees (which can handle categorical variables but sklearn's implementation requires numeric inputs):

  * **Label Encoding** for ordinal categorical features.
  * **One-Hot Encoding** for nominal categorical features with a small number of categories.
* Avoid high-cardinality one-hot encoding; consider **target encoding** or **frequency encoding** if categories are numerous.
* Make sure encoding is consistent between training and test datasets.


## **Step 3: Train a Decision Tree Model**

* Split data into **training** and **validation/test** sets.
* Initialize the Decision Tree Classifier, e.g., with default parameters initially.
* Fit the model on the processed training data.
* Monitor training to check for overfitting (Decision Trees can easily overfit without constraints).


## **Step 4: Tune Hyperparameters**

* Use **hyperparameter tuning** (e.g., GridSearchCV or RandomizedSearchCV) to optimize:

  * `max_depth` (controls tree complexity)
  * `min_samples_split` and `min_samples_leaf` (control minimum data to split/leaf)
  * `criterion` (e.g., Gini vs. Entropy)
  * Others like `max_features`, `max_leaf_nodes`
* Use **cross-validation** during tuning to ensure robust parameter selection.


## **Step 5: Evaluate Performance**

* Evaluate the tuned model on a **held-out test set**.
* Use relevant metrics for classification in healthcare:

  * **Accuracy**: Overall correctness.
  * **Precision and Recall**: Especially recall (sensitivity) to minimize false negatives (missing disease cases).
  * **F1-Score**: Balance between precision and recall.
  * **ROC-AUC**: How well the model separates classes.
* Consider **confusion matrix** to understand error types.
* If the dataset is imbalanced, consider techniques like **SMOTE** or adjusting decision thresholds.


## **Business Value of the Model**

* **Early detection and intervention**: Predicting disease presence early enables timely treatment, potentially saving lives and reducing healthcare costs.
* **Resource optimization**: Helps prioritize patients for further testing or specialist referral, optimizing limited medical resources.
* **Improved patient outcomes**: Proactive care based on prediction can improve prognosis and quality of life.
* **Cost savings**: Prevents unnecessary tests or late-stage treatments by focusing attention where it’s most needed.
* **Data-driven decision making**: Provides clinicians with actionable insights to support diagnosis, supplementing human expertise.

