# Decision Tree | Assignment

### Question 1: What is a Decision Tree, and how does it work in the context of classification?

### **Definition**:-  
A **Decision Tree** is a supervised machine learning algorithm that uses a tree-like structure to make decisions.  
- The **Root Node** represents the entire dataset.  
- **Internal Nodes** represent tests or conditions on features.  
- **Branches** represent outcomes of those tests.  
- **Leaf Nodes** represent the final predicted class labels.  

### **How it works in Classification**:-  
In the context of classification, a Decision Tree follows these steps:  

1. **Feature Selection**  
   - At each step, the algorithm chooses the feature that best splits the data into distinct classes.  
   - This selection is based on metrics such as:  
     - **Information Gain** (using Entropy)  
     - **Gini Impurity**  

2. **Splitting the Dataset**  
   - The dataset is divided into subsets based on the chosen feature’s values.  
   - Each subset moves to a new branch of the tree.  

3. **Recursive Partitioning**  
   - The process of selecting features and splitting is repeated recursively for each subset.  
   - This continues until:  
     - All instances in a node belong to the same class, or  
     - No further meaningful split can be made.  

4. **Leaf Node Prediction**  
   - Once a leaf node is reached, it represents the final predicted class label.  
   - For any new instance, the model traces the path from the root to a leaf node, based on its feature values, and assigns the corresponding class label.  

 **Example**:-  
Consider classifying emails into *Spam* or *Not Spam*:  
- The tree may first split on whether the subject line contains the word **“offer”**.  
- If yes, it may further split on whether the sender is in the recipient’s contact list.  
- Finally, the leaf nodes represent the decision (*Spam* or *Not Spam*).

### Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

### **Gini Impurity**:-  
- **Formula:**  
  \[
  Gini = 1 - \sum_{i=1}^{C} p_i^2
  \]  
- It measures the probability of incorrectly classifying a randomly chosen instance.  
- Range: **0 (pure node)** to **0.5 (maximum impurity for binary classes)**.  

### **Entropy**:-  
- **Formula:**  
  \[
  Entropy = - \sum_{i=1}^{C} p_i \cdot \log_2(p_i)
  \]  
- It measures the amount of disorder or uncertainty in the data.  
- Range: **0 (pure node)** to **1 (maximum impurity for binary classes)**.  

- **Information Gain** = Reduction in entropy after a split.  

### **Impact on Splits**:-  
- At each node, the Decision Tree algorithm evaluates possible features using Gini or Entropy.  
- The feature that results in the **largest decrease in impurity (highest purity gain)** is selected for the split.  
- **Gini** tends to favor splits where one class is dominant.  
- **Entropy** focuses on reducing overall disorder.

 Thus, both measures guide the tree to create splits that make the resulting child nodes purer, improving classification accuracy.

### Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

### **Pre-Pruning (Early Stopping)**:-  
- **Concept:** Stops the tree from growing once a certain condition is met (e.g., maximum depth, minimum number of samples per node, minimum information gain).  
- **Goal:** Prevents the tree from becoming too complex during training.

**Advantage:-**  
- Saves computation time and avoids overfitting by restricting unnecessary splits.  

### **Post-Pruning (Pruning after Full Growth)**:-  
- **Concept:** The tree is allowed to grow fully, and then branches that provide little to no improvement are removed.  
- **Goal:** Simplifies the tree after training by cutting back less important branches.  

**Advantage:-**  
- Results in a simpler and more generalizable model with better performance on unseen data.  

 **Key Difference:-**  
- **Pre-Pruning** controls tree growth during training.  
- **Post-Pruning** reduces the size of a fully grown tree by removing weak splits afterward.

### Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

### **Information Gain**  
- **Definition:** Information Gain (IG) is a measure of how much **uncertainty (entropy)** in the dataset is reduced after splitting based on a particular feature.  
- **Formula:**  
  \[
  IG(D, A) = Entropy(D) - \sum_{v \in Values(A)} \frac{|D_v|}{|D|} \cdot Entropy(D_v)
  \]  
  where:  
  - \(D\) = dataset  
  - \(A\) = feature used for splitting  
  - \(D_v\) = subset of \(D\) where feature \(A\) takes value \(v\)  

### **Importance for Choosing the Best Split**  
1. At each node, the Decision Tree algorithm calculates Information Gain for all features.  
2. The feature with the **highest Information Gain** is chosen, since it provides the most effective separation of classes.  
3. High IG → maximum reduction in impurity → more **pure child nodes**.  
4. This ensures the tree learns meaningful patterns and improves classification accuracy.  

 **Example:-**  
If splitting student data by "Study Hours" reduces entropy more than splitting by "Attendance", then "Study Hours" will be chosen as the splitting feature because it gives higher Information Gain.

### Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?
Dataset Info:
● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).
● Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).

### **Real-World Applications**:-

1. **Classification Tasks**  
   - **Iris Dataset**: Predicting the species of a flower based on features like petal and sepal length/width.  
     ```python
     from sklearn.datasets import load_iris
     iris = load_iris()
     X, y = iris.data, iris.target
     ```  
   - **Medical Diagnosis**: Predicting diseases based on patient symptoms and test results.  
   - **Customer Segmentation**: Classifying customers based on behavior or demographics.


2. **Regression Tasks**  
   - **Boston Housing Dataset**: Predicting house prices based on features like number of rooms, location, etc.  
     ```python
     from sklearn.datasets import load_boston
     boston = load_boston()
     X, y = boston.data, boston.target
     ```  
   - **Sales Forecasting**: Predicting sales revenue from historical data.  
   - **Risk Assessment**: Estimating financial risk scores in banking.  

### **Advantages of Decision Trees**:-  
- **Easy to Interpret:** Graphical tree structure is intuitive.  
- **Handles Both Data Types:** Can manage categorical and numerical features.  
- **Non-Linear Relationships:** Captures complex patterns without requiring data scaling.  
- **Requires Minimal Data Preparation:** No need for normalization or one-hot encoding (in some cases).

### **Limitations of Decision Trees**:-  
- **Overfitting:** Deep trees can memorize training data, reducing generalization.  
- **High Variance:** Small changes in data can lead to very different trees.  
- **Bias Towards Dominant Features:** Features with more levels may dominate splits.  
- **Limited Predictive Accuracy:** Alone, may not perform as well as ensemble methods like Random Forest or Gradient Boosting.








# Question 6: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances
(Include your Python code and output in the code box below.)

In [2]:
# Question 6: Decision Tree Classifier on Iris Dataset

# Step 1: Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Step 2: Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names

# Step 3: Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 4: Train Decision Tree Classifier using Gini criterion
dt_model = DecisionTreeClassifier(criterion='gini', random_state=42)
dt_model.fit(X_train, y_train)

# Step 5: Make predictions and evaluate accuracy
y_pred = dt_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Step 6: Print accuracy and feature importances
print(f"Decision Tree Accuracy: {accuracy:.2f}")
print("Feature Importances:")
for name, importance in zip(feature_names, dt_model.feature_importances_):
    print(f"{name}: {importance:.3f}")


Decision Tree Accuracy: 1.00
Feature Importances:
sepal length (cm): 0.000
sepal width (cm): 0.019
petal length (cm): 0.893
petal width (cm): 0.088


# Question 7: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.
(Include your Python code and output in the code box below.)

In [1]:
# Question 7: Compare Decision Tree with max_depth=3 vs fully-grown tree on Iris Dataset

# Step 1: Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Step 2: Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Step 3: Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 4: Train Decision Tree with max_depth=3
dt_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
dt_limited.fit(X_train, y_train)

# Step 5: Train fully-grown Decision Tree
dt_full = DecisionTreeClassifier(random_state=42)
dt_full.fit(X_train, y_train)

# Step 6: Make predictions
y_pred_limited = dt_limited.predict(X_test)
y_pred_full = dt_full.predict(X_test)

# Step 7: Evaluate and print accuracies
accuracy_limited = accuracy_score(y_test, y_pred_limited)
accuracy_full = accuracy_score(y_test, y_pred_full)

print(f"Accuracy of Decision Tree with max_depth=3: {accuracy_limited:.2f}")
print(f"Accuracy of Fully-Grown Decision Tree: {accuracy_full:.2f}")


Accuracy of Decision Tree with max_depth=3: 1.00
Accuracy of Fully-Grown Decision Tree: 1.00


# Question 8: Write a Python program to:
● Load the California Housing dataset from sklearn
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances
(Include your Python code and output in the code box below.)


In [2]:
# Question 8: Decision Tree Regressor on California Housing Dataset

# Step 1: Import required libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Step 2: Load the California Housing dataset
california = fetch_california_housing()
X, y = california.data, california.target
feature_names = california.feature_names

# Step 3: Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 4: Train Decision Tree Regressor
dt_regressor = DecisionTreeRegressor(random_state=42)
dt_regressor.fit(X_train, y_train)

# Step 5: Make predictions
y_pred = dt_regressor.predict(X_test)

# Step 6: Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)

# Step 7: Print MSE and feature importances
print(f"Decision Tree Regressor MSE: {mse:.3f}")
print("Feature Importances:")
for name, importance in zip(feature_names, dt_regressor.feature_importances_):
    print(f"{name}: {importance:.3f}")


Decision Tree Regressor MSE: 0.528
Feature Importances:
MedInc: 0.523
HouseAge: 0.052
AveRooms: 0.049
AveBedrms: 0.025
Population: 0.032
AveOccup: 0.139
Latitude: 0.090
Longitude: 0.089


# Question 9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy
(Include your Python code and output in the code box below.)

In [3]:
# Question 9: Hyperparameter Tuning for Decision Tree using GridSearchCV on Iris Dataset

# Step 1: Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Step 2: Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Step 3: Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 4: Define the parameter grid for tuning
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 5, 10]
}

# Step 5: Initialize GridSearchCV with Decision Tree Classifier
grid_search = GridSearchCV(estimator=DecisionTreeClassifier(random_state=42),
                           param_grid=param_grid,
                           cv=5,
                           scoring='accuracy')

# Step 6: Fit GridSearchCV to training data
grid_search.fit(X_train, y_train)

# Step 7: Retrieve the best parameters and evaluate model accuracy
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Step 8: Print results
print("Best Parameters:", best_params)
print(f"Accuracy of tuned Decision Tree: {accuracy:.2f}")


Best Parameters: {'max_depth': 4, 'min_samples_split': 10}
Accuracy of tuned Decision Tree: 1.00


### Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

### **Step 1: Handle Missing Values**
- **Identify missing values** in both numerical and categorical columns (`df.isnull().sum()`).  
- **Imputation Methods:**  
  - **Numerical Features:** Fill missing values with **mean, median**, or use advanced methods like KNN imputer.  
  - **Categorical Features:** Fill missing values with **mode** or create a separate category like "Unknown".  
- **Why:** Decision Trees cannot handle missing values directly; proper imputation ensures the model can learn effectively.

### **Step 2: Encode Categorical Features**
- **Identify categorical columns**.  
- **Encoding Options:**  
  - **One-Hot Encoding:** For nominal categories without order.  
  - **Label Encoding:** For ordinal categories with intrinsic order.  
- **Why:** Decision Trees require numerical input; encoding converts categorical data to numeric form while preserving information.

### **Step 3: Train a Decision Tree Model**
- **Split dataset** into training and testing sets (e.g., 70:30).  
- **Initialize Decision Tree Classifier**:  
  ```python
  from sklearn.tree import DecisionTreeClassifier
  dt_model = DecisionTreeClassifier(criterion='gini', random_state=42)
  dt_model.fit(X_train, y_train)

### **Step 4: Tune Hyperparameters**
- Use **GridSearchCV** or **RandomizedSearchCV** to find the best parameters:  
  - `max_depth` → controls tree complexity  
  - `min_samples_split` → prevents overfitting  
  - `min_samples_leaf` → ensures minimum samples at leaf nodes  
- Example:  
```python
from sklearn.model_selection import GridSearchCV
param_grid = {'max_depth':[3,5,7,None], 'min_samples_split':[2,5,10]}
grid_search = GridSearchCV(dt_model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_

### **Step 5: Evaluate Model Performance**
- Use appropriate classification metrics:  
  - **Accuracy**: Measures overall correctness  
  - **Precision, Recall, F1-score**: Crucial for identifying true positive cases  
  - **ROC-AUC**: Evaluates model’s ability to distinguish between diseased and non-diseased  
- Perform **cross-validation** to ensure the model generalizes well to unseen data.

### **Step 6: Business Value in Real-World**
- **Early Detection:** Identify high-risk patients before severe symptoms develop.  
- **Resource Allocation:** Focus medical tests and interventions on high-risk patients.  
- **Decision Support:** Provide doctors with data-driven insights to reduce diagnostic errors.  
- **Cost Efficiency:** Optimize healthcare resources and reduce unnecessary procedures.

 Summary:-
By following a systematic approach — handling missing values, encoding categorical features, training and tuning a Decision Tree, and evaluating performance — the model can provide actionable insights in healthcare, improving patient outcomes and operational efficiency.

