# **Decision Trees Assignment**

### **THEORY**

## 1. What is a Decision Tree, and how does it work in the context of classification?

- A Decision Tree is a type of algorithm used in machine learning to make decisions — like answering questions step by step until you reach a conclusion.
- Think of it like a flowchart that asks yes/no questions (or checks some conditions) to divide data into groups until it can make a final decision or prediction.

##### In the Context of Classification

- In classification, a decision tree helps us figure out which category or class something belongs to.

## 2. Explain the concepts of Gini Impurity and Entropy as impurity measures.How do they impact the splits in a Decision Tree?
- Gini Impurity measures the probability that a randomly chosen item would be incorrectly classified if it was labeled according to the distribution of labels in the group.
- Entropy measures the amount of randomness or uncertainty. Higher entropy = more disorder (mixed classes).

#####  How They Impact Splits in a Decision Tree

- At each node, the tree checks each feature and finds the best split — the one - that gives the lowest impurity after the split.
- It calculates:
  - Gini or Entropy for the current node (before split)
  - Weighted average of Gini/Entropy after splitting by a feature
  - Chooses the split that gives the highest impurity reduction

## 3. What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

- Pruning: Cutting off branches of a decision tree that are not useful or too specific, to make the model simpler and more general (better on new data).

| Feature            | **Pre-Pruning (Early Stopping)**             | **Post-Pruning (Reduced Error Pruning)**           |
| ------------------ | -------------------------------------------- | -------------------------------------------------- |
| **When?**        | During tree building (early stop)            | After full tree is built                           |
| **How?**        | Stops splitting a node if conditions met     | Grows full tree, then removes unnecessary branches |
| **Criteria**    | Max depth, min samples at node, min impurity | Validation accuracy, error reduction after pruning |
| **Overfitting?** | Tries to avoid it while growing the tree     | Handles it after seeing the full tree              |
| **Complexity**  | Faster, needs fewer resources                | Slower but more accurate                           |

##### Pre-Pruning - Example
- Suppose you're building a tree to predict loan approval. You set:
  - Max depth = 3
  - Min samples to split = 10
- If a node has less than 10 samples or depth reaches 3, it stops splitting even if it could improve accuracy on training data.

- Advantage:
  - Faster training time, prevents tree from becoming too large.

##### Post-Pruning - Example
- You first allow the tree to grow fully (even if it overfits), then:
- Go back and remove branches that do not improve accuracy on a validation set.

- Advantage:
  - Produces a tree that is more generalizable to new data (better real-world performance).

## 4. What is Information Gain in Decision Trees, and why is it important for choosing the best split?

- Information Gain (IG) is a measure used to decide which feature to split on at each step of building a Decision Tree.
- It tells us how much “information” or “purity” we gain by splitting the data based on a particular feature.
- Information Gain = How much the uncertainty (Entropy) is reduced after the split.

-  Formula: Information Gain = Entropy(Parent) - Weighted Avg Entropy (Children)

###### Why is it Important?
- The higher the Information Gain, the better the feature is at separating the data.
- So, the decision tree chooses the feature with the highest IG for the split.
- This helps the tree make more accurate and meaningful splits.

## 5. What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

##### Real-World Applications of Decision Trees

| Area          | Application Example                                     |
| ------------- | ------------------------------------------------------- |
|  Education  | Predicting student performance                          |
|  Healthcare | Diagnosing diseases based on symptoms                   |
|  Banking    | Approving loans or detecting fraud                      |
|  Retail     | Customer purchase prediction, segmentation              |
|  HR      | Predicting employee attrition (will someone leave?)     |
|  Telecom    | Predicting customer churn (will a user cancel service?) |

##### Advantages of Decision Trees

| Advantage                | Explanation                                      |
| ------------------------ | ------------------------------------------------ |
|  Easy to understand     | Works like a flowchart; no math degree needed!   |
|  No need to scale data  | Works with both numeric and categorical features |
|  Fast training          | Especially for small to medium datasets          |
|  Handles missing values | Can work even if some values are missing         |

#####  Limitations of Decision Trees

| Limitation                                 | Explanation                                                   |
| ------------------------------------------ | ------------------------------------------------------------- |
|  Overfitting                              | Tree may become too complex and memorize the training data    |
|  Unstable                                 | Small data changes can result in a completely different tree  |
|  Biased towards features with more levels | Features with more categories may be chosen unfairly          |
|  Not always the most accurate             | Often less accurate than models like Random Forest or XGBoost |

### **PRACTICAL**

# Dataset Info:
- Iris Dataset for classification tasks (sklearn.datasets.load_iris() or provided CSV).
- Boston Housing Dataset for regression tasks
  - (sklearn.datasets.load_boston() or provided CSV).

## 6. Write a Python program to:
- Load the Iris Dataset
- Train a Decision Tree Classifier using the Gini criterion
- Print the model's accuracy and feature importances


In [None]:
# Solution 6

# Import libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Step 1: Load the Iris dataset
iris = load_iris()
X = iris.data            # Features
y = iris.target          # Labels

# Step 2: Split the data into train and test sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Train Decision Tree Classifier with Gini Index
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Step 4: Predict on the test set
y_pred = clf.predict(X_test)

# Step 5: Calculate Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", round(accuracy * 100, 2), "%")

# Step 6: Print Feature Importances
feature_names = iris.feature_names
importances = clf.feature_importances_

print("\nFeature Importances:")
for name, importance in zip(feature_names, importances):
    print(f"{name}: {round(importance, 4)}")

Model Accuracy: 100.0 %

Feature Importances:
sepal length (cm): 0.0
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


## 7. Write a Python program to:
- Load the Iris Dataset
- Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.


In [None]:
# Solution 7

# Import libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Step 1: Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Step 2: Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3a: Train fully-grown Decision Tree
clf_full = DecisionTreeClassifier(criterion='gini', random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Step 3b: Train pruned Decision Tree (max_depth=3)
clf_pruned = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
clf_pruned.fit(X_train, y_train)
y_pred_pruned = clf_pruned.predict(X_test)
accuracy_pruned = accuracy_score(y_test, y_pred_pruned)

# Step 4: Print the comparison
print("Model Accuracy Comparison:\n")
print(f"Fully-grown Tree Accuracy   : {round(accuracy_full * 100, 2)}%")
print(f"Pruned Tree (max_depth=3)   : {round(accuracy_pruned * 100, 2)}%")

Model Accuracy Comparison:

Fully-grown Tree Accuracy   : 100.0%
Pruned Tree (max_depth=3)   : 100.0%


## 8. Write a Python program to:
- Load the Boston Housing Dataset
- Train a Decision Tree Regressor
- Print the Mean Squared Error (MSE) and feature importances

In [None]:
# Solution 8

# Import libraries
from sklearn.datasets import fetch_california_housing  # Use load_boston() if using older version
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Step 1: Load the housing dataset (alternative to Boston Housing)
data = fetch_california_housing()
X = data.data
y = data.target

# Step 2: Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Train a Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Step 4: Predict and calculate Mean Squared Error (MSE)
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", round(mse, 4))

# Step 5: Print Feature Importances
print("\n Feature Importances:")
for name, importance in zip(data.feature_names, regressor.feature_importances_):
    print(f"{name}: {round(importance, 4)}")

Mean Squared Error (MSE): 0.4952

 Feature Importances:
MedInc: 0.5285
HouseAge: 0.0519
AveRooms: 0.053
AveBedrms: 0.0287
Population: 0.0305
AveOccup: 0.1308
Latitude: 0.0937
Longitude: 0.0829


## 9. Write a Python program to:
- Load the Iris Dataset
- Tune the Decision Tree's max_depth and min_samples_split using
GridSearchCV
- Print the best parameters and the resulting model accuracy

In [None]:
# Solution 9

# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Step 1: Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Step 2: Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Set up the parameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 4, 6, 10]
}

# Step 4: Create the Decision Tree model and apply GridSearchCV
clf = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Step 5: Predict and evaluate the best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Step 6: Output best parameters and accuracy
print("Best Parameters Found by GridSearchCV:")
print(grid_search.best_params_)

print("\n Accuracy of the Best Model:")
print(f"{round(accuracy * 100, 2)}%")

Best Parameters Found by GridSearchCV:
{'max_depth': 4, 'min_samples_split': 2}

 Accuracy of the Best Model:
100.0%


## 10. Imagine you're working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values. Explain the step-by-step process you would follow to:
- Handle the missing values
- Encode the categorical features
- Train a Decision Tree model
- Tune its hyperparameters
- Evaluate its performance
- And describe what business value this model could provide in the real-world
setting.

In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

#  Step 1: Sample mock dataset (for demo purposes)
# In real use, replace this with your CSV or real data

data = {
    'age': [25, 35, 45, np.nan, 50],
    'blood_pressure': [120, 130, np.nan, 110, 140],
    'gender': ['Male', 'Female', np.nan, 'Male', 'Female'],
    'smoking_status': ['Never', 'Former', 'Current', 'Never', np.nan],
    'has_disease': [0, 1, 1, 0, 1]
}

df = pd.DataFrame(data)

# Separate features and target
X = df.drop('has_disease', axis=1)
y = df['has_disease']

# Step 2: Automatically detect column types
numeric_cols = X.select_dtypes(include=['float64', 'int64']).columns.tolist()
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()

# Step 3: Handle missing values
num_imputer = SimpleImputer(strategy='median')
X[numeric_cols] = num_imputer.fit_transform(X[numeric_cols])

cat_imputer = SimpleImputer(strategy='most_frequent')
X[categorical_cols] = cat_imputer.fit_transform(X[categorical_cols])

# Step 4: Encode categorical features
encoder = ColumnTransformer(
    transformers=[('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)],
    remainder='passthrough'  # keep numeric columns
)

X_encoded = encoder.fit_transform(X)

# Step 5: Train-test split and Decision Tree
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

# Step 6: Predict and evaluate
y_pred = model.predict(X_test)
print("Classification Report:")
print(classification_report(y_test, y_pred))


Classification Report:
              precision    recall  f1-score   support

           1       1.00      1.00      1.00         1

    accuracy                           1.00         1
   macro avg       1.00      1.00      1.00         1
weighted avg       1.00      1.00      1.00         1



#### Business Value in a Real-World Healthcare Setting

| Area                     | Business Impact                                                               |
| ------------------------ | ----------------------------------------------------------------------------- |
| Early Diagnosis        | Predicts disease at early stages, enabling faster treatment                   |
| Risk Stratification    | Classifies patients into high-risk/low-risk groups for preventive care        |
| Resource Allocation    | Hospitals can better allocate beds, staff, and tests based on predicted needs |
| Cost Reduction         | Prevents unnecessary tests for low-risk patients, saving money                |
| Personalized Care      | Helps doctors make data-driven decisions customized to patient profiles       |
| Compliance & Reporting | Decision trees provide interpretable logic for audits and medical validation  |