# Decision Tree Assignment

---
## 1. What is a Decision Tree?

A **Decision Tree** is a popular supervised machine learning algorithm used for both classification and regression tasks. In the context of classification, it helps you decide the class (category) of a given input by learning simple decision rules inferred from the data features.

### How does a Decision Tree work for classification?

Think of a decision tree like a flowchart where:

* **Nodes** represent tests on features (e.g., is "Age > 30?").
* **Branches** represent the outcome of those tests (yes or no).
* **Leaves** represent the final decision or class label (e.g., "Will buy product" or "Won't buy product").

#### Step-by-step process:

1. **Start at the root node**, which considers the entire dataset.
2. **Choose the best feature** to split the data based on some criteria (like Information Gain or Gini Impurity), aiming to separate the data into distinct classes.
3. **Split the dataset** into subsets according to the chosen feature’s values.
4. **Repeat the process recursively** on each subset (child nodes), selecting features and splitting until:

   * All samples in a node belong to the same class, or
   * No more features are available, or
   * A stopping criterion like max depth or minimum samples per node is reached.
5. **Assign class labels** to the leaves based on the majority class in those nodes.

### Example:

If you have a dataset predicting whether someone will play tennis based on weather conditions, a decision tree might ask:

* Is it sunny?

  * Yes: Is humidity high?

    * Yes → Don’t play tennis
    * No → Play tennis
  * No: Play tennis

### Why Decision Trees?

* Easy to understand and interpret.
* Requires little data preprocessing.
* Can handle both numerical and categorical data.
* Non-linear relationships handled naturally.

---
## 2. Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

### What is an impurity measure?

* It quantifies how *impure* or *uncertain* the class distribution is within a node.
* The goal of a split in a decision tree is to reduce impurity, ideally resulting in child nodes that are as *pure* as possible (i.e., containing mostly one class).

### 1. Gini Impurity

Gini Impurity measures the probability of misclassifying a randomly chosen element if it was labeled according to the distribution of labels in the node.

$$
\text{Gini} = 1 - \sum_{i=1}^C p_i^2
$$

* $p_i$ = proportion of class $i$ in the node.
* $C$ = number of classes.

**Interpretation:**

* Gini impurity is 0 when all samples belong to one class (pure node).
* It’s maximum when classes are equally mixed.

### 2. Entropy

Entropy comes from information theory and measures the amount of uncertainty or disorder in the node.

$$
\text{Entropy} = - \sum_{i=1}^C p_i \log_2(p_i)
$$

* $p_i$ same as above.

**Interpretation:**

* Entropy is 0 if the node is pure.
* It’s maximum when the classes are equally mixed.

### How do they impact splits?

* During tree building, the algorithm looks for splits that **reduce impurity the most**.
* The *reduction* is called **Information Gain** (using Entropy) or **Gini Gain** (using Gini Impurity).

**Formula for Information Gain (Entropy):**

$$
\text{Information Gain} = \text{Entropy(before split)} - \sum_{k} \frac{N_k}{N} \times \text{Entropy(after split}_k)
$$

Similarly for Gini Gain, replacing Entropy with Gini.

### Choosing splits:

* The algorithm evaluates all possible splits for all features.
* For each split, it calculates impurity for the child nodes.
* It picks the split that results in the **highest impurity reduction**.

### Differences and practical notes:

* Gini Impurity is faster to compute and often preferred in implementations like **scikit-learn**.
* Entropy is more theoretically grounded in information theory.
* Usually, both lead to very similar trees.

---
## 3. What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each. 

When a decision tree grows fully, it might fit the training data too closely, capturing noise instead of the underlying pattern — this is **overfitting**. Pruning helps reduce that by cutting back the tree.

### a. Pre-Pruning (Early Stopping)

* **What it is:** You stop the tree from growing too deep **during** the training process.
* You set conditions (like max depth, minimum samples per leaf, or minimum impurity decrease) to halt splitting early.
* The tree is prevented from becoming too complex.

#### Practical advantage of Pre-Pruning:

* **Faster training time** because the tree is smaller and simpler from the start.
* It can save computation and prevent overfitting early on without needing extra steps.

### b. Post-Pruning (Prune after full growth)

* **What it is:** First, grow a **full tree** (potentially overfitting).
* Then, **cut back** or remove branches that don’t improve performance on a validation set (or by some cost-complexity measure).
* This typically involves evaluating subtree removal to see if it improves generalization.

#### Practical advantage of Post-Pruning:

* **Potentially better accuracy** because you explore the full complexity before simplifying.
* It can find a better balance by removing only the truly unnecessary branches.

### Summary:

| Aspect        | Pre-Pruning                     | Post-Pruning                        |
| ------------- | ------------------------------- | ----------------------------------- |
| When pruning? | During tree growth              | After full tree is grown            |
| Advantage     | Faster training, simple model   | Often more accurate and flexible    |
| Risk          | Might stop too early (underfit) | Needs extra validation step, slower |

---
## 4. What is Information Gain in Decision Trees, and why is it important for choosing the best split? 

**Information Gain (IG)** measures how much *uncertainty* (or impurity) is reduced after splitting a dataset based on a particular feature.

* It’s based on the concept of **Entropy** from information theory.
* The idea: a good split should separate data into groups that are as pure (homogeneous) as possible.

### How is Information Gain calculated?

$$
\text{Information Gain} = \text{Entropy(before split)} - \sum_{k=1}^{m} \frac{N_k}{N} \times \text{Entropy(after split}_k)
$$

Where:

* $N$ = total number of samples before split.
* $m$ = number of child nodes (branches) created by the split.
* $N_k$ = number of samples in child node $k$.
* $\text{Entropy(before split)}$ = entropy of the parent node.
* $\text{Entropy(after split}_k)$ = entropy of child node $k$.

### Why is Information Gain important?

* **Choosing the best split:** The decision tree algorithm evaluates all possible splits on all features and selects the one with the **highest Information Gain**.
* A split with high IG means the data is divided into groups with lower entropy, i.e., the classes are more clearly separated.
* This helps the tree become more accurate and efficient by making better decisions at each node.

### Intuition:

* Before splitting, you have uncertainty about the class labels.
* After splitting, if the uncertainty is reduced significantly, the split is valuable.
* Information Gain quantifies that reduction.

---
## 5. What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

### Common Real-World Applications of Decision Trees

1. **Healthcare**

   * Diagnosing diseases based on symptoms and patient data.
   * Predicting patient outcomes or risk stratification.

2. **Finance**

   * Credit scoring and risk assessment.
   * Fraud detection by classifying transactions.

3. **Marketing**

   * Customer segmentation.
   * Predicting customer churn or purchase behavior.

4. **Manufacturing**

   * Quality control by classifying defective vs. non-defective products.
   * Predictive maintenance based on sensor data.

5. **Customer Support**

   * Automating decision-making in chatbots.
   * Routing customer queries based on issue type.
   
### Advantages of Decision Trees

* **Easy to interpret and visualize:** Decision rules are straightforward and understandable.
* **Handles both numerical and categorical data:** No need for complex preprocessing.
* **Non-parametric:** No assumptions about the data distribution.
* **Handles multi-class classification and regression tasks.**
* **Works well with missing data:** Can handle missing values during training.

### Limitations of Decision Trees

* **Overfitting:** Trees can grow very deep and fit noise in training data.
* **Instability:** Small changes in data can lead to very different trees.
* **Bias towards features with more levels:** Features with many categories can dominate splits.
* **Less accurate compared to ensemble methods:** Alone, they might underperform compared to random forests or gradient boosting.
* **Greedy splitting:** The algorithm makes locally optimal splits, which might not lead to the best overall tree.

---
## 6. Write a Python program to: 
> * #### Load the Iris Dataset 
> * #### Train a Decision Tree Classifier using the Gini criterion 
> * #### Print the model’s accuracy and feature importances 

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into train and test sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the Decision Tree Classifier with Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# Train the model
clf.fit(X_train, y_train)

# Predict on test data
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Print feature importances
feature_importances = clf.feature_importances_
for name, importance in zip(iris.feature_names, feature_importances):
    print(f"{name}: {importance:.3f}")

Accuracy: 1.00
sepal length (cm): 0.000
sepal width (cm): 0.019
petal length (cm): 0.893
petal width (cm): 0.088


---
## 7. Write a Python program to: 
> * #### Load the Iris Dataset 
> * #### Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree. 

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Decision Tree with max_depth=3 (pruned tree)
clf_pruned = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_pruned.fit(X_train, y_train)
y_pred_pruned = clf_pruned.predict(X_test)
accuracy_pruned = accuracy_score(y_test, y_pred_pruned)

# Fully grown Decision Tree (no max_depth)
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

print(f"Accuracy of Decision Tree with max_depth=3: {accuracy_pruned:.2f}")
print(f"Accuracy of fully grown Decision Tree: {accuracy_full:.2f}")

Accuracy of Decision Tree with max_depth=3: 1.00
Accuracy of fully grown Decision Tree: 1.00


---
## 8. Write a Python program to: 
> * #### Load the Boston Housing Dataset 
> * #### Train a Decision Tree Regressor 
> * #### Print the Mean Squared Error (MSE) and feature importances 

In [3]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import pandas as pd

# Load the Boston Housing dataset from OpenML
boston = fetch_openml(name='boston', version=1, as_frame=True)

# Extract features and target
X = boston.data
y = boston.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Decision Tree Regressor
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)

# Print the MSE
print(f"Mean Squared Error: {mse:.4f}")

# Get feature importances
importances = model.feature_importances_

# Combine feature names and their importances into a DataFrame for better readability
feature_importances = pd.DataFrame({
    'Feature': X.columns,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

print("\nFeature Importances:")
print(feature_importances)

Mean Squared Error: 10.4161

Feature Importances:
    Feature  Importance
5        RM    0.600326
12    LSTAT    0.193328
7       DIS    0.070688
0      CRIM    0.051296
4       NOX    0.027148
6       AGE    0.013617
9       TAX    0.012464
10  PTRATIO    0.011012
11        B    0.009009
2     INDUS    0.005816
1        ZN    0.003353
8       RAD    0.001941
3      CHAS    0.000002


---
## 9.  Write a Python program to: 
> * #### Load the Iris Dataset 
> * #### Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV 
> * #### Print the best parameters and the resulting model accuracy 

In [4]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)

# Define parameter grid for tuning
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 5, 10]
}

# Setup GridSearchCV
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5, n_jobs=-1)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid_search.best_params_)

# Predict on test set using best estimator
y_pred = grid_search.best_estimator_.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of best model: {accuracy:.2f}")

Best Parameters: {'max_depth': 4, 'min_samples_split': 10}
Accuracy of best model: 1.00


---
## 10. Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values. Explain the step-by-step process you would follow to: 
> * #### Handle the missing values 
> * #### Encode the categorical features 
> * #### Train a Decision Tree model 
> * #### Tune its hyperparameters 
> * #### Evaluate its performance And describe what business value this model could provide in the real-world setting. 

### 1. **Handling Missing Values**

* **Analyze missingness:** Check which features have missing data and the percentage missing.
* **Decide on imputation strategy:**

  * For **numerical features**: use mean, median, or more advanced methods like KNN imputation.
  * For **categorical features**: use the most frequent category or create a separate category like "Missing".
* **Apply imputation:** Use tools like `SimpleImputer` or `IterativeImputer` from scikit-learn to fill missing values consistently.
* **Optional:** Consider if missingness itself is informative and create indicator features if needed.

### 2. **Encoding Categorical Features**

* Identify categorical features.
* For **ordinal categories** (with a clear order): use **Ordinal Encoding**.
* For **nominal categories** (no order): use **One-Hot Encoding** to create binary features.
* If categories have high cardinality, consider techniques like **Target Encoding** or **Frequency Encoding**.
* Use `ColumnTransformer` in scikit-learn to apply different transformations to numerical and categorical features seamlessly.

### 3. **Training a Decision Tree Model**

* Split data into **training** and **testing** sets to evaluate performance fairly.
* Initialize a **Decision Tree Classifier**, e.g., `DecisionTreeClassifier` in scikit-learn.
* Fit the model on the **training data** after preprocessing.
* Decision Trees can handle mixed data types and do not require feature scaling.

### 4. **Tuning Hyperparameters**

* Use **GridSearchCV** or **RandomizedSearchCV** to tune important hyperparameters like:

  * `max_depth`: to control tree complexity.
  * `min_samples_split` and `min_samples_leaf`: to avoid overfitting on small data.
  * `criterion`: 'gini' or 'entropy'.
* Use cross-validation during tuning to ensure the model generalizes well.
* Select the best hyperparameters based on validation performance (e.g., accuracy, recall, or AUC).

### 5. **Evaluating Performance**

* Evaluate the final model on the **test set**.
* Use metrics relevant for healthcare and classification:

  * **Accuracy** (overall correctness).
  * **Precision and Recall**: especially important to minimize false negatives (missing a disease).
  * **F1 Score**: balance between precision and recall.
  * **ROC-AUC**: how well the model discriminates between classes.
* Consider **confusion matrix** analysis to understand types of errors.

### 6. **Business Value in Real-World Setting**

* **Early and accurate disease detection:** Enables timely intervention and treatment, improving patient outcomes.
* **Resource optimization:** Prioritize patients who need urgent care or further testing, saving costs.
* **Personalized care:** Tailor treatment plans based on risk prediction.
* **Reducing hospital readmissions:** Proactively monitor high-risk patients.
* **Supporting doctors’ decision-making:** Provide data-driven insights that augment clinical judgment.
* **Regulatory compliance and reporting:** Automated, consistent predictions can support audits and reporting requirements.

### Summary:

| Step                        | Description                                     | Tools/Techniques                          |
| --------------------------- | ----------------------------------------------- | ----------------------------------------- |
| Handle Missing Values       | Impute with mean/median or create indicators    | `SimpleImputer`, `IterativeImputer`       |
| Encode Categorical Features | One-hot, ordinal, target encoding               | `OneHotEncoder`, `OrdinalEncoder`         |
| Train Decision Tree         | Fit on processed data                           | `DecisionTreeClassifier`                  |
| Tune Hyperparameters        | Grid/random search with cross-validation        | `GridSearchCV`, `RandomizedSearchCV`      |
| Evaluate Performance        | Use metrics like precision, recall, ROC-AUC     | `accuracy_score`, `classification_report` |
| Business Value              | Early detection, cost-saving, personalized care | Better patient outcomes, efficiency       |