## Decision Tree | **Assignment**

### 1. What is a Decision Tree, and how does it work in the context of classification?

#### A Decision Tree is a supervised learning algorithm used for classification and regression tasks. In classification, it helps predict which category a given data point belongs to by learning decision rules from the data features.
How It Works (Step-by-Step for Classification)
- Root Node
    - Starts with the entire dataset.
    - Chooses the best feature to split on using impurity metrics like Gini or Entropy.
- Splitting Criteria
    - At each node, the algorithm selects the feature and threshold that maximizes class separation (e.g., highest Information Gain or Gini reduction).
- Branching
    - Data is partitioned into subsets.
    - Each subset forms a new branch with more specific conditions.
- Leaf Node
    - Splitting stops when: 
        - All samples belong to the same class.
        - Impurity is minimal.
        - Predefined constraints are reached (e.g., max depth).
    - Final prediction is made based on majority class in that leaf.


### 2. Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

#### Gini Impurity measures the probability that a randomly chosen sample from the node would be incorrectly classified if labeled randomly according to the distribution of labels in that node.
Formula:  	$Gini=1-\sum_{i=1}{(\ p}_{i\ })2$<br>
	$\left(p_i\right)$ is the probability of class ($i$) in the node<br>
Example: A node has:
- 60% Class A
- 40% Class B
<br>Gini = (1 - (0.62 + 0.42) = 0.48)<br>

Entropy comes from information theory. It measures the amount of uncertainty or disorder at a node.<br>
Formula:		$Entropy=-\sum_{i=1}^{N}p_i\cdot\log_2{\left(p_i\right)}$
<br>Example: Same distribution:<br>
Entropy=$-\left(0.6\cdot\log_2{\left(0.6\right)}+0.4\cdot\log_2{\left(0.4\right)}\right)\approx0.97$<br>
Impact on Tree Splits<br>
Both metrics are used to evaluate which feature and threshold split the node most effectively:
- Lower impurity after the split -> better feature

Algorithms try all possible splits, compute resulting impurity, and select the one that maximizes purity in child nodes.

### 3. What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

#### Pre-Pruning vs Post-Pruning

| Feature           | **Pre-Pruning**                             | **Post-Pruning**                            |
|-------------------|---------------------------------------------|---------------------------------------------|
| Timing            | Happens *while* building the tree           | Happens *after* the tree is fully grown     |
| Method           | Stops splitting if conditions fail          | Removes subtrees based on validation checks |
| Criteria Used    | Max depth, min samples, Gini/Entropy gain   | Validation accuracy, complexity trade-off   |
| Outcome          | May prevent overfitting early               | Cleans up a complex model post hoc          |


#### Practical Advantages

- **Pre-Pruning Advantage**  
  *Speeds up training time* — By halting unnecessary splits early, it reduces computational cost, especially useful for large datasets or real-time systems.

- **Post-Pruning Advantage**  
  *Improves generalization* — By trimming branches that don’t boost validation accuracy, it sharpens the model’s ability to perform well on unseen data.

### 4. What is Information Gain in Decision Trees, and why is it important for choosing the best split?

**Information Gain (IG)** measures the reduction in entropy when a dataset is split on a particular feature.

Entropy quantifies uncertainty — so IG tells us how *much that uncertainty drops* when we make a split.

#### Formula:
```
Information Gain = Entropy(parent) - ∑ (Weighted Entropy(children))
```
We compute the total entropy before the split, then subtract the weighted sum of the entropies after splitting the data.

#### Why Is It Important?

Because it acts like a magnet for the **best feature to split on**.

- If IG is high → the split creates purer child nodes → great candidate for a split!
- If IG is low → the feature doesn't reduce uncertainty much → not ideal.

In short: **Higher Information Gain = More confident and meaningful split**.

### 5. What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

### Common Real-World Applications of Decision Trees

#### Healthcare
- **Disease diagnosis**: Classify patients based on symptoms, lab results, and history.
- **Triage systems**: Prioritize emergency cases using decision paths.
- **Treatment recommendations**: Suggest therapies based on patient profiles.

#### Business & Finance
- **Loan approval**: Assess creditworthiness using income, credit score, and history.
- **Fraud detection**: Flag suspicious transactions by learning patterns.
- **Investment decisions**: Evaluate risk profiles and market conditions.

#### Retail & Marketing
- **Customer segmentation**: Group users by behavior, demographics, or purchase history.
- **Churn prediction**: Identify customers likely to leave and trigger retention strategies.
- **Product recommendation**: Suggest items based on decision paths from past purchases.

#### Education
- **Student performance prediction**: Use attendance, grades, and engagement to forecast outcomes.
- **Adaptive learning systems**: Tailor content based on learner responses.

#### Engineering & Operations
- **Quality control**: Classify defective vs. non-defective items.
- **Maintenance scheduling**: Predict equipment failure based on usage and sensor data.



#### Advantages of Decision Trees:
- **Interpretability** | Easy to visualize and explain to non-technical stakeholders.
- **Handles mixed data types** | Works with both categorical and numerical features.
- **Minimal preprocessing** | No need for scaling or normalization.
- **Captures non-linear relationships** | Splits can model complex decision boundaries.
- **Feature importance** | Highlights which variables drive decisions.
- **Robust to missing values** | Can handle gaps without imputation.



#### Limitations of Decision Trees:
- **Overfitting** | Deep trees may memorize training data, hurting generalization.
- **Instability** | Small data changes can lead to very different trees.
- **Bias toward dominant features** | Features with many levels may dominate splits.
- **Greedy splitting** | Locally optimal splits may not yield the best global structure.
- **Poor extrapolation** | Regression trees struggle with continuous trends.
- **Computational cost** | Large datasets can lead to deep, complex trees.

### 6. Write a Python program to:
- Load the Iris Dataset
- Train a Decision Tree Classifier using the Gini criterion
- Print the model’s accuracy and feature importances

In [27]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
x = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names

# Split the data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=4
)

# Train a Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=4)
clf.fit(x_train, y_train)

# Predict and print accuracy
y_pred = clf.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Print feature importances
print("\nFeature Importances:")
for feature, importance in zip(feature_names, clf.feature_importances_):
    print(f"{feature:20}: {importance:.3f}")

Model Accuracy: 0.97

Feature Importances:
sepal length (cm)   : 0.017
sepal width (cm)    : 0.000
petal length (cm)   : 0.517
petal width (cm)    : 0.467


### 7. Write a Python program to:
- Load the Iris Dataset
- Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

In [49]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the dataset
iris = load_iris()
x, y = iris.data, iris.target

# Split into training and test sets
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=1
)

# Fully-grown tree (no max_depth constraint)
full_tree = DecisionTreeClassifier(criterion='gini', random_state=1)
full_tree.fit(x_train, y_train)
full_pred = full_tree.predict(x_test)
full_accuracy = accuracy_score(y_test, full_pred)

# Limited-depth tree (max_depth=3)
shallow_tree = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=1)
shallow_tree.fit(x_train, y_train)
shallow_pred = shallow_tree.predict(x_test)
shallow_accuracy = accuracy_score(y_test, shallow_pred)

# Output comparison
print(f"Fully-grown Tree Accuracy: {full_accuracy:.2f}")
print(f"Shallow Tree Accuracy (max_depth=3): {shallow_accuracy:.2f}")

Fully-grown Tree Accuracy: 0.97
Shallow Tree Accuracy (max_depth=3): 0.97


### 8. Write a Python program to:
- Load the Boston Housing Dataset
- Train a Decision Tree Regressor
- Print the Mean Squared Error (MSE) and feature importances

In [46]:
#Boston Housing Dataset is removed from Sci-Kit Learning, so I'm making making model on California Housing Dataset
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
housing = fetch_california_housing()
X = housing.data
y = housing.target
feature_names = housing.feature_names

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Predict and calculate MSE
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

# Feature Importances
print("\nFeature Importances:")
for name, importance in zip(feature_names, regressor.feature_importances_):
    print(f"{name:20}: {importance:.3f}")

Mean Squared Error: 0.50

Feature Importances:
MedInc              : 0.529
HouseAge            : 0.052
AveRooms            : 0.053
AveBedrms           : 0.029
Population          : 0.031
AveOccup            : 0.131
Latitude            : 0.094
Longitude           : 0.083


### 9. Write a Python program to:
- Load the Iris Dataset
- Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
- Print the best parameters and the resulting model accuracy

In [35]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1
)

# Create a Decision Tree Classifier
dtree = DecisionTreeClassifier(random_state=1)

# Define the parameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5, 6],
    'min_samples_split': [2, 4, 6, 8]
}

# GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(
    estimator=dtree,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

# Fit the grid search to the training data
grid_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid_search.best_params_)

# Test set accuracy using best estimator
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Set Accuracy: {accuracy:.2f}")

Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Test Set Accuracy: 0.97


### 10. Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values. Explain the step-by-step process you would follow to: 
- Handle the missing values 
- Encode the categorical features 
- Train a Decision Tree model 
- Tune its hyperparameters 
- Evaluate its performance

And describe what business value this model could provide in the real-world setting.


#### 1. Handle Missing Values
- **Identify missing values** using `df.isnull().sum()` for a clear picture.
- **Numerical features:**
  - Use **mean/median imputation** for continuous data.
  - For skewed distributions, prefer `median` to avoid distortion.
- **Categorical features:**
  - Replace missing entries with the **mode** or a distinct label like `"Missing"`.
- Consider using **`SimpleImputer`** from `sklearn.impute` for streamlined preprocessing.

---

#### 2. Encode Categorical Features
- **Label Encoding**: For ordinal features with a natural order.
- **One-Hot Encoding**: For nominal features (non-ordered), using `pd.get_dummies()` or `OneHotEncoder`.
- If cardinality is high, consider **target encoding** or **hash encoding** to prevent feature explosion.

---

#### 3. Train a Decision Tree Model
```python
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)
```
- No need to scale features, which is a bonus with tree-based models.
- Decision trees naturally handle mixed data types after encoding.

---

#### 4. Tune Hyperparameters
Use **GridSearchCV** or **RandomizedSearchCV** to optimize parameters like:
- `max_depth`: Controls overfitting.
- `min_samples_split` and `min_samples_leaf`: Improve generalization.
- `criterion`: Try both `'gini'` and `'entropy'`.
- `class_weight`: Important if your dataset is imbalanced.

```python
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, None],
    'min_samples_split': [2, 5, 10],
    'criterion': ['gini', 'entropy']
}
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
```

---

#### 5. Evaluate Performance
Use a mix of metrics:
- **Accuracy**: Good for balanced datasets.
- **Precision/Recall/F1 Score**: Crucial for disease prediction where false negatives matter.
- **Confusion Matrix** and **ROC AUC**: For visual insights.
```python
from sklearn.metrics import classification_report, roc_auc_score

print(classification_report(y_test, best_model.predict(X_test)))
print("ROC AUC:", roc_auc_score(y_test, best_model.predict_proba(X_test)[:,1]))
```

---

##### Business Value in Healthcare
A well-tuned model like this could:
- **Enable early detection** and intervention for at-risk patients.
- Support **personalized treatment plans**.
- Help hospitals **prioritize resources** and manage patient loads.
- Provide **predictive insights** for chronic condition management.
- Reduce operational costs by **automating screening** and triage.

By making disease prediction more proactive, the company could increase patient survival rates and reduce long-term care expenses.