# 🌳 Decision Tree - In-depth Notes

Decision Trees are a **supervised learning** algorithm used for both **classification and regression** tasks. They split data into subsets based on feature values to create a tree structure.

## 🔑 1. Key Concepts
- **Root Node**: The first split (most informative feature)
- **Internal Nodes**: Features that split data
- **Leaf Nodes**: Final output labels or values
- **Splitting**: Dividing dataset based on feature
- **Pruning**: Reducing tree size to avoid overfitting

## 🧠 2. Impurity Measures
- **Gini Impurity**:
$$
G = 1 - \sum p_i^2
$$
- **Entropy**:
$$
H = - \sum p_i \log_2(p_i)
$$

Both are used to evaluate the quality of a split. Lower value = purer node.

## ⚙️ 3. Implementing Decision Tree Classifier (Scikit-learn)

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

# Load Data
data = load_iris()
X = data.data
y = data.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train
model = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=0)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))

## 🌳 4. Visualizing the Tree

In [None]:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plt.figure(figsize=(12,8))
plot_tree(model, filled=True, feature_names=data.feature_names, class_names=data.target_names)
plt.show()

## ✂️ 5. Pruning to Prevent Overfitting
- **max_depth**: Limit the depth
- **min_samples_split**: Minimum samples to split node
- **min_samples_leaf**: Minimum samples at leaf

These are controlled using hyperparameters in `DecisionTreeClassifier()`.

## 🔍 6. Feature Importance
Decision Trees provide a measure of **feature importance** based on how much each feature decreases impurity.

In [None]:
import pandas as pd

importance = pd.Series(model.feature_importances_, index=data.feature_names)
importance.sort_values(ascending=False)

## 📌 Summary
- Used for both regression and classification
- Easy to visualize and interpret
- Prone to overfitting (use pruning)
- Scikit-learn makes implementation simple