## Decision Tree Classification 

In this notebook, we explore **Decision Trees** using the classic Iris flower dataset.

The goal is to classify a flower into one of three species based on its physical measurements.

### Load the Dataset

The Iris dataset contains:
- Sepal length
- Sepal width
- Petal length
- Petal width
- Species (target label)


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("../datasets/iris.csv")
df.head()

### Features and Target

In [None]:
X = df.drop(columns=["species"])
y = df['species']

### Visualizing Feature Relationships

Before training a model, it is useful to visualize the data.

These scatter plots show how different flower species are distributed based on:
- Sepal measurements
- Petal measurements

This helps us understand whether the classes are separable.

In [None]:
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
for species in y.unique():
    subset = df[df["species"] == species]
    plt.scatter(subset["sepal_length"], subset["sepal_width"], label=species)
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
plt.title("Sepal Features")
plt.legend()

plt.subplot(1, 2, 2)
for species in y.unique():
    subset = df[df["species"] == species]
    plt.scatter(subset["petal_length"], subset["petal_width"], label=species)
plt.xlabel("Petal Length")
plt.ylabel("Petal Width")
plt.title("Petal Features")
plt.legend()

plt.show()

### Trainâ€“Test Split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

### Decision Tree Classifier

A Decision Tree learns a series of rules to split the data based on feature values.

The `max_depth` parameter limits how deep the tree can grow, helping control overfitting.

In [None]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(max_depth=5)

model.fit(X_train, y_train)

### Model Performance

We evaluate the model using accuracy on the test data.

Getting very high accuracy on this dataset is common because the Iris dataset is small and well-structured.

In [None]:
from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

### Visualizing the Tree

Decision Trees are easy to interpret because we can visualize the learned rules.

Each node represents:
- A feature-based decision
- A split that helps separate the classes

In [None]:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt


feature_names = X.columns.tolist()
class_names = sorted(y.unique().astype(str))

plt.figure(figsize=(18, 8))
plot_tree(
    model,
    feature_names=feature_names,
    class_names=class_names,
    filled=False,     # white background
    rounded=True
)
plt.show()


### Testing with New Data

In [None]:
test_samples = pd.DataFrame([
    {"sepal_length": 5.1, "sepal_width": 3.5, "petal_length": 1.4, "petal_width": 0.2},
    {"sepal_length": 6.7, "sepal_width": 3.0, "petal_length": 5.2, "petal_width": 2.3}
])

predictions = model.predict(test_samples)
predictions