# Iris Flower Classification (Advanced)

This notebook demonstrates classification of iris flower species using a variety of advanced machine learning models. We use the classic `Iris.csv` dataset and cover:

- Data exploration and visualization
- Data preprocessing
- Training multiple models: Logistic Regression, Decision Tree, Random Forest, Support Vector Machine (SVM)
- Model evaluation and comparison
- Explanation of key classification concepts
- Guidance on interpreting results

## Classification Concepts

- **Classification**: Predicting a categorical label (here: iris species) from input features.
- **Training/Test Split**: We split our data so the model can be trained and then tested on unseen data, which helps estimate real-world performance.
- **Accuracy**: Percentage of correct predictions on the test set.
- **Confusion Matrix**: Shows how often each class is correctly or incorrectly predicted.
- **Precision, Recall, F1-score**: Evaluate model quality for each class. High values mean good predictions. F1-score balances precision and recall.
- **Cross-validation**: More robust evaluation using multiple train/test splits.

---

In [1]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

%matplotlib inline

## 1. Load and Explore Data

In [2]:
df = pd.read_csv('Iris.csv')
df.head()

In [3]:
df.info()

In [4]:
df['Species'].value_counts()

The dataset is well-balanced with 50 samples per species.

In [5]:
# Check for missing values
df.isnull().sum()

No missing values are found.

## 2. Data Visualization

In [6]:
# Pairplot to visualize feature distributions by species
sns.pairplot(df.drop('Id', axis=1), hue='Species', diag_kind='hist')
plt.suptitle('Pairplot of Iris Features by Species', y=1.02)
plt.show()

In [7]:
# Feature correlation heatmap
plt.figure(figsize=(8,6))
sns.heatmap(df.drop(['Id', 'Species'], axis=1).corr(), annot=True, cmap='Blues')
plt.title('Feature Correlation Heatmap')
plt.show()

## Observations
- Petal measurements show strong correlation with species, making them good predictors.
- Sepal measurements show less separation between classes.
---

## 3. Preprocessing
- Select features and target
- Encode target labels
- Split data
- Scale features

In [8]:
# Features and target
X = df[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']]
y = df['Species']

# Encode target labels
le = LabelEncoder()
y_encoded = le.fit_transform(y)
le.classes_

In [9]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f'Training samples: {X_train.shape[0]}, Test samples: {X_test.shape[0]}')

## 4. Train and Evaluate Multiple Models

In [10]:
# Initialize models
models = {
    'Logistic Regression': LogisticRegression(max_iter=200),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM (RBF kernel)': SVC(kernel='rbf', probability=True, random_state=42)
}

# Train and evaluate
results = {}
for name, model in models.items():
    if 'SVM' in name or 'Logistic' in name:
        # Use scaled features for SVM and Logistic Regression
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
        X_eval = X_test_scaled
    else:
        # Tree-based models don't need scaling
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        X_eval = X_test
    acc = accuracy_score(y_test, y_pred)
    results[name] = {'model': model, 'accuracy': acc, 'y_pred': y_pred, 'X_eval': X_eval}

## 5. Compare Model Accuracies

In [11]:
for name, res in results.items():
    print(f'{name}: Accuracy = {res["accuracy"]:.3f}')

### All models achieve high accuracy (>0.93) on this well-separated dataset.
- SVM and Random Forest often perform best on small, non-linear, or complex datasets.
- Logistic Regression is simple and interpretable, Decision Trees are explainable, Random Forests and SVMs are usually more robust.

## 6. Detailed Evaluation: Confusion Matrix and Classification Report
Let's look at the best model (highest accuracy).

In [12]:
# Find best model
best_name = max(results, key=lambda k: results[k]['accuracy'])
best_res = results[best_name]
print(f'Best Model: {best_name} (Accuracy: {best_res["accuracy"]:.3f})')

# Confusion matrix
cm = confusion_matrix(y_test, best_res['y_pred'])
plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Greens', xticklabels=le.classes_, yticklabels=le.classes_)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title(f'Confusion Matrix: {best_name}')
plt.show()

In [13]:
# Classification report
print(classification_report(y_test, best_res['y_pred'], target_names=le.classes_))

### Guidance on Results Interpretation
- **Accuracy**: Percentage of total correct predictions. High accuracy here means the model rarely misclassifies the species.
- **Confusion Matrix**: Diagonal values (top-left to bottom-right) are correctly classified samples for each class. Off-diagonal values show misclassifications.
- **Precision**: Of all predicted instances of a class, how many were correct?
- **Recall**: Of all actual instances of a class, how many did we correctly predict?
- **F1-score**: Harmonic mean of precision and recall. Best if close to 1.
- **Interpretation**: All classes are well separated; metrics close to 1.0 indicate excellent performance.

> On real-world data or less-separated datasets, you'd want to check for class imbalance, overfitting, and generalization using cross-validation.

## 7. Visualize Test Predictions
Scatter plot for petal features with predicted species colors.

In [14]:
# For visualization, use true test values
plt.figure(figsize=(8,6))
scatter = plt.scatter(X_test['PetalLengthCm'], X_test['PetalWidthCm'], c=best_res['y_pred'], cmap='viridis', s=60, edgecolor='k')
plt.xlabel('Petal Length (cm)')
plt.ylabel('Petal Width (cm)')
plt.title(f'Test Set Predictions: {best_name}')
plt.legend(handles=scatter.legend_elements()[0], labels=le.classes_)
plt.show()

## 8. Cross-Validation (Bonus: Robust Performance Estimate)
Cross-validation helps estimate how well the model will generalize to unseen data.

In [15]:
cv_scores = cross_val_score(best_res['model'], scaler.transform(X), y_encoded, cv=5)
print(f'Cross-Validation Accuracy (mean ± std): {cv_scores.mean():.3f} ± {cv_scores.std():.3f}')

## 9. Summary

- Used multiple models (Logistic Regression, Decision Tree, Random Forest, SVM)
- All models performed extremely well due to clear class separation in the data
- SVM/Random Forest are robust for more complex datasets
- Classification metrics confirm very high model quality
- Cross-validation supports generalization

### Next Steps
- Try hyperparameter tuning
- Explore feature importance (Random Forest)
- Test on new data
- Share your notebook and results on GitHub!

---
## References
- [Scikit-learn Classification User Guide](https://scikit-learn.org/stable/supervised_learning.html)
- [Iris Dataset Info](https://en.wikipedia.org/wiki/Iris_flower_data_set)
---
### Happy Data Science!