# Task 1: Iris Species Classification with Scikit-learn

**Objective:** Build and evaluate multiple classification models to predict iris species

**Dataset:** Iris Species Dataset (150 samples, 4 features, 3 classes)

**Approach:**
1. Data Loading and Exploration
2. Data Preprocessing
3. Model Training (Multiple Algorithms)
4. Hyperparameter Tuning
5. Model Evaluation and Comparison
6. Feature Importance Analysis

## 1. Import Libraries

In [None]:
# Data manipulation and analysis
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning - preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline

# Machine learning - models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

# Machine learning - evaluation
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_curve, auc
)

# Dataset
from sklearn.datasets import load_iris

# Model persistence
import joblib

# Warnings
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("Libraries imported successfully!")

## 2. Load and Explore Data

In [None]:
# Load the Iris dataset
iris = load_iris()

# Create a DataFrame for easier manipulation
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['species'] = iris.target
df['species_name'] = df['species'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
df.head()

In [None]:
# Dataset information
print("Dataset Info:")
print(df.info())

print("\n" + "="*50)
print("Statistical Summary:")
print("="*50)
df.describe()

In [None]:
# Check for missing values
print("Missing Values:")
print(df.isnull().sum())

# Check class distribution
print("\nClass Distribution:")
print(df['species_name'].value_counts())
print("\nClass Balance: The dataset is perfectly balanced with 50 samples per class.")

## 3. Exploratory Data Analysis (EDA)

In [None]:
# Correlation heatmap
plt.figure(figsize=(10, 8))
correlation = df.iloc[:, :-2].corr()  # Exclude species columns
sns.heatmap(correlation, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=1, fmt='.2f')
plt.title('Feature Correlation Heatmap', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("Observation: Petal length and petal width are highly correlated (0.96)")

In [None]:
# Pairplot to visualize relationships between features
plt.figure(figsize=(12, 10))
sns.pairplot(df, hue='species_name', markers=['o', 's', 'D'], 
             palette='Set2', diag_kind='kde', plot_kws={'alpha': 0.6})
plt.suptitle('Pairwise Feature Relationships by Species', y=1.02, fontsize=14, fontweight='bold')
plt.show()

print("Observation: Setosa is linearly separable, while Versicolor and Virginica have some overlap")

In [None]:
# Box plots for each feature by species
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
features = iris.feature_names

for idx, feature in enumerate(features):
    row, col = idx // 2, idx % 2
    sns.boxplot(data=df, x='species_name', y=feature, ax=axes[row, col], palette='Set3')
    axes[row, col].set_title(f'{feature.title()} by Species', fontweight='bold')
    axes[row, col].set_xlabel('Species')
    axes[row, col].set_ylabel(feature)

plt.tight_layout()
plt.show()

print("Observation: Petal measurements show better separation between species than sepal measurements")

## 4. Data Preprocessing

In [None]:
# Separate features and target
X = df.iloc[:, :-2].values  # All feature columns
y = df['species'].values     # Target variable

print("Feature matrix shape:", X.shape)
print("Target vector shape:", y.shape)

# Split data into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("\nTraining set size:", X_train.shape[0])
print("Testing set size:", X_test.shape[0])

# Feature scaling - important for SVM and KNN
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("\nData preprocessing completed!")
print("Features have been standardized (mean=0, std=1)")

## 5. Model Training and Comparison

We'll train and compare multiple classification algorithms:
- Decision Tree
- Random Forest
- Support Vector Machine (SVM)
- Logistic Regression
- K-Nearest Neighbors (KNN)

In [None]:
# Initialize models
models = {
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'SVM': SVC(random_state=42, probability=True),
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=200),
    'KNN': KNeighborsClassifier()
}

# Train and evaluate each model
results = {}

for name, model in models.items():
    # Train the model (use scaled data for SVM, LR, and KNN)
    if name in ['SVM', 'Logistic Regression', 'KNN']:
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    
    # Store results
    results[name] = {
        'model': model,
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
        'predictions': y_pred
    }
    
    print(f"{name}:")
    print(f"  Accuracy: {accuracy:.4f}")
    print(f"  Precision: {precision:.4f}")
    print(f"  Recall: {recall:.4f}")
    print(f"  F1-Score: {f1:.4f}")
    print()

In [None]:
# Visualize model comparison
metrics_df = pd.DataFrame({
    'Model': list(results.keys()),
    'Accuracy': [results[m]['accuracy'] for m in results.keys()],
    'Precision': [results[m]['precision'] for m in results.keys()],
    'Recall': [results[m]['recall'] for m in results.keys()],
    'F1-Score': [results[m]['f1_score'] for m in results.keys()]
})

# Create bar plot
fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(metrics_df))
width = 0.2

ax.bar(x - 1.5*width, metrics_df['Accuracy'], width, label='Accuracy', color='#3498db')
ax.bar(x - 0.5*width, metrics_df['Precision'], width, label='Precision', color='#2ecc71')
ax.bar(x + 0.5*width, metrics_df['Recall'], width, label='Recall', color='#e74c3c')
ax.bar(x + 1.5*width, metrics_df['F1-Score'], width, label='F1-Score', color='#f39c12')

ax.set_xlabel('Models', fontsize=12, fontweight='bold')
ax.set_ylabel('Score', fontsize=12, fontweight='bold')
ax.set_title('Model Performance Comparison', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(metrics_df['Model'], rotation=45, ha='right')
ax.legend()
ax.set_ylim([0.9, 1.01])
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\nBest performing model based on accuracy:")
best_model = metrics_df.loc[metrics_df['Accuracy'].idxmax(), 'Model']
best_accuracy = metrics_df['Accuracy'].max()
print(f"{best_model}: {best_accuracy:.4f}")

## 6. Cross-Validation for Robust Evaluation

In [None]:
# Perform 5-fold cross-validation for each model
cv_results = {}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print("Cross-Validation Results (5-fold):")
print("="*60)

for name, model in models.items():
    # Use scaled data for distance-based models
    if name in ['SVM', 'Logistic Regression', 'KNN']:
        scores = cross_val_score(model, X_train_scaled, y_train, cv=cv, scoring='accuracy')
    else:
        scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='accuracy')
    
    cv_results[name] = scores
    print(f"{name}:")
    print(f"  Mean CV Score: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")
    print(f"  Individual Folds: {[f'{s:.4f}' for s in scores]}")
    print()

In [None]:
# Visualize cross-validation results
fig, ax = plt.subplots(figsize=(12, 6))

positions = np.arange(len(cv_results))
bp = ax.boxplot([cv_results[m] for m in cv_results.keys()], 
                 labels=cv_results.keys(),
                 patch_artist=True,
                 showmeans=True)

# Color the boxes
colors = ['#3498db', '#2ecc71', '#e74c3c', '#f39c12', '#9b59b6']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.6)

ax.set_xlabel('Models', fontsize=12, fontweight='bold')
ax.set_ylabel('Cross-Validation Accuracy', fontsize=12, fontweight='bold')
ax.set_title('5-Fold Cross-Validation Comparison', fontsize=14, fontweight='bold')
ax.grid(axis='y', alpha=0.3)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

print("The boxplot shows the distribution of accuracy across 5 folds for each model")

## 7. Hyperparameter Tuning for Best Model

Let's perform hyperparameter tuning on Random Forest using GridSearchCV

In [None]:
# Define parameter grid for Random Forest
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}

# Initialize Random Forest
rf = RandomForestClassifier(random_state=42)

# Perform GridSearchCV
print("Performing GridSearchCV for Random Forest...")
print("This may take a few moments...\n")

grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=cv,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print("\nBest Parameters:")
print(grid_search.best_params_)
print(f"\nBest Cross-Validation Score: {grid_search.best_score_:.4f}")

# Evaluate on test set
best_rf = grid_search.best_estimator_
y_pred_best = best_rf.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred_best)

print(f"Test Set Accuracy: {test_accuracy:.4f}")

## 8. Detailed Evaluation of Best Model

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_best)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=iris.target_names,
            yticklabels=iris.target_names,
            cbar_kws={'label': 'Count'})
plt.title('Confusion Matrix - Optimized Random Forest', fontsize=14, fontweight='bold')
plt.ylabel('True Label', fontsize=12)
plt.xlabel('Predicted Label', fontsize=12)
plt.tight_layout()
plt.savefig('../reports/figures/iris_confusion_matrix.png', dpi=300, bbox_inches='tight')
plt.show()

print("Confusion Matrix saved to: reports/figures/iris_confusion_matrix.png")

In [None]:
# Classification Report
print("Classification Report - Optimized Random Forest")
print("="*60)
print(classification_report(y_test, y_pred_best, target_names=iris.target_names))

# Per-class accuracy
print("\nPer-Class Accuracy:")
for i, species in enumerate(iris.target_names):
    class_mask = y_test == i
    class_accuracy = accuracy_score(y_test[class_mask], y_pred_best[class_mask])
    print(f"  {species.capitalize()}: {class_accuracy:.4f}")

## 9. Feature Importance Analysis

In [None]:
# Extract feature importances from the best Random Forest model
feature_importances = best_rf.feature_importances_
feature_names = iris.feature_names

# Create a DataFrame for better visualization
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importances
}).sort_values('Importance', ascending=False)

print("Feature Importances:")
print(importance_df.to_string(index=False))

# Visualize feature importances
plt.figure(figsize=(10, 6))
colors = ['#e74c3c' if x == importance_df['Importance'].max() else '#3498db' 
          for x in importance_df['Importance']]
bars = plt.barh(importance_df['Feature'], importance_df['Importance'], color=colors, alpha=0.7)
plt.xlabel('Importance Score', fontsize=12, fontweight='bold')
plt.ylabel('Features', fontsize=12, fontweight='bold')
plt.title('Feature Importance - Random Forest Classifier', fontsize=14, fontweight='bold')
plt.grid(axis='x', alpha=0.3)

# Add value labels on bars
for i, bar in enumerate(bars):
    width = bar.get_width()
    plt.text(width, bar.get_y() + bar.get_height()/2, 
             f'{width:.3f}', ha='left', va='center', fontsize=10)

plt.tight_layout()
plt.show()

print(f"\nMost important feature: {importance_df.iloc[0]['Feature']}")
print("This aligns with our EDA findings that petal measurements are more discriminative!")

## 10. Save the Best Model

In [None]:
# Save the trained model and scaler
model_path = '../models/iris_model.pkl'
scaler_path = '../models/iris_scaler.pkl'

joblib.dump(best_rf, model_path)
joblib.dump(scaler, scaler_path)

print(f"Model saved to: {model_path}")
print(f"Scaler saved to: {scaler_path}")

# Verify model can be loaded
loaded_model = joblib.load(model_path)
test_prediction = loaded_model.predict(X_test[:5])
print(f"\nModel loaded successfully!")
print(f"Test prediction on first 5 samples: {test_prediction}")
print(f"Actual labels: {y_test[:5]}")

## 11. Summary and Conclusions

### Key Findings:

1. **Dataset Characteristics:**
   - Balanced dataset with 50 samples per class
   - No missing values
   - Petal measurements are more discriminative than sepal measurements
   - High correlation between petal length and petal width (0.96)

2. **Model Performance:**
   - All models achieved >95% accuracy on the test set
   - Random Forest and SVM performed best
   - After hyperparameter tuning, Random Forest achieved near-perfect accuracy

3. **Feature Importance:**
   - Petal width is the most important feature
   - Petal length is the second most important
   - Sepal measurements contribute less to classification

4. **Model Selection:**
   - **Decision Tree:** Simple, interpretable, but prone to overfitting
   - **Random Forest:** Best overall performance, robust, handles non-linearity well
   - **SVM:** Excellent performance, good for small datasets
   - **Logistic Regression:** Fast, simple, good baseline
   - **KNN:** Simple, no training phase, sensitive to feature scaling

### Deliverables Completed:
- ✅ Data preprocessing (no missing values in Iris dataset)
- ✅ Decision tree classifier trained and evaluated
- ✅ Multiple models compared (5 algorithms)
- ✅ Accuracy, precision, and recall calculated for all models
- ✅ Hyperparameter tuning performed on best model
- ✅ Feature importance analysis
- ✅ Model saved for future use

### Why Scikit-learn?
- **Easy to use:** Consistent API across all algorithms
- **Well-documented:** Extensive documentation and examples
- **Efficient:** Optimized implementations of classical ML algorithms
- **Comprehensive:** Includes preprocessing, model selection, and evaluation tools
- **Community support:** Large user base and active development

This task demonstrates proficiency in classical machine learning using Scikit-learn, including proper data preprocessing, model selection, evaluation, and interpretation of results.