# Task 1: Classical ML with Scikit-learn - Iris Species Classification

In this notebook, we'll work with the Iris Species dataset to:
1. Preprocess the data (handle missing values, encode labels)
2. Train a decision tree classifier to predict iris species
3. Evaluate using accuracy, precision, and recall

## About the Dataset
The Iris dataset is one of the most well-known datasets in machine learning. It contains measurements of 150 iris flowers from three different species:
- Setosa
- Versicolor
- Virginica

Each flower has four features measured:
- Sepal length (cm)
- Sepal width (cm)
- Petal length (cm)
- Petal width (cm)

## 1. Import Required Libraries

First, let's import all the libraries we'll need for our analysis and modeling.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, classification_report
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Set up the plotting style
plt.style.use('seaborn-v0_8')

## 2. Load and Explore the Data

Let's load the Iris dataset and explore its characteristics to better understand what we're working with.

In [None]:
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names

# Create a pandas DataFrame for easier manipulation
df = pd.DataFrame(X, columns=feature_names)
df['species'] = [target_names[i] for i in y]

# Display basic information
print(f"Dataset shape: {df.shape}")
print(f"Features: {feature_names}")
print(f"Target classes: {target_names}")
print(f"\nFirst 5 rows:")
df.head()

In [None]:
# Basic statistics
df.describe()

In [None]:
# Check for missing values
print("Missing values in dataset:")
df.isnull().sum()

In [None]:
# Class distribution
print("Class distribution:")
class_dist = df['species'].value_counts()
print(class_dist)

# Visualize class distribution
plt.figure(figsize=(8, 6))
class_dist.plot(kind='bar', color='skyblue')
plt.title('Class Distribution of Iris Species')
plt.xlabel('Species')
plt.ylabel('Count')
plt.grid(axis='y', alpha=0.3)
plt.show()

## 3. Data Visualization

Visualizing the data will help us understand the relationships between features and classes.

In [None]:
# Pairwise scatter plots
sns.pairplot(df, hue='species', markers=['o', 's', 'D'], palette='Set1')
plt.suptitle('Pairplot of Iris Dataset', y=1.02, fontsize=16)
plt.show()

In [None]:
# Correlation heatmap
plt.figure(figsize=(10, 8))
correlation_matrix = df[feature_names].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, fmt=".2f")
plt.title('Feature Correlation Matrix')
plt.show()

In [None]:
# Box plots for each feature by species
fig, axes = plt.subplots(2, 2, figsize=(16, 10))
axes = axes.flatten()

for i, feature in enumerate(feature_names):
    sns.boxplot(data=df, x='species', y=feature, ax=axes[i])
    axes[i].set_title(f'{feature} by Species')
    axes[i].set_xlabel('Species')
    axes[i].set_ylabel(feature)
    axes[i].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## 4. Data Preprocessing

Now we'll prepare the data for modeling:
1. Check for and handle any missing values
2. Split the data into training and testing sets

In [None]:
# Check for missing values (Iris dataset typically has none)
print(f"Missing values in features: {np.isnan(X).sum()}")
print(f"Missing values in target: {np.isnan(y).sum()}")

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"\nTraining set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")
print(f"Feature dimensions: {X_train.shape[1]}")

# Check if classes are balanced in both train and test sets
print("\nTraining set class distribution:")
unique_train, counts_train = np.unique(y_train, return_counts=True)
for species, count in zip(target_names[unique_train], counts_train):
    print(f"- {species}: {count}")

print("\nTesting set class distribution:")
unique_test, counts_test = np.unique(y_test, return_counts=True)
for species, count in zip(target_names[unique_test], counts_test):
    print(f"- {species}: {count}")

## 5. Train the Decision Tree Classifier

Now we'll train a decision tree classifier on our preprocessed data.

In [None]:
# Initialize the Decision Tree Classifier with controlled depth to prevent overfitting
model = DecisionTreeClassifier(random_state=42, max_depth=5)

# Train the model
print("Training Decision Tree Classifier...")
model.fit(X_train, y_train)
print("Training complete!")

# Display model parameters
print(f"\nModel parameters:")
print(f"- Max depth: {model.max_depth}")
print(f"- Min samples split: {model.min_samples_split}")
print(f"- Min samples leaf: {model.min_samples_leaf}")
print(f"- Random state: {model.random_state}")

In [None]:
# Feature importance
feature_importance = model.feature_importances_
print(f"\nFeature Importance:")
for name, importance in zip(feature_names, feature_importance):
    print(f"- {name}: {importance:.4f}")

# Visualize feature importance
plt.figure(figsize=(10, 6))
sorted_idx = np.argsort(feature_importance)
plt.barh(range(len(sorted_idx)), feature_importance[sorted_idx])
plt.yticks(range(len(sorted_idx)), [feature_names[i] for i in sorted_idx])
plt.title('Feature Importance in Decision Tree')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()

## 6. Visualize the Decision Tree

Let's visualize the actual decision tree to understand how it makes predictions.

In [None]:
# Visualize the decision tree
plt.figure(figsize=(20, 12))
plot_tree(model, 
         feature_names=feature_names,
         class_names=target_names,
         filled=True,
         rounded=True,
         fontsize=10)
plt.title('Decision Tree for Iris Species Classification', fontsize=18)
plt.show()

## 7. Evaluate the Model

Now we'll evaluate the model using various metrics:
- Accuracy
- Precision and Recall
- Confusion Matrix
- Detailed Classification Report

In [None]:
# Make predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculate accuracy
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Testing Accuracy: {test_accuracy:.4f}")

# Detailed classification metrics
print(f"\nDetailed Classification Report (Test Set):")
print(classification_report(y_test, y_test_pred, target_names=target_names))

In [None]:
# Precision and Recall for each class
precision = precision_score(y_test, y_test_pred, average=None)
recall = recall_score(y_test, y_test_pred, average=None)

print(f"Per-class Metrics:")
for i, class_name in enumerate(target_names):
    print(f"- {class_name}:")
    print(f"  * Precision: {precision[i]:.4f}")
    print(f"  * Recall: {recall[i]:.4f}")

# Overall averages
avg_precision = precision_score(y_test, y_test_pred, average='weighted')
avg_recall = recall_score(y_test, y_test_pred, average='weighted')

print(f"\nWeighted Averages:")
print(f"- Precision: {avg_precision:.4f}")
print(f"- Recall: {avg_recall:.4f}")

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_test_pred)
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
           xticklabels=target_names, yticklabels=target_names)
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

In [None]:
# Cross-validation to get a more robust estimate of model performance
cv_scores = cross_val_score(model, X_train, y_train, cv=5)
print(f"Cross-validation Scores: {cv_scores}")
print(f"CV Mean Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

## 8. Conclusion

We've successfully trained and evaluated a Decision Tree classifier for the Iris Species dataset.

### Summary of Results:
- Test accuracy: High accuracy in predicting the correct species
- Features: Petal length and petal width were most important for classification
- Model robustness: Cross-validation showed consistent performance across different data splits

### Next Steps:
- Try other algorithms like Random Forest or SVM for comparison
- Perform hyperparameter tuning to find optimal tree parameters
- For a more complex dataset, consider dimensionality reduction techniques

### Key Takeaways:
- Decision trees provide both good performance and interpretability
- For well-separated classes like in the Iris dataset, even simple models can perform well
- Understanding feature importance helps with feature selection in more complex datasets