# Machine Learning Study: Classification Example

This notebook demonstrates how to use the Machine Learning Study project for a classification task using the Iris dataset.

## Overview

We'll cover:
1. Setting up the environment
2. Loading and exploring data
3. Running a complete ML experiment
4. Analyzing results
5. Experiment tracking with MLflow

## 1. Setup and Imports

In [None]:
# Add the src directory to Python path
import sys
from pathlib import Path

# Add src to path for imports
sys.path.insert(0, str(Path.cwd().parent / 'src'))

# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Import our ML study modules
from machine_learning_study.pipelines import run_experiment
from machine_learning_study.data.loader import DataLoader
from machine_learning_study.models.classifier import Classifier
from machine_learning_study.models.evaluator import ModelEvaluator
from machine_learning_study.utils.experiment_tracker import ExperimentTracker

print("‚úÖ Environment setup complete!")

## 2. Data Loading and Exploration

In [None]:
# Initialize data loader
data_loader = DataLoader('data')

# Load the Iris dataset
iris_data = data_loader.load_sklearn_dataset('iris')

# Create DataFrame for easier manipulation
df = pd.DataFrame(
    iris_data['data'], 
    columns=iris_data['feature_names']
)
df['target'] = iris_data['target']
df['species'] = df['target'].map(dict(enumerate(iris_data['target_names'])))

print("Dataset shape:", df.shape)
print("\nFirst few rows:")
df.head()

In [None]:
# Basic data exploration
print("Dataset Information:")
print(df.info())

print("\nClass distribution:")
print(df['species'].value_counts())

print("\nBasic statistics:")
df.describe()

In [None]:
# Visualize the data
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('Iris Dataset Feature Distributions', fontsize=16)

# Plot histograms for each feature
features = iris_data['feature_names']
for i, feature in enumerate(features):
    ax = axes[i//2, i%2]
    for species in df['species'].unique():
        species_data = df[df['species'] == species][feature]
        ax.hist(species_data, alpha=0.7, label=species, bins=15)
    ax.set_title(f'{feature} Distribution')
    ax.set_xlabel(feature)
    ax.set_ylabel('Frequency')
    ax.legend()

plt.tight_layout()
plt.show()

In [None]:
# Pair plot to see feature relationships
plt.figure(figsize=(10, 8))
sns.pairplot(df, hue='species', diag_kind='kde', height=2)
plt.suptitle('Pair Plot of Iris Features', y=1.02)
plt.show()

## 3. Running a Complete ML Experiment

In [None]:
# Define experiment configuration
config = {
    "task": "classification",
    "target_column": "target",
    "random_state": 42,
    "data": {
        "sklearn_dataset": "iris"
    },
    "preprocessing": {
        "scaling": {
            "method": "standard"
        }
    },
    "features": {
        "selection": {
            "method": "mutual_info",
            "k": 4
        }
    },
    "model": {
        "type": "random_forest",
        "parameters": {
            "n_estimators": 100,
            "max_depth": 10,
            "random_state": 42
        }
    },
    "training": {
        "test_size": 0.2,
        "stratify": True
    }
}

# Save configuration to file
import yaml
config_path = 'notebooks/iris_experiment_config.yaml'
with open(config_path, 'w') as f:
    yaml.dump(config, f)

print("Configuration saved to:", config_path)

In [None]:
# Run the experiment
print("üöÄ Running ML experiment...")
results = run_experiment("iris_classification", config_path)
print("‚úÖ Experiment completed!")

## 4. Analyzing Results

In [None]:
# Display evaluation metrics
metrics = results['evaluation_results']
cv_results = results['cross_validation_results']

print("üìä Model Performance Metrics:")
print(f"Accuracy: {metrics['accuracy']:.4f}")
print(f"Precision: {metrics['precision']:.4f}")
print(f"Recall: {metrics['recall']:.4f}")
print(f"F1-Score: {metrics['f1_score']:.4f}")

print("\nüìà Cross-Validation Results:")
print(f"Mean CV Score: {cv_results['metrics']['mean_score']:.4f}")
print(f"Std CV Score: {cv_results['metrics']['std_score']:.4f}")

In [None]:
# Get predictions and analyze them
predictions = results['pipeline_results']['predictions']
y_test = results['pipeline_results']['splits']['y_test']

# Create confusion matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns

cm = confusion_matrix(y_test, predictions)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=iris_data['target_names'],
            yticklabels=iris_data['target_names'])
plt.title('Confusion Matrix - Iris Classification')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

In [None]:
# Feature importance analysis
model = results['pipeline_results']['model']
feature_importance = model.get_feature_importance()

if feature_importance is not None:
    plt.figure(figsize=(10, 6))
    features = iris_data['feature_names']
    plt.barh(features, feature_importance)
    plt.title('Feature Importance - Random Forest')
    plt.xlabel('Importance')
    plt.ylabel('Feature')
    plt.show()
else:
    print("Feature importance not available for this model type")

## 5. Experiment Tracking with MLflow

In [None]:
# Initialize experiment tracker
tracker = ExperimentTracker(experiment_name="iris_classification_study")

# Log the experiment results
with tracker.start_run(run_name="notebook_iris_experiment"):
    # Log parameters
    tracker.log_parameters({
        "model_type": "random_forest",
        "n_estimators": 100,
        "max_depth": 10,
        "test_size": 0.2
    })
    
    # Log metrics
    tracker.log_metrics(metrics)
    
    # Log model
    tracker.log_model(model.model, "random_forest_model")
    
    # Log additional information
    tracker.log_dict({
        "dataset": "iris",
        "cv_folds": 5,
        "experiment_type": "classification_example"
    }, "experiment_metadata.json")

print("‚úÖ Experiment logged to MLflow!")
print("View results at: http://localhost:5000 (if MLflow server is running)")

## 6. Model Comparison Example

In [None]:
# Compare different models
models_to_test = ['random_forest', 'gradient_boosting', 'svm', 'knn']
model_results = {}

X = df.drop(['target', 'species'], axis=1)
y = df['target']

evaluator = ModelEvaluator(task='classification')

for model_type in models_to_test:
    print(f"\nüîç Testing {model_type}...")
    
    # Create and train model
    model = Classifier(model_type=model_type, random_state=42)
    model.fit(X, y)
    
    # Cross-validate
    cv_results = model.cross_validate(X, y, cv=5)
    
    model_results[model_type] = {
        'cv_mean': cv_results['mean_score'],
        'cv_std': cv_results['std_score']
    }
    
    print(f"{model_type}: {cv_results['mean_score']:.4f} (+/- {cv_results['std_score']:.4f})")

# Visualize comparison
model_names = list(model_results.keys())
cv_means = [model_results[m]['cv_mean'] for m in model_names]
cv_stds = [model_results[m]['cv_std'] for m in model_names]

plt.figure(figsize=(10, 6))
bars = plt.bar(model_names, cv_means, yerr=cv_stds, capsize=5)
plt.title('Model Comparison - Cross-Validation Accuracy')
plt.ylabel('Accuracy')
plt.ylim(0.8, 1.0)

# Add value labels on bars
for bar, mean in zip(bars, cv_means):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, 
             f'{mean:.3f}', ha='center', va='bottom')

plt.show()

## Summary

This notebook demonstrated:

1. **Data Loading & Exploration**: Loading sklearn datasets and basic EDA
2. **ML Pipeline Execution**: Running complete experiments with configuration
3. **Result Analysis**: Evaluating model performance and visualizing results
4. **Experiment Tracking**: Logging experiments with MLflow
5. **Model Comparison**: Comparing different algorithms

### Key Takeaways:

- The project provides a clean, modular approach to ML experimentation
- Experiment tracking ensures reproducibility and comparison
- Configuration-driven approach makes it easy to modify experiments
- Integration with MLflow provides enterprise-grade experiment management

### Next Steps:

1. Try different datasets and configurations
2. Experiment with feature engineering techniques
3. Explore hyperparameter optimization with Optuna
4. Deploy models using the included Docker setup

Happy learning! üöÄ