# Module 01: Experiment Tracking with MLflow

**Difficulty**: ‚≠ê‚≠ê Intermediate  
**Estimated Time**: 60 minutes  
**Prerequisites**: 
- Module 00: Introduction to MLOps
- Basic machine learning model training

## Learning Objectives

By the end of this notebook, you will be able to:
1. Set up and configure MLflow for experiment tracking
2. Log parameters, metrics, and artifacts during model training
3. Compare multiple experiment runs to identify the best model
4. Organize experiments with tags and nested runs
5. Retrieve and load models from previous experiments

## 1. Why Experiment Tracking Matters

Imagine you've trained 50 different versions of a model with varying:
- Hyperparameters (learning rate, number of layers, etc.)
- Feature engineering approaches
- Training data subsets
- Preprocessing techniques

**Without experiment tracking:**
- ‚ùå "Which parameters gave the best accuracy?"
- ‚ùå "Can't remember which data preprocessing we used for model v23"
- ‚ùå "The model worked last week, but I changed something..."
- ‚ùå "Let me manually copy metrics into a spreadsheet"

**With experiment tracking:**
- ‚úÖ All experiments automatically logged
- ‚úÖ Easy comparison across runs
- ‚úÖ Reproducible results
- ‚úÖ Collaboration enabled (team can see all experiments)

In [None]:
# Setup: Import required libraries
import mlflow
import mlflow.sklearn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, roc_auc_score, roc_curve
import warnings

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
%matplotlib inline

# Set random seed for reproducibility
np.random.seed(42)

print("‚úì Libraries imported successfully")
print(f"‚úì MLflow version: {mlflow.__version__}")

## 2. Setting Up MLflow

MLflow has four main components:
1. **MLflow Tracking**: Log parameters, metrics, and artifacts
2. **MLflow Projects**: Package code in a reproducible format
3. **MLflow Models**: Manage and deploy models
4. **MLflow Registry**: Centralized model store (covered in Module 02)

In this notebook, we'll focus on **MLflow Tracking**.

In [None]:
# Set up MLflow tracking URI
# By default, MLflow logs to ./mlruns directory
# For production, you'd use a remote tracking server

mlflow.set_tracking_uri("file:./mlruns")

# Create or set an experiment
# Experiments group related runs together
experiment_name = "credit_risk_classification"
mlflow.set_experiment(experiment_name)

print(f"‚úì MLflow tracking URI: {mlflow.get_tracking_uri()}")
print(f"‚úì Active experiment: {experiment_name}")
print(f"\nYou can view the MLflow UI by running:")
print(f"  mlflow ui")
print(f"Then navigate to http://localhost:5000")

## 3. Creating Sample Data for Experiments

Let's create a binary classification dataset to simulate a credit risk prediction problem.

In [None]:
# Generate synthetic credit risk dataset
# Features: income, debt, credit_history, employment_length, etc.
X, y = make_classification(
    n_samples=2000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=2,
    weights=[0.7, 0.3],  # Imbalanced classes (70% good credit, 30% bad)
    random_state=42
)

# Create feature names for better interpretability
feature_names = [f'feature_{i}' for i in range(20)]
X_df = pd.DataFrame(X, columns=feature_names)
X_df['target'] = y

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"\nClass distribution:")
print(f"  Class 0 (Good Credit): {(y == 0).sum()} ({(y == 0).mean()*100:.1f}%)")
print(f"  Class 1 (Bad Credit): {(y == 1).sum()} ({(y == 1).mean()*100:.1f}%)")

# Display first few rows
print("\nSample data:")
X_df.head()

## 4. Basic MLflow Tracking: Single Experiment Run

Let's start with a simple example of tracking one model training run.

### Key Concepts:
- **Parameters**: Input values that configure the model (e.g., max_depth, learning_rate)
- **Metrics**: Output values that measure performance (e.g., accuracy, F1-score)
- **Artifacts**: Files produced during the run (e.g., plots, models, datasets)

In [None]:
# Start an MLflow run
with mlflow.start_run(run_name="baseline_logistic_regression") as run:
    
    # Define model parameters
    params = {
        'C': 1.0,
        'max_iter': 100,
        'solver': 'lbfgs'
    }
    
    # Log parameters
    mlflow.log_params(params)
    
    # Train model
    model = LogisticRegression(**params, random_state=42)
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    # Calculate metrics
    metrics = {
        'accuracy': accuracy_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred),
        'recall': recall_score(y_test, y_pred),
        'f1_score': f1_score(y_test, y_pred),
        'roc_auc': roc_auc_score(y_test, y_pred_proba)
    }
    
    # Log metrics
    mlflow.log_metrics(metrics)
    
    # Log model
    mlflow.sklearn.log_model(model, "model")
    
    # Log additional metadata as tags
    mlflow.set_tag("model_type", "LogisticRegression")
    mlflow.set_tag("dataset", "synthetic_credit_risk")
    
    print("‚úì Run completed and logged to MLflow")
    print(f"‚úì Run ID: {run.info.run_id}")
    print(f"\nLogged Metrics:")
    for metric_name, metric_value in metrics.items():
        print(f"  {metric_name}: {metric_value:.4f}")

## 5. Tracking Multiple Experiments

In practice, you'll want to compare multiple models and hyperparameter configurations. Let's train several models and track them all.

In [None]:
# Define different model configurations to experiment with
experiments_config = [
    {
        'name': 'logistic_regression_c0.1',
        'model': LogisticRegression,
        'params': {'C': 0.1, 'max_iter': 100, 'solver': 'lbfgs', 'random_state': 42}
    },
    {
        'name': 'logistic_regression_c10',
        'model': LogisticRegression,
        'params': {'C': 10.0, 'max_iter': 100, 'solver': 'lbfgs', 'random_state': 42}
    },
    {
        'name': 'random_forest_depth5',
        'model': RandomForestClassifier,
        'params': {'n_estimators': 100, 'max_depth': 5, 'random_state': 42}
    },
    {
        'name': 'random_forest_depth10',
        'model': RandomForestClassifier,
        'params': {'n_estimators': 100, 'max_depth': 10, 'random_state': 42}
    },
    {
        'name': 'random_forest_depth20',
        'model': RandomForestClassifier,
        'params': {'n_estimators': 100, 'max_depth': 20, 'random_state': 42}
    }
]

# Store results for comparison
results = []

print("Running experiments...\n")

for config in experiments_config:
    with mlflow.start_run(run_name=config['name']):
        # Log parameters
        mlflow.log_params(config['params'])
        
        # Train model
        model = config['model'](**config['params'])
        model.fit(X_train, y_train)
        
        # Make predictions
        y_pred = model.predict(X_test)
        y_pred_proba = model.predict_proba(X_test)[:, 1]
        
        # Calculate metrics
        metrics = {
            'accuracy': accuracy_score(y_test, y_pred),
            'precision': precision_score(y_test, y_pred),
            'recall': recall_score(y_test, y_pred),
            'f1_score': f1_score(y_test, y_pred),
            'roc_auc': roc_auc_score(y_test, y_pred_proba)
        }
        
        # Log metrics
        mlflow.log_metrics(metrics)
        
        # Log model
        mlflow.sklearn.log_model(model, "model")
        
        # Tag with model type
        mlflow.set_tag("model_type", config['model'].__name__)
        
        # Store results
        results.append({
            'run_name': config['name'],
            **metrics
        })
        
        print(f"‚úì Completed: {config['name']}")

print("\n‚úì All experiments completed!")

## 6. Comparing Experiment Results

Now let's visualize and compare the results of all our experiments.

In [None]:
# Create comparison DataFrame
results_df = pd.DataFrame(results)

print("Experiment Results Comparison:")
print("=" * 80)
print(results_df.to_string(index=False))

# Find best model for each metric
print("\n" + "=" * 80)
print("Best Models by Metric:")
print("=" * 80)
for metric in ['accuracy', 'precision', 'recall', 'f1_score', 'roc_auc']:
    best_idx = results_df[metric].idxmax()
    best_model = results_df.loc[best_idx, 'run_name']
    best_score = results_df.loc[best_idx, metric]
    print(f"{metric.upper()}: {best_model} ({best_score:.4f})")

In [None]:
# Visualize comparison
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
fig.suptitle('Model Performance Comparison Across Experiments', 
             fontsize=16, fontweight='bold')

metrics_to_plot = ['accuracy', 'precision', 'recall', 'f1_score', 'roc_auc']

for idx, metric in enumerate(metrics_to_plot):
    ax = axes[idx // 3, idx % 3]
    
    # Sort by metric value for better visualization
    sorted_df = results_df.sort_values(metric, ascending=True)
    
    # Create horizontal bar chart
    bars = ax.barh(range(len(sorted_df)), sorted_df[metric], 
                   color='steelblue', alpha=0.7)
    
    # Highlight best performer
    best_idx = sorted_df[metric].idxmax()
    bars[list(sorted_df.index).index(best_idx)].set_color('seagreen')
    
    ax.set_yticks(range(len(sorted_df)))
    ax.set_yticklabels(sorted_df['run_name'], fontsize=9)
    ax.set_xlabel(metric.replace('_', ' ').title(), fontweight='bold')
    ax.set_xlim(sorted_df[metric].min() - 0.02, sorted_df[metric].max() + 0.02)
    ax.grid(axis='x', alpha=0.3)
    
    # Add value labels
    for i, (bar, value) in enumerate(zip(bars, sorted_df[metric])):
        ax.text(value + 0.002, bar.get_y() + bar.get_height()/2,
                f'{value:.3f}',
                va='center', fontsize=8)

# Remove extra subplot
axes[1, 2].axis('off')

plt.tight_layout()
plt.show()

print("\nVisualization shows Random Forest with depth=10 or 20 generally performs best.")

## 7. Logging Artifacts: Saving Plots and Files

Beyond metrics, we often want to save visualizations, datasets, or other files associated with an experiment.

In [None]:
# Train a model and log comprehensive artifacts
with mlflow.start_run(run_name="best_model_with_artifacts"):
    
    # Train best performing model (based on previous experiments)
    params = {'n_estimators': 100, 'max_depth': 10, 'random_state': 42}
    model = RandomForestClassifier(**params)
    model.fit(X_train, y_train)
    
    # Log parameters and metrics
    mlflow.log_params(params)
    
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    metrics = {
        'accuracy': accuracy_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred),
        'recall': recall_score(y_test, y_pred),
        'f1_score': f1_score(y_test, y_pred),
        'roc_auc': roc_auc_score(y_test, y_pred_proba)
    }
    mlflow.log_metrics(metrics)
    
    # Create and log confusion matrix plot
    fig, ax = plt.subplots(figsize=(8, 6))
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax)
    ax.set_xlabel('Predicted', fontweight='bold')
    ax.set_ylabel('Actual', fontweight='bold')
    ax.set_title('Confusion Matrix', fontweight='bold')
    plt.tight_layout()
    plt.savefig('confusion_matrix.png', dpi=150, bbox_inches='tight')
    mlflow.log_artifact('confusion_matrix.png')
    plt.close()
    
    # Create and log ROC curve
    fig, ax = plt.subplots(figsize=(8, 6))
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
    ax.plot(fpr, tpr, linewidth=2, label=f'ROC (AUC = {metrics["roc_auc"]:.3f})')
    ax.plot([0, 1], [0, 1], 'k--', linewidth=2, label='Random Classifier')
    ax.set_xlabel('False Positive Rate', fontweight='bold')
    ax.set_ylabel('True Positive Rate', fontweight='bold')
    ax.set_title('ROC Curve', fontweight='bold')
    ax.legend()
    ax.grid(alpha=0.3)
    plt.tight_layout()
    plt.savefig('roc_curve.png', dpi=150, bbox_inches='tight')
    mlflow.log_artifact('roc_curve.png')
    plt.close()
    
    # Log feature importance
    feature_importance = pd.DataFrame({
        'feature': feature_names,
        'importance': model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    feature_importance.to_csv('feature_importance.csv', index=False)
    mlflow.log_artifact('feature_importance.csv')
    
    # Log the model
    mlflow.sklearn.log_model(model, "model")
    
    print("‚úì Model, metrics, and artifacts logged successfully!")
    print("\nLogged artifacts:")
    print("  - confusion_matrix.png")
    print("  - roc_curve.png")
    print("  - feature_importance.csv")
    print("  - model/")

## 8. Retrieving and Loading Previous Runs

One of the key benefits of MLflow is the ability to retrieve past experiments and load models.

In [None]:
# Search for runs in the current experiment
from mlflow.tracking import MlflowClient

client = MlflowClient()
experiment = client.get_experiment_by_name(experiment_name)

# Get all runs from this experiment
runs = mlflow.search_runs(
    experiment_ids=[experiment.experiment_id],
    order_by=["metrics.f1_score DESC"]
)

print("All Runs in Experiment (sorted by F1 score):")
print("=" * 80)

# Display key information
display_cols = ['run_id', 'tags.mlflow.runName', 'metrics.accuracy', 
                'metrics.f1_score', 'metrics.roc_auc']
available_cols = [col for col in display_cols if col in runs.columns]
print(runs[available_cols].head(10).to_string(index=False))

# Get the best run by F1 score
best_run_id = runs.iloc[0]['run_id']
print(f"\n‚úì Best run ID: {best_run_id}")

In [None]:
# Load the best model from MLflow
best_model_uri = f"runs:/{best_run_id}/model"
loaded_model = mlflow.sklearn.load_model(best_model_uri)

print(f"‚úì Model loaded from run: {best_run_id}")
print(f"‚úì Model type: {type(loaded_model).__name__}")

# Verify the model works
test_predictions = loaded_model.predict(X_test[:5])
print(f"\nTest predictions on first 5 samples: {test_predictions}")
print(f"Actual values: {y_test[:5]}")

print("\n‚úì Successfully loaded and tested model from MLflow!")

## 9. Advanced: Nested Runs for Hyperparameter Tuning

For complex experiments like grid search or cross-validation, you can organize runs hierarchically using nested runs.

In [None]:
# Hyperparameter tuning with nested runs
from sklearn.model_selection import cross_val_score

# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15]
}

# Parent run for the entire hyperparameter search
with mlflow.start_run(run_name="hyperparameter_tuning") as parent_run:
    
    best_score = 0
    best_params = {}
    
    # Track total number of combinations
    total_combinations = len(param_grid['n_estimators']) * len(param_grid['max_depth'])
    mlflow.log_param("total_combinations", total_combinations)
    
    combination_num = 0
    
    # Grid search
    for n_est in param_grid['n_estimators']:
        for max_d in param_grid['max_depth']:
            combination_num += 1
            
            # Nested run for each hyperparameter combination
            with mlflow.start_run(
                run_name=f"n{n_est}_d{max_d}", 
                nested=True
            ) as child_run:
                
                # Define and train model
                params = {
                    'n_estimators': n_est,
                    'max_depth': max_d,
                    'random_state': 42
                }
                mlflow.log_params(params)
                
                model = RandomForestClassifier(**params)
                
                # Use cross-validation for more robust evaluation
                cv_scores = cross_val_score(
                    model, X_train, y_train, 
                    cv=5, scoring='f1'
                )
                
                mean_cv_score = cv_scores.mean()
                std_cv_score = cv_scores.std()
                
                # Log metrics
                mlflow.log_metric("cv_f1_mean", mean_cv_score)
                mlflow.log_metric("cv_f1_std", std_cv_score)
                
                # Train on full training set and evaluate on test
                model.fit(X_train, y_train)
                y_pred = model.predict(X_test)
                test_f1 = f1_score(y_test, y_pred)
                mlflow.log_metric("test_f1", test_f1)
                
                # Update best parameters if this is better
                if mean_cv_score > best_score:
                    best_score = mean_cv_score
                    best_params = params
                
                print(f"[{combination_num}/{total_combinations}] "
                      f"n_estimators={n_est}, max_depth={max_d}: "
                      f"CV F1={mean_cv_score:.4f} (¬±{std_cv_score:.4f})")
    
    # Log best parameters to parent run
    mlflow.log_params({f"best_{k}": v for k, v in best_params.items()})
    mlflow.log_metric("best_cv_f1", best_score)
    
    print(f"\n‚úì Hyperparameter tuning complete!")
    print(f"‚úì Best parameters: {best_params}")
    print(f"‚úì Best CV F1 score: {best_score:.4f}")

## 10. Exercises

### üéØ Exercise 1: Track a New Experiment

Create a new experiment to compare different classifiers on the same dataset.

**Requirements:**
1. Create an experiment named "classifier_comparison"
2. Train and log at least 3 different classifier types (e.g., SVM, KNN, Gradient Boosting)
3. Log parameters, metrics, and a confusion matrix for each
4. Identify which classifier performs best

**Bonus**: Log the training time for each model as a metric.

In [None]:
# Your solution here
import time
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier

# TODO: Implement your solution
# 1. Set experiment
# 2. Define classifiers
# 3. Train and log each one
# 4. Compare results

### üéØ Exercise 2: Log Custom Artifacts

Enhance the experiment tracking by logging additional useful artifacts.

**Requirements:**
1. Train a Random Forest model
2. Create and log a feature importance bar plot
3. Create and log a precision-recall curve
4. Save and log a text file with a model summary (parameters, metrics, insights)

**Hint**: Use `plt.savefig()` for plots and standard file I/O for text files.

In [None]:
# Your solution here
from sklearn.metrics import precision_recall_curve

# TODO: Implement your solution

### üéØ Exercise 3: Query and Compare Past Runs

Practice retrieving and analyzing past experiments.

**Requirements:**
1. Search for all runs where accuracy > 0.85
2. Find the run with the best precision score
3. Load that model and make predictions on new data
4. Create a visualization comparing the top 5 runs across all metrics

**Hint**: Use `mlflow.search_runs()` with filter strings.

In [None]:
# Your solution here

# TODO: Implement your solution
# 1. Search runs with filter
# 2. Find best precision
# 3. Load model
# 4. Visualize comparison

## 11. Summary

### Key Concepts Covered

1. **MLflow Setup**: Configured tracking URI and experiments
2. **Logging**: Tracked parameters, metrics, and artifacts
3. **Comparison**: Compared multiple experiment runs
4. **Artifacts**: Saved plots, models, and files
5. **Retrieval**: Loaded past experiments and models
6. **Nested Runs**: Organized complex experiments hierarchically

### Best Practices

- ‚úÖ **Always log parameters**: Even if you think they won't matter
- ‚úÖ **Use descriptive run names**: Makes finding experiments easier
- ‚úÖ **Log artifacts liberally**: Plots and files help future you understand results
- ‚úÖ **Use tags**: Organize experiments by team, project, or model type
- ‚úÖ **Version your data**: Track which dataset version was used
- ‚úÖ **Document insights**: Add notes about why certain experiments were run

### Common Pitfalls to Avoid

- ‚ùå Not logging random seeds (makes reproduction impossible)
- ‚ùå Overwriting runs (each experiment should be a new run)
- ‚ùå Logging too few metrics (log more than you think you need)
- ‚ùå Not cleaning up artifacts (can consume significant disk space)

### What's Next

In **Module 02: Model Versioning and Registry**, we'll learn:
- How to use MLflow Model Registry
- Model lifecycle management (staging, production, archived)
- Model lineage and governance
- Transitioning models between stages

### Additional Resources

- **MLflow Documentation**: https://mlflow.org/docs/latest/tracking.html
- **MLflow Tutorial**: https://mlflow.org/docs/latest/tutorials-and-examples/tutorial.html
- **Experiment Tracking Best Practices**: https://neptune.ai/blog/ml-experiment-tracking

---

## Next Steps

Proceed to **Module 02: Model Versioning and Registry** to learn how to manage model versions and promote models through different lifecycle stages.

**Before moving on, ensure you can:**
- ‚úÖ Set up MLflow tracking and create experiments
- ‚úÖ Log parameters, metrics, and artifacts
- ‚úÖ Compare multiple experiment runs
- ‚úÖ Retrieve and load past models
- ‚úÖ Organize experiments with nested runs and tags