# Lesson 3: MLflow Basics

**Module 2: Reproducibility & Versioning**  
**Estimated Time**: 2-3 hours  
**Difficulty**: Beginner-Intermediate

---

## ðŸŽ¯ Learning Objectives

By the end of this lesson, you will:

âœ… Understand the 4 components of MLflow  
âœ… Implement MLflow Tracking in Python scripts  
âœ… Log parameters, metrics, and artifacts  
âœ… Use the MLflow UI to compare experiments  
âœ… Answer interview questions on experiment tracking  

---

## ðŸ“š Table of Contents

1. [What is MLflow?](#1-what-is-mlflow)
2. [The MLflow Tracking Component](#2-tracking)
3. [Hands-On: First MLflow Experiment](#3-hands-on)
4. [Comparing Runs in UI](#4-comparing-runs)
5. [Interview Preparation](#5-interview-questions)

---

## 1. What is MLflow?

MLflow is an open-source platform for the machine learning lifecycle. It has four main components:

1. **MLflow Tracking**: Record and query experiments (code, data, config, results).
2. **MLflow Projects**: Package data science code in a reproducible format.
3. **MLflow Models**: Deploy machine learning models in diverse serving environments.
4. **MLflow Registry**: Store, annotate, discover, and manage models.

In this lesson, we focus on **Tracking**.

## 2. The MLflow Tracking Component

### Why do we need it?

Without tracking:
- "Which hyperparameters gave that 98% accuracy?"
- "Where is the model file for the experiment I ran last Tuesday?"
- "Did the new dataset update improve performance?"

MLflow solves this by logging:
- **Parameters**: Key-value inputs (n_estimators=100, learning_rate=0.01)
- **Metrics**: Numeric values that update (accuracy, loss)
- **Artifacts**: Files (plots, models, data samples)
- **Tags**: Metadata (user, git_commit_hash)

## 3. Hands-On: First MLflow Experiment

You need to have specific libraries installed (mlflow, sklearn, pandas).
If not installed: `!pip install mlflow`

In [None]:
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
import pandas as pd
import numpy as np

# 1. Prepare Data
db = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(db.data, db.target)

# 2. Define Training Function
def train(n_estimators, max_depth):
    # Start MLflow run
    with mlflow.start_run():
        # Log Parameters
        mlflow.log_param("n_estimators", n_estimators)
        mlflow.log_param("max_depth", max_depth)
        mlflow.log_param("model_type", "RandomForestRegressor")
        
        # Create and Train Model
        rf = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth)
        rf.fit(X_train, y_train)
        
        # Evaluate
        predictions = rf.predict(X_test)
        rmse = np.sqrt(mean_squared_error(y_test, predictions))
        
        # Log Metrics
        mlflow.log_metric("rmse", rmse)
        
        # Log Model (Artifact)
        mlflow.sklearn.log_model(rf, "model")
        
        print(f"Run Complete: n_est={n_estimators}, depth={max_depth}, RMSE={rmse:.4f}")

# 3. Run Experiments
print("Starting Experiments...")
train(50, 5)
train(100, 10)
train(200, 15)
print("Done!")

## 4. Comparing Runs in UI

To see the results, you would typically run:
```bash
mlflow ui
```
And navigate to `http://localhost:5000`.

### What to Look For:
1. **Experiment List**: Usually 'Default' or named experiments.
2. **Table View**: Compare RMSE across different runs.
3. **Details Page**: Click a run to see execution time, parameters, and download the model artifact.

## 5. Interview Preparation

### Common Questions

#### Q1: "What is the difference between logging a parameter and a metric?"
**Answer**: Parameters are inputs (config, hyperparameters) and are typically constant for a run. Metrics are outputs (accuracy, loss) and can change over time (e.g., loss per epoch).

#### Q2: "How would you track models across a team?"
**Answer**: Use a **remote tracking server** (e.g., on AWS EC2 or Managed MLflow on Databricks/Azure). Everyone points their `mlflow.set_tracking_uri()` to the shared server. This creates a central repository of all experiments for the team.

#### Q3: "What are MLflow Artifacts?"
**Answer**: Artifacts are output files generated by the run. Common examples: serialized model files (.pkl), plots (confusion matrix images), and data samples (CSV, Parquet). They are stored in an artifact store (S3, Azure Blob) while metadata is stored in a database (SQL).