# MLflow 101: Experiment Tracking with Modern ML Pipelines

Welcome to the first notebook in our MLflow series! This notebook is designed to introduce you to the basics of MLflow, focusing on **experiment tracking** in modern machine learning pipelines. I aim to keep things simple and clear, ensuring you get comfortable with the core concepts of MLflow. 

**Note**: This notebook keeps things simple to get you comfortable with MLFlow’s core concepts. Starting from Notebook 2, I’ll dive into more exciting, cutting-edge machine learning and generative AI challenges!

---

## Table of Contents

1. [Introduction to MLflow](#introduction-to-mlflow)
2. [Setting up Your MLflow Environment](#setting-up-mlflow)
3. [Loading and Preparing a Well-Known Dataset: California Housing](#loading-and-preparing-dataset)
4. [Building a Modern ML Model: XGBoost](#building-a-modern-ml-model)
5. [Tracking Your First Experiment with MLflow](#tracking-experiments-with-mlflow)
6. [Exploring and Comparing Experiment Runs via MLflow UI](#comparing-experiment-runs)
7. [Key Takeaways](#key-takeaways)
8. [Engaging Resources and Further Reading](#resources-and-further-reading)

---

## 1. Introduction to MLflow

MLflow is an open-source platform designed to manage the end-to-end machine learning lifecycle. This includes:
- **Experimentation:** Tracking parameters, code, data, and results.
- **Reproducibility:** Ensuring that your experiments and results can be consistently replicated.
- **Deployment:** Packaging and deploying models to various serving environments.
- **Model Registry:** A centralized hub to manage, version, and stage your models.

![MLflow Logo](https://www.the-odd-dataguy.com/images/posts/20191113/cover.jpg)

In this initial notebook, our primary focus will be on **MLflow Tracking**. This component allows you to log parameters, code versions, metrics, and output files when running your machine learning code. It provides a UI to visualize and compare results from different runs, which is invaluable for iterating on models and understanding what works.

Think of MLflow Tracking as your sophisticated lab notebook for every experiment you conduct. It helps you answer questions like:
- *What were the exact parameters used for that high-performing model?*
- *How did changing a particular feature affect the outcome?*
- *Which version of the code produced this specific result?*

By the end of this notebook, you'll be able to set up MLflow, run a simple ML experiment, log its details, and know how to view them.

---

## 2. Setting up Your MLflow Environment


![MLFlow Workflow](https://mlflow.org/docs/latest/assets/images/learn-core-components-b2c38671f104ca6466f105a92ed5aa68.png)


First things first, let's get MLflow installed and imported into our notebook. We'll also set up a **tracking URI**. The tracking URI tells MLflow where to store your experiment data (runs, parameters, metrics, artifacts, etc.). For simplicity, we'll use a local directory named `mlruns` which MLflow will create automatically in your current working directory.

In [None]:
# Install MLflow if you haven't already. 
# We use 'pip install --quiet' to suppress extensive output for a cleaner notebook.
# It takes a few minutes to install xgboost.
!pip install --quiet mlflow

# Import the MLflow library
import mlflow
import mlflow.xgboost # Using mlflow.xgboost for specific XGBoost model logging

# Set the tracking URI. MLflow will store tracking data in a local './mlruns' directory.
# If you have a dedicated MLflow tracking server, you would put its URI here (e.g., http://your-mlflow-server:5000).
mlflow.set_tracking_uri('mlruns') # Using a local directory for simplicity

print(f"MLflow Version: {mlflow.__version__}")
print(f"MLflow Tracking URI: {mlflow.get_tracking_uri()}")

With these few lines, MLflow is ready to start tracking our experiments. The `mlruns` directory will be created once we log our first experiment. 

---

## 3. Loading and Preparing a Well-Known Dataset: California Housing

For our first MLflow experiment, we'll use the **California Housing dataset**. This is a classic dataset derived from the 1990 U.S. census, commonly used for regression tasks in machine learning. It contains information about housing districts in California, with features like median income, house age, average number of rooms, etc., and the target variable is the median house value for California districts.

We'll load it directly from Hugging Face using the `datasets` library. This dataset is well-structured with numerical features and is suitable for demonstrating MLflow with a regression model.

In [None]:
!pip install --quiet datasets pandas scikit-learn # Install Hugging Face datasets, pandas, and scikit-learn

from datasets import load_dataset
import pandas as pd

# Load the California Housing dataset from Hugging Face
try:
    # Using the gvlassis version which provides the 8 numeric features directly
    housing_dataset = load_dataset('gvlassis/california_housing', split='train')
    print("Successfully loaded California Housing dataset.")
except Exception as e:
    print(f"Failed to load dataset: {e}")
    print("Please ensure you have internet connectivity and the dataset name is correct.")
    # If loading fails, you might want to stop or use a very simple fallback for demonstration if absolutely necessary
    # For this dataset, failure is less likely than with very large or experimental datasets.
    raise e # Re-raise the exception to halt if dataset loading fails

# Convert to pandas DataFrame for easier manipulation
df = housing_dataset.to_pandas()

print("\nDataset Preview (First 5 rows):")
print(df.head())
print(f"\nDataset Shape: {df.shape}")
print("\nDataset Features and Types:")
print(df.info())

# Define features and target based on the dataset's known structure
# Features: MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude
# Target: MedHouseVal
feature_columns = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
target_column = 'MedHouseVal'

X = df[feature_columns]
y = df[target_column]

# The California Housing dataset (gvlassis version) is clean and doesn't typically have NaNs
# However, adding a check or a simple imputer is good practice for robustness if needed in other contexts.
print(f"\nMissing values in features (X): {X.isnull().sum().sum()}")
print(f"Missing values in target (y): {y.isnull().sum()}")

# If there were NaNs, a simple strategy would be:
# X = X.fillna(X.mean())
# y = y.fillna(y.mean())

print("\nFeatures (X) shape:", X.shape)
print("Target (y) shape:", y.shape)

We've loaded the California Housing data and prepared our features (X) and target (y). This is a regression task: predicting the median house value. The data preparation steps here are minimal as the dataset is quite clean, allowing us to focus on MLflow.

---

## 4. Building a Modern ML Model: XGBoost

Instead of a simplistic model, let's use **XGBoost (Extreme Gradient Boosting)**. XGBoost is a powerful and widely-used gradient boosting library that excels in many tabular data competitions and real-world applications. It's known for its performance and scalability.

We'll train an `XGBRegressor` model to predict the `MedHouseVal` based on the selected features.

In [None]:
# It takes a few minutes to install xgboost.
!pip install --quiet xgboost # Ensure XGBoost is installed

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Validation set size: {X_val.shape[0]} samples")

# Initialize and train the XGBoost Regressor model
# We'll use a few common hyperparameters. More advanced tuning will be covered later.
params = {
    'objective': 'reg:squarederror', # Objective function for regression
    'n_estimators': 100,             # Number of boosting rounds (trees)
    'learning_rate': 0.1,            # Step size shrinkage
    'max_depth': 5,                  # Maximum depth of a tree (adjusted from 3 for potentially more complex data)
    'subsample': 0.8,                # Subsample ratio of the training instance
    'colsample_bytree': 0.8,         # Subsample ratio of columns when constructing each tree
    'random_state': 42               # For reproducibility
}

model = xgb.XGBRegressor(**params)
model.fit(X_train, y_train)

# Make predictions on the validation set
y_pred = model.predict(X_val)

# Evaluate the model
mse = mean_squared_error(y_val, y_pred)
mae = mean_absolute_error(y_val, y_pred)
r2 = r2_score(y_val, y_pred)

print(f"\nModel Performance on Validation Set:")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"R-squared (R2 Score): {r2:.4f}")

Great! We've trained a sophisticated XGBoost model and evaluated its performance using MSE, MAE, and R2 score. Now, let's see how MLflow helps us keep track of this process, especially if we want to try different parameters, features, or even models.

---

## 5. Tracking Your First Experiment with MLflow

This is where MLflow shines. We'll use `mlflow.start_run()` to create a new MLflow run. Within the context of this run, we can log:

- **Parameters (`mlflow.log_param()` or `mlflow.log_params()`):** Key-value pairs representing input parameters to our model training, like hyperparameter values (`learning_rate`, `n_estimators`).
- **Metrics (`mlflow.log_metric()` or `mlflow.log_metrics()`):** Key-value pairs representing model performance metrics, like MSE or R2 score. Metrics can be updated throughout a run (e.g., logging loss at each epoch).
- **Artifacts (`mlflow.log_artifact()` or specific `mlflow.<flavor>.log_model()`):** Larger files, such as the trained model itself, images (like plots), or data files.

Let's re-run our training, but this time, wrapped in an MLflow run context.

In [None]:
# Define an experiment name. If it doesn't exist, MLflow creates it.
experiment_name = "California_Housing_Price_Prediction"
mlflow.set_experiment(experiment_name)

with mlflow.start_run(run_name="XGBoost_Initial_Run_Housing") as run:
    run_id = run.info.run_id
    experiment_id = run.info.experiment_id
    print(f"Starting MLflow Run: {run.info.run_name}")
    print(f"Run ID: {run_id}")
    print(f"Experiment ID: {experiment_id}")
    
    # Log parameters used for this run
    mlflow.log_params(params) # Log all parameters from the dict
    mlflow.log_param("train_test_split_random_state", 42)
    mlflow.log_param("dataset_name", "gvlassis/california_housing")
    mlflow.log_param("features_used_count", len(feature_columns))

    # Re-train the model and evaluate INSIDE the mlflow.start_run() context for proper logging
    model_in_run = xgb.XGBRegressor(**params)
    model_in_run.fit(X_train, y_train)
    y_pred_in_run = model_in_run.predict(X_val)
    
    mse_in_run = mean_squared_error(y_val, y_pred_in_run)
    mae_in_run = mean_absolute_error(y_val, y_pred_in_run)
    r2_in_run = r2_score(y_val, y_pred_in_run)
    
    # Log metrics
    metrics_to_log = {"mse": mse_in_run, "mae": mae_in_run, "r2_score": r2_in_run}
    mlflow.log_metrics(metrics_to_log)
    print(f"Logged Metrics: MSE={mse_in_run:.4f}, MAE={mae_in_run:.4f}, R2={r2_in_run:.4f}")

    # Log the trained model using MLflow's XGBoost flavor
    # This saves the model in a format MLflow understands, allowing for easy loading later.
    # The 'artifact_path' is a name for the model within this run's artifacts.
    mlflow.xgboost.log_model(model_in_run, artifact_path="xgboost-housing-model")
    print(f"Model logged under path: xgboost-housing-model")

    # You can also log arbitrary files as artifacts
    # For example, let's log the list of features used as a text file
    with open("features_california_housing.txt", "w") as f:
        for feature in feature_columns:
            f.write(f"{feature}\n")
    mlflow.log_artifact("features_california_housing.txt", artifact_path="feature_info")
    print(f"Logged 'features_california_housing.txt' artifact to 'feature_info' directory.")

    # Set a tag for this run for easier filtering/organization
    mlflow.set_tag("model_type", "XGBoost_Regressor")
    mlflow.set_tag("data_version", "1990_census")
    print("Set tags for the run.")

    print(f"\nMLflow Run {run.info.run_name} completed and logged.")

print("\nCheck the 'mlruns' directory in your file system. It should now contain experiment data.")

And that's it! We've successfully logged our first experiment. MLflow has captured the parameters we used, the performance metrics (MSE, MAE, and R2 score), the trained XGBoost model, and our feature list. All this information is now neatly organized and associated with a unique run ID within the specified experiment.

If you were to change a hyperparameter (e.g., `learning_rate` to 0.05) and re-run the cell above (perhaps with a new `run_name`), MLflow would create a *new run* with the updated parameters and corresponding metrics. This is the core of experiment tracking: systematically recording each attempt.

---

## 6. Exploring and Comparing Experiment Runs via 

One of the most powerful features of MLflow is its User Interface (UI). The UI allows you to visualize, search, and compare your experiment runs graphically.

To launch the , open your terminal or command prompt, navigate to the directory **containing your `mlruns` folder** (which is likely the same directory where this notebook is saved), and run the following command:

``

This will start a local web server, typically at `http://localhost:5000` (or `http://127.0.0.1:5000`). Open this URL in your web browser.

![MLflow UI Example](https://mlflow.org/docs/latest/assets/images/default-ui-e733e29706c434eb4443048ea57b8a17.png)

**What to explore in the MLflow UI:**

- **Experiments List:** On the left, you'll see your experiment (`California_Housing_Price_Prediction`). Clicking on it will show all associated runs.
- **Runs Table:** For each run, you'll see key information like start time, duration, parameters, and metrics you logged. You can customize the columns displayed.
- **Run Details:** Click on a specific run (e.g., `XGBoost_Initial_Run_Housing`) to see more details:
    - **Parameters:** All logged hyperparameters.
    - **Metrics:** Logged metrics like MSE, MAE, and R2. You can even see plots if metrics are logged over time (e.g., training epochs, though not done in this example).
    - **Artifacts:** Any logged artifacts, including the `xgboost-housing-model` directory (containing your model files) and the `feature_info` directory with `features_california_housing.txt`.
    - **Tags:** Any tags you set for the run, like `model_type`.
- **Comparing Runs:** If you run the experiment multiple times (e.g., with different hyperparameters), you can select multiple runs from the table and click "Compare." This provides a side-by-side comparison of their parameters and metrics, making it easy to see what changes led to better (or worse) performance. There are also parallel coordinate plots and scatter plots for visual comparison.

**Spend some time navigating the UI. It's very intuitive!** Try changing a parameter in the code cell for section 5 (e.g., `n_estimators` to 150 or `max_depth` to 7), give the run a new name (e.g., `XGBoost_Run_More_Estimators_Housing`), and re-run it. Then, go back to the MLflow UI, refresh, and see how you can compare these two runs.

---

## 7. Key Takeaways

In this first notebook, you've learned the essentials of MLflow experiment tracking:

- **What MLflow is:** An open-source platform for the MLOps lifecycle.
- **Setting up MLflow:** Installing MLflow and setting a tracking URI (local `mlruns` directory for now).
- **Core Tracking Concepts:**
    - **Experiments:** A way to organize runs for a specific problem (e.g., `California_Housing_Price_Prediction`).
    - **Runs:** Single executions of your ML code.
    - **Parameters:** Inputs to your run (e.g., hyperparameters).
    - **Metrics:** Outputs/results of your run (e.g., MSE, R2 score).
    - **Artifacts:** Any other files you want to save (e.g., trained models, data files, plots).
    - **Tags:** Key-value metadata to help organize and query runs.
- **Logging with MLflow:** Using `mlflow.set_experiment()`, `mlflow.start_run()`, `mlflow.log_param()`, `mlflow.log_params()`, `mlflow.log_metric()`, `mlflow.log_metrics()`, `mlflow.xgboost.log_model()`, `mlflow.log_artifact()`, and `mlflow.set_tag()`.
- **MLflow UI:** Launching and using the UI to view, search, and compare experiments and runs.

This foundation is crucial. As we move to more complex models and tasks in subsequent notebooks, including those involving Large Language Models (LLMs) and Generative AI, these MLflow tracking skills will be indispensable for managing the increased complexity and iteration involved.

---

## 8. Engaging Resources and Further Reading

Want to dive deeper? Here are some excellent resources:

- **MLflow Official Documentation:**
    - [MLflow Quickstart](https://mlflow.org/docs/latest/getting-started/index.html)
    - [MLflow Tracking Guide](https://mlflow.org/docs/latest/tracking.html)
    - [MLflow Python API](https://mlflow.org/docs/latest/python_api/index.html)
- **Dataset Used:**
    - [California Housing on Hugging Face (gvlassis/california_housing)](https://huggingface.co/datasets/gvlassis/california_housing)
    - [Original California Housing Dataset Information (StatLib)](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html)
- **Model Library Used:**
    - [XGBoost Documentation](https://xgboost.readthedocs.io/en/stable/)
- **Community & Code:**
    - [MLflow GitHub Repository](https://github.com/mlflow/mlflow)
    - [MLflow Community Slack (via LF AI & Data Foundation)](https://lfaidata.foundation/projects/mlflow/)

--- 

Thank you for working through this first notebook! I hope you're excited about the power and simplicity of MLflow for experiment tracking. 

**Coming Up Next:** Get ready for more advanced MLflow features, including hyperparameter optimization, model registry, and applications in the realm of LLMs and GenAI. Stay tuned!

![Keep Learning](https://memento.epfl.ch/image/23136/1440x810.jpg)