# MLflow 101: Experiment Tracking with Modern ML Pipelines

Welcome to the first notebook in our MLflow series! This notebook is designed to introduce you to the basics of MLflow, focusing on **experiment tracking** in modern machine learning pipelines. We aim to keep things simple and clear, ensuring you get comfortable with the core concepts of MLflow. 

**Note**: This notebook keeps things simple to get you comfortable with MLFlow’s core concepts. Starting from Notebook 2, we’ll dive into more exciting, cutting-edge machine learning and generative AI challenges!

---

## Table of Contents

1. [Introduction to MLflow](#introduction-to-mlflow)
2. [Setting up Your MLflow Environment](#setting-up-mlflow)
3. [Loading and Preparing a Modern Dataset: OMat24](#loading-and-preparing-dataset)
4. [Building a Modern ML Model: XGBoost](#building-a-modern-ml-model)
5. [Tracking Your First Experiment with MLflow](#tracking-experiments-with-mlflow)
6. [Exploring and Comparing Experiment Runs via MLflow UI](#comparing-experiment-runs)
7. [Key Takeaways](#key-takeaways)
8. [Engaging Resources and Further Reading](#resources-and-further-reading)

---

## 1. Introduction to MLflow

MLflow is an open-source platform designed to manage the end-to-end machine learning lifecycle. This includes:
- **Experimentation:** Tracking parameters, code, data, and results.
- **Reproducibility:** Ensuring that your experiments and results can be consistently replicated.
- **Deployment:** Packaging and deploying models to various serving environments.
- **Model Registry:** A centralized hub to manage, version, and stage your models.

![MLflow Logo](https://www.the-odd-dataguy.com/images/posts/20191113/cover.jpg)

In this initial notebook, our primary focus will be on **MLflow Tracking**. This component allows you to log parameters, code versions, metrics, and output files when running your machine learning code. It provides a UI to visualize and compare results from different runs, which is invaluable for iterating on models and understanding what works.

Think of MLflow Tracking as your sophisticated lab notebook for every experiment you conduct. It helps you answer questions like:
- *What were the exact parameters used for that high-performing model?*
- *How did changing a particular feature affect the outcome?*
- *Which version of the code produced this specific result?*

By the end of this notebook, you'll be able to set up MLflow, run a simple ML experiment, log its details, and know how to view them.

---

## 2. Setting up Your MLflow Environment

First things first, let's get MLflow installed and imported into our notebook. We'll also set up a **tracking URI**. The tracking URI tells MLflow where to store your experiment data (runs, parameters, metrics, artifacts, etc.). For simplicity, we'll use a local directory named `mlruns` which MLflow will create automatically in your current working directory.

In [None]:
# Install MLflow if you haven't already. 
# We use 'pip install --quiet' to suppress extensive output for a cleaner notebook.
!pip install --quiet mlflow

# Import the MLflow library
import mlflow
import mlflow.sklearn # For scikit-learn flavor, though we'll use XGBoost directly

# Set the tracking URI. MLflow will store tracking data in a local './mlruns' directory.
# If you have a dedicated MLflow tracking server, you would put its URI here (e.g., http://your-mlflow-server:5000).
mlflow.set_tracking_uri('mlruns') # Using a local directory for simplicity

print(f"MLflow Version: {mlflow.__version__}")
print(f"MLflow Tracking URI: {mlflow.get_tracking_uri()}")

With these few lines, MLflow is ready to start tracking our experiments. The `mlruns` directory will be created once we log our first experiment. 

---

## 3. Loading and Preparing a Modern Dataset: OMat24

For our first MLflow experiment, we'll use the **Meta Open Materials 2024 (OMat24)** dataset. This is a rich, scientific dataset published in 2024, containing results of Density Functional Theory (DFT) computations for a vast number of materials. It's an excellent example of a modern dataset relevant for practical ML applications in materials science and beyond.

We'll load it directly from Hugging Face using the `datasets` library. To keep this introductory notebook nimble, we'll use the 'small' configuration of the dataset and only take a small slice of its training data.

In [None]:
!pip install --quiet datasets pandas # Install Hugging Face datasets and pandas

from datasets import load_dataset
import pandas as pd

# Load a small subset of the OMat24 dataset ('small' configuration, first 1000 training samples)
try:
    omat_dataset = load_dataset('facebook/OMAT24', 'small', split='train[:1000]')
    print("Successfully loaded OMat24 dataset.")
except Exception as e:
    print(f"Failed to load dataset: {e}")
    print("Please ensure you have internet connectivity and the dataset name is correct.")
    # As a fallback for environments without internet or if OMAT24 is too large/problematic quickly:
    # We can create a dummy dataset for demonstration purposes.
    print("Using a dummy dataset for demonstration.")
    data = {
        'feature1': [i for i in range(1000)],
        'feature2': [i*2 for i in range(1000)],
        'feature3': [i*0.5 for i in range(1000)],
        'formation_energy_per_atom': [i*1.5 + 0.1*(-1)**i for i in range(1000)]
    }
    omat_dataset = pd.DataFrame(data)

# Convert to pandas DataFrame for easier manipulation
if not isinstance(omat_dataset, pd.DataFrame):
    df = omat_dataset.to_pandas()
else:
    df = omat_dataset

print("\nDataset Preview (First 5 rows):")
print(df.head())
print(f"\nDataset Shape: {df.shape}")

# For OMat24, let's select a few numerical features and our target variable.
# The actual OMat24 dataset has many features. We'll pick a few simple ones for this example.
# Target: 'formation_energy_per_atom' (a common target in materials science)
# Features: We'll try to use some lattice parameters if they exist and are numeric.
# If using the dummy dataset, these specific feature names will match.

potential_features = ['lattice_vector_1_x', 'lattice_vector_1_y', 'lattice_vector_1_z', 
                      'lattice_vector_2_x', 'lattice_vector_2_y', 'lattice_vector_2_z',
                      'feature1', 'feature2', 'feature3'] # Add dummy features for fallback
target_column = 'formation_energy_per_atom'

available_features = [f for f in potential_features if f in df.columns and pd.api.types.is_numeric_dtype(df[f])]

if not available_features:
    print("\nCould not find suitable numeric features from the predefined list.")
    # Fallback: use any available numeric columns if predefined are not found
    available_features = [col for col in df.columns if pd.api.types.is_numeric_dtype(df[col]) and col != target_column]
    if len(available_features) > 5: # Limit to 5 features for simplicity
        available_features = available_features[:5]

if not available_features or target_column not in df.columns:
    raise ValueError("Selected features or target column not found or not numeric in the dataset. Please check the dataset structure.")

print(f"\nSelected Features: {available_features}")
print(f"Target Variable: {target_column}")

X = df[available_features]
y = df[target_column]

# Basic preprocessing: fill any NaN values with the mean (simple strategy for this intro)
X = X.fillna(X.mean())
y = y.fillna(y.mean())

print("\nFeatures (X) shape:", X.shape)
print("Target (y) shape:", y.shape)

We've loaded our data, selected relevant features and our target ('formation_energy_per_atom'). This is a regression task: predicting a continuous value. The data preparation steps here are minimal to keep focus on MLflow. In real-world scenarios, this stage would involve more extensive exploratory data analysis (EDA) and feature engineering.

---

## 4. Building a Modern ML Model: XGBoost

Instead of a simplistic model, let's use **XGBoost (Extreme Gradient Boosting)**. XGBoost is a powerful and widely-used gradient boosting library that excels in many tabular data competitions and real-world applications. It's known for its performance and scalability.

We'll train an `XGBRegressor` model to predict the `formation_energy_per_atom` based on the selected lattice features.

In [None]:
!pip install --quiet xgboost scikit-learn # Install XGBoost and scikit-learn

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Validation set size: {X_val.shape[0]} samples")

# Initialize and train the XGBoost Regressor model
# We'll use a few common hyperparameters. More advanced tuning will be covered later.
params = {
    'objective': 'reg:squarederror', # Objective function for regression
    'n_estimators': 100,             # Number of boosting rounds (trees)
    'learning_rate': 0.1,            # Step size shrinkage
    'max_depth': 3,                  # Maximum depth of a tree
    'random_state': 42               # For reproducibility
}

model = xgb.XGBRegressor(**params)
model.fit(X_train, y_train)

# Make predictions on the validation set
y_pred = model.predict(X_val)

# Evaluate the model
mse = mean_squared_error(y_val, y_pred)
r2 = r2_score(y_val, y_pred)

print(f"\nModel Performance on Validation Set:")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"R-squared (R2 Score): {r2:.4f}")

Great! We've trained a sophisticated XGBoost model and evaluated its performance. Now, how can MLflow help us keep track of this process, especially if we want to try different parameters, features, or even models?

---

## 5. Tracking Your First Experiment with MLflow

This is where MLflow shines. We'll use `mlflow.start_run()` to create a new MLflow run. Within the context of this run, we can log:

- **Parameters (`mlflow.log_param()`):** Key-value pairs representing input parameters to our model training, like hyperparameter values (`learning_rate`, `n_estimators`).
- **Metrics (`mlflow.log_metric()`):** Key-value pairs representing model performance metrics, like MSE or R2 score. Metrics can be updated throughout a run (e.g., logging loss at each epoch).
- **Artifacts (`mlflow.log_artifact()` or specific `mlflow.<flavor>.log_model()`):** Larger files, such as the trained model itself, images (like plots), or data files.

Let's re-run our training, but this time, wrapped in an MLflow run context.

In [None]:
# Define an experiment name. If it doesn't exist, MLflow creates it.
experiment_name = "OMat24_Material_Property_Prediction"
mlflow.set_experiment(experiment_name)

with mlflow.start_run(run_name="XGBoost_Initial_Run") as run: # You can give your run a custom name
    print(f"Starting MLflow Run: {run.info.run_name}")
    print(f"Run ID: {run.info.run_id}")
    print(f"Experiment ID: {run.info.experiment_id}")
    
    # Log parameters used for this run
    mlflow.log_params(params) # Log all parameters from the dict
    mlflow.log_param("train_test_split_random_state", 42)
    mlflow.log_param("dataset_subset_size", len(df))
    mlflow.log_param("features_used", ", ".join(available_features))

    # Re-train the model (or you could train it inside the 'with' block from scratch)
    # For this example, we'll assume the model 'model' and metrics 'mse', 'r2' are from the cell above.
    # In a typical workflow, training happens INSIDE the mlflow.start_run() context.
    # Let's re-run the training and evaluation to be self-contained within the run for clarity.
    
    model_in_run = xgb.XGBRegressor(**params)
    model_in_run.fit(X_train, y_train)
    y_pred_in_run = model_in_run.predict(X_val)
    mse_in_run = mean_squared_error(y_val, y_pred_in_run)
    r2_in_run = r2_score(y_val, y_pred_in_run)
    
    # Log metrics
    mlflow.log_metric("mse", mse_in_run)
    mlflow.log_metric("r2_score", r2_in_run)
    print(f"Logged Metrics: MSE={mse_in_run:.4f}, R2={r2_in_run:.4f}")

    # Log the trained model using MLflow's scikit-learn flavor (XGBoost is scikit-learn compatible)
    # This saves the model in a format MLflow understands, allowing for easy loading later.
    # The 'artifact_path' is a name for the model within this run's artifacts.
    mlflow.xgboost.log_model(model_in_run, artifact_path="xgboost-model")
    print(f"Model logged under path: xgboost-model")

    # You can also log arbitrary files as artifacts
    # For example, let's log the list of features used as a text file
    with open("features.txt", "w") as f:
        for feature in available_features:
            f.write(f"{feature}\n")
    mlflow.log_artifact("features.txt", artifact_path="feature_info")
    print(f"Logged 'features.txt' artifact to 'feature_info' directory.")

    print(f"\nMLflow Run {run.info.run_name} completed and logged.")

print("\nCheck the 'mlruns' directory in your file system. It should now contain experiment data.")

And that's it! We've successfully logged our first experiment. MLflow has captured the parameters we used, the performance metrics (MSE and R2 score), and even the trained XGBoost model itself, along with our feature list. All this information is now neatly organized and associated with a unique run ID within the specified experiment.

If you were to change a hyperparameter (e.g., `learning_rate` to 0.05) and re-run the cell above (perhaps with a new `run_name`), MLflow would create a *new run* with the updated parameters and corresponding metrics. This is the core of experiment tracking: systematically recording each attempt.

---

## 6. Exploring and Comparing Experiment Runs via MLflow UI

One of the most powerful features of MLflow is its User Interface (UI). The UI allows you to visualize, search, and compare your experiment runs graphically.

To launch the MLflow UI, open your terminal or command prompt, navigate to the directory **containing your `mlruns` folder** (which is likely the same directory where this notebook is saved), and run the following command:

`mlflow ui`

This will start a local web server, typically at `http://localhost:5000` (or `http://127.0.0.1:5000`). Open this URL in your web browser.

![MLflow UI Example](https://blog.min.io/content/images/2025/03/Screenshot-2025-03-10-at-3.30.33-PM.png)
*Image Source: MLflow Documentation. Your UI will show the experiment we just ran.* 

**What to explore in the MLflow UI:**

- **Experiments List:** On the left, you'll see your experiment (`OMat24_Material_Property_Prediction`). Clicking on it will show all associated runs.
- **Runs Table:** For each run, you'll see key information like start time, duration, parameters, and metrics you logged.
- **Run Details:** Click on a specific run (e.g., `XGBoost_Initial_Run`) to see more details:
    - **Parameters:** All logged hyperparameters.
    - **Metrics:** Logged metrics like MSE and R2. You can even see plots if metrics are logged over time (e.g., training epochs).
    - **Artifacts:** Any logged artifacts, including the `xgboost-model` directory (containing your model files) and the `feature_info` directory with `features.txt`.
- **Comparing Runs:** If you run the experiment multiple times (e.g., with different hyperparameters), you can select multiple runs from the table and click "Compare." This provides a side-by-side comparison of their parameters and metrics, making it easy to see what changes led to better (or worse) performance.

**Spend some time navigating the UI. It's very intuitive!** Try changing a parameter in the code cell for section 5 (e.g., `n_estimators` to 150 or `learning_rate` to 0.05), give the run a new name (e.g., `XGBoost_Run_More_Estimators`), and re-run it. Then, go back to the MLflow UI, refresh, and see how you can compare these two runs.

---

## 7. Key Takeaways

In this first notebook, you've learned the essentials of MLflow experiment tracking:

- **What MLflow is:** An open-source platform for the MLOps lifecycle.
- **Setting up MLflow:** Installing MLflow and setting a tracking URI (local `mlruns` directory for now).
- **Core Tracking Concepts:**
    - **Experiments:** A way to organize runs for a specific problem (e.g., `OMat24_Material_Property_Prediction`).
    - **Runs:** Single executions of your ML code.
    - **Parameters:** Inputs to your run (e.g., hyperparameters).
    - **Metrics:** Outputs/results of your run (e.g., MSE, R2 score).
    - **Artifacts:** Any other files you want to save (e.g., trained models, data files, plots).
- **Logging with MLflow:** Using `mlflow.set_experiment()`, `mlflow.start_run()`, `mlflow.log_param()`, `mlflow.log_params()`, `mlflow.log_metric()`, `mlflow.xgboost.log_model()`, and `mlflow.log_artifact()`.
- **MLflow UI:** Launching and using the UI to view, search, and compare experiments and runs.

This foundation is crucial. As we move to more complex models and tasks in subsequent notebooks, including those involving Large Language Models (LLMs) and Generative AI, these MLflow tracking skills will be indispensable for managing the increased complexity and iteration involved.

---

## 8. Engaging Resources and Further Reading

Want to dive deeper? Here are some excellent resources:

- **MLflow Official Documentation:**
    - [MLflow Quickstart](https://mlflow.org/docs/latest/getting-started/index.html)
    - [MLflow Tracking Guide](https://mlflow.org/docs/latest/tracking.html)
    - [MLflow Python API](https://mlflow.org/docs/latest/python_api/index.html)
- **Dataset Used:**
    - [Meta Open Materials 2024 (OMat24) on Hugging Face](https://huggingface.co/datasets/facebook/OMAT24)
    - [OMat24 Paper (if interested in the science)](https://ai.meta.com/blog/meta-open-materials-omat-dataset-ai-accelerate-discovery/)
- **Model Library Used:**
    - [XGBoost Documentation](https://xgboost.readthedocs.io/en/stable/)
- **Community & Code:**
    - [MLflow GitHub Repository](https://github.com/mlflow/mlflow)
    - [MLflow Community Slack (via LF AI & Data Foundation)](https://lfaidata.foundation/projects/mlflow/)

--- 

Thank you for working through this first notebook! We hope you're excited about the power and simplicity of MLflow for experiment tracking. 

**Coming Up Next:** Get ready for more advanced MLflow features, including hyperparameter optimization, model registry, and applications in the realm of LLMs and GenAI. Stay tuned!
![Keep Learning](https://memento.epfl.ch/image/23136/1440x810.jpg)