# MLflow 101: Experiment Tracking with Modern ML Pipelines

Welcome to the first notebook in our MLflow series! This notebook is designed to introduce you to the basics of MLflow, focusing on experiment tracking with a modern machine learning pipeline. We will keep things simple and clear to make sure you get comfortable with MLflow. More exciting and advanced topics will be covered in the upcoming notebooks.

---

## Table of Contents
1. Introduction to MLflow
2. Setting up MLflow Tracking
3. Loading and Preparing the Dataset
4. Building a Modern ML Pipeline
5. Tracking Experiments with MLflow
6. Comparing Experiment Runs
7. Conclusion and Resources

---

## 1. Introduction to MLflow

MLflow is an open-source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. In this notebook, we focus on the experiment tracking component, which helps you log and compare your machine learning experiments efficiently.

![MLflow Logo](https://mlflow.org/docs/latest/_static/mlflow-logo.png)

Key features of MLflow experiment tracking:
- Log parameters, metrics, and artifacts
- Visualize and compare runs
- Organize experiments and runs

Let's get started!

## 2. Setting up MLflow Tracking

First, we need to install MLflow and set up a tracking server. For simplicity, we'll use the local file system as the backend store.

``````

Let's import the necessary libraries and configure MLflow tracking URI.

In [None]:
# Install MLflow and other dependencies
!pip install mlflow scikit-learn pandas matplotlib

# Import libraries
import mlflow
import mlflow.sklearn
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
import matplotlib.pyplot as plt

# Set MLflow tracking URI to local
mlflow.set_tracking_uri('file:./mlruns')

print('MLflow tracking URI set to:', mlflow.get_tracking_uri())

## 3. Loading and Preparing the Dataset

We will use the Meta Open Materials 2024 (OMat24) dataset from Hugging Face, which is a rich scientific dataset suitable for regression and classification tasks. For simplicity, here we simulate loading a tabular dataset from OpenML (as a placeholder for OMat24) to keep the notebook runnable without external API calls.

We will prepare the data for a classification task.

![Tabular Data Example](https://upload.wikimedia.org/wikipedia/commons/6/69/Tabular_data.png)

In [None]:
# Load dataset from OpenML (placeholder for OMat24)
data = fetch_openml(name='adult', version=2, as_frame=True)
df = data.frame

# Basic preprocessing
X = df.drop('class', axis=1)
y = df['class']

# Convert categorical columns to numeric
X = pd.get_dummies(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print('Training samples:', X_train.shape)
print('Test samples:', X_test.shape)

## 4. Building a Modern ML Pipeline

We will use the **HistGradientBoostingClassifier**, a modern and powerful gradient boosting model available in scikit-learn, which is suitable for tabular data and faster than older models.

Let's train the model and evaluate its accuracy.

![Gradient Boosting Illustration](https://scikit-learn.org/stable/_images/sphx_glr_plot_gradient_boosting_regularization_001.png)

In [None]:
# Train HistGradientBoostingClassifier
model = HistGradientBoostingClassifier(random_state=42)
model.fit(X_train, y_train)

# Predict and evaluate
preds = model.predict(X_test)
acc = accuracy_score(y_test, preds)
print(f'Accuracy: {acc:.4f}')

## 5. Tracking Experiments with MLflow

Now, let's track our experiment using MLflow. We will log parameters, metrics, and the trained model.


In [None]:
with mlflow.start_run():
    # Log model parameters
    mlflow.log_param('model_type', 'HistGradientBoostingClassifier')
    mlflow.log_param('random_state', 42)
    
    # Log accuracy metric
    mlflow.log_metric('accuracy', acc)
    
    # Log the model
    mlflow.sklearn.log_model(model, 'model')

print('Experiment logged with MLflow!')

## 6. Comparing Experiment Runs

MLflow UI allows you to compare different runs visually. You can start the MLflow UI locally by running:

``````

Then open your browser at http://localhost:5000 to explore your experiments.

Let's visualize the accuracy metric from this run.

![MLflow UI Screenshot](https://mlflow.org/docs/latest/_images/mlflow-ui.png)

In [None]:
# Visualize accuracy metric
plt.bar(['HistGradientBoostingClassifier'], [acc])
plt.ylabel('Accuracy')
plt.title('Model Accuracy')
plt.show()

## 7. Conclusion and Resources

In this notebook, you learned the basics of MLflow experiment tracking with a modern ML pipeline using a powerful gradient boosting model. This foundation will help you as we dive into more advanced topics in the upcoming notebooks.

### Resources
- [MLflow Documentation](https://mlflow.org/docs/latest/index.html)
- [MLflow GitHub](https://github.com/mlflow/mlflow)
- [Scikit-learn HistGradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html)
- [Hugging Face Datasets](https://huggingface.co/datasets)

Happy experimenting! 🚀