<a href="https://colab.research.google.com/gist/ruvnet/772f5b12d3df4027f5c7952186cb0d1c/notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MLflow Pipeline with H2O AutoML and DSPy

## Introduction

This tutorial demonstrates an end-to-end machine learning pipeline using **H2O's AutoML** for automated model training, **MLflow** for experiment tracking and model deployment, and **DSPy** to illustrate how large language model components can be integrated. We will walk through the steps of setting up the environment, loading data, training multiple models automatically, tracking results with MLflow, and finally deploying the best model. The pipeline is organized in a modular fashion, making it easy to reuse or extend for different datasets or tasks.

## Installation and Setup

To get started, install the required libraries in your Google Colab environment:
- **H2O AutoML** – Open-source machine learning library for automatic model selection, training, and evaluation.
- **MLflow** – A framework for managing the ML lifecycle, including logging, model versioning, and deployment.
- **DSPy** – A declarative framework for self-improving machine learning tasks, replacing PyTorch for our LLM-based component.

In [None]:
!pip install h2o mlflow dspy

In [None]:
import mlflow
import mlflow.dspy
import h2o
import dspy

# Initialize H2O
h2o.init()

# Enable MLflow autologging for DSPy
mlflow.dspy.autolog()

## Data Loading and Preprocessing

We use the Iris dataset as an example and split it into training and test sets.

In [None]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the Iris dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target, name='species')
df = pd.concat([X, y], axis=1)

# Split into training and test sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['species'])

# Convert pandas DataFrames to H2O Frames
train_hf = h2o.H2OFrame(train_df)
test_hf = h2o.H2OFrame(test_df)
train_hf['species'] = train_hf['species'].asfactor()
test_hf['species'] = test_hf['species'].asfactor()

## Model Training with H2O AutoML

We train multiple models using H2O's AutoML and track the best-performing model in MLflow.

In [None]:
from h2o.automl import H2OAutoML
from sklearn.metrics import accuracy_score

aml = H2OAutoML(max_models=10, seed=1)
with mlflow.start_run(run_name="H2O_AutoML_Iris") as run:
    aml.train(x=list(X.columns), y='species', training_frame=train_hf)
    best_model = aml.leader
    perf = best_model.model_performance(test_hf)
    y_true = test_df['species']
    y_pred = best_model.predict(test_hf).as_data_frame()['predict']
    test_accuracy = accuracy_score(y_true, y_pred)
    mlflow.log_param('max_models', 10)
    mlflow.log_metric('test_accuracy', test_accuracy)
    mlflow.h2o.log_model(best_model, artifact_path='model')
run_id = run.info.run_id

## Model Deployment

We now load the best model from MLflow and use it for predictions.

In [None]:
# Load the best model from MLflow
loaded_model = mlflow.h2o.load_model(f"runs:/{run_id}/model")

# Use the model to predict on test data
sample = h2o.H2OFrame(test_df.head())
predictions = loaded_model.predict(sample)
predictions.as_data_frame()

## Conclusion

This pipeline demonstrated how to integrate H2O AutoML, MLflow, and DSPy into a self-optimizing machine learning workflow. It covered:
- **Data Loading**: Loading and preprocessing structured data.
- **AutoML Training**: Automating model selection and training using H2O AutoML.
- **Model Tracking**: Using MLflow to log models and track metrics.
- **Deployment**: Loading a saved model and running predictions.

This modular structure makes it easy to extend the pipeline with new datasets or additional ML models. Future improvements could include hyperparameter tuning, explainability analysis, and deployment through cloud services like AWS or GCP.