# <span style="color:#1f77b4">**Machine Learning 02 - MLflow**</span>


### <span style="color:#1f77b4">**Unity Catalog configuration**</span>

Set up widgets for `CATALOG`, `SCHEMA`, and `VOLUME`, resolve the active catalog, and build the `BASE` path.


In [0]:
# Configure Unity Catalog widgets and resolve the active catalog.

# Unity Catalog config for this project
dbutils.widgets.removeAll()
dbutils.widgets.text("CATALOG", "")
dbutils.widgets.text("SCHEMA", "default")
dbutils.widgets.text("VOLUME", "ml_lab")

catalog_widget = dbutils.widgets.get("CATALOG")
if catalog_widget:
    CATALOG = catalog_widget
else:
    # Prefer current catalog, otherwise pick the first non-system catalog
    current = spark.sql("SELECT current_catalog()").first()[0]
    catalogs = [r.catalog for r in spark.sql("SHOW CATALOGS").collect()]
    CATALOG = current if current not in ("system",) else next(c for c in catalogs if c not in ("system",))

SCHEMA = dbutils.widgets.get("SCHEMA")
VOLUME = dbutils.widgets.get("VOLUME")
BASE = f"dbfs:/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}"


### <span style="color:#1f77b4">**Create schema and volume**</span>

Ensure the Unity Catalog schema and volume exist before loading data or saving models.


In [0]:
# Create the schema and volume if needed.

# Ensure schema and volume exist
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}")
spark.sql(f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.{VOLUME}")


DataFrame[]

### <span style="color:#1f77b4">**Load data into the UC volume**</span>

Copy the diabetes CSV into the Unity Catalog volume only if it is missing.


In [0]:
# Avoid overwriting shared files during pipeline runs.

# Sync raw data files into the UC volume (only if missing)
data_dir = f"{BASE}/diabetes"
data_file = f"{data_dir}/diabetes.csv"
try:
    dbutils.fs.ls(data_file)
    file_exists = True
except Exception:
    file_exists = False

if not file_exists:
    dbutils.fs.mkdirs(data_dir)
    dbutils.fs.cp("https://raw.githubusercontent.com/Ch3rry-Pi3-Azure/DataBricks-Machine-Learning/refs/heads/main/data/diabetes.csv", data_file)


### <span style="color:#1f77b4">**Load, clean, and split data**</span>

Read the CSV, cast columns, remove nulls, and create train/test splits for model evaluation.


In [0]:
# Cast columns and split into train/test sets.

# Import required libraries
from pyspark.sql.types import *
from pyspark.sql.functions import *
   
data = spark.read.format("csv").option("header", "true").load(BASE + "/diabetes/diabetes.csv")
data = data.dropna().select(col("Pregnancies").astype("int"),
                           col("Glucose").astype("int"),
                          col("BloodPressure").astype("int"),
                          col("SkinThickness").astype("int"),
                          col("Insulin").astype("int"),
                          col("BMI").astype("float"),
                          col("DiabetesPedigreeFunction").astype("float"),
                          col("Age").astype("int"),
                          col("Outcome").astype("int")
                          )

   
splits = data.randomSplit([0.7, 0.3])
train = splits[0]
test = splits[1]
print ("Training Rows:", train.count(), " Testing Rows:", test.count())


Training Rows: 523  Testing Rows: 245


### <span style="color:#1f77b4">**Define the MLflow training function**</span>

Use MLflow to track parameters and metrics while training a Spark ML pipeline.


In [0]:
# Build a pipeline, log metrics with MLflow, and save the model.

def train_diabetes_model(training_data, test_data, maxIterations, regularization):
    import mlflow
    from pyspark.ml import Pipeline
    from pyspark.ml.feature import VectorAssembler, MinMaxScaler
    from pyspark.ml.classification import LogisticRegression
    from pyspark.ml.evaluation import MulticlassClassificationEvaluator
    import time
    
    with mlflow.start_run():
        numFeatures = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age"]
        numVector = VectorAssembler(inputCols=numFeatures, outputCol="numericFeatures")
        numScaler = MinMaxScaler(inputCol=numVector.getOutputCol(), outputCol="normalizedFeatures")
        featureVector = VectorAssembler(inputCols=["normalizedFeatures"], outputCol="features")
        algo = LogisticRegression(labelCol="Outcome", featuresCol="features", maxIter=maxIterations, regParam=regularization)
        pipeline = Pipeline(stages=[numVector, numScaler, featureVector, algo])
        
        mlflow.log_param('maxIter', algo.getMaxIter())
        mlflow.log_param('regParam', algo.getRegParam())
        model = pipeline.fit(training_data)
        
        prediction = model.transform(test_data)
        metrics = ["accuracy", "weightedRecall", "weightedPrecision"]
        for metric in metrics:
            evaluator = MulticlassClassificationEvaluator(labelCol="Outcome", predictionCol="prediction", metricName=metric)
            metricValue = evaluator.evaluate(prediction)
            print(f"{metric}: {metricValue}")
            mlflow.log_metric(metric, metricValue)
        
        unique_model_name = "classifier-" + str(time.time())
        model_path = BASE + f"/models/{unique_model_name}"
        model.write().overwrite().save(model_path)
        print("Experiment run complete. Model saved to", model_path)
        return model


### <span style="color:#1f77b4">**Run experiment: config A**</span>

Train and log a model run with a smaller iteration count and higher regularization.


In [0]:
# First experiment run with chosen hyperparameters.

modeb_a = train_diabetes_model(train, test, 5, 0.5)


accuracy: 0.7020408163265306
weightedRecall: 0.7020408163265306
weightedPrecision: 0.7627482993197279


2025/12/28 21:24:53 INFO mlflow.tracking._tracking_service.client: 🏃 View run dazzling-mole-391 at: adb-7405608564792326.6.azuredatabricks.net/ml/experiments/2636880519013446/runs/eee233e8296d444fa5e80ab75493b707.
2025/12/28 21:24:53 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: adb-7405608564792326.6.azuredatabricks.net/ml/experiments/2636880519013446.


Experiment run complete. Model saved to dbfs:/Volumes/dbw_databricks_ml_jaguar/default/ml_lab/models/classifier-1766957085.9095488


### <span style="color:#1f77b4">**Run experiment: config B**</span>

Train and log a model run with more iterations and lower regularization.


In [0]:
# Second experiment run for comparison.

model_b = train_diabetes_model(train, test, 10, 0.2)


accuracy: 0.7591836734693878
weightedRecall: 0.7591836734693878
weightedPrecision: 0.7702799647777893


2025/12/28 21:25:08 INFO mlflow.tracking._tracking_service.client: 🏃 View run redolent-rook-161 at: adb-7405608564792326.6.azuredatabricks.net/ml/experiments/2636880519013446/runs/f00c1519aaf2471ab0e667a4dd474a43.
2025/12/28 21:25:08 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: adb-7405608564792326.6.azuredatabricks.net/ml/experiments/2636880519013446.


Experiment run complete. Model saved to dbfs:/Volumes/dbw_databricks_ml_jaguar/default/ml_lab/models/classifier-1766957101.2997136


### <span style="color:#1f77b4">**Save and register the model in Unity Catalog**</span>

Persist an MLflow model artifact to the Unity Catalog volume and register it so it appears in the Models tab. This creates a new version each time you run it.


In [0]:
# Import required libraries
import os
import mlflow
import mlflow.spark
from mlflow.models.signature import infer_signature

# Use Unity Catalog volume for MLflow temp staging and artifacts
mlflow_tmp = f"{BASE}/mlflow_tmp"
dbutils.fs.mkdirs(mlflow_tmp)
os.environ["MLFLOW_DFS_TMP"] = mlflow_tmp

# Choose which trained model to register
model = model_b

# Build input/output samples to infer model signature
feature_cols = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age"]
input_df = train.select(*feature_cols).limit(20)
output_df = model.transform(input_df).select("prediction").limit(20)
signature = infer_signature(input_df, output_df)
input_example = input_df.limit(5).toPandas()

# Register the model in Unity Catalog
mlflow.set_registry_uri("databricks-uc")
model_name = f"{CATALOG}.{SCHEMA}.diabetes_lr"

with mlflow.start_run() as run:
    mlflow.spark.log_model(
        spark_model=model,
        artifact_path="model",
        signature=signature,
        input_example=input_example,
        registered_model_name=model_name
    )
    print(f"Registered {model_name} from run {run.info.run_id}")


2025/12/28 21:25:27 INFO mlflow.spark: Inferring pip requirements by reloading the logged model from the databricks artifact repository, which can be time-consuming. To speed up, explicitly specify the conda_env or pip_requirements when calling log_model().


Downloading artifacts:   0%|          | 0/35 [00:00<?, ?it/s]



Uploading artifacts:   0%|          | 0/6 [00:00<?, ?it/s]

Registered model 'dbw_databricks_ml_jaguar.default.diabetes_lr' already exists. Creating a new version of this model...


Downloading artifacts:   0%|          | 0/41 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/41 [00:00<?, ?it/s]

Created version '5' of model 'dbw_databricks_ml_jaguar.default.diabetes_lr'.
2025/12/28 21:26:09 INFO mlflow.tracking._tracking_service.client: 🏃 View run bouncy-slug-708 at: adb-7405608564792326.6.azuredatabricks.net/ml/experiments/2636880519013446/runs/f0d54207f3fa4dbbb09c76e022a0ef9b.
2025/12/28 21:26:09 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: adb-7405608564792326.6.azuredatabricks.net/ml/experiments/2636880519013446.


Registered dbw_databricks_ml_jaguar.default.diabetes_lr from run f0d54207f3fa4dbbb09c76e022a0ef9b


### <span style="color:#1f77b4">**Sample request payload**</span>

Example JSON payload for real?time model scoring endpoints.


In [0]:
{
   "dataframe_records": [
   {
      "Pregnancies": 8,
      "Glucose": 85,
      "BloodPressure": 65,
      "SkinThickness": 29,
      "Insulin": 0,
      "BMI": 26.6,
      "DiabetesPedigreeFunction": 0.672,
      "Age": 34
   }
   ]
 }


{'dataframe_records': [{'Pregnancies': 8,
   'Glucose': 85,
   'BloodPressure': 65,
   'SkinThickness': 29,
   'Insulin': 0,
   'BMI': 26.6,
   'DiabetesPedigreeFunction': 0.672,
   'Age': 34}]}