### <span style="color:#1f77b4">**Unity Catalog configuration**</span>

Set up widgets for `CATALOG`, `SCHEMA`, and `VOLUME`, resolve the active catalog, and build the `BASE` path.


In [0]:
# Configure Unity Catalog widgets and resolve the active catalog.

# Unity Catalog config for this project
dbutils.widgets.removeAll()
dbutils.widgets.text("CATALOG", "")
dbutils.widgets.text("SCHEMA", "default")
dbutils.widgets.text("VOLUME", "ml_lab")

catalog_widget = dbutils.widgets.get("CATALOG")
if catalog_widget:
    CATALOG = catalog_widget
else:
    # Prefer current catalog, otherwise pick the first non-system catalog
    current = spark.sql("SELECT current_catalog()").first()[0]
    catalogs = [r.catalog for r in spark.sql("SHOW CATALOGS").collect()]
    CATALOG = current if current not in ("system",) else next(c for c in catalogs if c not in ("system",))

SCHEMA = dbutils.widgets.get("SCHEMA")
VOLUME = dbutils.widgets.get("VOLUME")
BASE = f"dbfs:/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}"


### <span style="color:#1f77b4">**Create schema and volume**</span>

Ensure the Unity Catalog schema and volume exist before loading data.


In [0]:
# Create the schema and volume if needed.

# Ensure schema and volume exist
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}")
spark.sql(f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.{VOLUME}")


DataFrame[]

### <span style="color:#1f77b4">**Load data into the UC volume**</span>

Copy the diabetes CSV into the Unity Catalog volume only if it is missing.


In [0]:
# Copy the dataset only if it is missing.

# Sync raw data files into the UC volume (only if missing)
data_dir = f"{BASE}/diabetes"
data_file = f"{data_dir}/diabetes.csv"
try:
    dbutils.fs.ls(data_file)
    file_exists = True
except Exception:
    file_exists = False

if not file_exists:
    dbutils.fs.mkdirs(data_dir)
    dbutils.fs.cp("https://raw.githubusercontent.com/Ch3rry-Pi3-Azure/DataBricks-Machine-Learning/refs/heads/main/data/diabetes.csv", data_file)


### <span style="color:#1f77b4">**Load, clean, and cache data**</span>

Read the CSV, cast types, and cache the dataset and splits to avoid inconsistent reads during tuning.


In [0]:
# Cast columns, split, and cache to stabilize reads.

# Import required libraries
from pyspark.sql.types import *
from pyspark.sql.functions import *
   
data = spark.read.format("csv").option("header", "true").load(BASE + "/diabetes/diabetes.csv")
data = data.dropna().select(col("Pregnancies").astype("int"),
                           col("Glucose").astype("int"),
                          col("BloodPressure").astype("int"),
                          col("SkinThickness").astype("int"),
                          col("Insulin").astype("int"),
                          col("BMI").astype("float"),
                          col("DiabetesPedigreeFunction").astype("float"),
                          col("Age").astype("int"),
                          col("Outcome").astype("int")
                          )

   
data = data.cache()
_ = data.count()

splits = data.randomSplit([0.7, 0.3])
train = splits[0]
test = splits[1]
train = train.cache()
test = test.cache()
_ = train.count()
_ = test.count()
print ("Training Rows:", train.count(), " Testing Rows:", test.count())


Training Rows: 545  Testing Rows: 223


### <span style="color:#1f77b4">**Tune with TrainValidationSplit**</span>

Use Spark ML?s built?in `TrainValidationSplit` to evaluate a small parameter grid without extra dependencies. This is faster than full cross?validation and works well on single?node clusters.


In [0]:
# Import required libraries
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, MinMaxScaler
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit

# Assemble features and define the model
numFeatures = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age"]
numVector = VectorAssembler(inputCols=numFeatures, outputCol="numericFeatures")
numScaler = MinMaxScaler(inputCol=numVector.getOutputCol(), outputCol="normalizedFeatures")
featureVector = VectorAssembler(inputCols=["normalizedFeatures"], outputCol="Features")
mlAlgo = DecisionTreeClassifier(labelCol="Outcome", featuresCol="Features")

pipeline = Pipeline(stages=[numVector, numScaler, featureVector, mlAlgo])

# Define a small hyperparameter grid
paramGrid = (ParamGridBuilder()
    .addGrid(mlAlgo.maxDepth, [2, 4, 6, 8])
    .addGrid(mlAlgo.maxBins, [10, 20, 30])
    .build())

evaluator = MulticlassClassificationEvaluator(labelCol="Outcome", predictionCol="prediction", metricName="accuracy")

# Use TrainValidationSplit for faster tuning on single-node compute
tvs = TrainValidationSplit(
    estimator=pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=evaluator,
    trainRatio=0.8,
    seed=42
)

tvs_model = tvs.fit(train)
best_model = tvs_model.bestModel

# Evaluate the best model on the held-out test set
pred = best_model.transform(test)
accuracy = evaluator.evaluate(pred)
print(f"Best model accuracy: {accuracy}")


Downloading artifacts:   0%|          | 0/75 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/35 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

Best model accuracy: 0.6636771300448431


### <span style="color:#1f77b4">**Register the tuned model**</span>

Log the best model from tuning to MLflow with a signature and register it in Unity Catalog so it appears in the Models tab.


In [0]:
# Import required libraries
import os
import mlflow
import mlflow.spark
from mlflow.models.signature import infer_signature

# Use UC volume for MLflow temp staging
mlflow_tmp = f"{BASE}/mlflow_tmp"
dbutils.fs.mkdirs(mlflow_tmp)
os.environ["MLFLOW_DFS_TMP"] = mlflow_tmp

# Prepare input/output samples for signature
feature_cols = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age"]
input_df = train.select(*feature_cols).limit(20)
output_df = best_model.transform(input_df).select("prediction").limit(20)
signature = infer_signature(input_df, output_df)
input_example = input_df.limit(5).toPandas()

# Register the tuned model
mlflow.set_registry_uri("databricks-uc")
model_name = f"{CATALOG}.{SCHEMA}.diabetes_tree_tuned"

with mlflow.start_run() as run:
    mlflow.spark.log_model(
        spark_model=best_model,
        artifact_path="model",
        signature=signature,
        input_example=input_example,
        registered_model_name=model_name
    )
    print(f"Registered {model_name} from run {run.info.run_id}")


2025/12/28 16:45:34 INFO mlflow.spark: Inferring pip requirements by reloading the logged model from the databricks artifact repository, which can be time-consuming. To speed up, explicitly specify the conda_env or pip_requirements when calling log_model().


Downloading artifacts:   0%|          | 0/35 [00:00<?, ?it/s]



Uploading artifacts:   0%|          | 0/6 [00:00<?, ?it/s]

Successfully registered model 'dbw_databricks_ml_jaguar.default.diabetes_tree_tuned'.


Downloading artifacts:   0%|          | 0/41 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/41 [00:00<?, ?it/s]

Created version '1' of model 'dbw_databricks_ml_jaguar.default.diabetes_tree_tuned'.
2025/12/28 16:46:07 INFO mlflow.tracking._tracking_service.client: 🏃 View run marvelous-sloth-134 at: adb-7405608564792326.6.azuredatabricks.net/ml/experiments/2636880519013444/runs/e8323df8d9f749e4b19d5388281c9f53.
2025/12/28 16:46:07 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: adb-7405608564792326.6.azuredatabricks.net/ml/experiments/2636880519013444.


Registered dbw_databricks_ml_jaguar.default.diabetes_tree_tuned from run e8323df8d9f749e4b19d5388281c9f53
