# <span style="color:#1f77b4">**Machine Learning 01 - Training Model**</span>


### <span style="color:#1f77b4">**Unity Catalog configuration**</span>

Set up widgets for `CATALOG`, `SCHEMA`, and `VOLUME`, resolve the active catalog, and build a reusable `BASE` path for storage.


In [0]:
# Configure Unity Catalog widgets and resolve the active catalog.

# Unity Catalog config for this project
dbutils.widgets.removeAll()
dbutils.widgets.text("CATALOG", "")
dbutils.widgets.text("SCHEMA", "default")
dbutils.widgets.text("VOLUME", "ml_lab")

catalog_widget = dbutils.widgets.get("CATALOG")
if catalog_widget:
    CATALOG = catalog_widget
else:
    # Prefer current catalog, otherwise pick the first non-system catalog
    current = spark.sql("SELECT current_catalog()").first()[0]
    catalogs = [r.catalog for r in spark.sql("SHOW CATALOGS").collect()]
    CATALOG = current if current not in ("system",) else next(c for c in catalogs if c not in ("system",))

SCHEMA = dbutils.widgets.get("SCHEMA")
VOLUME = dbutils.widgets.get("VOLUME")
BASE = f"dbfs:/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}"


### <span style="color:#1f77b4">**Create schema and volume**</span>

Ensure the schema and volume exist in Unity Catalog so read/write operations succeed.


In [0]:
# Create the schema and volume if they do not exist.

# Ensure schema and volume exist
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}")
spark.sql(f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.{VOLUME}")


DataFrame[]

### <span style="color:#1f77b4">**Load data into the UC volume**</span>

Check for the diabetes CSV and copy it from GitHub only if missing to avoid overwriting files during jobs.


In [0]:
# Only copy the dataset when it is missing to avoid job conflicts.

# Sync raw data files into the UC volume (only if missing)
data_dir = f"{BASE}/diabetes"
data_file = f"{data_dir}/diabetes.csv"
try:
    dbutils.fs.ls(data_file)
    file_exists = True
except Exception:
    file_exists = False

if not file_exists:
    dbutils.fs.mkdirs(data_dir)
    dbutils.fs.cp("https://raw.githubusercontent.com/Ch3rry-Pi3-Azure/DataBricks-Machine-Learning/refs/heads/main/data/diabetes.csv", data_file)


### <span style="color:#1f77b4">**Preview the raw dataset**</span>

Read the CSV into a Spark DataFrame and display a sample to confirm the data loaded correctly.


In [0]:
# Load the CSV and quickly inspect a sample of records.

# Load dataset into a Spark DataFrame
df = spark.read.format("csv").option("header", "true").load(BASE + "/diabetes/diabetes.csv")
display(df)


Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
5,116,74,0,0,25.6,0.201,30,0
3,78,50,32,88,31.0,0.248,26,1
10,115,0,0,0,35.3,0.134,29,0
2,197,70,45,543,30.5,0.158,53,1
8,125,96,0,0,0.0,0.232,54,1


### <span style="color:#1f77b4">**Clean and cast columns**</span>

Drop null rows and cast each feature to the correct numeric type using `pyspark.sql.functions` so ML models can train.


In [0]:
# Cast fields into numeric types for ML and remove nulls.

# Import required libraries
from pyspark.sql.types import *
from pyspark.sql.functions import *
   
data = df.dropna().select(col("Pregnancies").astype("int"),
                           col("Glucose").astype("int"),
                          col("BloodPressure").astype("int"),
                          col("SkinThickness").astype("int"),
                          col("Insulin").astype("int"),
                          col("BMI").astype("float"),
                          col("DiabetesPedigreeFunction").astype("float"),
                          col("Age").astype("int"),
                          col("Outcome").astype("int")
                          )
display(data)


Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
5,116,74,0,0,25.6,0.201,30,0
3,78,50,32,88,31.0,0.248,26,1
10,115,0,0,0,35.3,0.134,29,0
2,197,70,45,543,30.5,0.158,53,1
8,125,96,0,0,0.0,0.232,54,1


### <span style="color:#1f77b4">**Train/test split**</span>

Split the cleaned dataset into training and testing partitions for evaluation.


In [0]:
# Split the dataset for training and evaluation.

# Split data into training and testing sets
splits = data.randomSplit([0.7, 0.3])
train = splits[0]
test = splits[1]
print ("Training Rows:", train.count(), " Testing Rows:", test.count())


Training Rows: 523  Testing Rows: 245


### <span style="color:#1f77b4">**Assemble and scale features**</span>

Use `VectorAssembler` to build a feature vector and `MinMaxScaler` to normalize values for stable model training.


In [0]:
# Assemble features and normalize them for model stability.

# Import required libraries
from pyspark.ml.feature import VectorAssembler, MinMaxScaler

numericFeatures = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction"]
numericColVector = VectorAssembler(inputCols=numericFeatures, outputCol = "numericFeatures")
vectorizedData = numericColVector.transform(train)

minMax = MinMaxScaler(inputCol= numericColVector.getOutputCol(), outputCol="normalizedFeatures")
scaledData = minMax.fit(vectorizedData).transform(vectorizedData)

compareNumerics = scaledData.select("numericFeatures", "normalizedFeatures")
display(compareNumerics)


numericFeatures,normalizedFeatures
"Map(vectorType -> sparse, length -> 7, indices -> List(1, 5, 6), values -> List(73.0, 21.100000381469727, 0.34200000762939453))","Map(vectorType -> sparse, length -> 7, indices -> List(1, 5, 6), values -> List(0.36683417085427134, 0.35521885251597257, 0.1127241663541023))"
"Map(vectorType -> dense, length -> 7, values -> List(0.0, 78.0, 88.0, 29.0, 40.0, 36.900001525878906, 0.4339999854564667))","Map(vectorType -> dense, length -> 7, values -> List(0.0, 0.3919597989949749, 0.7719298245614035, 0.29292929292929293, 0.04728132387706856, 0.6212121309424986, 0.15200682002330046))"
"Map(vectorType -> dense, length -> 7, values -> List(0.0, 84.0, 82.0, 31.0, 125.0, 38.20000076293945, 0.2329999953508377))","Map(vectorType -> dense, length -> 7, values -> List(0.0, 0.4221105527638191, 0.7192982456140351, 0.31313131313131315, 0.14775413711583923, 0.6430976394217227, 0.06618274500370587))"
"Map(vectorType -> dense, length -> 7, values -> List(0.0, 91.0, 68.0, 32.0, 210.0, 39.900001525878906, 0.38100001215934753))","Map(vectorType -> dense, length -> 7, values -> List(0.0, 0.457286432160804, 0.5964912280701754, 0.32323232323232326, 0.24822695035460993, 0.6717171801501655, 0.12937660157454303))"
"Map(vectorType -> dense, length -> 7, values -> List(0.0, 91.0, 80.0, 0.0, 0.0, 32.400001525878906, 0.6010000109672546))","Map(vectorType -> dense, length -> 7, values -> List(0.0, 0.457286432160804, 0.7017543859649122, 0.0, 0.0, 0.5454545571309983, 0.22331340421855359))"
"Map(vectorType -> dense, length -> 7, values -> List(0.0, 93.0, 60.0, 25.0, 92.0, 28.700000762939453, 0.5320000052452087))","Map(vectorType -> dense, length -> 7, values -> List(0.0, 0.46733668341708545, 0.5263157894736842, 0.25252525252525254, 0.10874704491725767, 0.48316498359744436, 0.19385140442278603))"
"Map(vectorType -> dense, length -> 7, values -> List(0.0, 93.0, 100.0, 39.0, 72.0, 43.400001525878906, 1.0210000276565552))","Map(vectorType -> dense, length -> 7, values -> List(0.0, 0.46733668341708545, 0.8771929824561403, 0.393939393939394, 0.0851063829787234, 0.7306397375591102, 0.4026473082731292))"
"Map(vectorType -> sparse, length -> 7, indices -> List(1, 6), values -> List(94.0, 0.25600001215934753))","Map(vectorType -> sparse, length -> 7, indices -> List(1, 6), values -> List(0.4723618090452261, 0.07600341796487435))"
"Map(vectorType -> dense, length -> 7, values -> List(0.0, 94.0, 70.0, 27.0, 115.0, 43.5, 0.34700000286102295))","Map(vectorType -> dense, length -> 7, values -> List(0.0, 0.4723618090452261, 0.6140350877192982, 0.27272727272727276, 0.1359338061465721, 0.7323232135111692, 0.11485909166246368))"
"Map(vectorType -> dense, length -> 7, values -> List(0.0, 95.0, 64.0, 39.0, 105.0, 44.599998474121094, 0.3659999966621399))","Map(vectorType -> dense, length -> 7, values -> List(0.0, 0.47738693467336685, 0.5614035087719298, 0.393939393939394, 0.12411347517730496, 0.750841705865784, 0.12297181292430033))"


### <span style="color:#1f77b4">**Prepare features and labels**</span>

Select the normalized feature vector as `features` and the outcome column as `label` for Spark ML.


In [0]:
# Create the feature vector and label columns expected by Spark ML.

preppedData = scaledData[col("normalizedFeatures").alias("features"), col("Outcome").alias("label")]
display(preppedData)


features,label
"Map(vectorType -> sparse, length -> 7, indices -> List(1, 5, 6), values -> List(0.36683417085427134, 0.35521885251597257, 0.1127241663541023))",0
"Map(vectorType -> dense, length -> 7, values -> List(0.0, 0.3919597989949749, 0.7719298245614035, 0.29292929292929293, 0.04728132387706856, 0.6212121309424986, 0.15200682002330046))",0
"Map(vectorType -> dense, length -> 7, values -> List(0.0, 0.4221105527638191, 0.7192982456140351, 0.31313131313131315, 0.14775413711583923, 0.6430976394217227, 0.06618274500370587))",0
"Map(vectorType -> dense, length -> 7, values -> List(0.0, 0.457286432160804, 0.5964912280701754, 0.32323232323232326, 0.24822695035460993, 0.6717171801501655, 0.12937660157454303))",0
"Map(vectorType -> dense, length -> 7, values -> List(0.0, 0.457286432160804, 0.7017543859649122, 0.0, 0.0, 0.5454545571309983, 0.22331340421855359))",0
"Map(vectorType -> dense, length -> 7, values -> List(0.0, 0.46733668341708545, 0.5263157894736842, 0.25252525252525254, 0.10874704491725767, 0.48316498359744436, 0.19385140442278603))",0
"Map(vectorType -> dense, length -> 7, values -> List(0.0, 0.46733668341708545, 0.8771929824561403, 0.393939393939394, 0.0851063829787234, 0.7306397375591102, 0.4026473082731292))",0
"Map(vectorType -> sparse, length -> 7, indices -> List(1, 6), values -> List(0.4723618090452261, 0.07600341796487435))",0
"Map(vectorType -> dense, length -> 7, values -> List(0.0, 0.4723618090452261, 0.6140350877192982, 0.27272727272727276, 0.1359338061465721, 0.7323232135111692, 0.11485909166246368))",0
"Map(vectorType -> dense, length -> 7, values -> List(0.0, 0.47738693467336685, 0.5614035087719298, 0.393939393939394, 0.12411347517730496, 0.750841705865784, 0.12297181292430033))",0


### <span style="color:#1f77b4">**Train logistic regression**</span>

Fit a logistic regression classifier with regularization to create a baseline model.


In [0]:
# Train the classifier with regularization.

# Import required libraries
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(labelCol="label", featuresCol="features", maxIter=10, regParam=0.3)
model = lr.fit(preppedData)
print ("Model trained!")


Downloading artifacts:   0%|          | 0/15 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

Model trained!


### <span style="color:#1f77b4">**Generate predictions**</span>

Transform the test data and produce predicted labels and probabilities for evaluation.


In [0]:
# Transform the test data and compute predictions.

# Prepare the test data

vectorizedTestData = numericColVector.transform(test)
scaledTestData = minMax.fit(vectorizedTestData).transform(vectorizedTestData)
preppedTestData = scaledTestData[col("normalizedFeatures").alias("features"), col("Outcome").alias("label")]
   
# Get predictions
prediction = model.transform(preppedTestData)
predicted = prediction.select("features", "probability", col("prediction").astype("Int"), col("label").alias("trueLabel"))
display(predicted)


features,probability,prediction,trueLabel
"Map(vectorType -> dense, length -> 7, values -> List(0.0, 0.08441558441558442, 0.49180327868852464, 0.0, 0.0, 0.3233979322862226, 0.2950521934452419))","Map(vectorType -> dense, length -> 2, values -> List(0.9108713032366744, 0.08912869676332558))",0,0
"Map(vectorType -> dense, length -> 7, values -> List(0.0, 0.14935064935064937, 0.6229508196721312, 0.0, 0.0, 0.675111777454536, 0.04947798437920749))","Map(vectorType -> dense, length -> 2, values -> List(0.8652800824101606, 0.13471991758983937))",0,0
"Map(vectorType -> dense, length -> 7, values -> List(0.0, 0.19480519480519481, 0.42622950819672134, 0.18518518518518517, 0.052941176470588235, 0.41430700252224834, 0.08352246212463038))","Map(vectorType -> dense, length -> 2, values -> List(0.8880690922630521, 0.11193090773694792))",0,0
"Map(vectorType -> dense, length -> 7, values -> List(0.0, 0.25974025974025977, 0.5245901639344263, 0.4074074074074074, 0.09705882352941177, 0.5335320424912942, 0.2088061705306459))","Map(vectorType -> dense, length -> 2, values -> List(0.8348864652906366, 0.1651135347093634))",0,0
"Map(vectorType -> dense, length -> 7, values -> List(0.0, 0.27272727272727276, 0.5573770491803279, 0.5925925925925926, 0.0, 0.5335320424912942, 0.06945074772288301))","Map(vectorType -> dense, length -> 2, values -> List(0.848453379182019, 0.15154662081798098))",0,0
"Map(vectorType -> dense, length -> 7, values -> List(0.0, 0.3181818181818182, 0.49180327868852464, 0.0, 0.0, 0.5260804774932287, 0.08079891148071137))","Map(vectorType -> dense, length -> 2, values -> List(0.839122094148266, 0.160877905851734))",0,0
"Map(vectorType -> dense, length -> 7, values -> List(0.0, 0.37012987012987014, 0.5245901639344263, 0.31481481481481477, 0.0, 0.3129657299187452, 0.07580571726277695))","Map(vectorType -> dense, length -> 2, values -> List(0.8606665915347972, 0.1393334084652028))",0,0
"Map(vectorType -> dense, length -> 7, values -> List(0.0, 0.37662337662337664, 0.639344262295082, 0.7407407407407407, 0.1323529411764706, 0.51415798486651, 0.06945074772288301))","Map(vectorType -> dense, length -> 2, values -> List(0.8101917814828755, 0.18980821851712448))",0,0
"Map(vectorType -> dense, length -> 7, values -> List(0.0, 0.37662337662337664, 0.7049180327868853, 0.31481481481481477, 0.15441176470588236, 0.4366616975164444, 0.2768951260214917))","Map(vectorType -> dense, length -> 2, values -> List(0.8047912654807838, 0.19520873451921616))",0,0
"Map(vectorType -> dense, length -> 7, values -> List(0.0, 0.3961038961038961, 0.5245901639344263, 0.7592592592592592, 0.2088235294117647, 0.6184798948394251, 0.03994552330533551))","Map(vectorType -> dense, length -> 2, values -> List(0.7775278431434555, 0.22247215685654453))",0,0


### <span style="color:#1f77b4">**Evaluate the model**</span>

Use `MulticlassClassificationEvaluator` to compute accuracy, precision, recall, and F1 scores.


In [0]:
# Compute multiple classification metrics.

# Import required libraries
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
   
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction")
   
# Simple accuracy
accuracy = evaluator.evaluate(prediction, {evaluator.metricName:"accuracy"})
print("Accuracy:", accuracy)
   
# Individual class metrics
labels = [0,1]
print("\nIndividual class metrics:")
for label in sorted(labels):
    print ("Class %s" % (label))
   
    # Precision
    precision = evaluator.evaluate(prediction, {evaluator.metricLabel:label,
                                                evaluator.metricName:"precisionByLabel"})
    print("\tPrecision:", precision)
   
    # Recall
    recall = evaluator.evaluate(prediction, {evaluator.metricLabel:label,
                                             evaluator.metricName:"recallByLabel"})
    print("\tRecall:", recall)
   
    # F1 score
    f1 = evaluator.evaluate(prediction, {evaluator.metricLabel:label,
                                         evaluator.metricName:"fMeasureByLabel"})
    print("\tF1 Score:", f1)
   
# Weighted (overall) metrics
overallPrecision = evaluator.evaluate(prediction, {evaluator.metricName:"weightedPrecision"})
print("Overall Precision:", overallPrecision)
overallRecall = evaluator.evaluate(prediction, {evaluator.metricName:"weightedRecall"})
print("Overall Recall:", overallRecall)
overallF1 = evaluator.evaluate(prediction, {evaluator.metricName:"weightedFMeasure"})
print("Overall F1 Score:", overallF1)


Accuracy: 0.710204081632653

Individual class metrics:
Class 0
	Precision: 0.6977777777777778
	Recall: 0.98125
	F1 Score: 0.8155844155844156
Class 1
	Precision: 0.85
	Recall: 0.2
	F1 Score: 0.3238095238095238
Overall Precision: 0.7505895691609977
Overall Recall: 0.7102040816326531
Overall F1 Score: 0.6449686368053715


### <span style="color:#1f77b4">**Build a reusable pipeline**</span>

Bundle feature engineering and the classifier into a `Pipeline` so the same steps can be reused consistently.


In [0]:
# Create a pipeline to keep feature steps and model together.

# Import required libraries
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler, MinMaxScaler
from pyspark.ml.classification import LogisticRegression
   

numFeatures = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age"]
   
# Define the feature engineering and model training algorithm steps
numVector = VectorAssembler(inputCols=numFeatures, outputCol="numericFeatures")
numScaler = MinMaxScaler(inputCol = numVector.getOutputCol(), outputCol="normalizedFeatures")
featureVector = VectorAssembler(inputCols=["normalizedFeatures"], outputCol="Features")
algo = LogisticRegression(labelCol="Outcome", featuresCol="Features", maxIter=10, regParam=0.3)
   
# Chain the steps as stages in a pipeline
pipeline = Pipeline(stages=[ numVector, numScaler, featureVector, algo])
   
# Use the pipeline to prepare data and fit the model algorithm
model = pipeline.fit(train)
print ("Model trained!")


Downloading artifacts:   0%|          | 0/35 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

Model trained!


### <span style="color:#1f77b4">**Pipeline inference**</span>

Run the pipeline on test data and review predictions from the pipeline output.


In [0]:
# Run inference using the pipeline output.

prediction = model.transform(test)
predicted = prediction.select("Features", "probability", col("prediction").astype("Int"), col("Outcome").alias("trueLabel"))
display(predicted)


Features,probability,prediction,trueLabel
"Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.2864321608040201, 0.5263157894736842, 0.0, 0.0, 0.3653198687795551, 0.2805294584733362, 0.7666666666666666))","Map(vectorType -> dense, length -> 2, values -> List(0.78425965272794, 0.21574034727206004))",0,0
"Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.33668341708542715, 0.6666666666666666, 0.0, 0.0, 0.7626262301916712, 0.04953031614584443, 0.4166666666666667))","Map(vectorType -> dense, length -> 2, values -> List(0.7467116489434827, 0.25328835105651726))",0,0
"Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.37185929648241206, 0.45614035087719296, 0.10101010101010102, 0.0425531914893617, 0.46801344314694787, 0.08155422122158221, 0.016666666666666666))","Map(vectorType -> dense, length -> 2, values -> List(0.8431388697180214, 0.15686113028197857))",0,0
"Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.4221105527638191, 0.5614035087719298, 0.22222222222222224, 0.07801418439716312, 0.6026935743673928, 0.19940222040465247, 0.0))","Map(vectorType -> dense, length -> 2, values -> List(0.7824959498108743, 0.21750405018912566))",0,0
"Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.4321608040201005, 0.5964912280701754, 0.32323232323232326, 0.0, 0.6026935743673928, 0.06831767667464654, 0.06666666666666667))","Map(vectorType -> dense, length -> 2, values -> List(0.7902558400734617, 0.20974415992653828))",0,0
"Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.46733668341708545, 0.5263157894736842, 0.0, 0.0, 0.5942760661661151, 0.07899231594161199, 0.06666666666666667))","Map(vectorType -> dense, length -> 2, values -> List(0.781961184168235, 0.218038815831765))",0,0
"Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.507537688442211, 0.5614035087719298, 0.17171717171717174, 0.0, 0.3535353444536679, 0.07429547262812182, 0.0))","Map(vectorType -> dense, length -> 2, values -> List(0.8286832200691818, 0.17131677993081817))",0,0
"Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.5125628140703518, 0.6842105263157894, 0.4040404040404041, 0.10638297872340426, 0.5808080658881687, 0.06831767667464654, 0.05))","Map(vectorType -> dense, length -> 2, values -> List(0.7596012157213764, 0.2403987842786236))",0,0
"Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.5125628140703518, 0.7543859649122806, 0.17171717171717174, 0.12411347517730496, 0.4932659677507813, 0.26345003055612803, 0.1))","Map(vectorType -> dense, length -> 2, values -> List(0.7516300815178086, 0.24836991848219137))",0,0
"Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.5276381909547738, 0.5614035087719298, 0.4141414141414142, 0.16784869976359337, 0.698653180706058, 0.04056361585305221, 0.016666666666666666))","Map(vectorType -> dense, length -> 2, values -> List(0.7233900915052113, 0.2766099084947887))",0,0


### <span style="color:#1f77b4">**Save the trained model**</span>

Persist the pipeline model to the Unity Catalog volume so it can be loaded later.


In [0]:
# Write the model to UC storage for reuse.

model.write().overwrite().save(BASE + "/models/diabetes.model")


### <span style="color:#1f77b4">**Load and infer with the saved model**</span>

Load the saved `PipelineModel` and run inference on a new sample record.


In [0]:
# Load the model back and score a new row.

# Import required libraries
from pyspark.ml.pipeline import PipelineModel

persistedModel = PipelineModel.load(BASE + "/models/diabetes.model")
   
newData = spark.createDataFrame ([{"Pregnancies": 8,
                                  "Glucose": 85,
                                  "BloodPressure": 65,
                                  "SkinThickness": 29,
                                  "Insulin": 0,
                                  "BMI": 26.6,
                                  "DiabetesPedigreeFunction": 0.672,
                                  "Age": 34
                                  }])
   
   
predictions = persistedModel.transform(newData)
display(predictions.select("Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age",  col("prediction").alias("PredictedOutcome")))


Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,PredictedOutcome
8,85,65,29,0,26.6,0.672,34,0.0
