##Building ML Model for Binary Classification

The dataset is to predict whether or not a patient has diabetes, based on certain diagnostic measurements. All patients in this dataset are women and the Dataset consists of the following parameters :
- Pregnancies : number of pregnencies a women has had 
- Glucose : Denotes the plasma glucose concentration level of the patient
- BloodPressure : Denotes the diastolic blood pressure value of the patient in mm/hg 
- SkinThickness : Triceps skin fold thickness in mm
- Insulin : denotes the serum insulin value
- BMI : Body mass index of the patient
- DiabetesPedigreeFunction : Diabetes pedigree function
- Age : in years
- Outcome : Class variable 0 denoting absence of diabetes in the patient and 1 denotes presence of diabetes in the patient

In [0]:
pip install mlflow

Python interpreter will be restarted.
Python interpreter will be restarted.


In [0]:
import mlflow
import pyspark.pandas as ps
import pandas as pd

mlflow.set_experiment("/Users/dhanasree.rajamani@sjsu.edu/diabetes_classification")
target_col = "Outcome"

In [0]:
#pip install databricks-cli
#pip install databricks-cli --upgrade

####Load Data
Exploring and Understanding Data

In [0]:
#df = pd.read_csv("dbfs:/FileStore/shared_uploads/dhanasree.rajamani@sjsu.edu/diabetes.csv")
#df.head(5)

from pyspark.sql.types import DoubleType, StringType, StructField, StructType, IntegerType, FloatType

schema = StructType([
  StructField("Pregnancies", IntegerType(), False),
  StructField("Glucose", IntegerType(), False),
  StructField("BloodPressure", IntegerType(), False),
  StructField("SkinThickness", IntegerType(), False),
  StructField("Insulin", IntegerType(), False),
  StructField("BMI", FloatType(), False),
  StructField("DiabetesPedigreeFunction", FloatType(), False),
  StructField("Age", IntegerType(), False),
  StructField("Outcome", StringType(), False)
])

dataset = spark.read.format("csv").option("header","true").schema(schema).load("dbfs:/FileStore/shared_uploads/dhanasree.rajamani@sjsu.edu/diabetes___diabetes.csv")
cols = dataset.columns

In [0]:
display(dataset)

Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
6,148,72,35,0,33.6,0.627,50,Yes
1,85,66,29,0,26.6,0.351,31,No
8,183,64,0,0,23.3,0.672,32,Yes
1,89,66,23,94,28.1,0.167,21,No
0,137,40,35,168,43.1,2.288,33,Yes
5,116,74,0,0,25.6,0.201,30,No
3,78,50,32,88,31.0,0.248,26,Yes
10,115,0,0,0,35.3,0.134,29,No
2,197,70,45,543,30.5,0.158,53,Yes
8,125,96,0,0,0.0,0.232,54,Yes


In [0]:
from pyspark.sql.functions import isnan, when, count, col

dataset.select([count(when(isnan(c), c)).alias(c) for c in dataset.columns]).show()

+-----------+-------+-------------+-------------+-------+---+------------------------+---+-------+
|Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin|BMI|DiabetesPedigreeFunction|Age|Outcome|
+-----------+-------+-------------+-------------+-------+---+------------------------+---+-------+
|          0|      0|            0|            0|      0|  0|                       0|  0|      0|
+-----------+-------+-------------+-------------+-------+---+------------------------+---+-------+



In [0]:
cols

Out[6]: ['Pregnancies',
 'Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age',
 'Outcome']

In [0]:
#pip install databricks-automl-runtime


In [0]:
#pip install category_encoders

In [0]:
print("Number of rows in dataset: ")
dataset.count

Number of rows in dataset: 
Out[9]: <bound method DataFrame.count of DataFrame[Pregnancies: int, Glucose: int, BloodPressure: int, SkinThickness: int, Insulin: int, BMI: float, DiabetesPedigreeFunction: float, Age: int, Outcome: string]>

In [0]:
print("Number of columns in dataset: ")
len(dataset.columns)

Number of columns in dataset: 
Out[10]: 9

###Preprocess data(Data Preparation)
Making sure the data is in numeric format for the logistic regression algorithm

In [0]:
import pyspark
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler

from distutils.version import LooseVersion


Use the StringIndexer to encode labels to label indices.

In [0]:
stages = []
label_stringIdx = StringIndexer(inputCol="Outcome", outputCol="label")
stages = [label_stringIdx]

Use a VectorAssembler to combine all the feature columns into a single vector column.
This includes all the columns in the dataset, numeric columns in this case.

In [0]:
# Transform all features into a vector using VectorAssembler
numericCols = ['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI', 'DiabetesPedigreeFunction','Age',]
assemblerInputs = numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

Run the stages as a Pipeline. This puts the data through all of the feature transformations in a single call.

In [0]:
from pyspark.ml.classification import LogisticRegression
  
partialPipeline = Pipeline().setStages(stages)
pipelineModel = partialPipeline.fit(dataset)
preppedDataDF = pipelineModel.transform(dataset)

In [0]:
# Fit model to prepped data
lrModel = LogisticRegression().fit(preppedDataDF)

# ROC for training data
display(lrModel, preppedDataDF, "ROC")


False Positive Rate,True Positive Rate,Threshold
0.0,0.0,0.9493007747298532
0.0,0.0285714285714285,0.9493007747298532
0.0,0.0571428571428571,0.9203493013119196
0.0,0.0857142857142857,0.9063277683853448
0.0138888888888888,0.0857142857142857,0.8999879115347585
0.0138888888888888,0.1142857142857142,0.8977531062952016
0.0138888888888888,0.1428571428571428,0.8806445765830025
0.0138888888888888,0.1714285714285714,0.8730662815563599
0.0138888888888888,0.2,0.8371149495757508
0.0277777777777777,0.2,0.8001355942919259


In [0]:
display(lrModel, preppedDataDF)

fitted values,residuals
-2.973411411994336,-0.0486416145378205
-1.2677922008865747,-0.2196354251353299
2.172504678143269,0.1022468937047984
-1.1847477304467349,0.7658003842699722
-3.0372096204383743,-0.0457728938876003
0.1563135056158877,-0.5389990005436409
-1.7396741483748586,-0.1493543284720484
-0.4332366581350194,0.6066462870751475
-2.055074487788231,-0.1135406358698523
-3.2050946193208536,-0.0389744527323898


In [0]:
# Keep relevant columns
selectedcols = ["label", "features"] + cols
dataset = preppedDataDF.select(selectedcols)
display(dataset)

label,features,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
1.0,"Map(vectorType -> dense, length -> 8, values -> List(6.0, 148.0, 72.0, 35.0, 0.0, 33.599998474121094, 0.6269999742507935, 50.0))",6,148,72,35,0,33.6,0.627,50,Yes
0.0,"Map(vectorType -> dense, length -> 8, values -> List(1.0, 85.0, 66.0, 29.0, 0.0, 26.600000381469727, 0.35100001096725464, 31.0))",1,85,66,29,0,26.6,0.351,31,No
1.0,"Map(vectorType -> dense, length -> 8, values -> List(8.0, 183.0, 64.0, 0.0, 0.0, 23.299999237060547, 0.671999990940094, 32.0))",8,183,64,0,0,23.3,0.672,32,Yes
0.0,"Map(vectorType -> dense, length -> 8, values -> List(1.0, 89.0, 66.0, 23.0, 94.0, 28.100000381469727, 0.16699999570846558, 21.0))",1,89,66,23,94,28.1,0.167,21,No
1.0,"Map(vectorType -> dense, length -> 8, values -> List(0.0, 137.0, 40.0, 35.0, 168.0, 43.099998474121094, 2.2880001068115234, 33.0))",0,137,40,35,168,43.1,2.288,33,Yes
0.0,"Map(vectorType -> dense, length -> 8, values -> List(5.0, 116.0, 74.0, 0.0, 0.0, 25.600000381469727, 0.20100000500679016, 30.0))",5,116,74,0,0,25.6,0.201,30,No
1.0,"Map(vectorType -> dense, length -> 8, values -> List(3.0, 78.0, 50.0, 32.0, 88.0, 31.0, 0.24799999594688416, 26.0))",3,78,50,32,88,31.0,0.248,26,Yes
0.0,"Map(vectorType -> dense, length -> 8, values -> List(10.0, 115.0, 0.0, 0.0, 0.0, 35.29999923706055, 0.1340000033378601, 29.0))",10,115,0,0,0,35.3,0.134,29,No
1.0,"Map(vectorType -> dense, length -> 8, values -> List(2.0, 197.0, 70.0, 45.0, 543.0, 30.5, 0.15800000727176666, 53.0))",2,197,70,45,543,30.5,0.158,53,Yes
1.0,"Map(vectorType -> dense, length -> 8, values -> List(8.0, 125.0, 96.0, 0.0, 0.0, 0.0, 0.23199999332427979, 54.0))",8,125,96,0,0,0.0,0.232,54,Yes


In [0]:
# Randomly split data into training and test sets. set seed for reproducibility
(trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed=100)
print(trainingData.count())
print(testData.count())

547
221


### Fit and Evaluate Models

Trying the following Binary Classification algorithms available in the Pipelines API.

- Decision Tree Classifier
- Random Forest Classifier

Steps done for each of these models are : 
- Create initial model using the training set
- Tune parameters with a ParamGrid and 5-fold Cross Validation
- Evaluate the best model obtained from the Cross Validation using the test set

Use the BinaryClassificationEvaluator to evaluate the models, which uses areaUnderROC as the default metric.

###Logistic Regression

In [0]:
from pyspark.ml.classification import LogisticRegression

# Create initial LogisticRegression model
lr = LogisticRegression(labelCol="label", featuresCol="features", maxIter=10)

# Train model with Training Data
lrModel = lr.fit(trainingData)

In [0]:
# Make predictions on test data using the transform() method.
predictions = lrModel.transform(testData)

In [0]:
# View model's predictions and probabilities of each prediction class
selected = predictions.select("label", "prediction", "probability", 'Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI', 'DiabetesPedigreeFunction','Age')
display(selected)

label,prediction,probability,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.9311847278893338, 0.06881527211066618))",6,114,0,0,0,0.0,0.189,26
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.9233409523225907, 0.07665904767740928))",0,67,76,0,0,45.3,0.194,46
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.9716566453730425, 0.028343354626957473))",0,74,52,10,36,27.8,0.269,22
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.9419091382604333, 0.05809086173956668))",0,84,82,31,125,38.2,0.233,23
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.9288717533070908, 0.07112824669290918))",0,86,68,32,0,35.8,0.238,25
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.932739454928316, 0.067260545071684))",0,93,60,25,92,28.7,0.532,22
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.805866770141839, 0.19413322985816095))",0,93,100,39,72,43.4,1.021,35
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.963469321231079, 0.03653067876892102))",0,98,82,15,84,25.2,0.299,22
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.9612343711222975, 0.03876562887770252))",0,101,64,17,0,21.0,0.252,21
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.9917208142213196, 0.008279185778680365))",0,102,75,23,0,0.0,0.572,21


Using BinaryClassificationEvaluator to evaluate the model.

In [0]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Evaluate model
evaluator = BinaryClassificationEvaluator()
evaluator.evaluate(predictions)

Out[22]: 0.8320410932351223

In [0]:
evaluator.getMetricName()

Out[23]: 'areaUnderROC'

In [0]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# Create ParamGrid for Cross Validation
paramGrid = (ParamGridBuilder()
             .addGrid(lr.regParam, [0.01, 0.5, 2.0])
             .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])
             .addGrid(lr.maxIter, [1, 5, 10])
             .build())

In [0]:
# Create 5-fold CrossValidator
cv = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

# Run cross validations
cvModel = cv.fit(trainingData)
# this will likely take a fair amount of time because of the amount of models that we're creating and testing

In [0]:
# Use the test set to measure the accuracy of the model on new data
predictions = cvModel.transform(testData)

In [0]:
# cvModel uses the best model found from the Cross Validation
# Evaluate best model
evaluator.evaluate(predictions)

Out[27]: 0.8304904051172699

In [0]:
print('Model Intercept: ', cvModel.bestModel.intercept)

Model Intercept:  -7.558342229780866


In [0]:
weights = cvModel.bestModel.coefficients
weights = [(float(w),) for w in weights]  # convert numpy type to float, and to tuple
weightsDF = spark.createDataFrame(weights, ["Feature Weight"])
display(weightsDF)

Feature Weight
0.1175757486251037
0.0349156263736814
-0.0083449179383371
0.0
0.0
0.0660619620125552
0.5588002197476035
0.0106182226526396


In [0]:
# View best model's predictions and probabilities of each prediction class
selected = predictions.select("label", "prediction", "probability", 'Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI', 'DiabetesPedigreeFunction','Age')
display(selected)

label,prediction,probability,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.9234957076563698, 0.07650429234363021))",6,114,0,0,0,0.0,0.189,26
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.9058305730995481, 0.09416942690045194))",0,67,76,0,0,45.3,0.194,46
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.9603832708670305, 0.03961672913296954))",0,74,52,10,36,27.8,0.269,22
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.9177171342007413, 0.0822828657992587))",0,84,82,31,125,38.2,0.233,23
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.9136930511397868, 0.08630694886021317))",0,86,68,32,0,35.8,0.238,25
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.915678085254132, 0.08432191474586803))",0,93,60,25,92,28.7,0.532,22
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.7919003787366069, 0.2080996212633931))",0,93,100,39,72,43.4,1.021,35
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.9402204343692724, 0.059779565630727616))",0,98,82,15,84,25.2,0.299,22
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.9434723510055908, 0.05652764899440921))",0,101,64,17,0,21.0,0.252,21
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.983376926126135, 0.01662307387386497))",0,102,75,23,0,0.0,0.572,21


### Decision Tree

In [0]:
from pyspark.ml.classification import DecisionTreeClassifier

# Create initial Decision Tree Model
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features", maxDepth=3)

# Train model with Training Data
dtModel = dt.fit(trainingData)

In [0]:
print("numNodes = ", dtModel.numNodes)
print("depth = ", dtModel.depth)

numNodes =  7
depth =  3


In [0]:
display(dtModel)

treeNode
"{""index"":1,""featureType"":""continuous"",""prediction"":null,""threshold"":128.5,""categories"":null,""feature"":1,""overflow"":false}"
"{""index"":0,""featureType"":null,""prediction"":0.0,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"
"{""index"":5,""featureType"":""continuous"",""prediction"":null,""threshold"":30.15000057220459,""categories"":null,""feature"":5,""overflow"":false}"
"{""index"":3,""featureType"":""continuous"",""prediction"":null,""threshold"":148.5,""categories"":null,""feature"":1,""overflow"":false}"
"{""index"":2,""featureType"":null,""prediction"":0.0,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"
"{""index"":4,""featureType"":null,""prediction"":1.0,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"
"{""index"":6,""featureType"":null,""prediction"":1.0,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"


In [0]:
# Make predictions on test data using the Transformer.transform() method.
predictions = dtModel.transform(testData)

In [0]:
# View model's predictions and probabilities of each prediction class
selected = predictions.select("label", "prediction", "probability", 'Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI', 'DiabetesPedigreeFunction','Age')
display(selected)

label,prediction,probability,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.7843137254901961, 0.21568627450980393))",6,114,0,0,0,0.0,0.189,26
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.7843137254901961, 0.21568627450980393))",0,67,76,0,0,45.3,0.194,46
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.7843137254901961, 0.21568627450980393))",0,74,52,10,36,27.8,0.269,22
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.7843137254901961, 0.21568627450980393))",0,84,82,31,125,38.2,0.233,23
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.7843137254901961, 0.21568627450980393))",0,86,68,32,0,35.8,0.238,25
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.7843137254901961, 0.21568627450980393))",0,93,60,25,92,28.7,0.532,22
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.7843137254901961, 0.21568627450980393))",0,93,100,39,72,43.4,1.021,35
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.7843137254901961, 0.21568627450980393))",0,98,82,15,84,25.2,0.299,22
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.7843137254901961, 0.21568627450980393))",0,101,64,17,0,21.0,0.252,21
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.7843137254901961, 0.21568627450980393))",0,102,75,23,0,0.0,0.572,21


In [0]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
# Evaluate model
evaluator = BinaryClassificationEvaluator()
evaluator.evaluate(predictions)

Out[36]: 0.6670381856949021

In [0]:
dt.getImpurity()
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
paramGrid = (ParamGridBuilder()
             .addGrid(dt.maxDepth, [1, 2, 6, 10])
             .addGrid(dt.maxBins, [20, 40, 80])
             .build())
# Create 5-fold CrossValidator
cv = CrossValidator(estimator=dt, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

# Run cross validations
cvModel = cv.fit(trainingData)
# Takes ~5 minutes

print("numNodes = ", cvModel.bestModel.numNodes)
print("depth = ", cvModel.bestModel.depth)

predictions = cvModel.transform(testData)
evaluator.evaluate(predictions)

numNodes =  69
depth =  6
Out[37]: 0.7326032176778445

In [0]:
# View model's predictions and probabilities of each prediction class
selected = predictions.select("label", "prediction", "probability", 'Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI', 'DiabetesPedigreeFunction','Age')
display(selected)

label,prediction,probability,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(1.0, 0.0))",6,114,0,0,0,0.0,0.189,26
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(1.0, 0.0))",0,67,76,0,0,45.3,0.194,46
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(1.0, 0.0))",0,74,52,10,36,27.8,0.269,22
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.8181818181818182, 0.18181818181818182))",0,84,82,31,125,38.2,0.233,23
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.8181818181818182, 0.18181818181818182))",0,86,68,32,0,35.8,0.238,25
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(1.0, 0.0))",0,93,60,25,92,28.7,0.532,22
0.0,1.0,"Map(vectorType -> dense, length -> 2, values -> List(0.23076923076923078, 0.7692307692307693))",0,93,100,39,72,43.4,1.021,35
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(1.0, 0.0))",0,98,82,15,84,25.2,0.299,22
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(1.0, 0.0))",0,101,64,17,0,21.0,0.252,21
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(1.0, 0.0))",0,102,75,23,0,0.0,0.572,21


### Random Forest

In [0]:
from pyspark.ml.classification import RandomForestClassifier

# Create an initial RandomForest model.
rf = RandomForestClassifier(labelCol="label", featuresCol="features")

# Train model with Training Data
rfModel = rf.fit(trainingData)

In [0]:
# Make predictions on test data using the Transformer.transform() method.
predictions = rfModel.transform(testData)

In [0]:
selected = predictions.select("label", "prediction", "probability", "age", "DiabetesPedigreeFunction", "Insulin")
display(selected)

label,prediction,probability,age,DiabetesPedigreeFunction,Insulin
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.8988697716056989, 0.10113022839430097))",26,0.189,0
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.8251192459438442, 0.1748807540561558))",46,0.194,0
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.9493215294963834, 0.05067847050361653))",22,0.269,36
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.781355127114681, 0.21864487288531903))",23,0.233,125
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.812374267863617, 0.18762573213638303))",25,0.238,0
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.9354076524714816, 0.06459234752851833))",22,0.532,92
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.6801218266707203, 0.31987817332927976))",35,1.021,72
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.9722555789643952, 0.02774442103560481))",22,0.299,84
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.9207397327855783, 0.07926026721442171))",21,0.252,0
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.9616353371811828, 0.0383646628188173))",21,0.572,0


In [0]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Evaluate model
evaluator = BinaryClassificationEvaluator()
evaluator.evaluate(predictions)

Out[42]: 0.8304904051172708

In [0]:
# Create ParamGrid for Cross Validation
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

paramGrid = (ParamGridBuilder()
             .addGrid(rf.maxDepth, [2, 4, 6])
             .addGrid(rf.maxBins, [20, 60])
             .addGrid(rf.numTrees, [5, 20])
             .build())

In [0]:
# Create 5-fold CrossValidator
cv = CrossValidator(estimator=rf, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

# Run cross validations.  This can take about 6 minutes since it is training over 20 trees!
cvModel = cv.fit(trainingData)

In [0]:
# Use the test set to measure the accuracy of the model on new data
predictions = cvModel.transform(testData)

In [0]:
# cvModel uses the best model found from the Cross Validation
# Evaluate best model
evaluator.evaluate(predictions)

Out[46]: 0.8253052917232018

In [0]:
# View model's predictions and probabilities of each prediction class
selected = predictions.select("label", "prediction", "probability", 'Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI', 'DiabetesPedigreeFunction','Age')
display(selected)

label,prediction,probability,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.8318891249745025, 0.1681108750254974))",6,114,0,0,0,0.0,0.189,26
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.7059235811285653, 0.29407641887143454))",0,67,76,0,0,45.3,0.194,46
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.8924501319181182, 0.10754986808188174))",0,74,52,10,36,27.8,0.269,22
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.8298820308219442, 0.1701179691780557))",0,84,82,31,125,38.2,0.233,23
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.8132417644863087, 0.18675823551369125))",0,86,68,32,0,35.8,0.238,25
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.9001342501872402, 0.09986574981275977))",0,93,60,25,92,28.7,0.532,22
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.7481116403268051, 0.2518883596731949))",0,93,100,39,72,43.4,1.021,35
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.9403739878007563, 0.059626012199243704))",0,98,82,15,84,25.2,0.299,22
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.9072946791951318, 0.09270532080486811))",0,101,64,17,0,21.0,0.252,21
0.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(0.9264647196809619, 0.07353528031903815))",0,102,75,23,0,0.0,0.572,21


Use Best Model for deployment and to make predictions
As Logistic Regression gives the best areaUnderROC value, use the bestModel obtained from Logistic Regression for deployment, and we can use it to generate predictions on new data.

In [0]:
bestModel = cvModel.bestModel

In [0]:
finalPredictions = bestModel.transform(dataset)

In [0]:
evaluator.evaluate(finalPredictions)

Out[50]: 0.8734440298507473