<a href="https://colab.research.google.com/github/Praxis-QR/BDSN/blob/main/ML_Pipeline_2_Diabetes_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![CC-BY-SA](https://licensebuttons.net/l/by-sa/3.0/88x31.png)<br>
<hr>

![alt text](https://github.com/Praxis-QR/RDWH/raw/main/images/YantraJaalBanner.png)<br>


<hr>

[Prithwis Mukerjee](http://www.linkedin.com/in/prithwis)<br>

In [1]:
from datetime import datetime
import pytz
print('Tested',datetime.now(pytz.timezone('Asia/Calcutta')))

Tested 2023-12-18 05:46:18.918904+05:30


#Common Pipelines for Logistic Regression, Random Forest, Gradient Boost <br>
[What are Pipelines & PipelineModels?](https://spark.apache.org/docs/latest/ml-pipeline.html)
![alt text](https://raw.githubusercontent.com/Praxis-QR/BDSN/main/images/pipeline.png)<br>

#Spark Install

In [2]:
!pip3 -q install pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Praxis').master("local[*]").getOrCreate()
sc = spark.sparkContext
#sc

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


## Imports

In [3]:
from pyspark.sql.functions import mean
from pyspark.ml.feature import (VectorAssembler,OneHotEncoder, StringIndexer)
from pyspark.ml import Pipeline
from pyspark.ml import PipelineModel
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.evaluation import BinaryClassificationEvaluator


# Data

## Load Data

In [4]:
!wget -O heart.csv -q https://raw.githubusercontent.com/Praxis-QR/BDSN/main/Documents/BigML_Heart_Dataset.csv

In [5]:
#loading dataset into spark dataframe
heartDF = spark.read.csv('heart.csv', inferSchema=True,header=True)
heartDF.show(5)

+------+----+------------+-------------+---------------+-----+--------+
|gender| age|hypertension|heart_disease|smoking_history|  BMI|diabetes|
+------+----+------------+-------------+---------------+-----+--------+
|Female|80.0|           0|            1|          never|25.19|       0|
|Female|54.0|           0|            0|           NULL| NULL|       0|
|  Male|28.0|           0|            0|          never| NULL|       0|
|Female|36.0|           0|            0|        current|23.45|       0|
|  Male|76.0|           1|            1|        current|20.14|       0|
+------+----+------------+-------------+---------------+-----+--------+
only showing top 5 rows



## Data Quick Look


In [6]:
print((heartDF.count(),len(heartDF.columns)))

(100000, 7)


In [7]:
heartDF.printSchema()

root
 |-- gender: string (nullable = true)
 |-- age: double (nullable = true)
 |-- hypertension: integer (nullable = true)
 |-- heart_disease: integer (nullable = true)
 |-- smoking_history: string (nullable = true)
 |-- BMI: double (nullable = true)
 |-- diabetes: integer (nullable = true)



In [8]:
#descriptive analysis
heartDF.describe().show()

+-------+------+-----------------+------------------+------------------+---------------+------------------+-------------------+
|summary|gender|              age|      hypertension|     heart_disease|smoking_history|               BMI|           diabetes|
+-------+------+-----------------+------------------+------------------+---------------+------------------+-------------------+
|  count|100000|           100000|            100000|            100000|          64184|             74556|             100000|
|   mean|  NULL|41.88585600000013|           0.07485|           0.03942|           NULL|27.321028891034764|              0.085|
| stddev|  NULL|22.51683987161704|0.2631504702289171|0.1945930169980986|           NULL| 7.686295651045002|0.27888308976661896|
|    min|Female|             0.08|                 0|                 0|        current|             10.01|                  0|
|    max| Other|             80.0|                 1|                 1|    not current|             95.

In [9]:
#diabetic and non-diabetic count
heartDF.groupBy('diabetes').count().show()

+--------+-----+
|diabetes|count|
+--------+-----+
|       1| 8500|
|       0|91500|
+--------+-----+



## Using Spark SQL

In [10]:
# create DataFrame as a temporary view
heartDF.createOrReplaceTempView('heart_T')

In [11]:
#group by gender
spark.sql(\
          "SELECT \
           gender, count(gender) as count_gender, \
           count(gender)*100/sum(count(gender)) over() as percent  \
           FROM heart_T GROUP BY gender" \
           ).show()

+------+------------+-------+
|gender|count_gender|percent|
+------+------------+-------+
|Female|       58552| 58.552|
| Other|          18|  0.018|
|  Male|       41430|  41.43|
+------+------------+-------+



In [12]:
#group by gender having diabetes
spark.sql(\
          "SELECT gender, count(gender), \
          round((COUNT(gender) * 100.0) /(SELECT count(gender) FROM heart_T ),2) as percentage \
          FROM heart_T WHERE diabetes = '1'  GROUP BY gender"\
          ).show()

+------+-------------+----------+
|gender|count(gender)|percentage|
+------+-------------+----------+
|Female|         4461|      4.46|
|  Male|         4039|      4.04|
+------+-------------+----------+



In [13]:
#group by gender having heart disease
spark.sql(\
          "SELECT gender, count(gender), \
          round((COUNT(gender) * 100.0) /(SELECT count(gender) FROM heart_T ),2) as percentage \
          FROM heart_T WHERE heart_disease = '1'  GROUP BY gender"\
          ).show()

+------+-------------+----------+
|gender|count(gender)|percentage|
+------+-------------+----------+
|Female|         1562|      1.56|
|  Male|         2380|      2.38|
+------+-------------+----------+



In [14]:
#group by gender having hypertension
spark.sql(\
          "SELECT gender, count(gender), \
          round((COUNT(gender) * 100.0) /(SELECT count(gender) FROM heart_T ),2) as percentage \
          FROM heart_T WHERE hypertension = '1'  GROUP BY gender"\
          ).show()

+------+-------------+----------+
|gender|count(gender)|percentage|
+------+-------------+----------+
|Female|         4197|      4.20|
|  Male|         3288|      3.29|
+------+-------------+----------+



In [15]:
#count of different types of smoker
heartDF.groupBy('smoking_history').count().show()

+---------------+-----+
|smoking_history|count|
+---------------+-----+
|    not current| 6447|
|           NULL|35816|
|         former| 9352|
|        current| 9286|
|          never|35095|
|           ever| 4004|
+---------------+-----+



In [16]:
#group by gender having hypertension
spark.sql(\
          "SELECT smoking_history, count(smoking_history) as count, \
          round((COUNT(smoking_history) * 100.0) /(SELECT count(smoking_history) FROM heart_T ),2) as percentage \
          FROM heart_T   GROUP BY smoking_history"\
          ).show()

+---------------+-----+----------+
|smoking_history|count|percentage|
+---------------+-----+----------+
|    not current| 6447|     10.04|
|           NULL|    0|      0.00|
|         former| 9352|     14.57|
|        current| 9286|     14.47|
|          never|35095|     54.68|
|           ever| 4004|      6.24|
+---------------+-----+----------+



In [17]:
#Age vs Diabetes
spark.sql("SELECT age, count(age) as age_count FROM heart_T WHERE diabetes == 1 GROUP BY age ORDER BY age_count DESC").show()

+----+---------+
| age|age_count|
+----+---------+
|80.0|     1024|
|62.0|      258|
|61.0|      250|
|66.0|      241|
|67.0|      236|
|65.0|      234|
|57.0|      233|
|59.0|      216|
|60.0|      213|
|64.0|      211|
|68.0|      208|
|69.0|      206|
|58.0|      205|
|63.0|      202|
|55.0|      201|
|71.0|      192|
|54.0|      191|
|56.0|      187|
|74.0|      184|
|70.0|      183|
+----+---------+
only showing top 20 rows



In [18]:
#count of diabetic patients over age 50
heartDF.filter((heartDF['diabetes'] == 1) & (heartDF['age'] > '50')).count()


6650

## Data Preprocessing

In [19]:
#checking null values
heartDF.toPandas().isnull().sum()

gender                 0
age                    0
hypertension           0
heart_disease          0
smoking_history    35816
BMI                25444
diabetes               0
dtype: int64

In [20]:
# fill in missing values for smoking_history
heartDF2 = heartDF.na.fill('No Info', subset=['smoking_history'])
# fill in miss values for BMI with mean

cmean = heartDF2.select(mean(heartDF2['BMI'])).collect()
meanBMI = cmean[0][0]
heartDF2 = heartDF2.na.fill(meanBMI,['BMI'])

In [21]:
heartDF2.describe().show()
# note, mean BMI has not changed, but std of BMI has reduced, as expected

+-------+------+-----------------+------------------+------------------+---------------+------------------+-------------------+
|summary|gender|              age|      hypertension|     heart_disease|smoking_history|               BMI|           diabetes|
+-------+------+-----------------+------------------+------------------+---------------+------------------+-------------------+
|  count|100000|           100000|            100000|            100000|         100000|            100000|             100000|
|   mean|  NULL|41.88585600000013|           0.07485|           0.03942|           NULL|27.321028891031315|              0.085|
| stddev|  NULL|22.51683987161704|0.2631504702289171|0.1945930169980986|           NULL|  6.63678340151884|0.27888308976661896|
|    min|Female|             0.08|                 0|                 0|        No Info|             10.01|                  0|
|    max| Other|             80.0|                 1|                 1|    not current|             95.

In [22]:
heartDF2.toPandas().isnull().sum()

gender             0
age                0
hypertension       0
heart_disease      0
smoking_history    0
BMI                0
diabetes           0
dtype: int64

In [23]:
heartDF2.dtypes

[('gender', 'string'),
 ('age', 'double'),
 ('hypertension', 'int'),
 ('heart_disease', 'int'),
 ('smoking_history', 'string'),
 ('BMI', 'double'),
 ('diabetes', 'int')]

# Serial Data Conversions <br>
see https://spark.apache.org/docs/latest/ml-features.html

## String Indexer

In [24]:
# we have two categorical varibales gender and smoking history
# indexing all categorical columns in the dataset

GenderIndexer = StringIndexer(inputCol="gender", outputCol="genderIndex")                       # Unfitted Transformer
SmokeHistIndexer = StringIndexer(inputCol="smoking_history", outputCol="smoking_statusIndex")   # Unfitted Transformer

In [25]:
heartDF2.show(5)

+------+----+------------+-------------+---------------+------------------+--------+
|gender| age|hypertension|heart_disease|smoking_history|               BMI|diabetes|
+------+----+------------+-------------+---------------+------------------+--------+
|Female|80.0|           0|            1|          never|             25.19|       0|
|Female|54.0|           0|            0|        No Info|27.321028891034764|       0|
|  Male|28.0|           0|            0|          never|27.321028891034764|       0|
|Female|36.0|           0|            0|        current|             23.45|       0|
|  Male|76.0|           1|            1|        current|             20.14|       0|
+------+----+------------+-------------+---------------+------------------+--------+
only showing top 5 rows



In [26]:
# Using any one String Indexer
#GenderIndexer.fit(heartDF2).transform(heartDF2).show()
SmokeHistIndexer.fit(heartDF2).transform(heartDF2).show(5)                                   # Fitted Transformer - SmokeHistIndexer.fit(heartDF2)

+------+----+------------+-------------+---------------+------------------+--------+-------------------+
|gender| age|hypertension|heart_disease|smoking_history|               BMI|diabetes|smoking_statusIndex|
+------+----+------------+-------------+---------------+------------------+--------+-------------------+
|Female|80.0|           0|            1|          never|             25.19|       0|                1.0|
|Female|54.0|           0|            0|        No Info|27.321028891034764|       0|                0.0|
|  Male|28.0|           0|            0|          never|27.321028891034764|       0|                1.0|
|Female|36.0|           0|            0|        current|             23.45|       0|                3.0|
|  Male|76.0|           1|            1|        current|             20.14|       0|                3.0|
+------+----+------------+-------------+---------------+------------------+--------+-------------------+
only showing top 5 rows



In [27]:
# Putting TWO indexers, the start of the so-called Pipeline
                                                                        # Fitted Transformer 1 - SmokeHistIndexer.fit(heartDF2)
                                                                        # Fitted Transformer 2 - GenderIndexer.fit(SmokeHistIndexer.fit(heartDF2).transform(heartDF2))

GenderIndexer.fit(SmokeHistIndexer.fit(heartDF2).transform(heartDF2)).transform(SmokeHistIndexer.fit(heartDF2).transform(heartDF2)).show(5)

+------+----+------------+-------------+---------------+------------------+--------+-------------------+-----------+
|gender| age|hypertension|heart_disease|smoking_history|               BMI|diabetes|smoking_statusIndex|genderIndex|
+------+----+------------+-------------+---------------+------------------+--------+-------------------+-----------+
|Female|80.0|           0|            1|          never|             25.19|       0|                1.0|        0.0|
|Female|54.0|           0|            0|        No Info|27.321028891034764|       0|                0.0|        0.0|
|  Male|28.0|           0|            0|          never|27.321028891034764|       0|                1.0|        1.0|
|Female|36.0|           0|            0|        current|             23.45|       0|                3.0|        0.0|
|  Male|76.0|           1|            1|        current|             20.14|       0|                3.0|        1.0|
+------+----+------------+-------------+---------------+--------

## One Hot Encoder

In [28]:
OHE_Gender = OneHotEncoder(inputCols=["genderIndex"], outputCols=["genderVec"])
#ohe.fit(??).transform(??).show()
            # Unfitted Transformer - OHE_Gender
            # Fitted Transformer - OHE_Gender.fit(GenderIndexer.fit(heartDF2).transform(heartDF2))

OHE_Gender.fit(GenderIndexer.fit(heartDF2).transform(heartDF2)).transform(GenderIndexer.fit(heartDF2).transform(heartDF2)).show(5)

+------+----+------------+-------------+---------------+------------------+--------+-----------+-------------+
|gender| age|hypertension|heart_disease|smoking_history|               BMI|diabetes|genderIndex|    genderVec|
+------+----+------------+-------------+---------------+------------------+--------+-----------+-------------+
|Female|80.0|           0|            1|          never|             25.19|       0|        0.0|(2,[0],[1.0])|
|Female|54.0|           0|            0|        No Info|27.321028891034764|       0|        0.0|(2,[0],[1.0])|
|  Male|28.0|           0|            0|          never|27.321028891034764|       0|        1.0|(2,[1],[1.0])|
|Female|36.0|           0|            0|        current|             23.45|       0|        0.0|(2,[0],[1.0])|
|  Male|76.0|           1|            1|        current|             20.14|       0|        1.0|(2,[1],[1.0])|
+------+----+------------+-------------+---------------+------------------+--------+-----------+-------------+
o

In [29]:
OHE_SmokeStat = OneHotEncoder(inputCols=["smoking_statusIndex"], outputCols=["smoking_statusVec"])
#ohe.fit(??).transform(??).show()
       # Unfitted Transformer - OHE_SmokeStat
       # Fitted Transformer - OHE_SmokeStat.fit(SmokeHistIndexer.fit(heartDF2).transform(heartDF2))
OHE_SmokeStat.fit(SmokeHistIndexer.fit(heartDF2).transform(heartDF2)).transform(SmokeHistIndexer.fit(heartDF2).transform(heartDF2)).show(5)

+------+----+------------+-------------+---------------+------------------+--------+-------------------+-----------------+
|gender| age|hypertension|heart_disease|smoking_history|               BMI|diabetes|smoking_statusIndex|smoking_statusVec|
+------+----+------------+-------------+---------------+------------------+--------+-------------------+-----------------+
|Female|80.0|           0|            1|          never|             25.19|       0|                1.0|    (5,[1],[1.0])|
|Female|54.0|           0|            0|        No Info|27.321028891034764|       0|                0.0|    (5,[0],[1.0])|
|  Male|28.0|           0|            0|          never|27.321028891034764|       0|                1.0|    (5,[1],[1.0])|
|Female|36.0|           0|            0|        current|             23.45|       0|                3.0|    (5,[3],[1.0])|
|  Male|76.0|           1|            1|        current|             20.14|       0|                3.0|    (5,[3],[1.0])|
+------+----+---

In [30]:
# Encoding Both Simltaenously
OHE_Gender_Smoke = OneHotEncoder(inputCols=["genderIndex","smoking_statusIndex"],
                                 outputCols=["genderVec","smoking_statusVec"])

In [31]:
#The 'Pipeline' becomes even longer
# but still ohe.fit(??).transform(??).show()
       # OHE_Gender_Smoke.fit(what?a).transform(what?b)
       # what?a = GenderIndexer.fit(SmokeHistIndexer.fit(heartDF2).transform(heartDF2)).transform(SmokeHistIndexer.fit(heartDF2).transform(heartDF2))
       # what?b = GenderIndexer.fit(SmokeHistIndexer.fit(heartDF2).transform(heartDF2)).transform(SmokeHistIndexer.fit(heartDF2).transform(heartDF2))
OHE_Gender_Smoke.fit(GenderIndexer.fit(SmokeHistIndexer.fit(heartDF2).transform(heartDF2)).transform(SmokeHistIndexer.fit(heartDF2).transform(heartDF2))).transform(GenderIndexer.fit(SmokeHistIndexer.fit(heartDF2).transform(heartDF2)).transform(SmokeHistIndexer.fit(heartDF2).transform(heartDF2))).show(5)

+------+----+------------+-------------+---------------+------------------+--------+-------------------+-----------+-------------+-----------------+
|gender| age|hypertension|heart_disease|smoking_history|               BMI|diabetes|smoking_statusIndex|genderIndex|    genderVec|smoking_statusVec|
+------+----+------------+-------------+---------------+------------------+--------+-------------------+-----------+-------------+-----------------+
|Female|80.0|           0|            1|          never|             25.19|       0|                1.0|        0.0|(2,[0],[1.0])|    (5,[1],[1.0])|
|Female|54.0|           0|            0|        No Info|27.321028891034764|       0|                0.0|        0.0|(2,[0],[1.0])|    (5,[0],[1.0])|
|  Male|28.0|           0|            0|          never|27.321028891034764|       0|                1.0|        1.0|(2,[1],[1.0])|    (5,[1],[1.0])|
|Female|36.0|           0|            0|        current|             23.45|       0|                3.0|  

## Assembler

In [32]:
F_assembler = VectorAssembler(inputCols=['genderVec',
 'age',
 'hypertension',
 'heart_disease',
 'BMI',
 'smoking_statusVec'],outputCol='features')

In [33]:
#assembler.transform(??).show()
F_assembler.transform(OHE_Gender_Smoke.fit(GenderIndexer.fit(SmokeHistIndexer.fit(heartDF2).transform(heartDF2)).transform(SmokeHistIndexer.fit(heartDF2).transform(heartDF2))).transform(GenderIndexer.fit(SmokeHistIndexer.fit(heartDF2).transform(heartDF2)).transform(SmokeHistIndexer.fit(heartDF2).transform(heartDF2)))).show(5)

+------+----+------------+-------------+---------------+------------------+--------+-------------------+-----------+-------------+-----------------+--------------------+
|gender| age|hypertension|heart_disease|smoking_history|               BMI|diabetes|smoking_statusIndex|genderIndex|    genderVec|smoking_statusVec|            features|
+------+----+------------+-------------+---------------+------------------+--------+-------------------+-----------+-------------+-----------------+--------------------+
|Female|80.0|           0|            1|          never|             25.19|       0|                1.0|        0.0|(2,[0],[1.0])|    (5,[1],[1.0])|(11,[0,2,4,5,7],[...|
|Female|54.0|           0|            0|        No Info|27.321028891034764|       0|                0.0|        0.0|(2,[0],[1.0])|    (5,[0],[1.0])|(11,[0,2,5,6],[1....|
|  Male|28.0|           0|            0|          never|27.321028891034764|       0|                1.0|        1.0|(2,[1],[1.0])|    (5,[1],[1.0])|(1

# Train-Test Split

In [34]:
# splitting training and validation data
train_heart,val_heart = heartDF2.randomSplit([0.7,0.3])
print(train_heart.count())
print(val_heart.count())

69933
30067


#Apply Different Techniques

In [35]:
basePipe = Pipeline(stages=[GenderIndexer, SmokeHistIndexer, OHE_Gender_Smoke, F_assembler])
basePipe.fit(heartDF2).transform(heartDF2).show(5)

+------+----+------------+-------------+---------------+------------------+--------+-----------+-------------------+-------------+-----------------+--------------------+
|gender| age|hypertension|heart_disease|smoking_history|               BMI|diabetes|genderIndex|smoking_statusIndex|    genderVec|smoking_statusVec|            features|
+------+----+------------+-------------+---------------+------------------+--------+-----------+-------------------+-------------+-----------------+--------------------+
|Female|80.0|           0|            1|          never|             25.19|       0|        0.0|                1.0|(2,[0],[1.0])|    (5,[1],[1.0])|(11,[0,2,4,5,7],[...|
|Female|54.0|           0|            0|        No Info|27.321028891034764|       0|        0.0|                0.0|(2,[0],[1.0])|    (5,[0],[1.0])|(11,[0,2,5,6],[1....|
|  Male|28.0|           0|            0|          never|27.321028891034764|       0|        1.0|                1.0|(2,[1],[1.0])|    (5,[1],[1.0])|(1

## Logistic Regression Model Pipeline

In [36]:
lr = LogisticRegression(labelCol='diabetes',featuresCol='features',maxIter=5)
#lr_pipeline = Pipeline(stages=[GenderIndexer, SmokeHistIndexer, OHE_Gender_Smoke, F_assembler,lr])
lr_pipeline = Pipeline(stages=[basePipe,lr])                    # Appending a stage to a pipeline
# training model pipeline with data
lr_model = lr_pipeline.fit(train_heart)
lr_predictions=lr_model.transform(val_heart)

In [37]:
lr_predictions.show(5)

+------+----+------------+-------------+---------------+------------------+--------+-----------+-------------------+-------------+-----------------+--------------------+--------------------+--------------------+----------+
|gender| age|hypertension|heart_disease|smoking_history|               BMI|diabetes|genderIndex|smoking_statusIndex|    genderVec|smoking_statusVec|            features|       rawPrediction|         probability|prediction|
+------+----+------------+-------------+---------------+------------------+--------+-----------+-------------------+-------------+-----------------+--------------------+--------------------+--------------------+----------+
|Female|0.08|           0|            0|        No Info|             11.88|       0|        0.0|                0.0|(2,[0],[1.0])|    (5,[0],[1.0])|(11,[0,2,5,6],[1....|[6.81498723998303...|[0.99890399534772...|       0.0|
|Female|0.08|           0|            0|        No Info|             12.22|       0|        0.0|            

In [38]:
acc_evaluator = MulticlassClassificationEvaluator(labelCol="diabetes", predictionCol="prediction", metricName="accuracy")
evaluator = BinaryClassificationEvaluator(labelCol='diabetes')

In [39]:
lr_acc=acc_evaluator.evaluate(lr_predictions)
#print('A Logistic Regression algorithm had an accuracy of: {0:2.2f}%'.format(lr_acc*100))
print(round(lr_acc,3), 'is the accuray of the LR pipeline')
lr_auroc = evaluator.evaluate(lr_predictions, {evaluator.metricName: "areaUnderROC"})
#auprc = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderPR"})
print("Area under ROC Curve: {:.4f}".format(lr_auroc))
#print("Area under PR Curve: {:.4f}".format(auprc))


0.915 is the accuray of the LR pipeline
Area under ROC Curve: 0.8341


## Random Forest Pipeline

In [40]:
rf = RandomForestClassifier(featuresCol = 'features', labelCol = 'diabetes')
#rf_pipeline = Pipeline(stages=[GenderIndexer, SmokeHistIndexer, OHE_Gender_Smoke, F_assembler, rf])
rf_pipeline = Pipeline(stages=[basePipe, rf])
# training model pipeline with data
rf_model = rf_pipeline.fit(train_heart)
rf_predictions=rf_model.transform(val_heart)

In [41]:
rf_acc=acc_evaluator.evaluate(rf_predictions)
print('A Random Forest algorithm had an accuracy of: {0:2.2f}%'.format(rf_acc*100))
# We have only two choices: area under ROC and PR curves :-(
rf_auroc = evaluator.evaluate(rf_predictions, {evaluator.metricName: "areaUnderROC"})
#auprc = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderPR"})
print("Area under ROC Curve: {:.4f}".format(rf_auroc))
#print("Area under PR Curve: {:.4f}".format(auprc))

A Random Forest algorithm had an accuracy of: 91.46%
Area under ROC Curve: 0.8220


## Gradient Boost Model Pipeline

In [42]:
gbt = GBTClassifier(labelCol='diabetes',featuresCol='features')
#gbt_pipeline = Pipeline(stages=[GenderIndexer, SmokeHistIndexer, OHE_Gender_Smoke, F_assembler, gbt])
gbt_pipeline = Pipeline(stages=[basePipe, gbt])
gbt_model = gbt_pipeline.fit(train_heart)
gbt_predictions = gbt_model.transform(val_heart)

In [43]:
gbt_acc = acc_evaluator.evaluate(gbt_predictions)
print('Gradient Boost algorithm had an accuracy of: {0:2.2f}%'.format(gbt_acc*100))
# We have only two choices: area under ROC and PR curves :-(
gbt_auroc = evaluator.evaluate(gbt_predictions, {evaluator.metricName: "areaUnderROC"})
#auprc = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderPR"})
print("Area under ROC Curve: {:.4f}".format(gbt_auroc))
#print("Area under PR Curve: {:.4f}".format(auprc))

Gradient Boost algorithm had an accuracy of: 91.51%
Area under ROC Curve: 0.8357


# Comparison of LR, RF, GBT

In [44]:
print(round(lr_acc,3), 'is the accuray of the LR pipeline')
print(round(rf_acc,3), 'is the accuray of the RF pipeline')
print(round(gbt_acc,3), 'is the accuray of the GBT pipeline')

print(round(lr_auroc,3), 'is area under ROC curve of the LR pipeline')
print(round(rf_auroc,3), 'is area under ROC curve of the RF pipeline')
print(round(gbt_auroc,3), 'is area under ROC curve of the GBT pipeline')

0.915 is the accuray of the LR pipeline
0.915 is the accuray of the RF pipeline
0.915 is the accuray of the GBT pipeline
0.834 is area under ROC curve of the LR pipeline
0.822 is area under ROC curve of the RF pipeline
0.836 is area under ROC curve of the GBT pipeline


# Convert Once, Apply Different Techniques <br>
without having to fit again https://stackoverflow.com/questions/49337830/spark-add-new-fitted-stage-to-a-exitsting-pipelinemodel-without-fitting-again




In [45]:
commonEstimator = Pipeline(stages=[GenderIndexer, SmokeHistIndexer, OHE_Gender_Smoke, F_assembler]) # A pipeline is an estimator
commonModel = commonEstimator.fit(train_heart)                                                      # A fitted pipeline is a model

commonTrainHeart = commonModel.transform(train_heart)
commonTrainHeart.show(5)

+------+----+------------+-------------+---------------+-----+--------+-----------+-------------------+-------------+-----------------+--------------------+
|gender| age|hypertension|heart_disease|smoking_history|  BMI|diabetes|genderIndex|smoking_statusIndex|    genderVec|smoking_statusVec|            features|
+------+----+------------+-------------+---------------+-----+--------+-----------+-------------------+-------------+-----------------+--------------------+
|Female|0.08|           0|            0|        No Info| 12.5|       0|        0.0|                0.0|(2,[0],[1.0])|    (5,[0],[1.0])|(11,[0,2,5,6],[1....|
|Female|0.08|           0|            0|        No Info|12.77|       0|        0.0|                0.0|(2,[0],[1.0])|    (5,[0],[1.0])|(11,[0,2,5,6],[1....|
|Female|0.08|           0|            0|        No Info|12.82|       0|        0.0|                0.0|(2,[0],[1.0])|    (5,[0],[1.0])|(11,[0,2,5,6],[1....|
|Female|0.08|           0|            0|        No Info|13

##Logistic Regression

In [46]:
lr0 = LogisticRegression(labelCol='diabetes',featuresCol='features',maxIter=5)
lr0Estimator = Pipeline(stages=[lr0])                               # A pipeline is an estimator
lr0model = lr0Estimator.fit(commonTrainHeart)                       # A fitted pipeline is a model

lr1model = PipelineModel(stages = [commonModel , lr0model])         # Adding two models together
lr1model_predictions = lr1model.transform(val_heart)

In [47]:
acc_evaluator = MulticlassClassificationEvaluator(labelCol="diabetes", predictionCol="prediction", metricName="accuracy")
lr1_acc=acc_evaluator.evaluate(lr1model_predictions)

print(round(lr1_acc,3), 'is the accuray of the new LR pipeline')
evaluator = BinaryClassificationEvaluator(labelCol='diabetes')
lr1_auroc = evaluator.evaluate(lr1model_predictions, {evaluator.metricName: "areaUnderROC"})
print("Area under ROC Curve: {:.4f}".format(lr1_auroc))


0.915 is the accuray of the new LR pipeline
Area under ROC Curve: 0.8341


## Random Forest

In [48]:
rf0 = RandomForestClassifier(featuresCol = 'features', labelCol = 'diabetes')
rf0Estimator = Pipeline(stages=[rf0])                               # A pipeline is an estimator
rf0model = rf0Estimator.fit(commonTrainHeart)                       # A fitted pipeline is a model

rf1model = PipelineModel(stages = [commonModel , rf0model])         # Adding two models together
rf1model_predictions = rf1model.transform(val_heart)


In [49]:
acc_evaluator = MulticlassClassificationEvaluator(labelCol="diabetes", predictionCol="prediction", metricName="accuracy")
rf1_acc=acc_evaluator.evaluate(rf1model_predictions)

print(round(rf1_acc,3), 'is the accuray of the new RF pipeline')
evaluator = BinaryClassificationEvaluator(labelCol='diabetes')
rf1_auroc = evaluator.evaluate(rf1model_predictions, {evaluator.metricName: "areaUnderROC"})
print("Area under ROC Curve: {:.4f}".format(rf1_auroc))

0.915 is the accuray of the new RF pipeline
Area under ROC Curve: 0.8220


##Gradient Boost

In [50]:
gbt0 = GBTClassifier(featuresCol = 'features', labelCol = 'diabetes')
gbt0Estimator = Pipeline(stages=[gbt0])                               # A pipeline is an estimator
gbt0model = gbt0Estimator.fit(commonTrainHeart)                       # A fitted pipeline is a model

gbt1model = PipelineModel(stages = [commonModel , gbt0model])         # Adding two models together
gbt1model_predictions = gbt1model.transform(val_heart)

In [51]:
acc_evaluator = MulticlassClassificationEvaluator(labelCol="diabetes", predictionCol="prediction", metricName="accuracy")
gbt1_acc=acc_evaluator.evaluate(gbt1model_predictions)

print(round(gbt1_acc,3), 'is the accuray of the new GBT pipeline')
evaluator = BinaryClassificationEvaluator(labelCol='diabetes')
gbt1_auroc = evaluator.evaluate(gbt1model_predictions, {evaluator.metricName: "areaUnderROC"})
print("Area under ROC Curve: {:.4f}".format(gbt1_auroc))

0.915 is the accuray of the new GBT pipeline
Area under ROC Curve: 0.8357


In [52]:
!date
from datetime import datetime
import pytz
print('Tested',datetime.now(pytz.timezone('Asia/Calcutta')))

Mon Dec 18 12:20:21 AM UTC 2023
Tested 2023-12-18 05:50:21.247746+05:30


#Chronobooks <br>
![alt text](https://1.bp.blogspot.com/-lTiYBkU2qbU/X1er__fvnkI/AAAAAAAAjtE/GhDR3OEGJr4NG43fZPodrQD5kbxtnKebgCLcBGAsYHQ/s600/Footer2020-600x200.png)<hr>
Chronotantra and Chronoyantra are two science fiction novels that explore the collapse of human civilisation on Earth and then its rebirth and reincarnation both on Earth as well as on the distant worlds of Mars, Titan and Enceladus. But is it the human civilisation that is being reborn? Or is it some other sentience that is revealing itself.
If you have an interest in AI and found this material useful, you may consider buying these novels, in paperback or kindle, from [http://bit.ly/chronobooks](http://bit.ly/chronobooks)