<a href="https://colab.research.google.com/github/Hanifanta/Logistic-Regression-pyspark/blob/main/BDL_Regresi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Import pyspark**

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
!tar xf spark-3.0.0-bin-hadoop3.2.tgz

!pip install -q findspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"

!pip install pyspark

In [None]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("spark").getOrCreate()

## **Load dataset**

Dataset :

https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset

In [None]:
# membuat spark dataframe
df = spark.read.csv('/content/diabetes.csv', header=True, inferSchema = True)

## **Exploratory Data Analysis**

In [None]:
# menampilkan isi dataframe
df.show()

+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin| BMI|DiabetesPedigreeFunction|Age|Outcome|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|          6|    148|           72|           35|      0|33.6|                   0.627| 50|      1|
|          1|     85|           66|           29|      0|26.6|                   0.351| 31|      0|
|          8|    183|           64|            0|      0|23.3|                   0.672| 32|      1|
|          1|     89|           66|           23|     94|28.1|                   0.167| 21|      0|
|          0|    137|           40|           35|    168|43.1|                   2.288| 33|      1|
|          5|    116|           74|            0|      0|25.6|                   0.201| 30|      0|
|          3|     78|           50|           32|     88|31.0|                   0.248| 26|      1|


In [None]:
# menampilkan schema dari dataframe
df.printSchema()

root
 |-- Pregnancies: integer (nullable = true)
 |-- Glucose: integer (nullable = true)
 |-- BloodPressure: integer (nullable = true)
 |-- SkinThickness: integer (nullable = true)
 |-- Insulin: integer (nullable = true)
 |-- BMI: double (nullable = true)
 |-- DiabetesPedigreeFunction: double (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Outcome: integer (nullable = true)



In [None]:
# menghitung total class diabetes atau tidak diabetes dari kolom 'Outcome'
print(df.count(), len(df.columns))
df.groupBy('Outcome').count().show()

768 9
+-------+-----+
|Outcome|count|
+-------+-----+
|      1|  268|
|      0|  500|
+-------+-----+



In [None]:
# melihat statistik dari dataframe yang digunakan
df.describe().show()

+-------+------------------+-----------------+------------------+------------------+------------------+------------------+------------------------+------------------+------------------+
|summary|       Pregnancies|          Glucose|     BloodPressure|     SkinThickness|           Insulin|               BMI|DiabetesPedigreeFunction|               Age|           Outcome|
+-------+------------------+-----------------+------------------+------------------+------------------+------------------+------------------------+------------------+------------------+
|  count|               768|              768|               768|               768|               768|               768|                     768|               768|               768|
|   mean|3.8450520833333335|     120.89453125|       69.10546875|20.536458333333332| 79.79947916666667|31.992578124999977|      0.4718763020833327|33.240885416666664|0.3489583333333333|
| stddev|  3.36957806269887|31.97261819513622|19.355807170644777|15.95

In [None]:
# melihat data null pada kolom
for col in df.columns:
  print(col+":",df[df[col].isNull()].count())

Pregnancies: 0
Glucose: 0
BloodPressure: 0
SkinThickness: 0
Insulin: 0
BMI: 0
DiabetesPedigreeFunction: 0
Age: 0
Outcome: 0


In [None]:
# melihat nilai 0 pada setiap kolom
def count_zeros():
  columns_list = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
  for i in columns_list:
    print(i+":",df[df[i]==0].count())

In [None]:
count_zeros()

Glucose: 5
BloodPressure: 35
SkinThickness: 227
Insulin: 374
BMI: 11


In [None]:
# mengganti nilai 0 dengan nilai rata-rata
from pyspark.sql.functions import *
for i in df.columns[1:6]:
  data = df.agg({i:'mean'}).first()[0]
  print("mean value for {} is {}".format(i,int(data)))
  df = df.withColumn(i,when(df[i]==0,int(data)).otherwise(df[i]))

mean value for Glucose is 120
mean value for BloodPressure is 69
mean value for SkinThickness is 20
mean value for Insulin is 79
mean value for BMI is 31


In [None]:
# melihat dataframe kembali
df.show()

+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin| BMI|DiabetesPedigreeFunction|Age|Outcome|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|          6|    148|           72|           35|     79|33.6|                   0.627| 50|      1|
|          1|     85|           66|           29|     79|26.6|                   0.351| 31|      0|
|          8|    183|           64|           20|     79|23.3|                   0.672| 32|      1|
|          1|     89|           66|           23|     94|28.1|                   0.167| 21|      0|
|          0|    137|           40|           35|    168|43.1|                   2.288| 33|      1|
|          5|    116|           74|           20|     79|25.6|                   0.201| 30|      0|
|          3|     78|           50|           32|     88|31.0|                   0.248| 26|      1|


### **Correlation**

In [None]:
# mencari korelasi di antara sekumpulan variabel input & output
for i in df.columns:
  print("Korelasi untuk kolom outcome dari kolom {} adalah : {}".format(i,df.stat.corr('Outcome',i)))

Korelasi untuk kolom outcome dari kolom Pregnancies adalah : 0.22189815303398638
Korelasi untuk kolom outcome dari kolom Glucose adalah : 0.49288410274882094
Korelasi untuk kolom outcome dari kolom BloodPressure adalah : 0.16287909949861834
Korelasi untuk kolom outcome dari kolom SkinThickness adalah : 0.171856814176564
Korelasi untuk kolom outcome dari kolom Insulin adalah : 0.17869558803050842
Korelasi untuk kolom outcome dari kolom BMI adalah : 0.31289043493401536
Korelasi untuk kolom outcome dari kolom DiabetesPedigreeFunction adalah : 0.17384406565296007
Korelasi untuk kolom outcome dari kolom Age adalah : 0.23835598302719757
Korelasi untuk kolom outcome dari kolom Outcome adalah : 1.0


### **Feature Selection**

Memilih fitur yang digunakan untuk modeling yaitu

**['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age']**

In [None]:
# feature selection
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age'],outputCol='features')
output_data = assembler.transform(df)

In [None]:
# menampilkan schema feature selection
output_data.printSchema()

root
 |-- Pregnancies: integer (nullable = true)
 |-- Glucose: integer (nullable = true)
 |-- BloodPressure: integer (nullable = true)
 |-- SkinThickness: integer (nullable = true)
 |-- Insulin: integer (nullable = true)
 |-- BMI: double (nullable = true)
 |-- DiabetesPedigreeFunction: double (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Outcome: integer (nullable = true)
 |-- features: vector (nullable = true)



In [None]:
# menampilkan dataframe feature selection
output_data.show()

+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+--------------------+
|Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin| BMI|DiabetesPedigreeFunction|Age|Outcome|            features|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+--------------------+
|          6|    148|           72|           35|     79|33.6|                   0.627| 50|      1|[6.0,148.0,72.0,3...|
|          1|     85|           66|           29|     79|26.6|                   0.351| 31|      0|[1.0,85.0,66.0,29...|
|          8|    183|           64|           20|     79|23.3|                   0.672| 32|      1|[8.0,183.0,64.0,2...|
|          1|     89|           66|           23|     94|28.1|                   0.167| 21|      0|[1.0,89.0,66.0,23...|
|          0|    137|           40|           35|    168|43.1|                   2.288| 33|      1|[0.0,137.0,40.0,3...|
|          5|    116|           

In [None]:
# membuat dataframe baru untuk dataframe 'features' dan 'Outcome'
from pyspark.ml.classification import LogisticRegression
final_data = output_data.select('features','Outcome')

In [None]:
# menampilkan schema 'final_data'
final_data.printSchema()

root
 |-- features: vector (nullable = true)
 |-- Outcome: integer (nullable = true)



## **Split dataset**

In [None]:
#split the dataset
train, test = final_data.randomSplit([0.7,0.3])

## **Modeling**

Pada modeling ini, kami menggunakan LogisticRegression untuk melakukan prediksi pasien diabetes atau tidak diabetes.

In [None]:
models = LogisticRegression(labelCol='Outcome')
model = models.fit(train)

In [None]:
# melihat hasil dari modeling
summary = model.summary
summary.predictions.describe().show()

+-------+------------------+-------------------+
|summary|           Outcome|         prediction|
+-------+------------------+-------------------+
|  count|               542|                542|
|   mean|0.3505535055350554| 0.2822878228782288|
| stddev|0.4775840964904406|0.45052846992996176|
|    min|               0.0|                0.0|
|    max|               1.0|                1.0|
+-------+------------------+-------------------+



## **Model Evaluation**

In [None]:
# evaluasi model menggunakan 'BinaryClassificationEvaluator'
from pyspark.ml.evaluation import BinaryClassificationEvaluator
predictions = model.evaluate(test)

In [None]:
predictions.predictions.show(20)

+--------------------+-------+--------------------+--------------------+----------+
|            features|Outcome|       rawPrediction|         probability|prediction|
+--------------------+-------+--------------------+--------------------+----------+
|[0.0,67.0,76.0,20...|      0|[2.77625916267863...|[0.94137934961120...|       0.0|
|[0.0,74.0,52.0,10...|      0|[4.04344191916773...|[0.98276523868226...|       0.0|
|[0.0,86.0,68.0,32...|      0|[2.78115304630996...|[0.94164883236902...|       0.0|
|[0.0,91.0,68.0,32...|      0|[2.37239863200590...|[0.91469820105114...|       0.0|
|[0.0,94.0,70.0,27...|      0|[1.71615517296890...|[0.84763293519777...|       0.0|
|[0.0,99.0,69.0,20...|      0|[3.49130752176337...|[0.97043942744512...|       0.0|
|[0.0,101.0,64.0,1...|      0|[3.80637448286474...|[0.97825474381092...|       0.0|
|[0.0,101.0,65.0,2...|      0|[3.33269853324334...|[0.96553368545205...|       0.0|
|[0.0,107.0,60.0,2...|      0|[2.98188382096029...|[0.95174895542512...|    

In [None]:
# menampilkan akurasi model
evaluator = BinaryClassificationEvaluator(rawPredictionCol='rawPrediction',labelCol='Outcome')
print('Model Accuracy:',format(evaluator.evaluate(model.transform(test))))

Model Accuracy: 0.8119369369369378


Akurasi model yang dihasilkan adalah 0.81, yang artinya untuk modeling menggunakan LogisticRegression kali ini cukup baik.

In [None]:
# Menyimpan hasil model yang sudah dibuat
model.save('model')

## **Model Testing**

In [None]:
# membuka model yang sudah dibuat
from pyspark.ml.classification import LogisticRegressionModel
model = LogisticRegressionModel.load('model')

In [None]:
# membuat spark dataframe baru untuk testing
testing = spark.read.csv('/content/data_testing.csv',header=True, inferSchema =True)

In [None]:
# menampilkan schema dari dataframe
testing.printSchema()

root
 |-- Pregnancies: integer (nullable = true)
 |-- Glucose: integer (nullable = true)
 |-- BloodPressure: integer (nullable = true)
 |-- SkinThickness: integer (nullable = true)
 |-- Insulin: integer (nullable = true)
 |-- BMI: double (nullable = true)
 |-- DiabetesPedigreeFunction: double (nullable = true)
 |-- Age: integer (nullable = true)



In [None]:
# membuat kolom gabungan fitur tambahan 
test_data = assembler.transform(testing)

In [None]:
# menampilkan schema
test_data.printSchema()

root
 |-- Pregnancies: integer (nullable = true)
 |-- Glucose: integer (nullable = true)
 |-- BloodPressure: integer (nullable = true)
 |-- SkinThickness: integer (nullable = true)
 |-- Insulin: integer (nullable = true)
 |-- BMI: double (nullable = true)
 |-- DiabetesPedigreeFunction: double (nullable = true)
 |-- Age: integer (nullable = true)
 |-- features: vector (nullable = true)



In [None]:
# menggunakan model yang sudah disimpan untuk membuat prediksi
results = model.transform(test_data)
results.printSchema()

root
 |-- Pregnancies: integer (nullable = true)
 |-- Glucose: integer (nullable = true)
 |-- BloodPressure: integer (nullable = true)
 |-- SkinThickness: integer (nullable = true)
 |-- Insulin: integer (nullable = true)
 |-- BMI: double (nullable = true)
 |-- DiabetesPedigreeFunction: double (nullable = true)
 |-- Age: integer (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



In [None]:
# menampilkan hasil prediksi
results.select('features','prediction').show()

+--------------------+----------+
|            features|prediction|
+--------------------+----------+
|[1.0,190.0,78.0,3...|       1.0|
|[0.0,80.0,84.0,36...|       0.0|
|[2.0,138.0,82.0,4...|       1.0|
|[1.0,110.0,63.0,4...|       1.0|
+--------------------+----------+



Diatas adalah hasil prediksi dari data yang kami buat untuk melakukan testing pada model yang sudah kami buat. terlihat prediksi diatas menunjukan angka 1 dan 0, yang artinya angka 1 adalah diabetes, dan angka 0 adalah tidak diabetes.