#Heart Disease Prediction Model

Heart disease is a leading cause of death worldwide, and identifying individuals who are at risk for developing heart disease is crucial for early intervention and prevention. This project aims to build a heart disease prediction model using machine learning techniques. The model will use various patient attributes such as age, gender, blood pressure, cholesterol levels, and other clinical features to predict the likelihood of a patient developing heart disease in the future.

The heart prediction model will be built using a dataset of patient information and medical records. The dataset will be preprocessed and cleaned to remove missing data, outliers, and other inconsistencies. The model will be trained using a supervised learning approach, with machine learning algorithm like RandomForestClassifier

The final heart prediction model will be evaluated BinaryClassificationEvaluator. The project aims to provide a useful tool for doctors and healthcare professionals to identify patients at high risk of developing heart disease and to take early preventive measures to reduce the risk.

importing necessary libraries

In [0]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('HeartDisease').getOrCreate()
from pyspark.sql.functions import col, when, corr
from pyspark.ml.linalg import Vector
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator

importing dataset

In [0]:
df = spark.sql("SELECT * FROM heart_disease_data_2_csv")

About the Dataset

1. ChestPainType: 

                  TA - Typical Angina
                  
                  ATA - Atypical Angina

                  NAP - Non-Anginal Pain
                  
                  ASY - Asymptotic
                  
2. Resting BP: Resting BloodPressure (mmHg)

3. Cholestrol: (mm/dl)

4. Fasting BS: Fasting Blood Suger

               1 - Fasting BS > 120 mg/dl
               
               0 - Fasting BS < 120 mg/dl
               
5. Resting ECG: Resting Electrocadiagram results

                Normal - Normal
                
                ST - Having ST-T wave abnormality (T wave invesion and/or ST elevation or depression of > 0.05mV)
                
                LVH - SHowing probable or definatr left venticullar hyperthrophy by Estes' criteria
                
6. Max HR:     maximum heart rate, achived between 60 and 202

7. ST Slop:   The stop of the peak excercise ST segment

              Up - upsloping
              
              Down - downsloping
              
              Flat - flatsloping

data exploration

In [0]:
#counting the number of column in the dataset
df.count()

Out[3]: 918

In [0]:
#checking the datatypes
df.printSchema()

root
 |-- Age: integer (nullable = true)
 |-- Sex: string (nullable = true)
 |-- ChestPainType: string (nullable = true)
 |-- RestingBP: integer (nullable = true)
 |-- Cholesterol: integer (nullable = true)
 |-- FastingBS: integer (nullable = true)
 |-- RestingECG: string (nullable = true)
 |-- MaxHR: integer (nullable = true)
 |-- ExerciseAngina: string (nullable = true)
 |-- Oldpeak: double (nullable = true)
 |-- ST_Slope: string (nullable = true)
 |-- HeartDisease: integer (nullable = true)



In [0]:
#displaying the first and last 5 columns
display(df.head(5))
display(df.tail(5))

Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
45,M,TA,110,264,0,Normal,132,N,1.2,Flat,1
68,M,ASY,144,193,1,Normal,141,N,3.4,Flat,1
57,M,ASY,130,131,0,Normal,115,Y,1.2,Flat,1
57,F,ATA,130,236,0,LVH,174,N,0.0,Flat,1
38,M,NAP,138,175,0,Normal,173,N,0.0,Up,0


In [0]:
#checking got null values
df.isEmpty()

Out[6]: False

In [0]:
#describing the data
display(df.describe())

summary,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
count,918.0,918,918,918.0,918.0,918.0,918,918.0,918,918.0,918,918.0
mean,53.510893246187365,,,132.39651416122004,198.7995642701525,0.233115468409586,,136.80936819172112,,0.8873638344226581,,0.5533769063180828
stddev,9.43261650673202,,,18.514154119907808,109.38414455220344,0.4230456247393029,,25.46033413825029,,1.0665701510493264,,0.497413738284597
min,28.0,F,ASY,0.0,0.0,0.0,LVH,60.0,N,-2.6,Down,0.0
max,77.0,M,TA,200.0,603.0,1.0,ST,202.0,Y,6.2,Up,1.0


In [0]:
#finding out how many patients has/ doesn't have a heart disease in the dataset
dff = df.withColumn('HeartDisease_Str', col('HeartDisease').cast('string'))
dff = dff.replace('0', 'False', 'HeartDisease_Str').replace('1', 'True', 'HeartDisease_Str')
display(dff)

Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease,HeartDisease_Str
40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0,False
49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1,True
37,M,ATA,130,283,0,ST,98,N,0.0,Up,0,False
48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1,True
54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0,False
39,M,NAP,120,339,0,Normal,170,N,0.0,Up,0,False
45,F,ATA,130,237,0,Normal,170,N,0.0,Up,0,False
54,M,ATA,110,208,0,Normal,142,N,0.0,Up,0,False
37,M,ASY,140,207,0,Normal,130,Y,1.5,Flat,1,True
48,F,ATA,120,284,0,Normal,120,N,0.0,Up,0,False


Output can only be rendered in Databricks

In [0]:
#finding out how many male/ female patients has a heart disease in the dataset
dff = dff.filter(dff['HeartDisease_Str'] == 'True')
display(dff)

Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease,HeartDisease_Str
49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1,True
48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1,True
37,M,ASY,140,207,0,Normal,130,Y,1.5,Flat,1,True
58,M,ATA,136,164,0,ST,99,Y,2.0,Flat,1,True
49,M,ASY,140,234,0,Normal,140,Y,1.0,Flat,1,True
38,M,ASY,110,196,0,Normal,166,N,0.0,Flat,1,True
60,M,ASY,100,248,0,Normal,125,N,1.0,Flat,1,True
36,M,ATA,120,267,0,Normal,160,N,3.0,Flat,1,True
44,M,ATA,150,288,0,Normal,150,Y,3.0,Flat,1,True
53,M,NAP,145,518,0,Normal,130,N,0.0,Flat,1,True


Output can only be rendered in Databricks

In [0]:
#figuring out which age category has heart disease
dff = dff.withColumn('Age Category', when(df['Age'] > 50,'Very Old').when(df['Age'] > 41, 'Old').when(df['Age'] > 35, 'Adult').when(df['Age'] > 28, 'Young Adult'))
display(dff)

Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease,HeartDisease_Str,Age Category
49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1,True,Old
48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1,True,Old
37,M,ASY,140,207,0,Normal,130,Y,1.5,Flat,1,True,Adult
58,M,ATA,136,164,0,ST,99,Y,2.0,Flat,1,True,Very Old
49,M,ASY,140,234,0,Normal,140,Y,1.0,Flat,1,True,Old
38,M,ASY,110,196,0,Normal,166,N,0.0,Flat,1,True,Adult
60,M,ASY,100,248,0,Normal,125,N,1.0,Flat,1,True,Very Old
36,M,ATA,120,267,0,Normal,160,N,3.0,Flat,1,True,Adult
44,M,ATA,150,288,0,Normal,150,Y,3.0,Flat,1,True,Old
53,M,NAP,145,518,0,Normal,130,N,0.0,Flat,1,True,Very Old


Output can only be rendered in Databricks

In [0]:
#Histogram of each numerical variable
display(df.select('Age'))
display(df.select('RestingBP'))
display(df.select('Cholesterol'))
display(df.select('FastingBS'))
display(df.select('MaxHR'))
display(df.select('Oldpeak'))
display(df.select('HeartDisease'))

Age
40
49
37
48
54
39
45
54
37
48


Output can only be rendered in Databricks

RestingBP
140
160
130
138
150
120
130
110
140
120


Output can only be rendered in Databricks

Cholesterol
289
180
283
214
195
339
237
208
207
284


Output can only be rendered in Databricks

FastingBS
0
0
0
0
0
0
0
0
0
0


Output can only be rendered in Databricks

MaxHR
172
156
98
108
122
170
170
142
130
120


Output can only be rendered in Databricks

Oldpeak
0.0
1.0
0.0
1.5
0.0
0.0
0.0
0.0
1.5
0.0


Output can only be rendered in Databricks

HeartDisease
0
1
0
1
0
0
0
0
1
0


Output can only be rendered in Databricks

In [0]:
#Box plot of each numerical variable
display(df)
display(df)
display(df)
display(df)
display(df)

Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0
39,M,NAP,120,339,0,Normal,170,N,0.0,Up,0
45,F,ATA,130,237,0,Normal,170,N,0.0,Up,0
54,M,ATA,110,208,0,Normal,142,N,0.0,Up,0
37,M,ASY,140,207,0,Normal,130,Y,1.5,Flat,1
48,F,ATA,120,284,0,Normal,120,N,0.0,Up,0


Output can only be rendered in Databricks

Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0
39,M,NAP,120,339,0,Normal,170,N,0.0,Up,0
45,F,ATA,130,237,0,Normal,170,N,0.0,Up,0
54,M,ATA,110,208,0,Normal,142,N,0.0,Up,0
37,M,ASY,140,207,0,Normal,130,Y,1.5,Flat,1
48,F,ATA,120,284,0,Normal,120,N,0.0,Up,0


Output can only be rendered in Databricks

Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0
39,M,NAP,120,339,0,Normal,170,N,0.0,Up,0
45,F,ATA,130,237,0,Normal,170,N,0.0,Up,0
54,M,ATA,110,208,0,Normal,142,N,0.0,Up,0
37,M,ASY,140,207,0,Normal,130,Y,1.5,Flat,1
48,F,ATA,120,284,0,Normal,120,N,0.0,Up,0


Output can only be rendered in Databricks

Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0
39,M,NAP,120,339,0,Normal,170,N,0.0,Up,0
45,F,ATA,130,237,0,Normal,170,N,0.0,Up,0
54,M,ATA,110,208,0,Normal,142,N,0.0,Up,0
37,M,ASY,140,207,0,Normal,130,Y,1.5,Flat,1
48,F,ATA,120,284,0,Normal,120,N,0.0,Up,0


Output can only be rendered in Databricks

Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0
39,M,NAP,120,339,0,Normal,170,N,0.0,Up,0
45,F,ATA,130,237,0,Normal,170,N,0.0,Up,0
54,M,ATA,110,208,0,Normal,142,N,0.0,Up,0
37,M,ASY,140,207,0,Normal,130,Y,1.5,Flat,1
48,F,ATA,120,284,0,Normal,120,N,0.0,Up,0


Output can only be rendered in Databricks

In [0]:
#Converting texts in the non numerical columns to numerics
indexer = StringIndexer(inputCols= ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope'], outputCols= ['Sex_ind', 'ChestPainType_ind', 'RestingECG_ind', 'ExerciseAngina_ind', 'ST_Slope_ind'])
df= indexer.fit(df).transform(df)
df= df.drop('Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope')
df.show(5)

+---+---------+-----------+---------+-----+-------+------------+-------+-----------------+--------------+------------------+------------+
|Age|RestingBP|Cholesterol|FastingBS|MaxHR|Oldpeak|HeartDisease|Sex_ind|ChestPainType_ind|RestingECG_ind|ExerciseAngina_ind|ST_Slope_ind|
+---+---------+-----------+---------+-----+-------+------------+-------+-----------------+--------------+------------------+------------+
| 40|      140|        289|        0|  172|    0.0|           0|    0.0|              2.0|           0.0|               0.0|         1.0|
| 49|      160|        180|        0|  156|    1.0|           1|    1.0|              1.0|           0.0|               0.0|         0.0|
| 37|      130|        283|        0|   98|    0.0|           0|    0.0|              2.0|           2.0|               0.0|         1.0|
| 48|      138|        214|        0|  108|    1.5|           1|    1.0|              0.0|           0.0|               1.0|         0.0|
| 54|      150|        195|       

In [0]:
#Correlarion of each variable with the HeartDisease variable
display(df.select(corr('Age', 'HeartDisease')))
display(df.select(corr('RestingBP', 'HeartDisease')))
display(df.select(corr('Cholesterol', 'HeartDisease')))
display(df.select(corr('FastingBS', 'HeartDisease')))
display(df.select(corr('MaxHR', 'HeartDisease')))
display(df.select(corr('Oldpeak', 'HeartDisease')))
display(df.select(corr('Sex_ind', 'HeartDisease')))
display(df.select(corr('ChestPainType_ind', 'HeartDisease')))
display(df.select(corr('RestingECG_ind', 'HeartDisease')))
display(df.select(corr('ExerciseAngina_ind', 'HeartDisease')))
display(df.select(corr('ST_Slope_ind', 'HeartDisease')))

"corr(Age, HeartDisease)"
0.2820385058189964


"corr(RestingBP, HeartDisease)"
0.1075889803714038


"corr(Cholesterol, HeartDisease)"
-0.2327406389270114


"corr(FastingBS, HeartDisease)"
0.2672911861102978


"corr(MaxHR, HeartDisease)"
-0.4004207694631906


"corr(Oldpeak, HeartDisease)"
0.403950722062886


"corr(Sex_ind, HeartDisease)"
-0.3054449159631403


"corr(ChestPainType_ind, HeartDisease)"
-0.4713544961077811


"corr(RestingECG_ind, HeartDisease)"
0.1076278836514335


"corr(ExerciseAngina_ind, HeartDisease)"
0.4942819918242684


"corr(ST_Slope_ind, HeartDisease)"
-0.3978017182777483


In [0]:
#creating a VectorAssembler object that combines several input columns into a single feature vector column.
vec_assem = VectorAssembler(inputCols= ['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak', 'Sex_ind', 'ChestPainType_ind', 'RestingECG_ind', 'ExerciseAngina_ind', 'ST_Slope_ind'], outputCol= 'features')
df = vec_assem.transform(df)
df = df.select('features', 'HeartDisease')
df.show(10)

+--------------------+------------+
|            features|HeartDisease|
+--------------------+------------+
|(11,[0,1,2,4,7,10...|           0|
|[49.0,160.0,180.0...|           1|
|[37.0,130.0,283.0...|           0|
|[48.0,138.0,214.0...|           1|
|(11,[0,1,2,4,7,10...|           0|
|(11,[0,1,2,4,7,10...|           0|
|[45.0,130.0,237.0...|           0|
|(11,[0,1,2,4,7,10...|           0|
|(11,[0,1,2,4,5,9]...|           1|
|[48.0,120.0,284.0...|           0|
+--------------------+------------+
only showing top 10 rows



In [0]:
#Splitting the data
train_data, test_data = df.randomSplit([0.7, 0.3])

In [0]:
#training a Random Forest Classifier (RC) model using the training data 
RC = RandomForestClassifier(labelCol= 'HeartDisease')
RC_model = RC.fit(train_data)

In [0]:
#use the trained RC_model to predict the target variable (or dependent variable) on the test dataset
RC_model_pred = RC_model.transform(test_data)
RC_model_pred.show(10)

+--------------------+------------+--------------------+--------------------+----------+
|            features|HeartDisease|       rawPrediction|         probability|prediction|
+--------------------+------------+--------------------+--------------------+----------+
|(11,[0,1,2,3,4,7]...|           1|[4.19181618180143...|[0.20959080909007...|       1.0|
|(11,[0,1,2,3,4,8]...|           1|[2.42603362019913...|[0.12130168100995...|       1.0|
|(11,[0,1,2,4],[46...|           1|[2.65986215379536...|[0.13299310768976...|       1.0|
|(11,[0,1,2,4],[46...|           1|[2.51883651276972...|[0.12594182563848...|       1.0|
|(11,[0,1,2,4],[49...|           1|[5.08162347392299...|[0.25408117369614...|       1.0|
|(11,[0,1,2,4],[52...|           1|[3.73860736645183...|[0.18693036832259...|       1.0|
|(11,[0,1,2,4],[63...|           1|[3.00145601576520...|[0.15007280078826...|       1.0|
|(11,[0,1,2,4,5],[...|           1|[2.92717524468576...|[0.14635876223428...|       1.0|
|(11,[0,1,2,4,5],[...

In [0]:
#creating a BinaryClassificationEvaluator object to evaluate the performance the model.
test =BinaryClassificationEvaluator(labelCol= 'HeartDisease')
test_result = test.evaluate(RC_model_pred)
test_result

Out[19]: 0.9327775155996015