# Heart Desease Detection

This data set dates from 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It contains 76 attributes, including the predicted attribute, but all published experiments refer to using a subset of 14 of them. The "target" field refers to the presence of heart disease in the patient. It is integer valued 0 = no disease and 1 = disease.

### Libraries

In [1]:
import sys

sys.executable

'/home/ec2-user/anaconda3/envs/python3/bin/python'

In [2]:
!/home/ec2-user/anaconda3/envs/python3/bin/python -m pip list

Package                            Version
---------------------------------- -------------------
aiobotocore                        1.3.0
aiohttp                            3.8.1
aioitertools                       0.7.1
aiosignal                          1.2.0
alabaster                          0.7.12
anaconda-client                    1.7.2
anaconda-project                   0.9.1
anyio                              3.4.0
appdirs                            1.4.4
argh                               0.26.2
argon2-cffi                        20.1.0
asn1crypto                         1.4.0
astroid                            2.9.0
astropy                            4.1
async-generator                    1.10
async-timeout                      4.0.1
asynctest                          0.13.0
atomicwrites                       1.4.0
attrs                              20.3.0
Automat                            20.2.0
autopep8                           1.5.5
autovizwidget  

### Read dataset

In [4]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('PySpark').getOrCreate()

In [41]:
df = spark.read.csv('heart.csv', inferSchema=True, header=True)

In [42]:
df = df.withColumnRenamed('target', 'label')

feature_names = df.columns[0:-1]
label_name = df.columns[-1]

In [43]:
df.toPandas().head(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,label
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


In [44]:
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=df.columns[:-1], outputCol='features')
df = assembler.transform(df).select('features', 'label')

In [45]:
df.show(5)

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[52.0,1.0,0.0,125...|    0|
|[53.0,1.0,0.0,140...|    0|
|[70.0,1.0,0.0,145...|    0|
|[61.0,1.0,0.0,148...|    0|
|[62.0,0.0,0.0,138...|    0|
+--------------------+-----+
only showing top 5 rows



### Training and test sets

In [46]:
train, test = df.randomSplit([0.9, 0.1])

In [49]:
df.toPandas()['label'].value_counts()

1    526
0    499
Name: label, dtype: int64

### Feature Scaling

In [47]:
from pyspark.ml.feature import MinMaxScaler

scaler = MinMaxScaler(inputCol='features', outputCol='scaledFeatures', min=0, max=1)
scalerModel = scaler.fit(train)

In [48]:
train = scalerModel.transform(train).select('scaledFeatures', 'label')
train = train.withColumnRenamed('scaledFeatures', 'features')

test = scalerModel.transform(test).select('scaledFeatures', 'label')
test = test.withColumnRenamed('scaledFeatures', 'features')

### Training

In [54]:
input_neurons = len(feature_names)
output_neurons = df.select('label').distinct().count()

In [56]:
from pyspark.ml.classification import MultilayerPerceptronClassifier

layers = [input_neurons, 10, output_neurons]

classifier = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)
fitModel = classifier.fit(train)

### Test

In [57]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

result = fitModel.transform(test)
predictionAndLabels = result.select("prediction", "label")
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")

print("Test set accuracy = " + str(evaluator.evaluate(predictionAndLabels)))

Test set accuracy = 0.8947368421052632
