# Project three: heart disease classification

## About Dataset
Cardiovascular illnesses (CVDs) are the major cause of death worldwide. CVDs include coronary heart disease, cerebrovascular disease, rheumatic heart disease, and other heart and blood vessel problems. According to the World Health Organization, 17.9 million people die each year. Heart attacks and strokes account for more than four out of every five CVD deaths, with one-third of these deaths occurring before the age of 70. A comprehensive database for factors that contribute to a heart attack has been constructed.

The main purpose here is to collect characteristics of Heart Attack or factors that contribute to it.
The size of the dataset is 1319 samples, which have nine fields, where eight fields are for input fields and one field for an output field. Age, gender(0 for Female, 1 for Male) ,heart rate (impulse), systolic BP (pressurehight), diastolic BP (pressurelow), blood sugar(glucose), CK-MB (kcm), and Test-Troponin (troponin) are representing the input fields, while the output field pertains to the presence of heart attack (class), which is divided into two categories (negative and positive); negative refers to the absence of a heart attack, while positive refers to the presence of a heart attack.

You will build a classification model to predict the presence of heart attack. As a starting point, I have built a logistics regression and a decision tree model for your reference. 

Please check the below site for possible classification models you can run in spark. 
[Spark MLLib for Classification](https://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-classifier)

Please run at least two other algorithms for classification based on this dataset and disucss the performance of each model (using f1 score). Which model generates the best result? what features are the most important in explaining the result? In addition, try some strategies to imporve the performance of the modes and discuss your experience/lessons learned. Were you be able to imporve the performance of the model and why?

## Strategy to improve the model (some may be not applicable to this dataset):
- remove/replace outliers
- find better ways to deal with missing values
- add/delete/modify features, create additional features based on existing features
- conduct hyper-parameters tuning and cross-validation
- try different models/algorithms
- use more data or anything else you find helpful

In [117]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from helper_functions import displayByGroup
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# check if the Spark session is active. If it is activate, close it

try:
    if spark:
        spark.stop()
except:
    pass    

spark = (SparkSession.builder.appName("Heart Attack Prediction")
        .config("spark.port.maxRetries", "200")
        .config("spark.sql.mapKeyDedupPolicy", "LAST_WIN")  # This configuration allow the duplicate keys in the map data type.
        .config("spark.driver.memory", "16g")
        .getOrCreate())

# confiture the log level (defaulty is WWARN)
spark.sparkContext.setLogLevel('ERROR')

# read the global warming tweets

df=spark.read.csv('/opt/shared/Heart_Attack.csv', header=True, inferSchema=True)

In [69]:
df.show()

+---+------+-------+-------------+-----------+-------+-----+--------+--------+
|age|gender|impluse|pressurehight|pressurelow|glucose|  kcm|troponin|   class|
+---+------+-------+-------------+-----------+-------+-----+--------+--------+
| 64|     1|     66|          160|         83|  160.0|  1.8|   0.012|negative|
| 21|     1|     94|           98|         46|  296.0| 6.75|    1.06|positive|
| 55|     1|     64|          160|         77|  270.0| 1.99|   0.003|negative|
| 64|     1|     70|          120|         55|  270.0|13.87|   0.122|positive|
| 55|     1|     64|          112|         65|  300.0| 1.08|   0.003|negative|
| 58|     0|     61|          112|         58|   87.0| 1.83|   0.004|negative|
| 32|     0|     40|          179|         68|  102.0| 0.71|   0.003|negative|
| 63|     1|     60|          214|         82|   87.0|300.0|    2.37|positive|
| 44|     0|     60|          154|         81|  135.0| 2.35|   0.004|negative|
| 67|     1|     61|          160|         95|  100.

In [70]:
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- gender: integer (nullable = true)
 |-- impluse: integer (nullable = true)
 |-- pressurehight: integer (nullable = true)
 |-- pressurelow: integer (nullable = true)
 |-- glucose: double (nullable = true)
 |-- kcm: double (nullable = true)
 |-- troponin: double (nullable = true)
 |-- class: string (nullable = true)



In [118]:
# create label (target variable)

df1=df.withColumn('label', F.when(F.col('class')=="positive", 1).otherwise(0))

df1.show()

+---+------+-------+-------------+-----------+-------+-----+--------+--------+-----+
|age|gender|impluse|pressurehight|pressurelow|glucose|  kcm|troponin|   class|label|
+---+------+-------+-------------+-----------+-------+-----+--------+--------+-----+
| 64|     1|     66|          160|         83|  160.0|  1.8|   0.012|negative|    0|
| 21|     1|     94|           98|         46|  296.0| 6.75|    1.06|positive|    1|
| 55|     1|     64|          160|         77|  270.0| 1.99|   0.003|negative|    0|
| 64|     1|     70|          120|         55|  270.0|13.87|   0.122|positive|    1|
| 55|     1|     64|          112|         65|  300.0| 1.08|   0.003|negative|    0|
| 58|     0|     61|          112|         58|   87.0| 1.83|   0.004|negative|    0|
| 32|     0|     40|          179|         68|  102.0| 0.71|   0.003|negative|    0|
| 63|     1|     60|          214|         82|   87.0|300.0|    2.37|positive|    1|
| 44|     0|     60|          154|         81|  135.0| 2.35|   0.

In [119]:
df1.groupBy('label').count().show()

+-----+-----+
|label|count|
+-----+-----+
|    1|  810|
|    0|  509|
+-----+-----+



## Feature Engineering

In [120]:
from pyspark.ml.feature import StringIndexer, VectorAssembler

#Split the model into train and test dataset

(trainDF, testDF) = df1.randomSplit([.8, .2], seed=42)

# we only have numerous features, and we can directly assemble all featuress into one vector
# need to remove target varible
numericCols = [field for (field, dataType) in df1.dtypes if (((dataType == "double")|(dataType == "int")) & (field != "label"))]

vecAssembler = VectorAssembler(inputCols=numericCols, outputCol="features")

In [121]:
# check numerical features and make sure it look correct

numericCols

['age',
 'gender',
 'impluse',
 'pressurehight',
 'pressurelow',
 'glucose',
 'kcm',
 'troponin']

## Logistics Regression

In [122]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Create initial LogisticRegression model

lr = LogisticRegression(labelCol="label", featuresCol="features", maxIter=10)

pipeline=Pipeline(stages=[vecAssembler, lr])

#train the model

pipelineModel_lr=pipeline.fit(trainDF)

#evaluate the model

lr_predDF = pipelineModel_lr.transform(testDF)
 
# Using areaUnderROC and areadUnderPR to evaluate binary classification model. roc is default measurement

evaluator_roc = BinaryClassificationEvaluator()

evaluator_pr=BinaryClassificationEvaluator(metricName="areaUnderPR")

# Evaluate logistic regression model 

print("Logistics Regression: areaUnderROC is ", evaluator_roc.evaluate(lr_predDF))

print("Logistics Regression: areaUnderPR is ", evaluator_pr.evaluate(lr_predDF))

Logistics Regression: areaUnderROC is  0.9227418207681363
Logistics Regression: areaUnderPR is  0.9375381601020925


In [76]:
lr_predDF.show()

+---+------+-------+-------------+-----------+-------+-----+--------+--------+-----+--------------------+--------------------+--------------------+----------+
|age|gender|impluse|pressurehight|pressurelow|glucose|  kcm|troponin|   class|label|            features|       rawPrediction|         probability|prediction|
+---+------+-------+-------------+-----------+-------+-----+--------+--------+-----+--------------------+--------------------+--------------------+----------+
| 19|     0|     70|          117|         76|   91.0|36.24|   0.025|positive|    1|[19.0,0.0,70.0,11...|[-16.928114529994...|[4.44849648021547...|       1.0|
| 21|     0|     62|           76|         55|  111.0| 3.11|   0.003|negative|    0|[21.0,0.0,62.0,76...|[2.71461586807154...|[0.93788360377704...|       0.0|
| 21|     1|     85|          204|         84|   93.0| 2.71|   0.002|negative|    0|[21.0,1.0,85.0,20...|[3.08493006518893...|[0.95626682720980...|       0.0|
| 22|     1|     84|          160|         79|

In [123]:
lr_predDF.select('label', 'prediction','features', 'probability').show(10, False)

+-----+----------+-------------------------------------------+-----------------------------------------+
|label|prediction|features                                   |probability                              |
+-----+----------+-------------------------------------------+-----------------------------------------+
|1    |1.0       |[19.0,0.0,70.0,117.0,76.0,91.0,36.24,0.025]|[4.448496480215476E-8,0.9999999555150352]|
|0    |0.0       |[21.0,0.0,62.0,76.0,55.0,111.0,3.11,0.003] |[0.9378836037770448,0.06211639622295517] |
|0    |0.0       |[21.0,1.0,85.0,204.0,84.0,93.0,2.71,0.002] |[0.9562668272098092,0.043733172790190844]|
|0    |0.0       |[22.0,1.0,84.0,160.0,79.0,102.0,2.25,0.006]|[0.953170217306127,0.046829782693872946] |
|1    |1.0       |[25.0,1.0,64.0,153.0,93.0,110.0,3.09,0.097]|[0.41696592150395756,0.5830340784960424] |
|1    |1.0       |[26.0,1.0,54.0,104.0,62.0,88.0,14.21,0.004]|[0.015205955775295755,0.9847940442247043]|
|1    |0.0       |[27.0,1.0,94.0,157.0,79.0,141.0,6.25,

In [124]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

def classification_performance(predDF):
  
  pd_prediction=predDF.select('label', 'prediction').toPandas()
  label=pd_prediction['label']
  pred=pd_prediction['prediction']
 
  confusion=confusion_matrix(label, pred)

  print('Confusion Matrix\n', confusion)

  print('\nClassification Report\n')

  print(classification_report(label, pred))

In [56]:
classification_performance(lr_predDF)

Confusion Matrix
 [[ 60  14]
 [ 24 128]]

Classification Report

              precision    recall  f1-score   support

           0       0.71      0.81      0.76        74
           1       0.90      0.84      0.87       152

    accuracy                           0.83       226
   macro avg       0.81      0.83      0.82       226
weighted avg       0.84      0.83      0.83       226



In [125]:
# option 1: extract feature importance

# define a function to return feature names for logisitcs regression
def feature_names(df, features):
  featureIndex=df.schema[features].metadata["ml_attr"]["attrs"]
 
  feature_names=[]
  # print numeric feature
  for x in range(len(df.schema[features].metadata["ml_attr"]["attrs"]['numeric'])):
    try:
      feature_names.append(featureIndex["numeric"][x]['name'])
    except:
      continue
 # print binary feature
  try:
      for x in range(len(df.schema[features].metadata["ml_attr"]["attrs"]['binary'])):
        try:
           feature_names.append(featureIndex["binary"][x]['name'])
        except:
          continue
  except:
     return feature_names

# feature importance
def lr_coefficients(df, model, features="features"):
  coefficients =model.coefficients
  names=feature_names(df, features)
 
  weightsDF = pd.DataFrame(zip(name, coefficients, list(map(abs, coefficients))), columns=['feature', 'weights', 'abs_weights'])
  sorted_list=weightsDF.sort_values('abs_weights', ascending=False)[['feature', 'weights']]
  return sorted_list

In [126]:
lr_coefficients(lr_predDF, pipelineModel_lr.stages[-1])

NameError: name 'name' is not defined

In [115]:
# option 2: extract feature importance

feature_names=pipelineModel_lr.stages[0].getInputCols()
coefficients=pipelineModel_lr.stages[1].coefficients

weightsDF = pd.DataFrame(zip(feature_names, coefficients, list(map(abs, coefficients))), columns=['feature', 'coefficient', 'abs_coefficient'])
sorted_list=weightsDF.sort_values('abs_coefficient', ascending=False)[['feature', 'coefficient']]

sorted_list.head(10)

Unnamed: 0,feature,coefficient
7,troponin,27.671303
6,kcm,0.577681
1,gender,0.247278
0,age,0.053819
4,pressurelow,0.010215
3,pressurehight,-0.005063
2,impluse,-0.000221
5,glucose,0.000106


## Decision Tree

In [116]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features", seed=42)

pipeline=Pipeline(stages=[vecAssembler, dt])

#train the model

pipelineModel_dt=pipeline.fit(trainDF)

#evaluate the model

dt_predDF = pipelineModel_dt.transform(testDF)
 
# Using areaUnderROC and areadUnderPR to evaluate binary classification model. roc is default measurement

evaluator_roc = BinaryClassificationEvaluator()

evaluator_pr=BinaryClassificationEvaluator(metricName="areaUnderPR")

# Evaluate logistic regression model 

print("Decision Tree: areaUnderROC is ", evaluator_roc.evaluate(dt_predDF))

print("Deccision Tree: areaUnderPR is ", evaluator_pr.evaluate(dt_predDF))

TypeError: Cannot recognize a pipeline stage of type <class 'list'>.

In [92]:
treeModel = pipelineModel_dt.stages[-1]
# summary only
print(treeModel)

DecisionTreeClassificationModel: uid=DecisionTreeClassifier_115380d19d7b, depth=5, numNodes=23, numClasses=2, numFeatures=8


In [97]:
classification_performance(dt_predDF)

Confusion Matrix
 [[ 60  14]
 [ 24 128]]

Classification Report

              precision    recall  f1-score   support

           0       0.71      0.81      0.76        74
           1       0.90      0.84      0.87       152

    accuracy                           0.83       226
   macro avg       0.81      0.83      0.82       226
weighted avg       0.84      0.83      0.83       226



In [66]:
# check feature importance for the tree baded model without ohe
def dt_featureImportance_no_ohe(model, vecAssembler):
    featureImp = pd.DataFrame(
        list(zip(vecAssembler.getInputCols(), model.featureImportances)),
      columns=["feature", "importance"])
    return featureImp.sort_values(by="importance", ascending=False)

In [99]:
dtModel=pipelineModel_dt.stages[1]
vecAssembler=pipelineModel_dt.stages[0]

dt_featureImportance_no_ohe(dtModel, vecAssembler)

Unnamed: 0,feature,importance
7,troponin,0.616069
6,kcm,0.353566
1,gender,0.021114
0,age,0.00551
3,pressurehight,0.003741
2,impluse,0.0
4,pressurelow,0.0
5,glucose,0.0
