## Portfolio Project - E. Kenny April 2024

Here we will try to predict the presence of heart disease in a patient. 
The original dataset can be found on Kaggle : https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction/download?datasetVersionNumber=1

The Machine Learning Process
1. Define the purpose of th ML project: To predict the presence of heart disease in a patient
2. Obtain the data set for the analysis - The original dataset can be found on Kaggle : https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction/download?datasetVersionNumber=1
3. Explore, clean and pre-process the data

In [None]:
#I had some trouble with an old version of scikit-learn that was throwing an error.
#The following code was added after an update in the terminal

In [1]:
import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))

The scikit-learn version is 1.4.2.


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Read the dataset 'heart'
heart_df = pd.read_csv("heart.csv")

#Have a quick look at the data
heart_df.head(10)

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0
5,39,M,NAP,120,339,0,Normal,170,N,0.0,Up,0
6,45,F,ATA,130,237,0,Normal,170,N,0.0,Up,0
7,54,M,ATA,110,208,0,Normal,142,N,0.0,Up,0
8,37,M,ASY,140,207,0,Normal,130,Y,1.5,Flat,1
9,48,F,ATA,120,284,0,Normal,120,N,0.0,Up,0


In [3]:
#Have a quick look at the data
heart_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


In [4]:
#What datatypes are we dealing with?

heart_df.dtypes

Age                 int64
Sex                object
ChestPainType      object
RestingBP           int64
Cholesterol         int64
FastingBS           int64
RestingECG         object
MaxHR               int64
ExerciseAngina     object
Oldpeak           float64
ST_Slope           object
HeartDisease        int64
dtype: object

In [5]:
# Are there any missing data?

print(heart_df.isnull().sum())

Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64


In [6]:
#Do we have a good, even mix of outcomes?
pd.value_counts(heart_df["HeartDisease"])

HeartDisease
1    508
0    410
Name: count, dtype: int64

In [7]:
#We need to covert the 'object' data types to numeric values
V_heart_df = pd.get_dummies(heart_df, columns=['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope'])
V_heart_df.head(10)

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease,Sex_F,Sex_M,ChestPainType_ASY,...,ChestPainType_NAP,ChestPainType_TA,RestingECG_LVH,RestingECG_Normal,RestingECG_ST,ExerciseAngina_N,ExerciseAngina_Y,ST_Slope_Down,ST_Slope_Flat,ST_Slope_Up
0,40,140,289,0,172,0.0,0,False,True,False,...,False,False,False,True,False,True,False,False,False,True
1,49,160,180,0,156,1.0,1,True,False,False,...,True,False,False,True,False,True,False,False,True,False
2,37,130,283,0,98,0.0,0,False,True,False,...,False,False,False,False,True,True,False,False,False,True
3,48,138,214,0,108,1.5,1,True,False,True,...,False,False,False,True,False,False,True,False,True,False
4,54,150,195,0,122,0.0,0,False,True,False,...,True,False,False,True,False,True,False,False,False,True
5,39,120,339,0,170,0.0,0,False,True,False,...,True,False,False,True,False,True,False,False,False,True
6,45,130,237,0,170,0.0,0,True,False,False,...,False,False,False,True,False,True,False,False,False,True
7,54,110,208,0,142,0.0,0,False,True,False,...,False,False,False,True,False,True,False,False,False,True
8,37,140,207,0,130,1.5,1,False,True,True,...,False,False,False,True,False,False,True,False,True,False
9,48,120,284,0,120,0.0,0,True,False,False,...,False,False,False,True,False,True,False,False,False,True


In [8]:
#Let's scale the numerical values now 
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
Columnstoscale = ['Age', 'RestingBP', 'Cholesterol', 'MaxHR', 'Oldpeak']
V_heart_df[Columnstoscale] = scaler.fit_transform(V_heart_df[Columnstoscale])
V_heart_df.head()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease,Sex_F,Sex_M,ChestPainType_ASY,...,ChestPainType_NAP,ChestPainType_TA,RestingECG_LVH,RestingECG_Normal,RestingECG_ST,ExerciseAngina_N,ExerciseAngina_Y,ST_Slope_Down,ST_Slope_Flat,ST_Slope_Up
0,-1.43314,0.410909,0.82507,0,1.382928,-0.832432,0,False,True,False,...,False,False,False,True,False,True,False,False,False,True
1,-0.478484,1.491752,-0.171961,0,0.754157,0.105664,1,True,False,False,...,True,False,False,True,False,True,False,False,True,False
2,-1.751359,-0.129513,0.770188,0,-1.525138,-0.832432,0,False,True,False,...,False,False,False,False,True,True,False,False,False,True
3,-0.584556,0.302825,0.13904,0,-1.132156,0.574711,1,True,False,True,...,False,False,False,True,False,False,True,False,True,False
4,0.051881,0.951331,-0.034755,0,-0.581981,-0.832432,0,False,True,False,...,True,False,False,True,False,True,False,False,False,True


In [9]:
#Let's split the data into testing and training sets
from sklearn.model_selection import train_test_split
X = V_heart_df.drop(['HeartDisease'], axis=1)
Y = V_heart_df['HeartDisease']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30)

In [10]:
#Let's fit our model for a first pass

from sklearn.neighbors import KNeighborsClassifier

Classifier = KNeighborsClassifier(n_neighbors=5, metric = 'euclidean')
Classifier.fit(X_train, Y_train)

In [13]:
from sklearn.metrics import confusion_matrix

In [14]:
y_pred = Classifier.predict(X_test)

In [15]:
#Let's check the performance of the model using a confusion matrix and some accuracy scores
cm_1 = confusion_matrix(Y_test, y_pred)
cm_1

array([[101,  24],
       [ 18, 133]], dtype=int64)

In [16]:
accuracy = Classifier.score(X_train, Y_train)
print("Accuracy", accuracy)

Accuracy 0.9034267912772586


In [33]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(Classifier, X, Y, cv = 5)

In [34]:
print('Model accuracy: ',np.mean(scores))

Model accuracy:  0.83436089332383


In [17]:
#Hyperparameter Try
from sklearn.model_selection import GridSearchCV

In [18]:
grid_params = { 'n_neighbors' :[2, 4, 6, 8], 
               'weights' :['uniform', 'distance'], 
               'metric': ['minkowski', 'euclidean','manhattan']}

In [19]:
GS = GridSearchCV(KNeighborsClassifier(), grid_params, verbose = 1, cv=3, n_jobs = -1)

In [21]:
GridTry = GS.fit(X_train, Y_train)

Fitting 3 folds for each of 24 candidates, totalling 72 fits


In [22]:
GridTry.best_params_

{'metric': 'manhattan', 'n_neighbors': 8, 'weights': 'distance'}

In [23]:
#Okay, so we've got some new hyperparameters to try
NewNeighbors = KNeighborsClassifier(n_neighbors = 8, weights = 'uniform', metric = 'manhattan')
NewNeighbors.fit(X_train, Y_train)

In [24]:
y_new = NewNeighbors.predict(X_train)
y_NewNeighbors = NewNeighbors.predict(X_test)

In [27]:
#Another confusion matrix and set of scores to evaluate performance
from sklearn import metrics
print('Train set accuracy: ', metrics.accuracy_score(Y_train, y_new))
print('Test set accuracy: ', metrics.accuracy_score(Y_test, y_NewNeighbors))

Train set accuracy:  0.8909657320872274
Test set accuracy:  0.8804347826086957


In [28]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix (Y_test, y_NewNeighbors))

[[105  20]
 [ 13 138]]


In [29]:
from sklearn.metrics import classification_report
print(classification_report(Y_test, y_NewNeighbors))

              precision    recall  f1-score   support

           0       0.89      0.84      0.86       125
           1       0.87      0.91      0.89       151

    accuracy                           0.88       276
   macro avg       0.88      0.88      0.88       276
weighted avg       0.88      0.88      0.88       276



In [31]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(NewNeighbors, X, Y, cv = 5)

In [32]:
print('Model accuracy: ',np.mean(scores))

Model accuracy:  0.8365763839391779


In [None]:
#So the latest model with the new hyperparameters has a better score than the original model