<h1>Heart Disease Binary Classification</h1>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer,IterativeImputer

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, RandomForestRegressor

from xgboost import XGBClassifier

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, mean_absolute_error, mean_squared_error, r2_score

<h2>Inspecting the Data for Missing Values</h2>

For this project we are required to train a model on one dataset and test the dataset on 2 or more datasets to ensure that the model is a good fit and was not over/underfitted in training.

Therefore since multiple datasets were used we will focus only on the features common to all three datasets and remove the rest.

The common features found in all three datasets were:
1. age
2. sex
3. chestpain type
4. blood pressure
5. cholesterol
6. fbs
7. restecg
8. max heart rate
9. exang
10. ST depression(old peak)
11. slope
12. Num Vessels
13. thal
14. heart_disease

The initial model will be trained on the UCI Heart Disease Dataset.

In [2]:
df = pd.read_csv("Datasets\heart_disease_uci.csv")
df.head()

Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,1,63,Male,Cleveland,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,2,67,Male,Cleveland,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,3,67,Male,Cleveland,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,4,37,Male,Cleveland,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,5,41,Female,Cleveland,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0


In [3]:
df.drop(df[['id','dataset']],axis=1,inplace=True)
df = df.rename(columns={'cp': 'chest_pain', 'trestbps': 'b_pressure','fbs':'b_sugar','thalch':'maxHeart_rate','ca':'num_vessels'})
df.head()

Unnamed: 0,age,sex,chest_pain,b_pressure,chol,b_sugar,restecg,maxHeart_rate,exang,oldpeak,slope,num_vessels,thal,num
0,63,Male,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,67,Male,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,67,Male,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,37,Male,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,41,Female,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 920 entries, 0 to 919
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   age            920 non-null    int64  
 1   sex            920 non-null    object 
 2   chest_pain     920 non-null    object 
 3   b_pressure     861 non-null    float64
 4   chol           890 non-null    float64
 5   b_sugar        830 non-null    object 
 6   restecg        918 non-null    object 
 7   maxHeart_rate  865 non-null    float64
 8   exang          865 non-null    object 
 9   oldpeak        858 non-null    float64
 10  slope          611 non-null    object 
 11  num_vessels    309 non-null    float64
 12  thal           434 non-null    object 
 13  num            920 non-null    int64  
dtypes: float64(5), int64(2), object(7)
memory usage: 100.8+ KB


In [25]:
df.head()

Unnamed: 0,age,sex,chest_pain,b_pressure,chol,b_sugar,restecg,maxHeart_rate,exang,oldpeak,slope,num_vessels,thal,num
0,63,1,typical angina,145.0,233.0,1.0,lv hypertrophy,150.0,2.0,2.3,downsloping,0.0,fixed defect,0
1,67,1,asymptomatic,160.0,286.0,2.0,lv hypertrophy,108.0,1.0,1.5,flat,3.0,normal,2
2,67,1,asymptomatic,120.0,229.0,2.0,lv hypertrophy,129.0,1.0,2.6,flat,2.0,reversable defect,1
3,37,1,non-anginal,130.0,250.0,2.0,normal,187.0,2.0,3.5,downsloping,0.0,normal,0
4,41,2,atypical angina,130.0,204.0,2.0,lv hypertrophy,172.0,2.0,1.4,upsloping,0.0,normal,0


In [5]:
df.describe()

Unnamed: 0,age,b_pressure,chol,maxHeart_rate,oldpeak,num_vessels,num
count,920.0,861.0,890.0,865.0,858.0,309.0,920.0
mean,53.51087,132.132404,199.130337,137.545665,0.878788,0.676375,0.995652
std,9.424685,19.06607,110.78081,25.926276,1.091226,0.935653,1.142693
min,28.0,0.0,0.0,60.0,-2.6,0.0,0.0
25%,47.0,120.0,175.0,120.0,0.0,0.0,0.0
50%,54.0,130.0,223.0,140.0,0.5,0.0,1.0
75%,60.0,140.0,268.0,157.0,1.5,1.0,2.0
max,77.0,200.0,603.0,202.0,6.2,3.0,4.0


We see that most of the columns contain missing values so we will have to do data imputation and we may have to drop some columns.

In [6]:
(df.isna().sum()/len(df) *100).sort_values(ascending=False)

num_vessels      66.413043
thal             52.826087
slope            33.586957
b_sugar           9.782609
oldpeak           6.739130
b_pressure        6.413043
maxHeart_rate     5.978261
exang             5.978261
chol              3.260870
restecg           0.217391
age               0.000000
sex               0.000000
chest_pain        0.000000
num               0.000000
dtype: float64

In [7]:
df.isnull().sum()[df.isnull().sum() > 0].sort_values(ascending=False)
missing_data_cols = df.isnull().sum()[df.isnull().sum() > 0].index.tolist()
missing_data_cols

['b_pressure',
 'chol',
 'b_sugar',
 'restecg',
 'maxHeart_rate',
 'exang',
 'oldpeak',
 'slope',
 'num_vessels',
 'thal']

In [8]:
categorical_cols = ['thal', 'ca', 'slope', 'exang', 'restecg','fbs', 'cp', 'sex', 'num']
bool_cols = ['fbs', 'exang']
numeric_cols = ['oldpeak', 'thalch', 'chol', 'trestbps', 'age']

In [9]:
imputer = SimpleImputer(strategy="mean")

In [17]:
imputed_data = imputer.fit_transform(df)

ValueError: Cannot use mean strategy with non-numeric data:
could not convert string to float: 'Male'

In [16]:
for col in missing_data_cols:
    print("Missing Values", col, ":", str(round((df[col].isnull().sum() / len(df)) * 100, 2))+"%")
    if col in categorical_cols:
        print("")
        #df[f"{col}"] = imputer.fit_transform(df[f"{col}"])
    elif col in numeric_cols:
        #print("")
        df[f"{col}"] = imputer.fit_transform(df[f"{col}"])
    else:
        pass

Missing Values b_pressure : 6.41%
Missing Values chol : 3.26%


ValueError: Expected a 2-dimensional container but got <class 'pandas.core.series.Series'> instead. Pass a DataFrame containing a single row (i.e. single sample) or a single column (i.e. single feature) instead.

In [None]:
df.info()

<h1>Machine Learning Algorithms</h1>


In [None]:
X = df.drop('num', axis=1)
y = df['num']

# encode the categorical columns using for lopp and le
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
for col in X.columns:
    if X[col].dtype == 'object' or X[col].dtype == 'category':
        X[col] = le.fit_transform(X[col])


# split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2 , random_state=42)

<h2>Decision Tree</h2>

In [None]:
dt = DecisionTreeClassifier(max_depth=5, random_state=42)
dt.fit(X_train, y_train)

# predict the test data
y_pred = dt.predict(X_test)

print('Accuracy score: ', accuracy_score(y_test, y_pred))
print('Precision score: ', precision_score(y_test, y_pred, average='micro'))
print('Recall score: ', recall_score(y_test, y_pred, average='micro'))
print('F1 score: ', f1_score(y_test, y_pred, average='micro'))

<h2>Logistic Regression</h2>

In [None]:
lr = LogisticRegression(max_depth=5, random_state=42)
lr.fit(X_train, y_train)

# predict the test data
y_pred = lr.predict(X_test)

print('Accuracy score: ', accuracy_score(y_test, y_pred))
print('Precision score: ', precision_score(y_test, y_pred, average='micro'))
print('Recall score: ', recall_score(y_test, y_pred, average='micro'))
print('F1 score: ', f1_score(y_test, y_pred, average='micro'))

<h2>K Nearest Neighbours</h2>

In [None]:
knn = KNeighborsClassifier(max_depth=5, random_state=42)
knn.fit(X_train, y_train)

# predict the test data
y_pred = knn.predict(X_test)

print('Accuracy score: ', accuracy_score(y_test, y_pred))
print('Precision score: ', precision_score(y_test, y_pred, average='micro'))
print('Recall score: ', recall_score(y_test, y_pred, average='micro'))
print('F1 score: ', f1_score(y_test, y_pred, average='micro'))