<h1>Heart Disease Binary Classification</h1>

In [48]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer,IterativeImputer

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, RandomForestRegressor

from xgboost import XGBClassifier

from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score, confusion_matrix, classification_report, mean_absolute_error, mean_squared_error, r2_score

<h2>Inspecting the Data for Missing Values</h2>

For this project we are required to train a model on one dataset and test the dataset on 2 or more datasets to ensure that the model is a good fit and was not over/underfitted in training.

Therefore since multiple datasets were used we will focus only on the features common to all three datasets and remove the rest.

The common features found in all three datasets were:
1. age
2. sex
3. chestpain type
4. blood pressure
5. cholesterol
6. fbs
7. restecg
8. max heart rate
9. exang
10. ST depression(old peak)
11. slope
12. Num Vessels
13. thal
14. heart_disease

The initial model will be trained on the UCI Heart Disease Dataset.

In [15]:
df = pd.read_csv("Datasets\heart_disease_uci.csv")
df.head()

Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,1,63,Male,Cleveland,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,2,67,Male,Cleveland,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,3,67,Male,Cleveland,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,4,37,Male,Cleveland,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,5,41,Female,Cleveland,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0


In [16]:
df.drop(df[['id','dataset']],axis=1,inplace=True)
#df = df.rename(columns={'cp': 'chest_pain', 'trestbps': 'b_pressure','fbs':'b_sugar','thalch':'maxHeart_rate','ca':'num_vessels'})
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,63,Male,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,67,Male,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,67,Male,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,37,Male,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,41,Female,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 920 entries, 0 to 919
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       920 non-null    int64  
 1   sex       920 non-null    object 
 2   cp        920 non-null    object 
 3   trestbps  861 non-null    float64
 4   chol      890 non-null    float64
 5   fbs       830 non-null    object 
 6   restecg   918 non-null    object 
 7   thalch    865 non-null    float64
 8   exang     865 non-null    object 
 9   oldpeak   858 non-null    float64
 10  slope     611 non-null    object 
 11  ca        309 non-null    float64
 12  thal      434 non-null    object 
 13  num       920 non-null    int64  
dtypes: float64(5), int64(2), object(7)
memory usage: 100.8+ KB


In [18]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,63,Male,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,67,Male,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,67,Male,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,37,Male,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,41,Female,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0


In [19]:
df.describe()

Unnamed: 0,age,trestbps,chol,thalch,oldpeak,ca,num
count,920.0,861.0,890.0,865.0,858.0,309.0,920.0
mean,53.51087,132.132404,199.130337,137.545665,0.878788,0.676375,0.995652
std,9.424685,19.06607,110.78081,25.926276,1.091226,0.935653,1.142693
min,28.0,0.0,0.0,60.0,-2.6,0.0,0.0
25%,47.0,120.0,175.0,120.0,0.0,0.0,0.0
50%,54.0,130.0,223.0,140.0,0.5,0.0,1.0
75%,60.0,140.0,268.0,157.0,1.5,1.0,2.0
max,77.0,200.0,603.0,202.0,6.2,3.0,4.0


We see that most of the columns contain missing values so we will have to do data imputation and we may have to drop some columns.

In [20]:
print(f"Percentage of missing values in each column is:\n {(df.isna().sum()/len(df) *100).sort_values(ascending=False)}")

Percentage of missing values in each column is:
 ca          66.413043
thal        52.826087
slope       33.586957
fbs          9.782609
oldpeak      6.739130
trestbps     6.413043
thalch       5.978261
exang        5.978261
chol         3.260870
restecg      0.217391
age          0.000000
sex          0.000000
cp           0.000000
num          0.000000
dtype: float64


In [21]:
df.isnull().sum()[df.isnull().sum() > 0].sort_values(ascending=False)
missing_data_cols = df.isnull().sum()[df.isnull().sum() > 0].index.tolist()
missing_data_cols

['trestbps',
 'chol',
 'fbs',
 'restecg',
 'thalch',
 'exang',
 'oldpeak',
 'slope',
 'ca',
 'thal']

In [22]:
categorical_cols = ['thal', 'ca', 'slope', 'exang', 'restecg','fbs', 'cp', 'sex', 'num']
bool_cols = ['fbs', 'exang']
numeric_cols = ['oldpeak', 'thalch', 'chol', 'trestbps', 'age']

In [23]:
le = LabelEncoder()

for col in categorical_cols:
    df[col] = le.fit_transform(df[col])

In [33]:
num_imputer = SimpleImputer(strategy="mean")
cat_imputer = SimpleImputer(strategy="most_frequent")

In [34]:
imputed_df = df.copy()
imputed_df['trestbps'] = num_imputer.fit_transform(imputed_df)

In [35]:
imputed_df['trestbps'].isna().sum()

0

In [42]:
for col in missing_data_cols:
    if col in categorical_cols:
        imputed_df[col] = cat_imputer.fit_transform(imputed_df[[col]])
    elif col in numeric_cols:
        imputed_df[col] = num_imputer.fit_transform(imputed_df[[col]])
    else:
        pass

In [44]:
imputed_df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,63,1,3,63.0,233.0,1,0,150.0,0,2.3,0,0,0,0
1,67,1,0,67.0,286.0,0,0,108.0,1,1.5,1,3,1,2
2,67,1,0,67.0,229.0,0,0,129.0,1,2.6,1,2,2,1
3,37,1,2,37.0,250.0,0,1,187.0,0,3.5,0,0,1,0
4,41,0,1,41.0,204.0,0,0,172.0,0,1.4,2,0,1,0


In [45]:
imputed_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 920 entries, 0 to 919
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       920 non-null    int64  
 1   sex       920 non-null    int32  
 2   cp        920 non-null    int32  
 3   trestbps  920 non-null    float64
 4   chol      920 non-null    float64
 5   fbs       920 non-null    int32  
 6   restecg   920 non-null    int32  
 7   thalch    920 non-null    float64
 8   exang     920 non-null    int32  
 9   oldpeak   920 non-null    float64
 10  slope     920 non-null    int32  
 11  ca        920 non-null    int64  
 12  thal      920 non-null    int32  
 13  num       920 non-null    int64  
dtypes: float64(4), int32(7), int64(3)
memory usage: 75.6 KB


<h1>Handling Outliers</h1>

Since outliers are usually wrt continuous data we will only examine our numeric columns

In [67]:
imputed_df[numeric_cols].describe()

Unnamed: 0,oldpeak,thalch,chol,trestbps,age
count,920.0,920.0,920.0,920.0,920.0
mean,0.878788,137.545665,199.130337,53.51087,53.51087
std,1.053774,25.138494,108.957634,9.424685,9.424685
min,-2.6,60.0,0.0,28.0,28.0
25%,0.0,120.0,177.75,47.0,47.0
50%,0.8,138.0,221.0,54.0,54.0
75%,1.5,156.0,267.0,60.0,60.0
max,6.2,202.0,603.0,77.0,77.0


<h1>Machine Learning Algorithms</h1>


In [46]:
X = imputed_df.drop('num', axis=1)
y = imputed_df['num']

# split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2 , random_state=42)

<h2>Support Vector Machine</h2>

In [55]:
svm_model = SVC()
svm_model.fit(X_train, y_train)

# predict the test data
y_pred = svm_model.predict(X_test)

print('Accuracy score: ', accuracy_score(y_test, y_pred))
print('Precision score: ', precision_score(y_test, y_pred, average='micro'))
print('Recall score: ', recall_score(y_test, y_pred, average='micro'))
print('F1 score: ', f1_score(y_test, y_pred, average='micro'))

Accuracy score:  0.44565217391304346
Precision score:  0.44565217391304346
Recall score:  0.44565217391304346
F1 score:  0.44565217391304346


<h2>Logistic Regression</h2>

In [59]:
lr = LogisticRegression()
lr.fit(X_train, y_train)

# predict the test data
y_pred = lr.predict(X_test)

print('Accuracy score: ', accuracy_score(y_test, y_pred))
print('Precision score: ', precision_score(y_test, y_pred, average='micro'))
print('Recall score: ', recall_score(y_test, y_pred, average='micro'))
print('F1 score: ', f1_score(y_test, y_pred, average='micro'))

Accuracy score:  0.5543478260869565
Precision score:  0.5543478260869565
Recall score:  0.5543478260869565
F1 score:  0.5543478260869565


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


<h2>K Nearest Neighbours</h2>

In [60]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

# predict the test data
y_pred = knn.predict(X_test)

print('Accuracy score: ', accuracy_score(y_test, y_pred))
print('Precision score: ', precision_score(y_test, y_pred, average='micro'))
print('Recall score: ', recall_score(y_test, y_pred, average='micro'))
print('F1 score: ', f1_score(y_test, y_pred, average='micro'))

Accuracy score:  0.4673913043478261
Precision score:  0.4673913043478261
Recall score:  0.4673913043478261
F1 score:  0.4673913043478261


In [61]:
xgb = XGBClassifier()
xgb.fit(X_train, y_train)

y_pred = xgb.predict(X_test)

print('Accuracy score: ', accuracy_score(y_test, y_pred))
print('Precision score: ', precision_score(y_test, y_pred, average='micro'))
print('Recall score: ', recall_score(y_test, y_pred, average='micro'))
print('F1 score: ', f1_score(y_test, y_pred, average='micro'))

Accuracy score:  0.6032608695652174
Precision score:  0.6032608695652174
Recall score:  0.6032608695652174
F1 score:  0.6032608695652174


In [64]:
rf = RandomForestClassifier()
rf.fit(X_train,y_train)

y_pred = rf.predict(X_test)

print('Accuracy score: ', accuracy_score(y_test, y_pred))
print('Precision score: ', precision_score(y_test, y_pred, average='micro'))
print('Recall score: ', recall_score(y_test, y_pred, average='micro'))
print('F1 score: ', f1_score(y_test, y_pred, average='micro'))

Accuracy score:  0.5760869565217391
Precision score:  0.5760869565217391
Recall score:  0.5760869565217391
F1 score:  0.5760869565217391
