## Predicting Heart Disease using Machine Learning

This is to build a machine learning model to predict if a patient has a heart disease or not based on their medical attributes

## Various approaches are stated below:

1. Problem Definition:
Given clinical parameters about a patient, can we predict if they have a heart disease or not

2. Data (The data is gotten from https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset)

3. Evaluation:
If the model can reach 95% accuracy at predicting if a patient has heart disease or not; then we will pursue the project

4. Features:
The features to be modeled in the algorithm are stated in the data dictionary below
**Data Dictionary**

id (Unique id for each patient)

age (Age of the patient in years)

origin (place of study)

sex (1 = male; 0 = female)

cp chest pain type ([typical angina, atypical angina, non-anginal, asymptomatic])

trestbps resting blood pressure (resting blood pressure (in mm Hg on admission to the hospital))

chol (serum cholesterol in mg/dl)

fbs (if fasting blood sugar > 120 mg/dl)

restecg (resting electrocardiographic results)

values: [normal, stt abnormality, lv hypertrophy]

thalach: maximum heart rate achieved

exang: exercise-induced angina (True/ False)

oldpeak: ST depression induced by exercise relative to rest

slope: the slope of the peak exercise ST segment

ca: number of major vessels (0-3) colored by fluoroscopy

thal: [normal; fixed defect; reversible defect]

num: the predicted attribute



## 5. Modelling

In [1]:
# Importing Machine Learning Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load Data
df = pd.read_csv("heart-disease.csv")


In [3]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [5]:
# From the data given, the target column has patients heart disease label;
#Let us study the target column, to know how close that data values are; the below outputs shows both positive and negative heart disease values are close in total
df["target"].value_counts()

1    165
0    138
Name: target, dtype: int64

In [6]:
#Analysis on Gender Segmentation on Heart Diseases Problem
pd.crosstab(df["target"], df["sex"])

sex,0,1
target,Unnamed: 1_level_1,Unnamed: 2_level_1
0,24,114
1,72,93


In [7]:
#Analysis on Heart Diseases Based on Chest Pain Type
pd.crosstab(df["target"], df["cp"])

cp,0,1,2,3
target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,104,9,18,7
1,39,41,69,16


In [5]:
#Split data into X and y label
X = df.drop("target", axis=1) #axis=1 means remove columns; and axis=0 means remove rows 
y = df["target"]
X

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3


In [6]:
y

0      1
1      1
2      1
3      1
4      1
      ..
298    0
299    0
300    0
301    0
302    0
Name: target, Length: 303, dtype: int64

In [27]:
!pip install sklearn

Collecting sklearn
  Downloading sklearn-0.0.post1.tar.gz (3.6 kB)
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py): started
  Building wheel for sklearn (setup.py): finished with status 'done'
  Created wheel for sklearn: filename=sklearn-0.0.post1-py3-none-any.whl size=2959 sha256=a854b742be2fa75e0a24df5031b01512479d3a512c02040c63dd50b38e55e945
  Stored in directory: c:\users\musiliuao\appdata\local\pip\cache\wheels\42\56\cc\4a8bf86613aafd5b7f1b310477667c1fca5c51c3ae4124a003
Successfully built sklearn
Installing collected packages: sklearn
Successfully installed sklearn-0.0.post1


In [1]:
!pip install scikit-learn



In [7]:
#Split data into train and test sets

from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV


In [8]:
np.random.seed(42) #to reproduce the data
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
X_train

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
132,42,1,1,120,295,0,1,162,0,0.0,2,0,2
202,58,1,0,150,270,0,0,111,1,0.8,2,0,3
196,46,1,2,150,231,0,1,147,0,3.6,1,0,2
75,55,0,1,135,250,0,0,161,0,1.4,1,0,2
176,60,1,0,117,230,1,1,160,1,1.4,2,2,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
188,50,1,2,140,233,0,1,163,0,0.6,1,1,3
71,51,1,2,94,227,0,1,154,1,0.0,2,1,3
106,69,1,3,160,234,1,0,131,0,0.1,1,1,2
270,46,1,0,120,249,0,0,144,0,0.8,2,0,3


In [15]:
len(y_train), len(y_test)

(242, 61)

## Selecting the Right Model, we will try different Estimators/Model from scikit-learn using KNeighborsClassifier, RandomForestClassifier and LogisticsRegression

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score



In [10]:
#Models in a Dictionary
models = {"Logistic Regression": LogisticRegression(),
          "KNN": KNeighborsClassifier(),
          "Random Forest": RandomForestClassifier()}

#Function to fit and score model
def fit_and_score(models, X_train,X_test,y_train, y_test):
    """
    fit and evaluate the model using different scikit-learn models 
    """
    np.random.seed(42)  
    #Dictionary to keep model scores
    model_scores = {}
    #Loop through models( where name is the key and model is the values in the models dictionary)
    for name, model in models.items():
        #fit the model to the data
        model.fit(X_train, y_train)
        #Evaluate the model and append its score to model_scores variable
        model_scores[name] = model.score(X_test, y_test) 
    return model_scores

Instantiated_model_score = fit_and_score(models = models, X_train=X_train,X_test=X_test,y_train=y_train, y_test=y_test)
Instantiated_model_score
        
    

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


{'Logistic Regression': 0.8852459016393442,
 'KNN': 0.6885245901639344,
 'Random Forest': 0.8360655737704918}

## Hyperparameter Tuning to Improve the Model
* From the above evaluation, it shows that Logistic Regression is the most accurate model for the problem set. Thus we make the above model the baseline model that should be improved on
* However, the aim of the report is to meet 95% model accuracy evaluation in order to pursue the project implementation

In [11]:
#Tuning KNearestNeighbor Model using GridSearchCV
estimator_KNN = KNeighborsClassifier(algorithm='auto')
parameters_KNN = {
    'n_neighbors': (1,30, 1),
    'leaf_size': (20,40,1),
    'p': (1,2),
    'weights': ('uniform', 'distance'),
    'metric': ('minkowski', 'chebyshev') }

grid_search_KNN = GridSearchCV(
    estimator=estimator_KNN,
    param_grid=parameters_KNN,
    scoring = 'accuracy',
    n_jobs = -1,
    cv = 5
)

grid_search_KNN.fit(X_train, y_train)




In [11]:
grid_search_KNN.best_params_

{'leaf_size': 20,
 'metric': 'minkowski',
 'n_neighbors': 30,
 'p': 1,
 'weights': 'uniform'}

In [12]:
from sklearn.metrics import accuracy_score, precision_score,recall_score, f1_score
y_preds = grid_search_KNN.predict(X_test)
accuracy = accuracy_score(y_test, y_preds)
precision = precision_score(y_test, y_preds)
recall = recall_score(y_test, y_preds)
f1 = f1_score(y_test, y_preds)
metric_dict = {"accuracy": round(accuracy, 2),
                   "precision": round(precision, 2),
                   "recall":round(recall, 2),
                   "f1": round(f1, 2)
                
                  }
print(f"accuracy score:{accuracy*100:.2f}%") #the ability of the model to predict the truth or actual label accurately
print(f"precision score:{precision:.2f}") #the ability of the model not to label as positive, a sample that is negative. The best value for the model is 1 while the worst value is 0.
print(f"recall score:{recall:.2f}") #the ability of the model to find all the positive sample. the closer to 1 the value, the better the model
print(f"f1 score:{f1:.2f}")  #the mean of precision and recall. the closer to 1 the value, the better the model
    

accuracy score:75.41%
precision score:0.73
recall score:0.84
f1 score:0.78


In [21]:
#Tuning RandomForestClassifier Model using RandomizedSearchCV
randforestclassifier = {"n_estimators": [10,100,200,500,1000,1200],
       "max_depth":[None,5,10,20,30],
       "max_features": ["auto", "sqrt"],
       "min_samples_split":[2,4,6],
       "min_samples_leaf":[1,2,4]}
randomized_log = RandomizedSearchCV(
    estimator = RandomForestClassifier(),
    param_distributions = randforestclassifier, 
    n_iter=10,#n-iters means number of models to try
    cv=5, verbose=2
                           )
#Fit the model with RandomizedSearchCV
randomized_log.fit(X_train, y_train)
randomized_log.best_params_

Fitting 5 folds for each of 10 candidates, totalling 50 fits


  warn(


[CV] END max_depth=5, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   0.3s


  warn(


[CV] END max_depth=5, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   0.3s


  warn(


[CV] END max_depth=5, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   0.3s


  warn(


[CV] END max_depth=5, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   0.2s


  warn(


[CV] END max_depth=5, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   0.2s
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=1200; total time=   1.8s
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=1200; total time=   1.8s
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=1200; total time=   1.7s
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=1200; total time=   1.8s
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=1200; total time=   1.8s


  warn(


[CV] END max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=500; total time=   0.7s


  warn(


[CV] END max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=500; total time=   0.8s


  warn(


[CV] END max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=500; total time=   0.8s


  warn(


[CV] END max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=500; total time=   0.8s


  warn(


[CV] END max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=500; total time=   0.9s
[CV] END max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=1200; total time=   2.0s
[CV] END max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=1200; total time=   1.9s
[CV] END max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=1200; total time=   1.8s
[CV] END max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=1200; total time=   1.8s
[CV] END max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=1200; total time=   1.7s
[CV] END max_depth=5, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time=   0.1s
[CV] END max_depth=5, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time=   0.1s
[CV] END max_depth=5, max_f

  warn(


[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=4, n_estimators=1000; total time=   1.6s


  warn(


[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=4, n_estimators=1000; total time=   1.5s


  warn(


[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=4, n_estimators=1000; total time=   1.8s


  warn(


[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=4, n_estimators=1000; total time=   1.5s


  warn(


[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=4, n_estimators=1000; total time=   1.7s
[CV] END max_depth=30, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=100; total time=   0.1s


  warn(
  warn(


[CV] END max_depth=30, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=100; total time=   0.1s
[CV] END max_depth=30, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=100; total time=   0.1s


  warn(
  warn(


[CV] END max_depth=30, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=100; total time=   0.1s
[CV] END max_depth=30, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=100; total time=   0.1s
[CV] END max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=10; total time=   0.0s


  warn(


[CV] END max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=10; total time=   0.0s
[CV] END max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=10; total time=   0.0s
[CV] END max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=10; total time=   0.0s
[CV] END max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=10; total time=   0.0s


  warn(
  warn(


[CV] END max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=100; total time=   0.1s
[CV] END max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=100; total time=   0.1s


  warn(
  warn(


[CV] END max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=100; total time=   0.1s
[CV] END max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=100; total time=   0.1s


  warn(


[CV] END max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=100; total time=   0.1s


{'n_estimators': 100,
 'min_samples_split': 2,
 'min_samples_leaf': 2,
 'max_features': 'sqrt',
 'max_depth': 5}

In [22]:
# Making Predictions with the best HyperParameters
y_preds = randomized_log.predict(X_test)
# Evaluate the Predictions
accuracy = accuracy_score(y_test, y_preds)
precision = precision_score(y_test, y_preds)
recall = recall_score(y_test, y_preds)
f1 = f1_score(y_test, y_preds)

print(f"accuracy score:{accuracy*100:.2f}%") #the ability of the model to predict the truth or actual label accurately
print(f"precision score:{precision:.2f}") #the ability of the model not to label as positive, a sample that is negative. The best value for the model is 1 while the worst value is 0.
print(f"recall score:{recall:.2f}") #the ability of the model to find all the positive sample. the closer to 1 the value, the better the model
print(f"f1 score:{f1:.2f}")  #the mean of precision and recall. the closer to 1 the value, the better the model
    

accuracy score:86.89%
precision score:0.85
recall score:0.91
f1 score:0.88
