# Hearth attack prediction
Using the heart attack analysis prediction dataset i will try to predict the probability of someone getting a heart attack based on the given data.

Making use of the next tools: 
* KNN classifier and data pipelines to fit and predict the model.
* Scaling to scale the data and obtain a higher and precission.
* train_test_split to split the data.
* classification report to obtain the performance of the model 
and
* GridsearchCV to choose the best hyperparameters for the KNN classifier.

link to the original kaggle dataset -> https://www.kaggle.com/rashikrahmanpritom/heart-attack-analysis-prediction-dataset?select=heart.csv

Sckit-learn documentation: Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011. https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Dataset import
df = pd.read_csv("/content/drive/MyDrive/data_for_colab/heart.csv")
# its important to look for missing values in the data
def na_status(df):
  total = df.isnull().sum().sort_values(ascending=False)
  percent_1 = df.isnull().sum()/df.isnull().count()*100
  percent_2 = (round(percent_1, 1)).sort_values(ascending=False)
  missing_data = pd.concat([total, percent_2], axis=1, keys=['Total NaN', 'NaN %'])
  print(missing_data.head(10))
na_status(df)
#First glimpse into the dataframe
df.info()
df.describe()
#This dataframe does not contain missing values
# All of its columns are numerical, therefore an encoder wont be needed this time

          Total NaN  NaN %
output            0    0.0
thall             0    0.0
caa               0    0.0
slp               0    0.0
oldpeak           0    0.0
exng              0    0.0
thalachh          0    0.0
restecg           0    0.0
fbs               0    0.0
chol              0    0.0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trtbps    303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalachh  303 non-null    int64  
 8   exng      303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slp       303 non-null    int64  
 11  caa       303 non-null    int64  
 12  thall     303 non-null    int64  
 13  output  

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler
#The classifier and scaler that will be passed on to the pipeline
knn = KNeighborsClassifier()
scaler=StandardScaler()

# The parameters for the model
leaf_size = list(range(1,50))
p=[1,2]
param_grid = {'knn__n_neighbors': np.arange(1, 50),
             "knn__leaf_size":leaf_size,
               "knn__p":p}

# The target column will be "output", lets define the features and the target
X = df.drop("output", axis=1).values
y = df["output"].values

#Now lets define the pipeline and its steps
steps = [("scaler",scaler),  ("knn", knn)]

# The steps should be called inside of the pipeline
pipeline = Pipeline(steps)

# The dataframe will need to be separated into training and testing variables
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)
cv = GridSearchCV(pipeline, param_grid=param_grid, cv=5)

#Lets fit the model into the data
knn = pipeline.fit(X_train, y_train)
knn_hyper = cv.fit(X_train, y_train)
#Now the predictions
y_pred = knn.predict(X_test)
y_pred = knn_hyper.predict(X_test)

#Lets print the score for both models
print("Knn model with Hyperparameter tunning")
print(knn_hyper.best_params_)
print(knn_hyper.score(X_test,y_test))
print("\n")
print("knn basic model")
print(knn.score(X_test,y_test))
print("\n",
      "Classification report")
print(classification_report(y_test, y_pred))
# We can see that the basic model, without the Hyper parameter tuning achieved a higher score than the other one.

Knn model with Hyperparameter tunning
{'knn__leaf_size': 1, 'knn__n_neighbors': 5, 'knn__p': 1}
0.8360655737704918


knn basic model
0.9016393442622951

 Classification report
              precision    recall  f1-score   support

           0       0.83      0.83      0.83        29
           1       0.84      0.84      0.84        32

    accuracy                           0.84        61
   macro avg       0.84      0.84      0.84        61
weighted avg       0.84      0.84      0.84        61



In [None]:
# Importing the other 3 models to be tested
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier

# Defining a list with such models, including the model name and its hyperparameters
list_of_models = {    
    'svm': {'model': svm.SVC(gamma='auto'),'params' : {'C': [1,10,20],'kernel': ['rbf','linear']}},
    'random_forest': {'model': RandomForestClassifier(),'params' : {'n_estimators': [1,5,10]}}
}
#Creating the scores variable that later will be converted to a dataframe
scores = []
for model_name, mp in list_of_models.items():
    clf =  GridSearchCV(mp['model'], mp['params'], cv=3, return_train_score=False)
    clf.fit(X_train, y_train)
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_
    })
    
df = pd.DataFrame(scores,columns=['model','best_score','best_params'])
df

Unnamed: 0,model,best_score,best_params
0,svm,0.793673,"{'C': 20, 'kernel': 'linear'}"
1,random_forest,0.781121,{'n_estimators': 5}


`The KNN classifier, using its default settings was the best performing classifier`