# Heart Disease Prediction 🩺
This dataset contains information regarding patients that have heart problems, and this dataset is used to predict whether a person has disease or not.

## Basic Objective 🎯
The basic objective of this notebook is: 
- Explore Dataset 
- Clean Dataset and performing Label Encoding for Non-Numerical Variables
- Find numerical analysis of the dataset
- Performing Feature Engineering
- Splitting the Dataset into Training and Testing Datasets
- Build Machine Learning Models to predict heart disease
- Make Modal Comparison between different models to find the best model

### Machine Learning Models 🤖
We will be using seven machine learning models in this notebook:
- K-Nearest Neighbour (KNN)
- Random Forest
- Decision Tree
- Logistic Regression
- Gaussian NB
- Support Vector Machine
- Linear Discriminant Analysis

### Dataset Description 📋
In this dataset, we have:
- 9 categorical variables
- 5 continuous variables.



In [None]:
# Importing Liberaries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

### Functions 📌
These are the functions that are used throughout the notebook

In [None]:
# importing label encoder
from sklearn.preprocessing import LabelEncoder


def dataset_information(df):
    print('# -------------------------\n# Basic Dataset Information\n# -------------------------')
    print("Shape of the dataset: ", df.shape)
    print('Number of rows: ', df.shape[0])
    print('Number of columns: ', df.shape[1])

    print('Number of Categorical Columns: ', df.select_dtypes(include=['object']).shape[1])
    print('Number of Numerical Columns: ', df.select_dtypes(include=['int64', 'float64']).shape[1])
    print('Missing values: ', df.isnull().sum().sum())

    # ------------------------
    # Unique Values In Dataset
    # ------------------------
    print('')
    print('# ------------------------\n# Unique Values In Dataset\n# ------------------------')
    print(df.nunique())


def replace_missing_values(df):
    df.replace('?', np.nan, inplace=True)
    df.dropna(inplace=True)

    
def label_encoding(df, label_column):
    print('----')
    print('Label Column: ', label_column)
    print(df[label_column].value_counts())
    print('')
    
    # Label Encoding
    le = LabelEncoder()
    df[label_column] = le.fit_transform(df[label_column])
    
    print(df[label_column].value_counts())
    print('')


categorical = []
def categorical_column_finder(df, print_data=True):
    if print_data:
        print('')
        print('Categorical Columns')
        print(df.select_dtypes(include=['object']).columns)
        print('')
        print('Non-Numeric Categorical Columns')
        print(df.select_dtypes(include=['object']).columns)
        print('')
    
    # emptying array
    categorical = []
    
    # adding data in array
    for col in df.select_dtypes(include=['object']).columns:
        categorical.append(col)
        
    return categorical


def label_encoding_of_categorical_columns(df, print_data = False):
    print('')
    data = categorical_column_finder(df, print_data)
    for i in data:
        label_encoding(df, i)

## Reading Datasets 🕵️‍♀️
Here we will load the dataset and explore the contents present within it.

Variables are assigned so any heart disease prediction model can be used.

In [None]:
# Reading Dataset
df = pd.read_csv("heart-1.csv")
df_v2 = pd.read_csv("heart-1.csv")

df_v3 = pd.read_csv("heart.csv")
df_v3 = df_v3.drop('target', axis=1) # Heart disease
df_v3 = df_v3.drop('ca', axis=1)
df_v3 = df_v3.drop('thal', axis=1)

# Assigning Variables for Reuseability
columns = ['Age','Sex','ChestPainType','RestingBP','Cholesterol','FastingBS','RestingECG','MaxHR','ExerciseAngina','Oldpeak','ST_Slope','HeartDisease'] 

age = df[columns[0]]
gender = df[columns[1]]
chest_pain_type = df[columns[2]]

resting_blood_pressure = df[columns[3]]
cholestoral = df[columns[4]]
fasting_blood_sugar = df[columns[5]]

resting_electrocardiographic_results = df[columns[6]]
max_heart_rate_achieved = df[columns[7]]
exercise_induced_angina = df[columns[8]]

oldpeak = df[columns[9]]
slope_of_the_peak_exercise_st_segment = df[columns[10]]
heart_disease = df[columns[11]]


# age = df['Age']
# gender = df['Sex']
# chest_pain_type = df['ChestPainType']

# resting_blood_pressure = df['RestingBP']
# cholestoral = df['Cholesterol']
# fasting_blood_sugar = df['FastingBS']

# resting_electrocardiographic_results = df['RestingECG']
# max_heart_rate_achieved = df['MaxHR']
# exercise_induced_angina = df['ExerciseAngina']

# oldpeak = df['Oldpeak']
# slope_of_the_peak_exercise_st_segment = df['ST_Slope']
# heart_disease = df['HeartDisease']

In [None]:
df.head()

## Dataset Information 📕
This section will display basic information on the dataset. 

In [None]:
# A function that will explain basic info on dataset
dataset_information(df)

In [None]:
# This will show data types of each column
df.info()

In [None]:
# Shows the columns that has Categorical Data
categorical_column_finder(df)

df.head()

### Label Encoding for Categorical Variables 📝
Here we will be using label encoding for categorical variables to convert them into numerical values.

An Example is, Gender has M or F as categorical variables, with the label encoder function, they will be turned into 0s and 1s for statistical calculation.

In [None]:
# Convert Non-numeric Data to Numeric Values
label_encoding_of_categorical_columns(df)

In [None]:
# Verifying for changes
df.head()

## Numerical Analysis 🔢
Now since we have examined the basic dataset, we will now look at the numerical analysis to study our dataset.

In [None]:
# Shows basic statistics of the dataset non-categorical columns
df_v2.describe()

In [None]:
# Shows correlation of non-categorical columns in the dataset
df_v2.corr()  

In [None]:
# Categorical Analysis using Heart Disease
df.groupby('HeartDisease').size()  # target variable

In [None]:
# Numerical Distribution 
df.groupby(heart_disease).mean().T

## Visulization of Dataset 📊
Now since we have obtained the numerical analysis, we will now look at the visulization of the numerical variables.

In [None]:
# Histogram of all columns
df.hist(figsize=(20,20))
plt.show()

In [None]:
# Pair Plot for categorical variables with a legend to show which variables are categorical
sns.pairplot(df_v2, hue='HeartDisease')
print('# 0 = Blue = no heart disease')
print('# 1 = Orange = heart disease')

In [None]:
# Heatmap for correlation between variables
sns.heatmap(df_v2.corr(), annot=True)

## Data Preprocessing 📈

In [None]:
# Data Preprocessing  
X = df.drop('HeartDisease', axis=1)
y = df['HeartDisease']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# # Training Samples
# print(len(X_train))
# print(len(y_train))

# # Testing Samples
# print(len(X_test))
# print(len(y_test))

### K-Nearest Neighbour (KNN) 📍
The k-nearest neighbors (KNN) is a simple algorithm which is commonly used to supervise machine learning algorithm which can be used to solve both classification and regression problems. We will be using the KNN algorithm to predict the heart disease of a person based on the features and then compare the results with the other algorithms.

In [None]:
import warnings
with warnings.catch_warnings():
    # ignore all caught warnings
    warnings.filterwarnings("ignore")
    knn = KNeighborsClassifier(n_neighbors=5) 
    knn.fit(X_train, y_train)

    # NN Score
    y_pred = knn.predict(X_test)
    predict = knn.predict(df_v3)
    print('Predicted values of original Dataset')
    print(y_pred[:15])
    print()
    print('Predicted values of other Dataset')
    print(predict[:15])
    print()
    print("{} NN Score: {:.2f}%".format(2, knn.score(X_test, y_test)*100))  

    # predict df_v3
    # predict_df_v3 = knn.predict(df_v3)

In [None]:
scoreList = []
for i in range(1,25):
    knn2 = KNeighborsClassifier(n_neighbors = i)
    knn2.fit(X_train, y_train)
    scoreList.append(knn2.score(X_test, y_test))
    
plt.plot(range(1,25), scoreList)
plt.xticks(np.arange(1,25,1))
plt.xlabel("K value")
plt.ylabel("Score")
plt.show()
print("KNN Score Max {:.2f}%".format((max(scoreList))*100))

# Obtained Data
KNN_data = max(scoreList)*100
KNN_df = pd.DataFrame({'K': range(1,25), 'Score': scoreList})

In [None]:
# Model Evaluation
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

### Random Forest 🎲
Random Forest is a supervised machine learning algorithm that performs both classification and regression. It is a meta-algorithm that can be used to train several decision trees at once. It is a form of ensemble learning that handles a large number of decision trees at once, while giving a higher degree of accuracy, and is computationally efficient.

In [None]:
# random forest 
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state = 1) 
rf.fit(X_train, y_train)

with warnings.catch_warnings():
    # ignore all caught warnings
    warnings.filterwarnings("ignore")
    y_pred = rf.predict(X_test)
    predict = rf.predict(df_v3)
    print('Predicted values of original Dataset')
    print(y_pred[0:15])
    print()
    print('Predicted values of other Dataset')
    print(predict[0:15])
    print()

print("Random Forest Algorithm Accuracy Score : {:.2f}%".format(rf.score(X_test,y_test)*100))

In [None]:
scoreListRF = []
for i in range(2,25):
    rf2 = RandomForestClassifier(n_estimators = 1000, random_state = 1, max_leaf_nodes=i)
    rf2.fit(X_train, y_train)
    scoreListRF.append(rf2.score(X_test, y_test))
    
plt.plot(range(2,25), scoreListRF)
plt.xticks(np.arange(2,25,1))
plt.xlabel("Leaf")
plt.ylabel("Score")
plt.show()
print("RF Score Max {:.2f}%".format((max(scoreListRF))*100))

# Obtained Data
RF_data = max(scoreListRF)*100
RF_df = pd.DataFrame({'Leaf': range(2,25), 'Score': scoreListRF})

In [None]:
# Model Evaluation
y_pred = rf.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

### Decision Tree 🌲
Decision Tree are extremely useful for data analytics and machine learning commonly because they break down complex data into more manageable parts. It clearly lays out the problem so that all options can be challenged, allowing us to analyze the possible consequences of a decision. 

In [None]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)


with warnings.catch_warnings():
    # ignore all caught warnings
    warnings.filterwarnings("ignore")
    y_pred = dtc.predict(X_test)
    predict = dtc.predict(df_v3)
    print('Predicted values of original Dataset')
    print(y_pred[0:15])
    print()
    print('Predicted values of other Dataset')
    print(predict[0:15])
    print()
print("Decision Tree Test Accuracy {:.2f}%".format(dtc.score(X_test, y_test)*100))

In [None]:
scoreListDT = []
for i in range(2,25):
    dtc2 = DecisionTreeClassifier(max_leaf_nodes=i)
    dtc2.fit(X_train, y_train)
    scoreListDT.append(dtc2.score(X_test, y_test))
    
plt.plot(range(2,25), scoreListDT)
plt.xticks(np.arange(2,25,1))
plt.xlabel("Leaf")  
plt.ylabel("Score")
plt.show()
print("DT Score Max {:.2f}%".format((max(scoreListDT))*100))

# Obtained Data
DT_data = max(scoreListDT)*100
DT_df= pd.DataFrame({'Leaf': range(2,25), 'Score': scoreListDT})

In [None]:
# Model Evaluation
y_pred = dtc.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

### Logistic Regression 🚀
Logistic Regression is a statistical model used for classification. It is a supervised learning algorithm that works by estimating the parameters of a logistic function, which is the probability that an instance belongs to a particular class.

It is commonly used in statistical software to understand the relationship between the dependent variable and one or more independent variables by estimating probabilities using a logistic regression equation. This type of analysis can help us predict the likelihood of an event happening or a choice being made.

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='lbfgs',max_iter=1000)
logreg.fit(X_train, y_train) 


with warnings.catch_warnings():
    # ignore all caught warnings
    warnings.filterwarnings("ignore")
    y_pred = logreg.predict(X_test)
    predict = logreg.predict(df_v3)
    print('Predicted values of original Dataset')
    print(y_pred[0:15])
    print()
    print('Predicted values of other Dataset')
    print(predict[0:15])
    print()
print("Logistic Regression Test Accuracy {:.2f}%".format(logreg.score(X_test, y_test)*100))

In [None]:
scoreListLR = []
for i in range(1,25):
    logreg2 = LogisticRegression(solver='lbfgs',max_iter=1000, C=i)
    logreg2.fit(X_train, y_train)
    scoreListLR.append(logreg2.score(X_test, y_test))
       
plt.plot(range(1,25), scoreListLR)
plt.xticks(np.arange(1,25,1))
plt.xlabel("C value")
plt.ylabel("Score")
plt.show()
print("LR Score Max {:.2f}%".format((max(scoreListLR))*100))
 
# Obtained Data
LR_data = max(scoreListLR)*100
LR_df = pd.DataFrame({'C value': range(1,25), 'Score': scoreListLR})

In [None]:
# Model Evaluation
y_pred = logreg.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
 

### Gaussian NB 📉
A Gaussian Naive Bayes algorithm is a special type of NB algorithm. It's specifically used when the features have continuous values and it's also assumed that all the features are following a gaussian distribution i.e, normal distribution. 

In [None]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)


with warnings.catch_warnings():
    # ignore all caught warnings
    warnings.filterwarnings("ignore")
    y_pred = gnb.predict(X_test)
    predict = gnb.predict(df_v3)
    print('Predicted values of original Dataset')
    print(y_pred[0:15])
    print()
    print('Predicted values of other Dataset')
    print(predict[0:15])
    print()
print("Gaussian Naive Bayes Test Accuracy {:.2f}%".format(gnb.score(X_test, y_test)*100))

In [None]:
scoreListNB = []
for i in range(1,25):
    gnb2 = GaussianNB()
    gnb2.fit(X_train, y_train)
    scoreListNB.append(gnb2.score(X_test, y_test))
     
plt.plot(range(1,25), scoreListNB)
plt.xticks(np.arange(1,25,1)) 
plt.xlabel("C value")
plt.ylabel("Score")
plt.show()
print("NB Score Max {:.2f}%".format((max(scoreListNB))*100))
 
# Obtained Data
NB_data = max(scoreListNB)*100
NB_df = pd.DataFrame({'C value': range(1,25), 'Score': scoreListNB})

### Support Vector Machine (SVM) 📈
Support Vector Machine (SVM) is a supervised learning algorithm that can be used to solve both classification and regression problems. It is a type of supervised learning algorithm that is used to build a model that can be used to predict the class of a new data instance.

In [None]:
from sklearn.svm import SVC
svc = SVC(kernel='linear')
svc.fit(X_train, y_train)
 
with warnings.catch_warnings():
    # ignore all caught warnings
    warnings.filterwarnings("ignore")
    y_pred = svc.predict(X_test)
    predict = svc.predict(df_v3)
    print('Predicted values of original Dataset')
    print(y_pred[0:15])
    print()
    print('Predicted values of other Dataset')
    print(predict[0:15])
    print()
    
print("SVM Test Accuracy {:.2f}%".format(svc.score(X_test, y_test)*100))

In [None]:
scoreListSVM = []
for i in range(1,25):
    svc2 = SVC(kernel='linear', C=i)
    svc2.fit(X_train, y_train)
    scoreListSVM.append(svc2.score(X_test, y_test))
     
plt.plot(range(1,25), scoreListSVM)
plt.xticks(np.arange(1,25,1)) 
plt.xlabel("C value")
plt.ylabel("Score")
plt.show()
print("SVM Score Max {:.2f}%".format((max(scoreListSVM))*100))
 
# Obtained Data
SVM_data = max(scoreListSVM)*100
SVM_df = pd.DataFrame({'C value': range(1,25), 'Score': scoreListSVM})

### Linear Discriminant Analysis (LDA) 📈
Linear Discriminant Analysis (LDA) is a supervised learning algorithm that can be used to solve both classification and regression problems. It is a type of supervised learning algorithm that is used to build a model that can be used to predict the class of a new data instance.

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)

with warnings.catch_warnings():
    # ignore all caught warnings
    warnings.filterwarnings("ignore")
    y_pred = lda.predict(X_test)
    predict = lda.predict(df_v3)
    print('Predicted values of original Dataset')
    print(y_pred[0:15])
    print()
    print('Predicted values of other Dataset')
    print(predict[0:15])
    print()
    
print("LDA Test Accuracy {:.2f}%".format(lda.score(X_test, y_test)*100))

In [None]:
scoreListLDA = []
for i in range(1,25): 
    lda2 = LinearDiscriminantAnalysis()
    lda2.fit(X_train, y_train)
    scoreListLDA.append(lda2.score(X_test, y_test))
    
plt.plot(range(1,25), scoreListLDA)
plt.xticks(np.arange(1,25,1))
plt.xlabel("C value")
plt.ylabel("Score")
plt.show()
print("LDA Score Max {:.2f}%".format((max(scoreListLDA))*100))
 
# Obtained Data
LDA_data = max(scoreListLDA)*100
LDA_df = pd.DataFrame({'C value': range(1,25), 'Score': scoreListLDA})

## Model Comparison 🔬
And lastly, we will compare the different models to find the best model.

In [None]:
comparison = pd.DataFrame({'Model': ["Logistic Regression", "KNN",  "Decision Tree", "Random Forest", 'Gaussian NB', 'SVM', 'LDA'], 
                        'Accuracy': [LR_data, KNN_data, DT_data, RF_data, NB_data, SVM_data, LDA_data]})

comparison.sort_values(by='Accuracy', ascending=False)

In [None]:
methods = ["Logistic Regression", "KNN", "Decision Tree", "Random Forest", 'Gaussian NB', 'SVM', 'LDA']
accuracy = [LR_data, KNN_data, DT_data, RF_data, NB_data, SVM_data, LDA_data]
colors = ["#16bc96", "#1885e4", "#9B287B","#170F11", "#FFCD00", "#FF8C00", "#FF0000"]

sns.set_style("whitegrid")
plt.figure(figsize=(16,5))
plt.yticks(np.arange(0,100,10))
plt.ylabel("Accuracy %")
plt.xlabel("Algorithms")
sns.barplot(x=methods, y=accuracy, palette=colors)
plt.show()

## Output 📥
We will now write the result in a csv file.

In [None]:
# output result to csv file
comparison.to_csv('comparison.csv', index=False)

# head of the csv file
comparison.head()
 

In [None]:
df_v3.to_csv('prediction.csv', index=False)
# df_v3.head()

df.to_csv('prediction-2.csv', index=False)
# df.head()