<a href="https://colab.research.google.com/github/AmitShanbhoug/CSI4106_Project1_Classification/blob/main/CSI4016_Group_46_Project1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Predicting Heart Disease in U.S Adults from key personal indicators using Naive Bayes, Logistic Regression, and Multi-Layer Preceptron learning algorithms**
---
**Prepared for**: CSI 4106, Introduction to Artificial Intelligence: Project 1, Classification Empirical Study

**Prepared by**: Group 46

1. Feyi Adesanya, 300120992
2. Amit Shanbhoug, 8677407

**Submission Date**: November 1, 2022

**Project Repository**: https://github.com/AmitShanbhoug/CSI4106_Project1_Classification.git

**Dataset Source**: https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease

# **1. Understand the classification task for your dataset**

---




Our selected dataset, *Personal Key Indicators of Heart Disease* is from a 2020 annual Center for Disease Control (CDC) survey of 400k adults in the United States (U.S), specifically from the Behavioral Risk Factor Surveillance System. We note that the CDC is a reliable source for health data as well as the recency of the data (2020), which allows us to form relevant conclusions and next steps. Further, we note that the dataset contains a binary variable *HeartDisease* - respondents indicated whether they have heart disease. 

When looking at the dataset, we also note that the classes are not balanced, as there are a significant number of respondents who have indicated the absence of heart disease. To rectify this, the project can do ABC for future iterations.

This dataset contains key personal indicators of heart disease in U.S adults - as an example, indicators include whether an individual chooses to drink  or smoke . Through this project, we conduct a binary classification, checking whether a participant is likely to have Heart Disease in the future. We will apply three supervised machine learning algorithms to test this: Native Bayes, Multilayer Perception, and Logistic Regression.


Aside from our goal to better predict risk of Heart Disease in adults, this project will augment our ability to understand and predict behavioral risk factors in the detection of Heart Disease.

# **2. Analyze your dataset**

---



General Overview of the Dataset
---


In [27]:
# All necessary imports will be placed here, we will utilize scikit-learn, matplotlib, and pandas.
import pandas as pandas
import sklearn
import matplotlib.pyplot as plt
import seaborn as seaborn
import numpy as numpy

from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

#  Read the Data
csv_file = "https://raw.githubusercontent.com/AmitShanbhoug/CSI4106_Project1_Classification/main/heart_2020_cleaned.csv"

# Place the dataset into a format we can use: Dataframe
data_df = pandas.read_csv(csv_file)
print("Dataset Columns: ")
print(list(data_df))

#Show name and data type of each column in the dataset
data_df.info()

# Commented out for future deletion
#data_df = pandas.read_csv("sample_data/heart_2020_cleaned.csv")

Dataset Columns: 
['HeartDisease', 'BMI', 'Smoking', 'AlcoholDrinking', 'Stroke', 'PhysicalHealth', 'MentalHealth', 'DiffWalking', 'Sex', 'AgeCategory', 'Race', 'Diabetic', 'PhysicalActivity', 'GenHealth', 'SleepTime', 'Asthma', 'KidneyDisease', 'SkinCancer']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319795 entries, 0 to 319794
Data columns (total 18 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   HeartDisease      319795 non-null  object 
 1   BMI               319795 non-null  float64
 2   Smoking           319795 non-null  object 
 3   AlcoholDrinking   319795 non-null  object 
 4   Stroke            319795 non-null  object 
 5   PhysicalHealth    319795 non-null  float64
 6   MentalHealth      319795 non-null  float64
 7   DiffWalking       319795 non-null  object 
 8   Sex               319795 non-null  object 
 9   AgeCategory       319795 non-null  object 
 10  Race              319795 non-null  object 
 11  

**Column Name Explanation**:

| Column | Explanation |



1.   HeartDisease: Variable to indicate either absence or presence of Heart Disease (yes/no)

2.   BMI:

3. Smoking: 

4. AlcoholDrinking:

5. Stroke:

6




This output showcase the 18th column in the dataset. The column labelled 'HeartDisease' is class, while the other 17 columns act as the features. This output also illustrates that our dataset has no missing values.

In [31]:
missing_data = pandas.DataFrame({'total_missing': data_df.isnull().sum(), 'perc_missing': (data_df.isnull().sum()/319795)*100})
missing_data

Unnamed: 0,total_missing,perc_missing
HeartDisease,0,0.0
BMI,0,0.0
Smoking,0,0.0
AlcoholDrinking,0,0.0
Stroke,0,0.0
PhysicalHealth,0,0.0
MentalHealth,0,0.0
DiffWalking,0,0.0
Sex,0,0.0
AgeCategory,0,0.0


In [29]:
print(data_df.head())

  HeartDisease    BMI Smoking AlcoholDrinking Stroke  PhysicalHealth  \
0           No  16.60     Yes              No     No             3.0   
1           No  20.34      No              No    Yes             0.0   
2           No  26.58     Yes              No     No            20.0   
3           No  24.21      No              No     No             0.0   
4           No  23.71      No              No     No            28.0   

   MentalHealth DiffWalking     Sex  AgeCategory   Race Diabetic  \
0          30.0          No  Female        55-59  White      Yes   
1           0.0          No  Female  80 or older  White       No   
2          30.0          No    Male        65-69  White      Yes   
3           0.0          No  Female        75-79  White       No   
4           0.0         Yes  Female        40-44  White       No   

  PhysicalActivity  GenHealth  SleepTime Asthma KidneyDisease SkinCancer  
0              Yes  Very good        5.0    Yes            No        Yes  
1       

Here is a sample of the dataset: 

Binary Categories - HeartDisease, Smoking, AlcoholDrinking, Stroke, DiffWaking, Sex, Diabetic, PhysicalActivity, Asthema, KidneyDisease, and SkinCancer

Non-Binary Categories - BMI, PhysicalHealth, MentalHealth, AgeCategory, Race, GenHealth, SleepTime

*Measures For the Dataset*

In [23]:
#Rows 
print("The total number of rows are: ",(len(data_df)))
#Columns
print("The total number of columns: ",(len(data_df.columns)))
#Total
print("the total number of data is: ",(data_df.size))
print("-----------------------------------")

print("Numeric and statistic measures for all Continuous Features (Numeric Variables)\n")
print(data_df.describe())


The total number of rows are:  319795
The total number of columns:  18
the total number of data is:  5756310
-----------------------------------
Numeric and statistic measures for all Continuous Features (Numeric Variables)

                 BMI  PhysicalHealth   MentalHealth      SleepTime
count  319795.000000    319795.00000  319795.000000  319795.000000
mean       28.325399         3.37171       3.898366       7.097075
std         6.356100         7.95085       7.955235       1.436007
min        12.020000         0.00000       0.000000       1.000000
25%        24.030000         0.00000       0.000000       6.000000
50%        27.340000         0.00000       0.000000       7.000000
75%        31.420000         2.00000       3.000000       8.000000
max        94.850000        30.00000      30.000000      24.000000


Previous Probabilities and Bias: What Can We Calculate Now 
---


In [None]:
for yesOrNo in data_df['HeartDisease'].value_counts().iteritems():
    print("The Prior Probability of", yesOrNo[0], "is", round((yesOrNo[1] / data_df['HeartDisease'].shape[0]*100),2),"%")

# ***3. Brainstorm about the attributes (Feature engineering)***

---




In [None]:
print("Dataset Columns: ")
print(list(data_df))

Convert Binary Columns into A Numerical Format For Graphing
---

In [24]:
binary_columns = ['HeartDisease','Smoking', 'AlcoholDrinking','Stroke','DiffWalking', 'Sex', 'PhysicalActivity', 'Asthma', 'KidneyDisease', 'SkinCancer' ]
#Convert Binary Columns to 0 or 1
data_df[binary_columns] = data_df[binary_columns].apply(LabelEncoder().fit_transform)

#Binary Columns New Output after conversion
for column in binary_columns:
    print(column,data_df[column].unique())

HeartDisease [0 1]
Smoking [1 0]
AlcoholDrinking [0 1]
Stroke [0 1]
DiffWalking [0 1]
Sex [0 1]
PhysicalActivity [1 0]
Asthma [1 0]
KidneyDisease [0 1]
SkinCancer [1 0]


Correlations
---
Using the corr() function from the DataFrame library we can calculate the pairwise correlation matrix of all columns in the dataset

In [None]:
#Pairwise Correlation of Continuous Features
plt.figure(figsize=(12, 8))
seaborn.heatmap(data_df.corr(), annot=True)

Choosing Our Features
---
We decided to keep certain features based on their relevancy to a participants general health and the affect of certain features directly on heart diease. We feel that each one picked will affect the weights our system in a manner that is relevant to the end result we want. 

Features Picked: HeartDisease, Smoking, AlcoholDrinking, BMI, Sex, Kidney Disease, Mental Health, Physical Health, Age Category, Race, and Physical Activity

In [None]:
#Drop Unrelated Categories
data_df.pop('Stroke')
data_df.pop('DiffWalking')
data_df.pop('SleepTime')
data_df.pop('Asthma')
data_df.pop('SkinCancer')
data_df.pop('GenHealth')
data_df.pop('Diabetic')
data_df.info()

Exploring Our Chosen Features
---

In [None]:
# Plotting Distributions using Histograms for Numerical Features
data_df.hist(bins=50, figsize=(20,15), color='g')
plt.show()

In [None]:
#Plotting the Distribution using a Horizontal Bar Chart for Categorical Features
def create_barh (columnNames):
  for col in columnNames:
    plt.figure()
    data_df[col].value_counts().plot(kind="barh", title=col, color='r')
    plt.show
  
create_barh(['Race', 'AgeCategory'])

# ***4. Encode the features***

---



Encoding Columns to Numerical Representations
---
Binary Columns are converted to 0 and 1 using the 

*   Binary Columns are converted in 0 and 1s using the label encoder function (Already Done Earlier)
*   Nominal Columns are converted using One Hot Endoding (split into multiple columns)

In [None]:
#Convert Heart Disease to True or False
data_df['HeartDisease'] = data_df['HeartDisease'].map({1: 'TRUE', 0: 'FALSE'})

#Unique Values From Each Column 
columns = data_df.columns
for column in columns:
    print(column,data_df[column].unique())

In [None]:
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder

#Encoding Columns
onehot_categories = ['AgeCategory','Race']

#Convert Binary Columns to 0 or 1, Done Above but placing code here too
#data_df[binary_columns] = data_df[binary_columns].apply(LabelEncoder().fit_transform)

#Get rid of nan values
data_df=data_df.dropna()

#Convert Categorical Data using One Hot Encoding
new_data_df = pandas.get_dummies(data_df, columns=onehot_categories, drop_first=False)
#data_df = pandas.get_dummies(data_df)


print(new_data_df.head())

We have finished converting all Categorical data into a numerical representation with Binary Categories being assigned 0 or 1 and Nominal Categories being converted into multiple columns using One Hot Encoding

# ***5. Prepare your data for the experiment, using cross-validation***

---




Split into training and dataset
---

In [None]:
# split the large dataset into train and test
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

predictedClass = new_data_df.pop("HeartDisease").values
X_train, X_test, y_train, y_test = train_test_split(new_data_df, predictedClass, test_size = 0.25, random_state=1)

#Standardize Data
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# Size of our training and test data
X_train.shape, X_test.shape, y_train.shape, y_test.shape

# ***6. Prepare your data for the experiment, using cross-validation***

---



*6A: Native Bayes*
---

*6B: Logistic Regression*
---

*6C: Multi Layer Perceptron*
---

Default Parameters Variation Test

In [None]:
#Importing MLPClassifier
from sklearn.neural_network import MLPClassifier
#Importing Confusion Matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import plot_confusion_matrix
from sklearn import metrics


#Initializing the Default MLPClassifier
classifier = MLPClassifier()


#Fitting the training data to the network, 3 runs
print("Training Begins")
for i in range (4):
    print("Training Run commenced: ", i)
    classifier.fit(X_train, y_train)
    print("the score for this run is: ",classifier.score(X_train, y_train))
print("Training Done")

print()
print("Final Training Score: ", classifier.score(X_train, y_train))
print("Final Test Score: ", classifier.score(X_test, y_test))


y_pred = classifier.predict(X_test)

print()
print(metrics.classification_report(y_test, y_pred))
print(metrics.confusion_matrix(y_test, y_pred))

#Printing the accuracy
print()
print (f'Accuracy Score: {classifier.score(X_test,y_test):.3f}')

#Plot the confusion matrix - This method has been depreciated
#fig = plot_confusion_matrix(classifier, X_test, y_test, display_labels=classifier.classes_)
#fig.figure_.suptitle("Confusion Matrix for Heart Disease Dataset MLP Default Variation ")
#plt.show()
print()
cm = confusion_matrix(y_test, y_pred)
cm_obj = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=classifier.classes_)
cm_obj.plot()
plt.show()

#Loss Curve
print()
plt.plot(classifier.loss_curve_)
plt.title("Loss Curve MLP Default Variation", fontsize=14)
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.show()

Default Parameter Variation Sample of Predictions

In [None]:
#Predictions for Training Set
print("--------Training Predictions--------")
y_pred_sample = classifier.predict(X_train[0:10])
print("Predictions from MLP Default For First Ten Values of Training Set: " , y_pred_sample)
print("Actual Values For First Ten Values of Training Set:", y_train[0:10])
print("Array showing probability of each prediction being correct")
print(classifier.predict_proba(X_train[0:10]))

print() #A seperator

#Predictions for Test Set
print("--------Test Predictions--------")
y_pred_sample = classifier.predict(X_test[0:10])
print("Predictions from MLP Default For First Ten Values of Test Set: " , y_pred_sample)
print("Actual Values First Ten Values of Test Set:", y_test[0:10])
print("Array showing probability of each prediction being correct")
print(classifier.predict_proba(X_test[0:10]))



Variation 1 Test

In [None]:
#Importing MLPClassifier
from sklearn.neural_network import MLPClassifier
#Importing Confusion Matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn import metrics

#Initializing the MLPClassifier with variations
classifier_var1 = MLPClassifier(hidden_layer_sizes=(5,2), max_iter=300,activation = 'relu',solver='adam',random_state=1)
#classifier = MLPClassifier(hidden_layer_sizes=(150,100,50), max_iter=300,activation = 'relu',solver='adam',random_state=1)

#Fitting the training data to the network, 3 runs
print("Training Begins")
for i in range (4):
    print("Training Run commenced: ", i)
    classifier_var1.fit(X_train, y_train)
    print("the score for this run is: ",classifier_var1.score(X_train, y_train))
print("Training Done")

print()
print("Final Training Score: ", classifier_var1.score(X_train, y_train))
print("Final Test Score: ", classifier_var1.score(X_test, y_test))


y_pred = classifier_var1.predict(X_test)

print()
print(metrics.classification_report(y_test, y_pred))
print(metrics.confusion_matrix(y_test, y_pred))

#Printing the accuracy
print()
print (f'Accuracy Score: {classifier_var1.score(X_test,y_test):.3f}')

# #Plot the confusion matrix
# fig = plot_confusion_matrix(classifier_var1, X_test, y_test, display_labels=classifier_var1.classes_)
# fig.figure_.suptitle("Confusion Matrix for Heart Disease Dataset MLP Variation 1")
# plt.show()
print()
cm = confusion_matrix(y_test, y_pred)
cm_obj = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=classifier.classes_)
cm_obj.plot()
plt.show()

#Loss Curve for Var1
print()
plt.plot(classifier_var1.loss_curve_)
plt.title("Loss Curve MLP Variation 1", fontsize=14)
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.show()

Variation 1 Sample of Predictions

In [None]:
#Predictions for Training Set
print("--------Training Predictions--------")
y_pred_sample = classifier_var1.predict(X_train[0:10])
print("Predictions from MLP Default For First Ten Values of Training Set: " , y_pred_sample)
print("Actual Values For First Ten Values of Training Set:", y_train[0:10])
print("Array showing probability of each prediction being correct")
print(classifier_var1.predict_proba(X_train[0:10]))

print() #A seperator

#Predictions for Test Set
print("--------Test Predictions--------")
y_pred_sample = classifier_var1.predict(X_test[0:10])
print("Predictions from MLP Default For First Ten Values of Test Set: " , y_pred_sample)
print("Actual Values For First Ten Values of Test Set:", y_test[0:10])
print("Array showing probability of each prediction being correct")
print(classifier_var1.predict_proba(X_test[0:10]))


Variation 2 Test

In [None]:
#Importing MLPClassifier
from sklearn.neural_network import MLPClassifier
#Importing Confusion Matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn import metrics

#Initializing the MLPClassifier with variations again
classifier_var2 = MLPClassifier(hidden_layer_sizes=(150, 150), max_iter=100,  solver='sgd', alpha=1e-4, random_state=1, learning_rate_init=0.01, warm_start=True)
#classifier = MLPClassifier(hidden_layer_sizes=(150,100,50), max_iter=300,activation = 'relu',solver='adam',random_state=1)


#Fitting the training data to the network, 3 runs
print("Training Begins")
for i in range (4):
    print("Training Run commenced: ", i)
    classifier_var2.fit(X_train, y_train)
    print("the score for this run is: ",classifier_var2.score(X_train, y_train))
print("Training Done")

print()
print("Final Training Score: ", classifier_var2.score(X_train, y_train))
print("Final Test Score: ", classifier_var2.score(X_test, y_test))


y_pred = classifier_var2.predict(X_test)

print()
print(metrics.classification_report(y_test, y_pred))
print(metrics.confusion_matrix(y_test, y_pred))

#Printing the accuracy
print()
print (f'Accuracy Score: {classifier_var2.score(X_test,y_test):.3f}')

# #Plot the confusion matrix
# fig = plot_confusion_matrix(classifier_var2, X_test, y_test, display_labels=classifier_var2.classes_)
# fig.figure_.suptitle("Confusion Matrix for Heart Disease Dataset MLP Variation 2")
# plt.show()
print()
cm = confusion_matrix(y_test, y_pred)
cm_obj = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=classifier.classes_)
cm_obj.plot()
plt.show()

#Loss Curve for Var2
print()
plt.plot(classifier_var2.loss_curve_)
plt.title("Loss Curve For MLP Variation 2", fontsize=14)
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.show()

Variation 2 Sample of Predictions

In [None]:
#Predictions for Training Set
print("--------Training Predictions--------")
y_pred_sample = classifier_var2.predict(X_train[0:10])
print("Predictions from MLP Default For First Ten Values of Training Set: " , y_pred_sample)
print("Actual Values For First Ten Values of Training Set:", y_train[0:10])
print("Array showing probability of each prediction being correct")
print(classifier_var2.predict_proba(X_train[0:10]))

print() #A seperator

#Predictions for Test Set
print("--------Test Predictions--------")
y_pred_sample = classifier_var2.predict(X_test[0:10])
print("Predictions from MLP Default For First Ten Values of Test Set: " , y_pred_sample)
print("Actual Values For First Ten Values of Test Set:", y_test[0:10])
print("Array showing probability of each prediction being correct")
print(classifier_var2.predict_proba(X_test[0:10]))

# **10. Analyze the obtained results**

---

