## The Heart Dataset

File name: 'D3_Heart_Dataset.csv'

This dataset has been obtained from Kaggle: https://www.kaggle.com/fedesoriano/heart-failure-prediction

The data contains 918 observations with 12 attributes as described below:
1. Age: patient's age, range: 28 to 77.
2. Sex: patient's gender, M(79%), F(21%).
3. ChestPainType: ASY (54%), NAP (22%), Other(24%).
4. RestingBP: resting blood pressure, range: 0 to 200.
5. Cholestrol: serum cholestrol, range: 0 to 603.
6. FastingBS: fasting blood sugar, 0 or 1.
7. RestingECG: resting electrocardiogram results, Normal (60%), LVH (20%), Other (19%).
8. MaxHR: maximum heart rate achieved, range: 60 to 202.
9. ExerciseAngina: exercise induced angina, true(317-40%), false (547-60%).
10. OldPeak: old peak=ST, range: -2.6 to 6.2.
11. ST_Slope: ST slope, Up or flat.
12. HeartDisease: target, 1 or 0.

Last column indicates presence of heart disease given the remaining 11 attributes.

This is a binary classification problem.

Contains categorical data, otherwise the dataset is clean.

# Mounting drive first


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# **Content**
**content** is the default root directory where all operations start. When you mount your Google Drive, it becomes accessible under /content/drive.

In [None]:
pwd

'/content'

In [None]:
PATH = '/content/drive/MyDrive/MLPractical'

In [None]:
!ls /content/drive/MyDrive/MLPractical


 array_2d.txt	     D3_Heart_Dataset.csv   ML_Lab_02.ipynb		   PRACTICAL-SPRING25.jpg
 array_archive.npz   data.txt		   'ML_Lab 03_Naive Bayes.ipynb'   some_array.npy
 array_file.txt      ML_Lab_01.ipynb	   'ML Workbook 2024.pdf'	   some_array.txt.npy


## Loading and Exploring Dataset

In [None]:
import pandas as pd
#Reading the file into a dataframe
PATH = '/content/drive/MyDrive/MLPractical'
data=pd.read_csv(f'{PATH}/D3_Heart_Dataset.csv')
#Displaying the read contents
data

Unnamed: 0,Age,Gender,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0
...,...,...,...,...,...,...,...,...,...,...,...,...
913,45,M,TA,110,264,0,Normal,132,N,1.2,Flat,1
914,68,M,ASY,144,193,1,Normal,141,N,3.4,Flat,1
915,57,M,ASY,130,131,0,Normal,115,Y,1.2,Flat,1
916,57,F,ATA,130,236,0,LVH,174,N,0.0,Flat,1


In [None]:
# Finding datatype of data
type(data)

In [None]:
# Displaying general info
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Gender          918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


## Separating Features and Target

In [None]:
# separating predictors
X = data.drop("HeartDisease",axis=1)
X

Unnamed: 0,Age,Gender,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up
...,...,...,...,...,...,...,...,...,...,...,...
913,45,M,TA,110,264,0,Normal,132,N,1.2,Flat
914,68,M,ASY,144,193,1,Normal,141,N,3.4,Flat
915,57,M,ASY,130,131,0,Normal,115,Y,1.2,Flat
916,57,F,ATA,130,236,0,LVH,174,N,0.0,Flat


In [None]:
# separating target
Y = data["HeartDisease"]
Y

Unnamed: 0,HeartDisease
0,0
1,1
2,0
3,1
4,0
...,...
913,1
914,1
915,1
916,1


## Applying Ordinal Encoding on all Five Categorical Features

In [None]:
# Feature: Gender
X['Gender'].unique()

array(['M', 'F'], dtype=object)

In [None]:
X['Gender']=X['Gender'].replace('M',1)
X['Gender']=X['Gender'].replace('F',0)
X

  X['Gender']=X['Gender'].replace('F',0)


Unnamed: 0,Age,Gender,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope
0,40,1,ATA,140,289,0,Normal,172,N,0.0,Up
1,49,0,NAP,160,180,0,Normal,156,N,1.0,Flat
2,37,1,ATA,130,283,0,ST,98,N,0.0,Up
3,48,0,ASY,138,214,0,Normal,108,Y,1.5,Flat
4,54,1,NAP,150,195,0,Normal,122,N,0.0,Up
...,...,...,...,...,...,...,...,...,...,...,...
913,45,1,TA,110,264,0,Normal,132,N,1.2,Flat
914,68,1,ASY,144,193,1,Normal,141,N,3.4,Flat
915,57,1,ASY,130,131,0,Normal,115,Y,1.2,Flat
916,57,0,ATA,130,236,0,LVH,174,N,0.0,Flat


In [None]:
# Feature: ChestPainType
X['ChestPainType'].unique()

array(['ATA', 'NAP', 'ASY', 'TA'], dtype=object)

In [None]:
X['ChestPainType']=X['ChestPainType'].replace('ATA',1)
X['ChestPainType']=X['ChestPainType'].replace('NAP',2)
X['ChestPainType']=X['ChestPainType'].replace('ASY',3)
X['ChestPainType']=X['ChestPainType'].replace('TA',4)
X

  X['ChestPainType']=X['ChestPainType'].replace('TA',4)


Unnamed: 0,Age,Gender,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope
0,40,1,1,140,289,0,Normal,172,N,0.0,Up
1,49,0,2,160,180,0,Normal,156,N,1.0,Flat
2,37,1,1,130,283,0,ST,98,N,0.0,Up
3,48,0,3,138,214,0,Normal,108,Y,1.5,Flat
4,54,1,2,150,195,0,Normal,122,N,0.0,Up
...,...,...,...,...,...,...,...,...,...,...,...
913,45,1,4,110,264,0,Normal,132,N,1.2,Flat
914,68,1,3,144,193,1,Normal,141,N,3.4,Flat
915,57,1,3,130,131,0,Normal,115,Y,1.2,Flat
916,57,0,1,130,236,0,LVH,174,N,0.0,Flat


In [None]:
# Feature: RestingECG
X['RestingECG'].unique()

array(['Normal', 'ST', 'LVH'], dtype=object)

In [None]:
X['RestingECG']=X['RestingECG'].replace('Normal',1)
X['RestingECG']=X['RestingECG'].replace('ST',2)
X['RestingECG']=X['RestingECG'].replace('LVH',3)
X

  X['RestingECG']=X['RestingECG'].replace('LVH',3)


Unnamed: 0,Age,Gender,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope
0,40,1,1,140,289,0,1,172,N,0.0,Up
1,49,0,2,160,180,0,1,156,N,1.0,Flat
2,37,1,1,130,283,0,2,98,N,0.0,Up
3,48,0,3,138,214,0,1,108,Y,1.5,Flat
4,54,1,2,150,195,0,1,122,N,0.0,Up
...,...,...,...,...,...,...,...,...,...,...,...
913,45,1,4,110,264,0,1,132,N,1.2,Flat
914,68,1,3,144,193,1,1,141,N,3.4,Flat
915,57,1,3,130,131,0,1,115,Y,1.2,Flat
916,57,0,1,130,236,0,3,174,N,0.0,Flat


In [None]:
# Feature: ExerciseAngina
X['ExerciseAngina'].unique()

array(['N', 'Y'], dtype=object)

In [None]:
X['ExerciseAngina']=X['ExerciseAngina'].replace('Y',1)
X['ExerciseAngina']=X['ExerciseAngina'].replace('N',0)
X

  X['ExerciseAngina']=X['ExerciseAngina'].replace('N',0)


Unnamed: 0,Age,Gender,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope
0,40,1,1,140,289,0,1,172,0,0.0,Up
1,49,0,2,160,180,0,1,156,0,1.0,Flat
2,37,1,1,130,283,0,2,98,0,0.0,Up
3,48,0,3,138,214,0,1,108,1,1.5,Flat
4,54,1,2,150,195,0,1,122,0,0.0,Up
...,...,...,...,...,...,...,...,...,...,...,...
913,45,1,4,110,264,0,1,132,0,1.2,Flat
914,68,1,3,144,193,1,1,141,0,3.4,Flat
915,57,1,3,130,131,0,1,115,1,1.2,Flat
916,57,0,1,130,236,0,3,174,0,0.0,Flat


In [None]:
# Feature: ST_Slope
X['ST_Slope'].unique()

array(['Up', 'Flat', 'Down'], dtype=object)

In [None]:
X['ST_Slope']=X['ST_Slope'].replace('Up',0)
X['ST_Slope']=X['ST_Slope'].replace('Flat',1)
X['ST_Slope']=X['ST_Slope'].replace('Down',2)
X

  X['ST_Slope']=X['ST_Slope'].replace('Down',2)


Unnamed: 0,Age,Gender,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope
0,40,1,1,140,289,0,1,172,0,0.0,0
1,49,0,2,160,180,0,1,156,0,1.0,1
2,37,1,1,130,283,0,2,98,0,0.0,0
3,48,0,3,138,214,0,1,108,1,1.5,1
4,54,1,2,150,195,0,1,122,0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...
913,45,1,4,110,264,0,1,132,0,1.2,1
914,68,1,3,144,193,1,1,141,0,3.4,1
915,57,1,3,130,131,0,1,115,1,1.2,1
916,57,0,1,130,236,0,3,174,0,0.0,1


## Splitting the Dataset into train and test sets

In [None]:
from sklearn.model_selection import train_test_split

X_train,X_test,Y_train,Y_test = train_test_split(X, Y,test_size=0.20,random_state=0)
# Running the code multiple times will always produce the same output for X_train and X_test
# because random_state=0 ensures the split is deterministic.
print(X_train.shape )
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(734, 11)
(184, 11)
(734,)
(184,)


## Creating Gaussian Naive Bayes Model

In [None]:
from sklearn.naive_bayes import GaussianNB

# Creating Gaussian Naive Bayes Object
classifer1 = GaussianNB()

In [None]:
# Training the model
model1 = classifer1.fit(X_train, Y_train) # supervised learning

In [None]:
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report

# Evaluating the model
Y_pred1 = model1.predict(X_test)
print("The accuracy is "+str(metrics.accuracy_score(Y_test,Y_pred1)*100)+"%")
print(confusion_matrix(Y_test, Y_pred1))

The accuracy is 83.15217391304348%
[[60 17]
 [14 93]]


In [None]:
target_names = ['class 0', 'class 1']
print(classification_report(Y_test, Y_pred1, target_names=target_names))
#will display precision    recall  f1-score   support

# support refers to the number of true instances for each class in the test set (Y_test).

              precision    recall  f1-score   support

     class 0       0.81      0.78      0.79        77
     class 1       0.85      0.87      0.86       107

    accuracy                           0.83       184
   macro avg       0.83      0.82      0.83       184
weighted avg       0.83      0.83      0.83       184



## Creating Gaussian Naive Bayes Model with Prior Probabilities of Classes

- The priors parameter is typically used when you have limited data in one target class and you want to specify equal initial probabilities.
- Equal initial probabilities means that the prior probabilities for all classes are the same, implying no bias toward any particular class at the start.
- For example, in a binary classification problem with classes 0 and 1, equal probabilities would be P(class 0) = 0.5 and P(class 1) = 0.5.
- These priors represent your assumption that, in the absence of any feature data, both classes are equally likely to occur.

In [None]:
# Creating Gaussian Naive Bayes Object
classifer2 = GaussianNB(priors=[0.25, 0.75]) # unequal priors
# This means you are giving class 1 a higher initial probability than class 0, which might be useful if:
# You know a priori that class 1 is more common than class 0.
# You have an imbalanced dataset and want to compensate for the class imbalance.

# Training the model
model2 = classifer2.fit(X_train, Y_train)

# Evaluating the model
Y_pred2 = model2.predict(X_test)
print("The accuracy is "+str(metrics.accuracy_score(Y_test,Y_pred2)*100)+"%")
print(confusion_matrix(Y_test, Y_pred2))

The accuracy is 84.23913043478261%
[[58 19]
 [10 97]]


## Creating Miltinomial Naive Bayes Model

MultinomialNB is a type of Naive Bayes classifier used for classification problems, especially when features represent discrete frequency counts, like word counts in text classification.

# when to use?
1. Discrete Features:

It works best when your features are counts of occurrences, such as word frequencies in text data or categorical features represented as integers.
Example: Number of times a symptom appears or count-based features.
2. Features Need to Be Non-Negative:

MultinomialNB assumes all feature values are non-negative (e.g., 0 or positive).



In [None]:
from sklearn.naive_bayes import MultinomialNB

# Creating Multinomial Naive Bayes Object
classifer3 = MultinomialNB()

# Training the model
model3 = classifer3.fit(X_train, Y_train)

# Evaluating the model
Y_pred3 = model3.predict(X_test)
print("The accuracy is "+str(metrics.accuracy_score(Y_test,Y_pred3)*100)+"%")
print(confusion_matrix(Y_test, Y_pred3))

ValueError: Negative values in data passed to MultinomialNB (input X).