# **Bayesian Classifiers**

## 1. Introduction

In this notebook I am going to use naive bayes classifiers to predict whether a japanese person is predisposed to havin a heart attack or not.

---


## 2. Setting up the data

In [23]:
import pandas as pd
import csv
from pathlib import Path

def importData():
    csvPath = Path('C:/Users/C00273530/Desktop/Datasets/japan_heart_attack_dataset.csv')
    data = pd.read_csv(csvPath)
    return data
data = importData()
data.head()

Unnamed: 0,Age,Gender,Region,Smoking_History,Diabetes_History,Hypertension_History,Cholesterol_Level,Physical_Activity,Diet_Quality,Alcohol_Consumption,...,Extra_Column_6,Extra_Column_7,Extra_Column_8,Extra_Column_9,Extra_Column_10,Extra_Column_11,Extra_Column_12,Extra_Column_13,Extra_Column_14,Extra_Column_15
0,56,Male,Urban,Yes,No,No,186.400209,Moderate,Poor,Low,...,0.007901,0.794583,0.290779,0.497193,0.521995,0.799657,0.722398,0.148739,0.83401,0.061632
1,69,Male,Urban,No,No,No,185.136747,Low,Good,Low,...,0.083933,0.688951,0.830164,0.63449,0.302043,0.043683,0.451668,0.878671,0.535602,0.617825
2,46,Male,Rural,Yes,No,No,210.696611,Low,Average,Moderate,...,0.227205,0.496344,0.752107,0.181501,0.62918,0.018276,0.063227,0.146512,0.997296,0.974455
3,32,Female,Urban,No,No,No,211.165478,Moderate,Good,High,...,0.403182,0.741409,0.223968,0.329314,0.143191,0.907781,0.542322,0.922461,0.626217,0.228606
4,60,Female,Rural,No,No,No,223.814253,High,Good,High,...,0.689787,0.904574,0.757098,0.337761,0.362375,0.728552,0.176699,0.484749,0.312091,0.452809


### 2.1 Data Preprocessing

Below i simply just remove the columns that aren't used by my algorithm

In [20]:
def dropEmptyColumns(data):
    data.drop(['Extra_Column_1',
            'Extra_Column_2',
            'Extra_Column_3',
            'Extra_Column_4',
            'Extra_Column_5',
            'Extra_Column_6', 
            'Extra_Column_7',
            'Extra_Column_8',
            'Extra_Column_9',
            'Extra_Column_10',
            'Extra_Column_11',
            'Extra_Column_12',
            'Extra_Column_13',
            'Extra_Column_14',
            'Extra_Column_15'
            ], axis=1, inplace=True)
    return data
data = dropEmptyColumns(data)
data.head()

Unnamed: 0,Age,Gender,Region,Smoking_History,Diabetes_History,Hypertension_History,Cholesterol_Level,Physical_Activity,Diet_Quality,Alcohol_Consumption,Stress_Levels,BMI,Heart_Rate,Systolic_BP,Diastolic_BP,Family_History,Heart_Attack_Occurrence
0,56,Male,Urban,Yes,No,No,186.400209,Moderate,Poor,Low,3.644786,33.961349,72.301534,123.90209,85.682809,No,No
1,69,Male,Urban,No,No,No,185.136747,Low,Good,Low,3.384056,28.242873,57.45764,129.893306,73.524262,Yes,No
2,46,Male,Rural,Yes,No,No,210.696611,Low,Average,Moderate,3.810911,27.60121,64.658697,145.654901,71.994812,No,No
3,32,Female,Urban,No,No,No,211.165478,Moderate,Good,High,6.014878,23.717291,55.131469,131.78522,68.211333,No,No
4,60,Female,Rural,No,No,No,223.814253,High,Good,High,6.806883,19.771578,76.667917,100.694559,92.902489,No,No


Below is the function I used to call all the other preprocessing functions.

In [28]:
def preprocessData(data):
    data = data.dropna()
    data = dropEmptyColumns(data)
    return data

Since there isn't a way to process categorical and numerical data simulatnously like you can using decision trees, I have to split the data into 2 seperate datasets which later on I will combine.

In [None]:
# Extracting Categorical Data to be used by Gausian Naive Bayes classifier
def extractCategoricalData(data):
    categorical = [var for var in data.columns if data[var].dtype=='O']
    return categorical

In [None]:
# Extracting Numerical Data to be used by Multinomial Naive Bayes classifier
def extractNumericalData(data):
    numerical = [var for var in data.columns if data[var].dtype!='O']
    return numerical

## 3. Naive Bayes Classifier Algorithm 

In [36]:
from sklearn.model_selection import train_test_split

data = importData()
data = preprocessData(data)

# X and y cause SettingWithCopyWarning fix later
X = data.drop(['Heart_Attack_Occurrence'], axis=1)
y = data['Heart_Attack_Occurrence']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
X_train.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.drop(['Extra_Column_1',


Unnamed: 0,Age,Gender,Region,Smoking_History,Diabetes_History,Hypertension_History,Cholesterol_Level,Physical_Activity,Diet_Quality,Alcohol_Consumption,Stress_Levels,BMI,Heart_Rate,Systolic_BP,Diastolic_BP,Family_History
27203,18,Male,Urban,No,No,No,255.090564,High,Good,High,4.064282,24.761419,91.126405,130.290083,69.356415,No
22325,75,Male,Rural,No,No,No,179.789535,High,Average,High,4.847468,25.725442,84.450749,128.389267,69.478532,Yes
2163,73,Female,Urban,No,No,No,219.149282,Moderate,Poor,Low,6.284325,29.987494,75.611959,153.278732,80.571403,Yes
18517,46,Female,Urban,Yes,No,No,210.167143,High,Good,Moderate,5.932378,23.897982,62.152381,128.26941,90.468372,Yes
21458,19,Female,Rural,No,No,Yes,237.700084,Moderate,Average,Moderate,3.406807,18.441184,70.240041,109.855903,80.297938,No
