# Heart Disease Prediction

· Qianhao Zheng - 300125316

· Lan Luo - 300127181

According to the CDC, heart disease is one of the leading causes of death for people of most races in the US (African Americans, American Indians and Alaska Natives, and white people). 

About half of all Americans (47%) have at least 1 of 3 key risk factors for heart disease: high blood pressure, high cholesterol, and smoking. 

Other key indicator include diabetic status, obesity (high BMI), not getting enough physical activity or drinking too much alcohol. 

Detecting and preventing the factors that have the greatest impact on heart disease is very important in healthcare. 

# Classification and Goal

We treat the variable "HeartDisease" as a binary ("Yes" - respondent had heart disease; "No" - respondent had no heart disease).

This project is for the "Heart Disease Prediction" application. We will ask users to complete a survey form, such as age, gender, smoking and drinking. We will analyze the data provided by users and predict the risk of heart disease. 

We will model and analyze data from a 2020 annual CDC survey data of 400k adults related to their health status.

# Analyzing and describing the dataset

Number of training examples: 319795

Number of features: 17

Distribution of features:

· HeartDisease: Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI)

· BMI: Body Mass Index

· Smoking: Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]

· AlcoholDrinking: Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week

· Stroke: (Ever told) (you had) a stroke?
PhysicalHealth: Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30

· MentalHealth: Thinking about your mental health, for how many days during the past 30 days was your mental health not good?

· DiffWalking: Do you have serious difficulty walking or climbing stairs?

· Sex: Are you male or female?

· AgeCategory: Fourteen-level age category

· Race: Imputed race/ethnicity value

· PhysicalActivity: Adults who reported doing physical activity or exercise during the past 30 days other than their regular job

· GenHealth: Would you say that in general your health is…

· SleepTime: On average, how many hours of sleep do you get in a 24-hour period?

· Asthma: (Ever told) (you had) asthma?

· KidneyDisease: Not including kidney stones, bladder infection or incontinence, were you ever told you had kidney disease?

· SkinCancer: (Ever told) (you had) skin cancer?

# Justification of dataset choice

In my opinion, the survey needs to know the basic information of the people under investigation, so several basic features are indispensable, such as HeartDisease, Sex, Age, BMI and Race. 

I classified Smoking, AlcoholDrinking,  PhysicalActivity and SleepTime into one category.

I group Stroke, Asthma, KidneyDisease and SkinCancer, all of which represent whether or not someone has experienced some illness.

I will lump together the features of PhysicalHealth, MentalHealth, and GenHealth. In my opinion, these three features represent people's subjective assessment of their own physical condition.

Finally, there is DiffWalking, which I do not classify in any category. I think this feature is very important, because it represents the precursor of the body suffering from diseases.

Because we don't have too many features to choose and we don't know enough about heart disease, we think every feature is very important and we will use all features for modeling and analysis.

In [1]:
import pandas as pd

In [2]:
from sklearn import model_selection

# Show the whole dataset

In [3]:
file_name = "heart_2020_cleaned.csv"
df = pd.read_csv(file_name)
df

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,No,16.60,Yes,No,No,3.0,30.0,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes
1,No,20.34,No,No,Yes,0.0,0.0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,No
2,No,26.58,Yes,No,No,20.0,30.0,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No
3,No,24.21,No,No,No,0.0,0.0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes
4,No,23.71,No,No,No,28.0,0.0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
319790,Yes,27.41,Yes,No,No,7.0,0.0,Yes,Male,60-64,Hispanic,Yes,No,Fair,6.0,Yes,No,No
319791,No,29.84,Yes,No,No,0.0,0.0,No,Male,35-39,Hispanic,No,Yes,Very good,5.0,Yes,No,No
319792,No,24.24,No,No,No,0.0,0.0,No,Female,45-49,Hispanic,No,Yes,Good,6.0,No,No,No
319793,No,32.81,No,No,No,0.0,0.0,No,Female,25-29,Hispanic,No,No,Good,12.0,No,No,No


BMI is discrete data, so we do not process BMI.

Setting "Yes" and "No" to "1" and "0" respectively for "HeartDisease", "Smoking", "AlcoholDrinking", "Stroke", "DiffWalking", "Diabetic", "PhysicalActivity", "Asthma", "KidneyDisease", "SkinCancer".

Separate the data of "PhysicalHealth" and "MentalHealth" to three classes(0,1,2)

In [4]:
yn_to_10 = {"Yes":1, "No":0, 1:1, 0:0}
col_name_lists = ["HeartDisease", "Smoking", "AlcoholDrinking", "Stroke", "DiffWalking", 
                  "Diabetic", "PhysicalActivity", "Asthma", "KidneyDisease", "SkinCancer"]
for col_name in col_name_lists:
    df[col_name] = df[col_name].map(yn_to_10)


def thirty_days_to_3cls(x):
    if x == 0.0:
        return 0
    elif x == 30.0:
        return 2
    return 1

def convert_thirty_days_to_3cls(df, key):
    df[key] = df[key].map(thirty_days_to_3cls)
    

convert_thirty_days_to_3cls(df, "PhysicalHealth")
convert_thirty_days_to_3cls(df, "MentalHealth")
df

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,0,16.60,1,0,0,1,2,0,Female,55-59,White,1.0,1,Very good,5.0,1,0,1
1,0,20.34,0,0,1,0,0,0,Female,80 or older,White,0.0,1,Very good,7.0,0,0,0
2,0,26.58,1,0,0,1,2,0,Male,65-69,White,1.0,1,Fair,8.0,1,0,0
3,0,24.21,0,0,0,0,0,0,Female,75-79,White,0.0,0,Good,6.0,0,0,1
4,0,23.71,0,0,0,1,0,1,Female,40-44,White,0.0,1,Very good,8.0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
319790,1,27.41,1,0,0,1,0,1,Male,60-64,Hispanic,1.0,0,Fair,6.0,1,0,0
319791,0,29.84,1,0,0,0,0,0,Male,35-39,Hispanic,0.0,1,Very good,5.0,1,0,0
319792,0,24.24,0,0,0,0,0,0,Female,45-49,Hispanic,0.0,1,Good,6.0,0,0,0
319793,0,32.81,0,0,0,0,0,0,Female,25-29,Hispanic,0.0,0,Good,12.0,0,0,0


# Changing discrete features to continuous features

# Setting "Female" and "Male" to "1" and "0" respectively

In [5]:
def sex_to_binary(x):
    return 1 if x == "Female" else 0
df['Sex'] = df['Sex'].map(sex_to_binary)

# Coding the data for each age group

From younger to older

In [6]:
a = df["AgeCategory"].value_counts()
age_num_dict = a.to_dict()
age_map = {key:i for i, key in enumerate(sorted(age_num_dict.keys()))}
print(age_map)
df["AgeCategory"] = df["AgeCategory"].map(age_map)
df

{'18-24': 0, '25-29': 1, '30-34': 2, '35-39': 3, '40-44': 4, '45-49': 5, '50-54': 6, '55-59': 7, '60-64': 8, '65-69': 9, '70-74': 10, '75-79': 11, '80 or older': 12}


Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,0,16.60,1,0,0,1,2,0,1,7,White,1.0,1,Very good,5.0,1,0,1
1,0,20.34,0,0,1,0,0,0,1,12,White,0.0,1,Very good,7.0,0,0,0
2,0,26.58,1,0,0,1,2,0,0,9,White,1.0,1,Fair,8.0,1,0,0
3,0,24.21,0,0,0,0,0,0,1,11,White,0.0,0,Good,6.0,0,0,1
4,0,23.71,0,0,0,1,0,1,1,4,White,0.0,1,Very good,8.0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
319790,1,27.41,1,0,0,1,0,1,0,8,Hispanic,1.0,0,Fair,6.0,1,0,0
319791,0,29.84,1,0,0,0,0,0,0,3,Hispanic,0.0,1,Very good,5.0,1,0,0
319792,0,24.24,0,0,0,0,0,0,1,5,Hispanic,0.0,1,Good,6.0,0,0,0
319793,0,32.81,0,0,0,0,0,0,1,1,Hispanic,0.0,0,Good,12.0,0,0,0


# We use one-hot to process this dataset ("Race")
There are six features
Firstly, we convert the column of race to 0, 1, 2, 3, 4, 5. Since there are 6 FEATURES in total, but 5 degrees of freedom, we map 0, 1, 2, 3, 4, 5 to five columns.

In [7]:
print(df["Race"].value_counts())
# df["AgeCategory"] = df["AgeCategory"].map(age_map)
race_table = pd.get_dummies(df.Race, prefix='Race')
with_race_on_hot = pd.concat([df, race_table], axis=1, join='outer')
df = with_race_on_hot.drop(columns=['Race', "Race_American Indian/Alaskan Native"])
df

White                             245212
Hispanic                           27446
Black                              22939
Other                              10928
Asian                               8068
American Indian/Alaskan Native      5202
Name: Race, dtype: int64


Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,...,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer,Race_Asian,Race_Black,Race_Hispanic,Race_Other,Race_White
0,0,16.60,1,0,0,1,2,0,1,7,...,Very good,5.0,1,0,1,0,0,0,0,1
1,0,20.34,0,0,1,0,0,0,1,12,...,Very good,7.0,0,0,0,0,0,0,0,1
2,0,26.58,1,0,0,1,2,0,0,9,...,Fair,8.0,1,0,0,0,0,0,0,1
3,0,24.21,0,0,0,0,0,0,1,11,...,Good,6.0,0,0,1,0,0,0,0,1
4,0,23.71,0,0,0,1,0,1,1,4,...,Very good,8.0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
319790,1,27.41,1,0,0,1,0,1,0,8,...,Fair,6.0,1,0,0,0,0,1,0,0
319791,0,29.84,1,0,0,0,0,0,0,3,...,Very good,5.0,1,0,0,0,0,1,0,0
319792,0,24.24,0,0,0,0,0,0,1,5,...,Good,6.0,0,0,0,0,0,1,0,0
319793,0,32.81,0,0,0,0,0,0,1,1,...,Good,12.0,0,0,0,0,0,1,0,0


# The five "Genhealth" are represented by numbers 0-4

In [8]:
a = df["GenHealth"].value_counts()
gen_health_dict = {"Poor":0, "Fair":1,"Good":2, "Very good":3, "Excellent":4}
df["GenHealth"] = df["GenHealth"].map(gen_health_dict)
df

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,...,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer,Race_Asian,Race_Black,Race_Hispanic,Race_Other,Race_White
0,0,16.60,1,0,0,1,2,0,1,7,...,3,5.0,1,0,1,0,0,0,0,1
1,0,20.34,0,0,1,0,0,0,1,12,...,3,7.0,0,0,0,0,0,0,0,1
2,0,26.58,1,0,0,1,2,0,0,9,...,1,8.0,1,0,0,0,0,0,0,1
3,0,24.21,0,0,0,0,0,0,1,11,...,2,6.0,0,0,1,0,0,0,0,1
4,0,23.71,0,0,0,1,0,1,1,4,...,3,8.0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
319790,1,27.41,1,0,0,1,0,1,0,8,...,1,6.0,1,0,0,0,0,1,0,0
319791,0,29.84,1,0,0,0,0,0,0,3,...,3,5.0,1,0,0,0,0,1,0,0
319792,0,24.24,0,0,0,0,0,0,1,5,...,2,6.0,0,0,0,0,0,1,0,0
319793,0,32.81,0,0,0,0,0,0,1,1,...,2,12.0,0,0,0,0,0,1,0,0


# The new dataset after adjustment
We have changed all the discrete features to continuous features.

In [9]:
Y = df["HeartDisease"]
X = df.drop(columns=["HeartDisease"])
print(X)
print(Y)

          BMI  Smoking  AlcoholDrinking  Stroke  PhysicalHealth  MentalHealth  \
0       16.60        1                0       0               1             2   
1       20.34        0                0       1               0             0   
2       26.58        1                0       0               1             2   
3       24.21        0                0       0               0             0   
4       23.71        0                0       0               1             0   
...       ...      ...              ...     ...             ...           ...   
319790  27.41        1                0       0               1             0   
319791  29.84        1                0       0               0             0   
319792  24.24        0                0       0               0             0   
319793  32.81        0                0       0               0             0   
319794  46.56        0                0       0               0             0   

        DiffWalking  Sex  A

# Training set and testing set (5-fold cross-validation)
We extracted 1000 rows of data and divided them into five sections (0-199, 200-399, 400-599, 600-799, 800-999). The length of training set is 800, and the length of testing set is 200. The five parts are respectively used as testing set, and the remaining four parts are used as training set.

In [10]:
from sklearn.model_selection import KFold  
import numpy as np 
kf = KFold(n_splits=5,shuffle=False)  
for train_index , test_index in kf.split(X[:1000]):  
    print('train_index:%s , test_index: %s ' %(train_index,test_index)) 
    print("train_length:%s, test_length:%s" % (len(train_index), len(test_index)))


train_index:[200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217
 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235
 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253
 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271
 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289
 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307
 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325
 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343
 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361
 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379
 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397
 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415
 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433
 434 435 436 437 438 439 440 441 442 44

In [11]:
import torch
from torch.nn import Linear, ReLU, Sigmoid
from torch.nn.init import kaiming_uniform_, xavier_uniform_
import torch.nn.functional as F
import random

# Data analysis for the dataset

In [12]:
df.describe()

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,...,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer,Race_Asian,Race_Black,Race_Hispanic,Race_Other,Race_White
count,319795.0,319795.0,319795.0,319795.0,319795.0,319795.0,319795.0,319795.0,319795.0,319795.0,...,319795.0,319795.0,319795.0,319795.0,319795.0,319795.0,319795.0,319795.0,319795.0,319795.0
mean,0.085595,28.325399,0.412477,0.068097,0.03774,0.35246,0.412036,0.13887,0.524727,6.514536,...,2.595028,7.097075,0.134061,0.036833,0.093244,0.025229,0.07173,0.085824,0.034172,0.766779
std,0.279766,6.3561,0.492281,0.251912,0.190567,0.591813,0.59238,0.345812,0.499389,3.564759,...,1.042918,1.436007,0.340718,0.188352,0.290775,0.156819,0.258041,0.280104,0.181671,0.422883
min,0.0,12.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,24.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,2.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,0.0,27.34,0.0,0.0,0.0,0.0,0.0,0.0,1.0,7.0,...,3.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,0.0,31.42,1.0,0.0,0.0,1.0,1.0,0.0,1.0,9.0,...,3.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,1.0,94.85,1.0,1.0,1.0,2.0,2.0,1.0,1.0,12.0,...,4.0,24.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


# See if there are NaN rows in the data frame

In [13]:
df = df.dropna(axis=0,how='any')
df_numpy = df.to_numpy()
df_numpy
df.isnull().any().any() 

False

In [21]:
def setup_seed(seed):
     torch.manual_seed(seed)
     torch.cuda.manual_seed_all(seed)
     np.random.seed(seed)
     random.seed(seed)
     torch.backends.cudnn.deterministic = True
setup_seed(1234)
def rectify_correct(num):
    return num if num > 0.5 else 1 - num

# Multi-Layer Perceptron
We got the result: Multi-Layer Perception precision of test is 89.00%.

In [15]:
from tqdm import trange
import time
class MLP(torch.nn.Module):
    # define model elements
    def __init__(self, n_inputs):
        super(MLP, self).__init__()
        # input to first hidden layer
        self.hidden1 = Linear(n_inputs, 10)
        kaiming_uniform_(self.hidden1.weight, nonlinearity='relu')
        self.act1 = ReLU()
        # second hidden layer
        self.hidden2 = Linear(10, 8)
        kaiming_uniform_(self.hidden2.weight, nonlinearity='relu')
        self.act2 = ReLU()
        # third hidden layer and output
        self.hidden3 = Linear(8, 1)
        xavier_uniform_(self.hidden3.weight)
        self.act3 = Sigmoid()

    # forward propagate input
    def forward(self, X):
        # input to first hidden layer
        X = self.hidden1(X)
        X = self.act1(X)
        # second hidden layer
        X = self.hidden2(X)
        X = self.act2(X)
        # third hidden layer and output
        X = self.hidden3(X)
        X = self.act3(X)
        return X
     
iter_train_test_index = kf.split(X[:10000])
model=MLP(21)
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())
epochs = 5
model.train()
    
for epoch in range(epochs):
    train_indexes, test_indexes = next(iter_train_test_index)
    sum_loss = 0
    train_correct = 0
    for i, train_index in enumerate(train_indexes):
        single_line = df_numpy[train_index]
        train_input, train_label = torch.from_numpy(single_line[1:]), torch.from_numpy(single_line[:1])
        train_input, train_label = train_input.type(torch.DoubleTensor).to(torch.float32),train_label.type(torch.DoubleTensor).to(torch.float32)
        # clear the gradients
        optimizer.zero_grad()
        # compute the model output
        yhat = model(train_input)
        # calculate loss
        loss = criterion(yhat, train_label)
        sum_loss += loss
        # credit assignment
        loss.backward()
        # update model weights
        optimizer.step()
        
        y_predict = 0 if yhat.item() < 0.5 else 1 
#         print(y_predict, train_label)
        if y_predict == train_label:
            train_correct += 1
#     print("correct:", train_correct, "total:", train_indexes)
    print('[%d,%d] loss:%.03f' % (epoch + 1, epochs, sum_loss / len(train_indexes)))
    print('        correct:%.03f%%' % (100 * rectify_correct(train_correct / len(train_indexes))))
    print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime()))
model.eval()
total_test_num = len(test_indexes)
total_correct_num = 0
for i, test_index in enumerate(test_indexes):
    single_line = df_numpy[test_index]
    test_input, test_label = torch.from_numpy(single_line[1:]), torch.from_numpy(single_line[:1])
    # compute the model output
    yhat = model(train_input)
    # calculate loss
    
    y_predict = 0 if yhat.item() < 0.5 else 1
    if y_predict == test_label:
        total_correct_num += 1
    
print("correct:%.3f %%" % (100 * rectify_correct(total_correct_num / total_test_num)))

[1,5] loss:0.000
        correct:91.038%
2022-10-29 22:29:45
[2,5] loss:0.000
        correct:90.237%
2022-10-29 22:29:50
[3,5] loss:0.000
        correct:89.775%
2022-10-29 22:29:55
[4,5] loss:0.000
        correct:89.675%
2022-10-29 22:30:00
[5,5] loss:0.000
        correct:90.575%
2022-10-29 22:30:05
correct:89.000 %


# Numpy split data into train and test

In [16]:
from sklearn.naive_bayes import MultinomialNB


np.random.seed(1234)

np.random.shuffle(df_numpy)
num_test = int(len(df_numpy) / 5)
test, train = df_numpy[:num_test,:], df_numpy[num_test:,:]
(test.shape, train.shape)


((62091, 22), (248364, 22))

In [17]:
X_train, y_train = train[:, 1:], train[:, :1]
(X.shape, y_train.shape)

((319795, 21), (248364, 1))

# Naïve Bayes
We got the result: Naïve Bayes precision of test is 89.536%.

In [18]:
mnb = MultinomialNB() 
mnb.fit(X_train, y_train)
# number of samples in each class。
print(mnb.class_count_)
# The number of times each feature occurs under each category. 
print(mnb.feature_count_)
# The proportion (probability) of each feature under each category, i.e., P(x}y). Note that the value is the probability 
# After taking the logarithm of the result, if you need to see the original probability, you need to use the exponential reduction. 
print(np.exp(mnb.feature_log_prob_))

[227239.  21125.]
[[6.40367848e+06 9.00300000e+04 1.62040000e+04 5.92600000e+03
  7.34860000e+04 9.30720000e+04 2.64020000e+04 1.20583000e+05
  1.41839000e+06 2.55800000e+04 1.79159000e+05 6.08913000e+05
  1.61257300e+06 2.92790000e+04 6.40600000e+03 1.92910000e+04
  5.86300000e+03 1.64300000e+04 1.99980000e+04 7.68200000e+03
  1.73681000e+05]
 [6.19578840e+05 1.23880000e+04 8.81000000e+02 3.40300000e+03
  1.36130000e+04 8.75100000e+03 7.74800000e+03 8.65600000e+03
  1.95362000e+05 7.13300000e+03 1.34700000e+04 3.72660000e+04
  1.51059000e+05 3.72700000e+03 2.67500000e+03 3.89700000e+03
  1.95000000e+02 1.33000000e+03 1.07800000e+03 6.64000000e+02
  1.74410000e+04]]
[[5.84669551e-01 8.22002170e-03 1.47955095e-03 5.41147701e-04
  6.70951933e-03 8.49776277e-03 2.41065003e-03 1.10095756e-02
  1.29502114e-01 2.33559968e-03 1.63576889e-02 5.55951427e-02
  1.47231434e-01 2.67332625e-03 5.84972721e-04 1.76140061e-03
  5.35395667e-04 1.50018523e-03 1.82595122e-03 7.01474234e-04
  1.58575358e-0

  y = column_or_1d(y, warn=True)


In [19]:
X_test, y_test = test[:, 1:], test[:, :1]
y_estimate = mnb.predict(X_test)

print("Naive Bayes precision:")
print(len([y_test[i] for i in range(len(y_test)) if y_test[i] == y_estimate[i]])/len(y_test))

Naive Bayes precision:
0.8953632571548211


# Logistic Regression
We got the result: Logistic Regression precision of test is 91.471%.

In [20]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
y_estimate_linear = model.predict(X_test)
print("Linear regression precision:")
print(len([y_test[i] for i in range(len(y_test)) if y_test[i] == (0 if y_estimate_linear[i] < 0.5 else 1)])/len(y_test))

Linear regression precision:
0.9147058349841362


# Analyzing and comparing the obtained results
Through three models of testing, we get the following results:

1. Multi-Layer Perception precision of test is 89.00%.
2. Naïve Bayes precision of test is 89.536%.
3. Logistic Regression precision of test is 91.471%.

For the precision of the three models, the results are very similar. It can be seen that the linear regression model can be used to obtain higher precision than the other two models.

# References

Kaggle, Personal Key Indicators of Heart Disease. https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease?resource=download

Train/Test Split and Cross Validation in Python. https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6

Scikit-learn. https://scikit-learn.org/stable/modules/cross_validation.html

Numpy. https://numpy.org/doc/stable/reference/random/generated/numpy.random.seed.html

How to split/partition a dataset into training and test datasets for, e.g., cross validation? https://stackoverflow.com/questions/3674409/how-to-split-partition-a-dataset-into-training-and-test-datasets-for-e-g-cros

Sentiment Analysis. https://blog.csdn.net/jclian91/article/details/90316414

Convert Python dict into a dataframe. https://stackoverflow.com/questions/18837262/convert-python-dict-into-a-dataframe