**Notebook Description**

In this notebook, I start by reading a CSV file containing heart disease data. I then provide descriptive statistics and print the first and last five rows of the dataset to get a better understanding of the data. Finally, I prepare the data by selecting the prediction target, removing the target column from the features, and splitting the data into training and testing sets.

First we read the csv file.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the CSV file
file_path = '../../data/personal-key-indicators-of-heart-disease/2020/heart_2020_cleaned.csv'
data = pd.read_csv(file_path)

Then we use the .describe function to get an overview of our dataset.

In [2]:

# Get the descriptive statistics
description = data.describe()

# Print the description
print(description)

                 BMI  PhysicalHealth   MentalHealth      SleepTime
count  319795.000000    319795.00000  319795.000000  319795.000000
mean       28.325399         3.37171       3.898366       7.097075
std         6.356100         7.95085       7.955235       1.436007
min        12.020000         0.00000       0.000000       1.000000
25%        24.030000         0.00000       0.000000       6.000000
50%        27.340000         0.00000       0.000000       7.000000
75%        31.420000         2.00000       3.000000       8.000000
max        94.850000        30.00000      30.000000      24.000000


The '.head' prints the first five rows and '.tail' prints the last 5 rows.
We do this to get a better understanding of our data.


In [3]:
print(data.head())

  HeartDisease    BMI Smoking AlcoholDrinking Stroke  PhysicalHealth  \
0           No  16.60     Yes              No     No             3.0   
1           No  20.34      No              No    Yes             0.0   
2           No  26.58     Yes              No     No            20.0   
3           No  24.21      No              No     No             0.0   
4           No  23.71      No              No     No            28.0   

   MentalHealth DiffWalking     Sex  AgeCategory   Race Diabetic  \
0          30.0          No  Female        55-59  White      Yes   
1           0.0          No  Female  80 or older  White       No   
2          30.0          No    Male        65-69  White      Yes   
3           0.0          No  Female        75-79  White       No   
4           0.0         Yes  Female        40-44  White       No   

  PhysicalActivity  GenHealth  SleepTime Asthma KidneyDisease SkinCancer  
0              Yes  Very good        5.0    Yes            No        Yes  
1       

In [4]:
print(data.tail())

       HeartDisease    BMI Smoking AlcoholDrinking Stroke  PhysicalHealth  \
319790          Yes  27.41     Yes              No     No             7.0   
319791           No  29.84     Yes              No     No             0.0   
319792           No  24.24      No              No     No             0.0   
319793           No  32.81      No              No     No             0.0   
319794           No  46.56      No              No     No             0.0   

        MentalHealth DiffWalking     Sex  AgeCategory      Race Diabetic  \
319790           0.0         Yes    Male        60-64  Hispanic      Yes   
319791           0.0          No    Male        35-39  Hispanic       No   
319792           0.0          No  Female        45-49  Hispanic       No   
319793           0.0          No  Female        25-29  Hispanic       No   
319794           0.0          No  Female  80 or older  Hispanic       No   

       PhysicalActivity  GenHealth  SleepTime Asthma KidneyDisease SkinCancer  


The next thing we want to do is select the prediction target. By convention, this is typically named **y**.

In [5]:
Y = data['HeartDisease'].apply(lambda x: 1 if x == 'Yes' else 0)

Now to the features. First we get rid of the 'HearDisease' column, since we already chose that as our prediction target. The next step would be to choose features from the remaining columns.

In [6]:
# Prepare the data
X = data.drop(columns=['HeartDisease'])

Now we split the training data into the train and test data. We use the common approach of using 80% as train data.

In [7]:

# Split the data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

We need to convert all categorical variables to floats in order to process it further with sklearn. First we print all the categorical variables:

In [8]:
# Get list of categorical variables
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical variables:")
print(object_cols)


Categorical variables:
['Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'Sex', 'AgeCategory', 'Race', 'Diabetic', 'PhysicalActivity', 'GenHealth', 'Asthma', 'KidneyDisease', 'SkinCancer']


After that we substitute with ordinal encoding.

In [9]:
from sklearn.preprocessing import OrdinalEncoder

# Make copy to avoid changing original data 
ordinal_X_train = X_train.copy()
ordinal_X_test = X_test.copy()

# Apply ordinal encoder to each column with categorical data
ordinal_encoder = OrdinalEncoder()
ordinal_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])
ordinal_X_test[object_cols] = ordinal_encoder.transform(X_test[object_cols])

In the context of a RandomForestClassifier, 100 estimators mean that the model will utilize 100 individual decision trees to form a 'forest'. Each tree contributes to the final prediction by voting, and the majority vote becomes the overall prediction. This helps improve the accuracy and robustness of the model.

In [10]:
# Initialize the model
model = RandomForestClassifier(n_estimators=500)

Training the model takes a significant longer time then all the other cells. For example, on my laptop it takes ~35 seconds.

In [11]:
# Train the model
import time
startTime = time.time()
model.fit(ordinal_X_train, Y_train)
endTime = time.time()
print(f'Training took {endTime - startTime:.1f} seconds')

Training took 169.2 seconds


In [12]:
# Save the model for the backend serve to use
import pickle

model_pkl_file = "predict_disease_model.pkl"  

with open(model_pkl_file, 'wb') as file:  
    pickle.dump(model, file)

In [13]:
# Save the encoder for the backend serve to use

encoder_pkl_file = "predict_model_encoder.pkl"  

with open(encoder_pkl_file, 'wb') as file:  
    pickle.dump(ordinal_encoder, file)

In [14]:
# Make predictions
predictions = model.predict(ordinal_X_test)

In [15]:
# Make single prediction

person_data = { "AgeCategory": "80 or older", "AlcoholDrinking": "No", "Asthma": "Yes", "BMI": 47, "Diabetic": "Yes", "DiffWalking": "Yes", "GenHealth": "Poor", "KidneyDisease": "Yes", "MentalHealth": 20, "PhysicalActivity": "Yes", "PhysicalHealth": 20, "Race": "Asian", "Sex": "Male", "SkinCancer": "No", "SleepTime": 4, "Smoking": "Yes", "Stroke": "Yes" }

# Convert the dictionary into a DataFrame
person_dataframe = pd.DataFrame([person_data])

s = (person_dataframe.dtypes == 'object')
object_cols = list(s[s].index)

ordinal_person_dataframe = person_dataframe.copy()

# Encode categorical data
ordinal_person_dataframe[object_cols] = ordinal_encoder.fit_transform(person_dataframe[object_cols])

# Reorder the dataframe to match the feature names to those that were passed during fit.
# Feature names must be in the same order as they were in fit
ordinal_person_dataframe = ordinal_person_dataframe[[
    "BMI",
    "Smoking",
    "AlcoholDrinking",
    "Stroke",
    "PhysicalHealth",
    "MentalHealth",
    "DiffWalking",
    "Sex",
    "AgeCategory",
    "Race",
    "Diabetic",
    "PhysicalActivity",
    "GenHealth",
    "SleepTime",
    "Asthma",
    "KidneyDisease",
    "SkinCancer"
]]

# Utilizing model to predict HeartDisease based on person_data
HeartDisease_prediction = model.predict(ordinal_person_dataframe)

# Output the prediction
print("Predicted Heart Disease Status:", HeartDisease_prediction[0])

Predicted Heart Disease Status: 0


The `accuracy_score` function is from the `sklearn.metrics` module of the `scikit-learn` library. It calculates the accuracy of a set of predictions by comparing the predicted values with the actual values and returning the proportion of correctly predicted instances.

In [16]:
# Evaluate the model
accuracy = accuracy_score(Y_test, predictions)
print(f'Accuracy: The model correctly predicted {accuracy * 100:.2f}% of the instances.')

Accuracy: The model correctly predicted 90.60% of the instances.
