###  Predicting cardiovascular disease risk with machine learning models
#### Jhair Ante - National University of Colombia


According to data from the National Institutes of Health, cardiovascular disease is one of the leading causes of death and disability in the United States. 

In this notebook we use machine learning models to predict cardiovascular disease. Four different models are used to evaluate and determine which one provides the best performance.

### Risk Factors

#### Risk factors?

Risk factors are characteristics, conditions or behaviors that increase a person's likelihood of developing a disease. Risk factors for CAD include:

- Age
- Gender
- Family history
- Tobacco smoking
- High cholesterol
- High blood pressure
- Physical inactivity
- Obesity
- Diabetes


### Dataset

https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset

70,000 data


### Package installation


In [2]:
!pip install pandas
!pip install scikit-learn
!pip install xgboost
!pip install tensorflow==2.14.0



### Imports


In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import classification_report
import tensorflow as tf
from sklearn.linear_model import LogisticRegression
from sklearn import svm

### Read dataset

In [2]:
cardio_df = pd.read_csv('./dataset/cardio_train.csv', sep=';')

In [3]:
cardio_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           70000 non-null  int64  
 1   age          70000 non-null  int64  
 2   gender       70000 non-null  int64  
 3   height       70000 non-null  int64  
 4   weight       70000 non-null  float64
 5   ap_hi        70000 non-null  int64  
 6   ap_lo        70000 non-null  int64  
 7   cholesterol  70000 non-null  int64  
 8   gluc         70000 non-null  int64  
 9   smoke        70000 non-null  int64  
 10  alco         70000 non-null  int64  
 11  active       70000 non-null  int64  
 12  cardio       70000 non-null  int64  
dtypes: float64(1), int64(12)
memory usage: 6.9 MB


In [4]:
print(cardio_df.describe())

                 id           age        gender        height        weight  \
count  70000.000000  70000.000000  70000.000000  70000.000000  70000.000000   
mean   49972.419900  19468.865814      1.349571    164.359229     74.205690   
std    28851.302323   2467.251667      0.476838      8.210126     14.395757   
min        0.000000  10798.000000      1.000000     55.000000     10.000000   
25%    25006.750000  17664.000000      1.000000    159.000000     65.000000   
50%    50001.500000  19703.000000      1.000000    165.000000     72.000000   
75%    74889.250000  21327.000000      2.000000    170.000000     82.000000   
max    99999.000000  23713.000000      2.000000    250.000000    200.000000   

              ap_hi         ap_lo   cholesterol          gluc         smoke  \
count  70000.000000  70000.000000  70000.000000  70000.000000  70000.000000   
mean     128.817286     96.630414      1.366871      1.226457      0.088129   
std      154.011419    188.472530      0.680250    

In [5]:
cardio_df['cardio'].value_counts()

cardio
0    35021
1    34979
Name: count, dtype: int64

In [6]:
cardio_df.head(10)

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0
5,8,21914,1,151,67.0,120,80,2,2,0,0,0,0
6,9,22113,1,157,93.0,130,80,3,1,0,0,1,0
7,12,22584,2,178,95.0,130,90,3,3,0,0,1,1
8,13,17668,1,158,71.0,110,70,1,1,0,0,1,0
9,14,19834,1,164,68.0,110,60,1,1,0,0,0,0


### Data preprocessing

The age specified in days, we convert it to years

In [7]:
cardio_df['age'] = (cardio_df['age'] / 365).round().astype(int)
cardio_df['weight'] = cardio_df['weight'].astype(int)
cardio_df = cardio_df.drop(columns = 'id')

In [8]:
cardio_df.head(10)

Unnamed: 0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,50,2,168,62,110,80,1,1,0,0,1,0
1,55,1,156,85,140,90,3,1,0,0,1,1
2,52,1,165,64,130,70,3,1,0,0,0,1
3,48,2,169,82,150,100,1,1,0,0,1,1
4,48,1,156,56,100,60,1,1,0,0,0,0
5,60,1,151,67,120,80,2,2,0,0,0,0
6,61,1,157,93,130,80,3,1,0,0,1,0
7,62,2,178,95,130,90,3,3,0,0,1,1
8,48,1,158,71,110,70,1,1,0,0,1,0
9,54,1,164,68,110,60,1,1,0,0,0,0


### Assignment of inputs and outputs

In [9]:
X = cardio_df.drop(columns = 'cardio', axis = 1)
y = cardio_df['cardio']

In [10]:
X.shape, y.shape

((70000, 11), (70000,))

### Standardization of characteristics

In [11]:
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [12]:
X

array([[-0.49350546,  1.36405487,  0.44345206, ..., -0.31087913,
        -0.23838436,  0.49416711],
       [ 0.24556599, -0.73310834, -1.01816804, ..., -0.31087913,
        -0.23838436,  0.49416711],
       [-0.19787688, -0.73310834,  0.07804703, ..., -0.31087913,
        -0.23838436, -2.02360695],
       ...,
       [-0.19787688,  1.36405487,  2.27047718, ..., -0.31087913,
         4.19490608, -2.02360695],
       [ 1.13245175, -0.73310834, -0.16555632, ..., -0.31087913,
        -0.23838436, -2.02360695],
       [ 0.39338029, -0.73310834,  0.68705541, ..., -0.31087913,
        -0.23838436,  0.49416711]])

### Dividing the dataset into training and test data
#### 20% of data for test and 80% for training


In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### Creation of an ANN
ANNs (Artificial Neural Networks) are a computational model inspired by the functioning of the human brain. They are composed of units called artificial neurons that are organized in layers and connected to each other by weights. Each neuron applies an activation function to the linear combination of its inputs, which allows the network to learn and represent complex patterns in the data.

In the context of binary classification, ANNs can be very effective. 

#### Artificial Neural Network Architecture
- Sequential model
- Dense Layer 1: 400 neurons, ReLU activation function, Input dimensions 11
- Dropout layer: 0.2
- Dense layer 2: 400 neurons, Activation function ReLU
- Dense Layer 3: 400 neurons, ReLU trigger function
- Output layer: 1 neuron, Sigmoid activation function

In [16]:
classifier = tf.keras.models.Sequential()
classifier.add( tf.keras.layers.Dense(units=400, activation='relu', input_shape=(11, )))
classifier.add( tf.keras.layers.Dropout(0.2))

classifier.add(tf.keras.layers.Dense(units=400, activation='relu'))
classifier.add(tf.keras.layers.Dense(units=400, activation='relu'))

classifier.add(tf.keras.layers.Dense(units=1, activation='sigmoid'))

In [17]:
classifier.compile(optimizer='Adam', loss='binary_crossentropy', metrics = ['accuracy'])

### RNA training

In [18]:
epochs_hist = classifier.fit(X_train, y_train, epochs = 50)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


### Use of the XGBClassifier model
This model uses decision trees as base estimators and uses boosting techniques to gradually improve model performance at each iteration.

In [14]:
model = XGBClassifier()

### Training the model

In [15]:
model.fit(X_train, y_train)

### Prediction with test data

In [16]:
predict = model.predict(X_test)

### Report on the predictions

In [17]:
print(classification_report(y_test, predict))

              precision    recall  f1-score   support

           0       0.72      0.78      0.75      6999
           1       0.76      0.69      0.72      7001

    accuracy                           0.73     14000
   macro avg       0.74      0.73      0.73     14000
weighted avg       0.74      0.73      0.73     14000



### Logistic Regression

In [18]:
model = LogisticRegression()

In [19]:
model.fit(X_train, y_train)

In [20]:
prediction_on_test_data = model.predict(X_test)

In [21]:
print(classification_report(y_test, prediction_on_test_data))

              precision    recall  f1-score   support

           0       0.71      0.78      0.74      6999
           1       0.76      0.68      0.72      7001

    accuracy                           0.73     14000
   macro avg       0.73      0.73      0.73     14000
weighted avg       0.73      0.73      0.73     14000



### svm (SVC)

In [22]:
model = svm.SVC(kernel='linear')

In [23]:
model.fit(X_train, y_train)

In [24]:
prediction_on_test_data = model.predict(X_test)

In [25]:
print(classification_report(y_test, prediction_on_test_data))

              precision    recall  f1-score   support

           0       0.69      0.82      0.75      6999
           1       0.78      0.64      0.70      7001

    accuracy                           0.73     14000
   macro avg       0.74      0.73      0.73     14000
weighted avg       0.74      0.73      0.73     14000



### Conclusion
The importance of using artificial intelligence models to make predictions is significant, since it allows to prevent problems in advance. Regarding the model, it is crucial that the datasets used are balanced. In this particular case, although the data set did not have many records, they were evenly distributed. In contrast, there are other datasets with many more records but are noticeably unbalanced towards a single label.