# Cardiovascular Disease Prediction Model

## Data Description
The dataset consists of 70 000 records of patients data, 11 features + target.

Attribute information: 
1. Age | Objective Feature | age | int (days)
2. Height | Objective Feature | height | int (cm) |
3. Weight | Objective Feature | weight | float (kg) |
4. Gender | Objective Feature | gender | 1 - female; 2 - male |
5. Systolic blood pressure | Examination Feature | ap_hi | int |
6. Diastolic blood pressure | Examination Feature | ap_lo | int |
7. Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
8. Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
9. Smoking | Subjective Feature | smoke | binary |
10. Alcohol intake | Subjective Feature | alco | binary |
11. Physical activity | Subjective Feature | active | binary |
12. Presence or absence of cardiovascular disease | Target Variable | cardio | binary |

The names and social security numbers of the patients were recently
removed from the database, replaced with dummy values.

In [16]:
import dice_ml
from dice_ml.utils import helpers # helper functions
import sklearn
from sklearn.model_selection import train_test_split

import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier

In [2]:
from sklearn.model_selection import train_test_split

In [3]:
heart = pd.read_csv("cardio_min.csv")

heart = heart.drop(columns=['id'])
heart = heart[['age', 'gender', 'cholesterol', 'gluc', 'smoke', 'bmi', 'systolic_bp', 'diastolic_bp', 'cardio']]

heart.head()

Unnamed: 0,age,gender,cholesterol,gluc,smoke,bmi,systolic_bp,diastolic_bp,cardio
0,50.39,2,1,1,0,21.97,110,80,0
1,51.66,1,3,1,0,23.51,130,70,1
2,48.41,1,1,1,0,28.44,110,70,0
3,51.55,2,1,1,0,20.05,120,80,0
4,58.35,1,1,1,0,25.95,130,70,0


In [4]:
# Supress jupyter warnings if required for cleaner output
import warnings
warnings.simplefilter('ignore')

In [6]:
X = heart.drop("cardio", axis = 1)
y = heart["cardio"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.2, random_state=0, stratify=y
)

In [7]:
numerical = ["age", "bmi", "systolic_bp", "diastolic_bp"]
categorical = X_train.columns.difference(numerical)

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

transformations = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical),
        ('cat', categorical_transformer, categorical)])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', transformations),
                      ('classifier', RandomForestClassifier())])
model = clf.fit(X_train, y_train)

In [10]:
# Dataset for training an ML model
d = dice_ml.Data(dataframe=heart,
                 continuous_features=['age', 'bmi', 'systolic_bp', 'diastolic_bp'],
                 outcome_name='cardio')

# Pre-trained ML model
m = dice_ml.Model(model=model, backend='sklearn')
# DiCE explanation instance
exp = dice_ml.Dice(d,m, method="random")

In [11]:
exp

<dice_ml.explainer_interfaces.dice_random.DiceRandom at 0x2809a3160>

In [15]:
# Generate counterfactual examples
idx = 12
query_instance = X_train[idx:idx+1]
print(query_instance)

dice_exp = exp.generate_counterfactuals(query_instance, total_CFs=10, desired_class="opposite")
# Visualize counterfactual explanation
dice_exp.visualize_as_dataframe(show_only_changes=True)

        age  gender  cholesterol  gluc  smoke    bmi  systolic_bp  \
6566  56.05       2            1     1      0  23.53          120   

      diastolic_bp  
6566            70  


100%|█████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  6.91it/s]

Query instance (original outcome : 0)





Unnamed: 0,age,gender,cholesterol,gluc,smoke,bmi,systolic_bp,diastolic_bp,cardio
0,56.049999,2,1,1,0,23.530001,120,70,0



Diverse Counterfactual set (new outcome: 1)


Unnamed: 0,age,gender,cholesterol,gluc,smoke,bmi,systolic_bp,diastolic_bp,cardio
0,-,-,-,-,-,157.85,-,-,-
1,-,1.0,-,-,-,-,220.0,-,1.0
2,-,-,2.0,-,-,-,106.0,-,1.0
3,-,-,-,2.0,-,100.4,-,-,1.0
4,-,-,-,-,-,148.04,137.0,-,1.0
5,-,-,-,-,-,-,231.0,81.0,1.0
6,64.06,-,-,-,-,-,-,46.0,1.0
7,-,-,-,-,-,113.03,191.0,-,1.0
8,-,-,-,-,1.0,-,188.0,-,1.0
9,-,1.0,-,-,-,31.33,-,-,1.0
