# Cardiovascular Diseases Risk Prediction

#### Project by Roberta Solom and Kadri-Ketter Kont
##### Dataset: <a href="https://www.kaggle.com/datasets/alphiree/cardiovascular-diseases-risk-prediction-dataset?fbclid=IwAR0HByCc2BdZzRXrOsv2GPAqviBa4R6kwMPwol5TCrTnExaOaBmaQplR59E">Cardiovascular Diseases Risk Prediction</a>
Our primary objective is to create various visual representations, such as plots, to illustrate the impact of different lifestyle factors on the presence of cardiovascular diseases. Furthermore, we're working on building an accurate prediction model and this model will be key in predicting cardiovascular disease risks.


## Our data

In [1]:
import pandas as pd

data = pd.read_csv("CVD_cleaned.csv")
df = pd.DataFrame(data)

data

Unnamed: 0,General_Health,Checkup,Exercise,Heart_Disease,Skin_Cancer,Other_Cancer,Depression,Diabetes,Arthritis,Sex,Age_Category,Height_(cm),Weight_(kg),BMI,Smoking_History,Alcohol_Consumption,Fruit_Consumption,Green_Vegetables_Consumption,FriedPotato_Consumption
0,Poor,Within the past 2 years,No,No,No,No,No,No,Yes,Female,70-74,150.0,32.66,14.54,Yes,0.0,30.0,16.0,12.0
1,Very Good,Within the past year,No,Yes,No,No,No,Yes,No,Female,70-74,165.0,77.11,28.29,No,0.0,30.0,0.0,4.0
2,Very Good,Within the past year,Yes,No,No,No,No,Yes,No,Female,60-64,163.0,88.45,33.47,No,4.0,12.0,3.0,16.0
3,Poor,Within the past year,Yes,Yes,No,No,No,Yes,No,Male,75-79,180.0,93.44,28.73,No,0.0,30.0,30.0,8.0
4,Good,Within the past year,No,No,No,No,No,No,No,Male,80+,191.0,88.45,24.37,Yes,0.0,8.0,4.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
308849,Very Good,Within the past year,Yes,No,No,No,No,No,No,Male,25-29,168.0,81.65,29.05,No,4.0,30.0,8.0,0.0
308850,Fair,Within the past 5 years,Yes,No,No,No,No,Yes,No,Male,65-69,180.0,69.85,21.48,No,8.0,15.0,60.0,4.0
308851,Very Good,5 or more years ago,Yes,No,No,No,Yes,"Yes, but female told only during pregnancy",No,Female,30-34,157.0,61.23,24.69,Yes,4.0,40.0,8.0,4.0
308852,Very Good,Within the past year,Yes,No,No,No,No,No,No,Male,65-69,183.0,79.38,23.73,No,3.0,30.0,12.0,0.0


### Data consits of 308854 rows (people) and 19 columns (features)

In [2]:
for column in data:
    print(data[column].value_counts())
    print()

General_Health
Very Good    110395
Good          95364
Excellent     55954
Fair          35810
Poor          11331
Name: count, dtype: int64

Checkup
Within the past year       239371
Within the past 2 years     37213
Within the past 5 years     17442
5 or more years ago         13421
Never                        1407
Name: count, dtype: int64

Exercise
Yes    239381
No      69473
Name: count, dtype: int64

Heart_Disease
No     283883
Yes     24971
Name: count, dtype: int64

Skin_Cancer
No     278860
Yes     29994
Name: count, dtype: int64

Other_Cancer
No     278976
Yes     29878
Name: count, dtype: int64

Depression
No     246953
Yes     61901
Name: count, dtype: int64

Diabetes
No                                            259141
Yes                                            40171
No, pre-diabetes or borderline diabetes         6896
Yes, but female told only during pregnancy      2646
Name: count, dtype: int64

Arthritis
No     207783
Yes    101071
Name: count, dtype: int64

Sex
Fe

## Analyzing the data

### Relationship between age category and general health

In [11]:
import pandas as pd
import plotnine as p9

df = pd.DataFrame({'General health': data['General_Health'], 'Age category': data['Age_Category']})

(p9.ggplot(
    data = df, 
    mapping = p9.aes(x = 'Age category', y = 'General health', fill='factor(General Health)')
) + 
    p9.geom_bar(stat='identity') +
    p9.theme(axis_text_x=p9.element_text(angle=90, hjust=1))
)

PlotnineError: "Could not evaluate the 'fill' mapping: 'factor(General_Health)' (original error: name 'General_Health' is not defined)"

## Preparing data for model training

Changing all categorical columns into binary features.<br>

We will change the columns with two values into binary values. These columns will be – Exercise, Heart_Disease, Skin_Cancer, Other_Cancer, Depression, Arthritis, Smoking_History.

Categorical columns that will be changed with get_dummies – General_Health, Checkup, Diabetes, Age_Category, Sex.<br>

We will also leave out columns such as Fruit_Consumption, Green_Vegetables_Consumption, FriedPotato_Consumption because we are not certain of consumption unit.


In [3]:
# Removing consumption columns with .drop

data_without_consumption = data.drop(columns=['Fruit_Consumption', 'Green_Vegetables_Consumption', 'FriedPotato_Consumption'])

# Changing columns with values 'yes' and 'no' into columns with binary values
# Yes - True
# No - False

yes_and_no_columns = ['Exercise', 'Heart_Disease', 'Skin_Cancer', 'Other_Cancer', 'Depression', 'Arthritis', 'Smoking_History']

for column in yes_and_no_columns:
    data_without_consumption[column] = data_without_consumption[column].map({'Yes': True, 'No': False})

# Changing categorical columns into binary columns

data_dum = pd.get_dummies(data_without_consumption, columns = ['General_Health', 'Checkup', 'Diabetes', 'Age_Category','Sex'])

data_dum


Unnamed: 0,Exercise,Heart_Disease,Skin_Cancer,Other_Cancer,Depression,Arthritis,Height_(cm),Weight_(kg),BMI,Smoking_History,...,Age_Category_45-49,Age_Category_50-54,Age_Category_55-59,Age_Category_60-64,Age_Category_65-69,Age_Category_70-74,Age_Category_75-79,Age_Category_80+,Sex_Female,Sex_Male
0,False,False,False,False,False,True,150.0,32.66,14.54,True,...,False,False,False,False,False,True,False,False,True,False
1,False,True,False,False,False,False,165.0,77.11,28.29,False,...,False,False,False,False,False,True,False,False,True,False
2,True,False,False,False,False,False,163.0,88.45,33.47,False,...,False,False,False,True,False,False,False,False,True,False
3,True,True,False,False,False,False,180.0,93.44,28.73,False,...,False,False,False,False,False,False,True,False,False,True
4,False,False,False,False,False,False,191.0,88.45,24.37,True,...,False,False,False,False,False,False,False,True,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
308849,True,False,False,False,False,False,168.0,81.65,29.05,False,...,False,False,False,False,False,False,False,False,False,True
308850,True,False,False,False,False,False,180.0,69.85,21.48,False,...,False,False,False,False,True,False,False,False,False,True
308851,True,False,False,False,True,False,157.0,61.23,24.69,True,...,False,False,False,False,False,False,False,False,True,False
308852,True,False,False,False,False,False,183.0,79.38,23.73,False,...,False,False,False,False,True,False,False,False,False,True


### Splitting data into train and test set

We will be predicting the 'Heart_Disease' column

In [6]:
from sklearn.model_selection import train_test_split

X = data_dum.drop(columns='Heart_Disease')
y = data_dum['Heart_Disease']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.25, random_state= 5)
len(X_train)

231640