# ***Can we predict , within 10 years, whether or not a person sees the risk to get coronary heart disease?***


## Summary:
 1. Initialization
 2. Show our dataset
 3. Data Exploration
 4. Modify and work on the dataset 

## Introduction
World Health Organization has estimated 12 million deaths occur worldwide, every year due to Heart diseases; in fact, Cardiovascular diseases are the number **1** cause of death globally!
The early predictions of cardiovascular diseases can make lifestyle changes in high risk patients, and it can reduce the complications.
This project intend to prove the correlation between current behaviours of a person, and his future risk of heart disease, using --models--

# 1. Initialization

In [None]:
import argparse
from pathlib import Path
import pandas as pd
import math
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
import numpy as np
from termcolor import colored
from sklearn.model_selection import train_test_split

At first, we:
 - load our dataset
 - delete rows with a NaN value inside 'heartRate','cigsPerDay','BMI','glucose','BPMeds','totChol'
 - substitute NaN values inside 'education' with 0

In [None]:
df = pd.read_csv('datasets/framingham.csv', header='infer', encoding='utf-8')

In [None]:
#delete rows with a NoNe value that is in 'heartRate','cigsPerDay','BMI','glucose','BPMeds','totChol'
columns_null=['heartRate', 'cigsPerDay', 'BMI', 'glucose', 'BPMeds', 'totChol']
for column in columns_null:
    df.drop(df[df[column].isna()==True].index, inplace=True)
    
    
#in 'education', substitute NaN values with 0
df['education'] = df['education'].fillna(0)

# 2. Show our dataset

The dataset consist of an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts.

We have a dataset consisting of **3749 rows** and **16 columns**

In [None]:
df.head(10)

### Description of the attributes

#### Demographic:
- Male:
 - 1 if male 
 - 0 if female


- Age: age of the patient in range (32,70)

#### Behavioral
- Education:
 - 0 if unknown
 - 1 = Some High School
 - 2 = High School or GED
 - 3 = Some College or Vocational School
 - 4 = college and further
 
 
- Current Smoker: 
  - 1 if the patient is a current smoker 
  - 0 if the patient is NOT a current smoker 
  

- Cigs Per Day: the number of cigarettes that the person smoked on average in one day

#### Medical (history)
- BP Meds: 
  - 1 if the patient is on blood pressure medication
  - 0 if the patient is NOT on blood pressure medication
  
  
- Prevalent Stroke: 
  - 1 if the patient had a stroke previously
  - 0 if the patient had NOT a stroke previously
  
  
- Prevalent Hyp:
  - 1 if the patient is hypertensive 
  - 0 if the patient is NOT hypertensive
  
  
- Diabetes: 
  - 1 if the patient has diabetes 
  - 0 if the patient has NOT diabetes 
  
  
#### Medical(current)
- Tot Chol: total cholesterol level


- Sys BP: systolic blood pressure 


- Dia BP: diastolic blood pressure


- BMI: Body Mass Index


- Heart Rate: heart rate


- Glucose: glucose level 


#### Predict variable (desired target)
- TenYearCHD: 10 year risk of coronary heart disease CHD
  - 1 if “Yes”
  - 0 if “No”

# 3. Data Exploration

Actually we can quite demonstrate that , for some of the variables, our dataset is enough balanced... but for other variables the dataset is NOT balanced


#### 1<sup>st</sup> example: SMOKERS vs NON-smokers

For example, we will notice that **current smokers** (and therefore also **current non-smokers**) cover almost half of the sample:

In [None]:
countNoSmoker = len(df[df.currentSmoker == 0])
countSmoker = len(df[df.currentSmoker == 1])
print(colored("Percentage of Current NON-Smoker Patients: {:.2f}%".format((countNoSmoker / (len(df.currentSmoker)) *100)), 'green', attrs=['bold']))
print(colored("Percentage of Current Smoker Patients: {:.2f}%".format((countSmoker / (len(df.currentSmoker))*100)), 'green', attrs=['bold']))

#### 2<sup>nd</sup> example: patients WITH diabetes vs withOUT diabetes


*But sometime it doesn't happen!*

*For example, the percentage of patients with and without ***diabetes*** is not balanced:*

In [None]:
diabetes0 = len(df[df.diabetes == 0])
diabetes1 = len(df[df.diabetes == 1])
print(colored("Percentage Patients WITH Diabetes: {:.2f}%".format((diabetes1 / (len(df.diabetes))*100)), 'green', attrs=['bold']))
print(colored("Percentage Patients withOUT Diabetes: {:.2f}%".format((diabetes0 / (len(df.diabetes))*100)), 'green', attrs=['bold']))



*In fact, as for the diabetes, we can note as well an imbalance about ***the prediction of the risk of coronary heart disease CHD within 10 years***:*



#### 3<sup>rd</sup> example: patients WITH or withOUT risk of coronary heart disease CHD within 10 years 

In [None]:
target0 = len(df[df.TenYearCHD == 0])
target1 = len(df[df.TenYearCHD == 1])
print(colored("Percentage of Patients withOUT risk of coronary heart disease CHD within 10 years: {:.2f}%".format((target0 / (len(df.TenYearCHD))*100)),'green',attrs=['bold']))
print(colored("Percentage of Patients WITH risk of coronary heart disease CHD within 10 years: {:.2f}%".format((target1 / (len(df.TenYearCHD))*100)),'green',attrs=['bold']))

#### 4<sup>th</sup> example: frequency of a previous Stroke differentiated for Sex

Now let's see another example to show the balancing of the dataset, seeing the ***frequency of a previous Stroke differentiated for Sex***  

In [None]:
pd.crosstab(df.male, df.prevalentStroke=='1').plot(kind="bar", figsize=(19.2, 10.8), color=['#AA1111'])
plt.title('Frequency of a previous Stroke for Sex', fontsize=20)
plt.xlabel('Sex:\n  0 = Female\n1 = Male')
plt.xticks(rotation=0)
plt.legend(["had a stroke previously"])
plt.ylabel('Frequency')
plt.show()

# 4. Modify and work on the dataset 

One of the choices we took to develop this project, is to divide our dataset into 2 parts:
  -  a **training** dataset ('train_set.csv'), that is the **80%** of our dataset
  - a **test** dataset ('test_set.csv'), that is the **20%** of our dataset

So at first we must normalize each value of those columns which values are in a too large range (we need that all the values are in the range **(0,1)**):

In [None]:
columns_to_normalize=['age', 'cigsPerDay', 'totChol', 'sysBP', 'diaBP', 'BMI', 'heartRate', 'glucose']
for column in columns_to_normalize:
    df[column]=MinMaxScaler(copy=False).fit_transform(df[[column]])

In [None]:
df.head(8)

Then, we create the ***training*** dataset and the ***test*** dataset using '***train_test_split***', a function in Sklearn model selection for splitting data arrays into two subsets: for training data and for testing data.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(df[args.features], df[args.label], test_size=0.2)
pd.concat([x_train, y_train], axis=1, copy=False).to_csv(Path('datasets', 'train_set.csv'), index=False, encoding='utf-8')
pd.concat([x_test, y_test], axis=1, copy=False).to_csv(Path('datasets', 'test_set.csv'), index=False, encoding='utf-8')

#### TRAINING dataset

Our new TRAINING dataset will consist of **915 rows**

In [None]:
df_train = pd.read_csv('datasets/train_set.csv', header='infer',encoding='utf-8')
df_train.head(7)

#### TEST dataset

Our new TEST dataset will consist of **229 rows**

In [None]:
df_test = pd.read_csv('datasets/test_set.csv', header='infer',encoding='utf-8')
df_test.head(7)

## References

- Article from World Health Organization web-site: 
https://www.who.int/en/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)
- DataSet resources:
https://www.kaggle.com/dileep070/heart-disease-prediction-using-logistic-regression