# Cardiovascualr Heart disease:

The goal of this probability and statistics project is to predict (using python) whether an individaul suffers from cardiovascular heart disease and establish what risk the main risk factors for cardiovascular heart disease are. 

Contents:
    
    1. Setup
    
    2. Data preprocessing and Analysis
       a. Outlier detection
       b. IQR based filtering ( Correlation between different predictors)
       c. BMI - filtering 
    
    3. Machine learning models
       a. Creation of training/test splits
       b. Logisitic Regression (Feature selection, Tuned Logistic Regression) 
       c. Discriminant Analysis
       d. Decision Tree (Basic/Tuned)
       e. Boosting Classifiers
       f. KNN (Feature selections, Hyperparameter Tuning
       g. Random Forest
       h. SVC
    
    4. Comparisons of ML model and Conclusion 

1. Setup
    A) Below we have the libraries and functions needed to run our code and algorthims properly.
    B) Reading the raw data set. 
    

In [7]:
#A
#basic data handling and utility.
import numpy as np
import pandas as pd 
import math
import string
import warnings

#sklearn preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
import statsmodels.api as sm
import statsmodels.formula.api as smf


#sklearn logistic regression and split creation
from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import confusion_matrix 
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_validate
from sklearn.model_selection import KFold

#sklearn LDA
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

#sklearn random forest
from sklearn.ensemble import RandomForestClassifier

#sklearn KNN
from sklearn.neighbors import KNeighborsClassifier

#sklearn SVM
from sklearn import svm
from sklearn.svm import LinearSVC

# Seaborn
import seaborn as sns

warnings.filterwarnings("ignore")

In [9]:
#B
cardio = pd.read_csv("/Users/Acthach/Desktop/Pstat131/cardio_train.csv",delimiter=";")

cardio.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0


The dataset we are using today (cardio_train.csv) is a public dataset recieved from kaggle.com consisting of 70000 records of patient data with 12 features and target variables. 

The features from this dataset include three different types of data:
    (Objective: factual information; Examination: results of medical examination;
Subjective: information given by the patient.)

    Age | Objective Feature | age | int (days)

    Height | Objective Feature | height | int (cm) 

    Weight | Objective Feature | weight | float (kg) 

    Gender | Objective Feature | gender | categorical code 

    Systolic blood pressure | Examination Feature | ap_hi | int 

    Diastolic blood pressure | Examination Feature | ap_lo | int 

    Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal 

    Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal 

    Smoking | Subjective Feature | smoke | binary 

    Alcohol intake | Subjective Feature | alco | binary 

    Physical activity | Subjective Feature | active | binary 

    Presence or absence of cardiovascular disease | Target Variable | cardio | binary 



In [10]:
cardio.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           70000 non-null  int64  
 1   age          70000 non-null  int64  
 2   gender       70000 non-null  int64  
 3   height       70000 non-null  int64  
 4   weight       70000 non-null  float64
 5   ap_hi        70000 non-null  int64  
 6   ap_lo        70000 non-null  int64  
 7   cholesterol  70000 non-null  int64  
 8   gluc         70000 non-null  int64  
 9   smoke        70000 non-null  int64  
 10  alco         70000 non-null  int64  
 11  active       70000 non-null  int64  
 12  cardio       70000 non-null  int64  
dtypes: float64(1), int64(12)
memory usage: 6.9 MB


From the results of the above code we are able conclude that there are no missing data entries and hence no data type problems and can continue with the exploratory analysis and preprocessing of our data.

2. Data preprocessing and continued Data Analysis

In [12]:
num_entries = cardio.shape[0]*cardio.shape[1]
print('Number of entries in the dataframe: ', num_entries)

num_missing_values = cardio.isna().sum().sum()
print('Missing values: ', num_missing_values, '\n')

cardio_dup = cardio.duplicated().sum()
if cardio_dup:
    print('Duplicates Rows in Dataset are : {}'.format(cardio_dup))
else:
    print('Dataset contains no Duplicate Values')

Number of entries in the dataframe:  910000
Missing values:  0 

Dataset contains no Duplicate Values


In [13]:
cardio.describe()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
count,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0
mean,49972.4199,19468.865814,1.349571,164.359229,74.20569,128.817286,96.630414,1.366871,1.226457,0.088129,0.053771,0.803729,0.4997
std,28851.302323,2467.251667,0.476838,8.210126,14.395757,154.011419,188.47253,0.68025,0.57227,0.283484,0.225568,0.397179,0.500003
min,0.0,10798.0,1.0,55.0,10.0,-150.0,-70.0,1.0,1.0,0.0,0.0,0.0,0.0
25%,25006.75,17664.0,1.0,159.0,65.0,120.0,80.0,1.0,1.0,0.0,0.0,1.0,0.0
50%,50001.5,19703.0,1.0,165.0,72.0,120.0,80.0,1.0,1.0,0.0,0.0,1.0,0.0
75%,74889.25,21327.0,2.0,170.0,82.0,140.0,90.0,2.0,1.0,0.0,0.0,1.0,1.0
max,99999.0,23713.0,2.0,250.0,200.0,16020.0,11000.0,3.0,3.0,1.0,1.0,1.0,1.0


After further observation of our data (similar to our summary function in r) the describe function gives us some deeper understandings within the context of our data and allows us to make observations data to understand the data better.