## PROJECT

### Observation

`Age` - The values of 'age' column is in unit of days. So, this will be changed to years for better understanding.

`Gender` - The values of 'gender' column are 1 and 2. In this, '1' is indicated as women and '2' as men. For better computation, the column name should be changed from 'gender' to 'male' and the values should be changed into 0 as women and 1 as men.

`Height` - The unit of height is in centimeter (cm). There are various incorrect values recorded in the dataset which cannot be considerd as height. So these outliers should be removed and only the normal human height of the patients should be considered. Those with abnormal small heights (risk of Achondroplasia or dwarfism) and tall heights (risk of atrial fibrillation) may have confirmed risk of cardiovascular diseases from an early stage of their life, which would affect the data analysis for the focused parameters.

`Weight` - The values of 'weight' column is in unit of kilogram (kg).

`BMI (Body-mass index)` - A column of bmi and bmi_class is required to better understand the healthy weight parameter of the patients in terms of their height. We use the parameters provided by The Canadian Diabetes Association (https://www.diabetes.ca/resources/tools---resources/body-mass-index-(bmi)-calculator#:~:text=The%20formula%20is%20BMI%20%3D%20kg,most%20adults%2018%2D65%20years.)

`Blood pressure (ap_hi / ap_lo)` - The unit of blood pressure can be assumed to be in mmHg. The minimum value of blood pressure is negative which is highly unlikely as normal blood pressure should be between 120/90mmHg (meaning Systolic pressure of 120mmHg and Diastolic pressure of 90mmHg). Blood pressure readings will be classified based on Medical News Today (https://www.medicalnewstoday.com/articles/327077).

`Cholesterol` - The cholesterol level are indicated as 1, 2 and 3, where 1 is normal, 2 is above normal and 3 is well above normal. For better computation, it should be changed into 0 as normal, 1 as above normal and 2 as well above normal.

`Glucose` - The glucose level are indicated as 1, 2 and 3, where 1 is normal, 2 is above normal and 3 is well above normal. For better computation, it should be changed into 0 as normal, 1 as above normal and 2 as well above normal.

`Smokin` - In the 'smoke' column, the value '0' means that the patient don't smokes and value '1' means that the patient smokes.

`Alcohol consumption` - In the 'alco' column, the value '0' means that the patient don't drink alcohol and value '1' means that the patient drink alcohol. The name of column should be changed into 'alcohol_intake' for better understanding.

`Physical activity` - In the 'active' column, the value '0' means that the patient is non-active and '1' means that the patient is active.

`The information about smoking, alcohol consumption and physical activity are subjective features which are given by the patients. These data may not be reliable but can provide a better insight of patient conditions.`

`Cardiovascular disease occurrence` - This is a target variable which provides the information about the presence or absence of cardiovascular disease. In the 'cardio' column, the value '0' means absence of cardiovascular disease and '1' means presence of cardiovascular disease. The name of this 'cardio' column should be changed into 'cvd' for better understanding. In this dataset, near about half of the subjects have cardiovascular diseases.

In [1]:
#Import Libraries
import pandas as pd
import numpy as np

pd.set_option('display.max_rows', 800)
pd.set_option('display.max_columns', 500)

import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
sns.set()

import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import classification_report 
from sklearn.metrics import accuracy_score 
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mutual_info_score
from sklearn.linear_model import RidgeClassifier
from sklearn.model_selection import KFold
from tqdm.auto import tqdm
from sklearn.metrics import f1_score

from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_recall_curve

import re

import xgboost as xgb

In [2]:
df = pd.read_csv("C:/Users/Co/Desktop/ML Project DataKlub/cardio_train.csv", sep=';')

In [3]:
df.columns

Index(['id', 'age', 'gender', 'height', 'weight', 'ap_hi', 'ap_lo',
       'cholesterol', 'gluc', 'smoke', 'alco', 'active', 'cardio'],
      dtype='object')

In [4]:
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0


In [5]:
df.dtypes

id               int64
age              int64
gender           int64
height           int64
weight         float64
ap_hi            int64
ap_lo            int64
cholesterol      int64
gluc             int64
smoke            int64
alco             int64
active           int64
cardio           int64
dtype: object

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           70000 non-null  int64  
 1   age          70000 non-null  int64  
 2   gender       70000 non-null  int64  
 3   height       70000 non-null  int64  
 4   weight       70000 non-null  float64
 5   ap_hi        70000 non-null  int64  
 6   ap_lo        70000 non-null  int64  
 7   cholesterol  70000 non-null  int64  
 8   gluc         70000 non-null  int64  
 9   smoke        70000 non-null  int64  
 10  alco         70000 non-null  int64  
 11  active       70000 non-null  int64  
 12  cardio       70000 non-null  int64  
dtypes: float64(1), int64(12)
memory usage: 6.9 MB


In [7]:
df.isnull().sum()

id             0
age            0
gender         0
height         0
weight         0
ap_hi          0
ap_lo          0
cholesterol    0
gluc           0
smoke          0
alco           0
active         0
cardio         0
dtype: int64

In [8]:
# df.describe()

In [9]:
df.columns

Index(['id', 'age', 'gender', 'height', 'weight', 'ap_hi', 'ap_lo',
       'cholesterol', 'gluc', 'smoke', 'alco', 'active', 'cardio'],
      dtype='object')

In [10]:
df.rename(columns={"gluc":"glucose", "alco":"alcohol_consumption", "cardio":"cardio_status"}, inplace=True)
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,glucose,smoke,alcohol_consumption,active,cardio_status
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0


In [11]:
# Discarding blood pressure greater than 370/360 mmHg and blood pressure less than 50/20mmHg respectively
df = df.loc[(df["ap_hi"] > 50) & (df["ap_hi"] < 370) & (df["ap_lo"] > 20) & (df["ap_lo"] < 360)]
df.reset_index(inplace=True)

In [12]:
df.shape

(68781, 14)

In [13]:
# function for blood pressure classification
def pressure_label(row):
    if row['ap_hi'] < 90 and row['ap_lo'] < 60:
        return 'Low Blood Pressure'
    elif row['ap_hi'] < 120 and row['ap_lo'] < 80:
        return "Normal Blood Pressure"
    elif row['ap_hi'] < 130 and row['ap_lo'] < 80:
        return "Elevated Blood Pressure"
    elif row['ap_hi'] < 140 and row['ap_lo'] < 90:
        return "High BP Stage 1"
    else:
        return "High BP Stage 2"
   

In [14]:
df['blood_pressure'] = df.apply(pressure_label, axis=1)

In [15]:
df['blood_pressure'].value_counts()

High BP Stage 1            32460
High BP Stage 2            23652
Normal Blood Pressure       9542
Elevated Blood Pressure     3113
Low Blood Pressure            14
Name: blood_pressure, dtype: int64

In [16]:
# df

In [17]:
## change days to years in age column
df["age"] = df["age"].apply(lambda x: round(x/365))

In [18]:
# Reference from the The Canadian Diabetes Association says that:
# Body Mass Index is a simple calculation using a person’s height and weight. 
# The formula is BMI = kg/m2 where kg is a person’s weight in kilograms and m2 is their height in metres squared.
# A BMI of 25.0 or more is overweight, while the healthy range is 18.5 to 24.9. BMI applies to most adults 18-65 years.

df["bmi"] = df["weight"] *10000 / ((df["height"])**2)
df['bmi_class'] = df['bmi'].apply(lambda x : x < 25)
df['bmi_class']

0         True
1        False
2         True
3        False
4         True
         ...  
68776    False
68777    False
68778    False
68779    False
68780     True
Name: bmi_class, Length: 68781, dtype: bool

In [19]:
df.dtypes

index                    int64
id                       int64
age                      int64
gender                   int64
height                   int64
weight                 float64
ap_hi                    int64
ap_lo                    int64
cholesterol              int64
glucose                  int64
smoke                    int64
alcohol_consumption      int64
active                   int64
cardio_status            int64
blood_pressure          object
bmi                    float64
bmi_class                 bool
dtype: object

In [20]:
## let's drop the id column which won't contribute to the model using domain knowldedge.
df.drop(['id', 'index'], axis=1, inplace=True)

In [21]:
## convert 
df[['weight', 'bmi_class']] = df[['weight', 'bmi_class']] .astype('int64')

In [22]:
# data variables are object ar boolean

cat_cols = list(df.dtypes[df.dtypes == 'object'].index)
print(f'Columns with categorical variables are: {cat_cols} \n')


int_cols = list(df.dtypes[df.dtypes == 'int64'].index)
print(f'Columns with integer values are: {int_cols}')

Columns with categorical variables are: ['blood_pressure'] 

Columns with integer values are: ['age', 'gender', 'height', 'weight', 'ap_hi', 'ap_lo', 'cholesterol', 'glucose', 'smoke', 'alcohol_consumption', 'active', 'cardio_status', 'bmi_class']


In [23]:
# int_cols = ['age', 'gender', 'height', 'weight', 'ap_hi', 'ap_lo', 'cholesterol', 'glucose', 'smoke', 'alcohol_consumption', 'active', 'bmi_class', 'cardio_status']
# df = df[int_cols]

In [24]:
## for better computation, replace 1 with 0, 2 with 1, 3 with 2
df[['gender', 'glucose', 'cholesterol']] = df[['gender', 'glucose', 'cholesterol']].apply(lambda x : x-1)

In [25]:
df.columns

Index(['age', 'gender', 'height', 'weight', 'ap_hi', 'ap_lo', 'cholesterol',
       'glucose', 'smoke', 'alcohol_consumption', 'active', 'cardio_status',
       'blood_pressure', 'bmi', 'bmi_class'],
      dtype='object')

In [26]:
select = ['gender', 'glucose', 'cholesterol', 'smoke', 'alcohol_consumption', 'active', 'bmi_class', 'blood_pressure', 'cardio_status']
df_select = df[select]

for col in df_select.columns:
    gra = df_select[col].value_counts()
    print(f'{col.upper()}: \n')
    print(gra)
    print('\n')
    

GENDER: 

0    44795
1    23986
Name: gender, dtype: int64


GLUCOSE: 

0    58472
2     5235
1     5074
Name: glucose, dtype: int64


CHOLESTEROL: 

0    51581
1     9314
2     7886
Name: cholesterol, dtype: int64


SMOKE: 

0    62728
1     6053
Name: smoke, dtype: int64


ALCOHOL_CONSUMPTION: 

0    65092
1     3689
Name: alcohol_consumption, dtype: int64


ACTIVE: 

1    55257
0    13524
Name: active, dtype: int64


BMI_CLASS: 

0    42805
1    25976
Name: bmi_class, dtype: int64


BLOOD_PRESSURE: 

High BP Stage 1            32460
High BP Stage 2            23652
Normal Blood Pressure       9542
Elevated Blood Pressure     3113
Low Blood Pressure            14
Name: blood_pressure, dtype: int64


CARDIO_STATUS: 

0    34741
1    34040
Name: cardio_status, dtype: int64




In [27]:
df.describe()

Unnamed: 0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,glucose,smoke,alcohol_consumption,active,cardio_status,bmi,bmi_class
count,68781.0,68781.0,68781.0,68781.0,68781.0,68781.0,68781.0,68781.0,68781.0,68781.0,68781.0,68781.0,68781.0,68781.0
mean,53.32682,0.34873,164.361684,74.121662,126.615286,81.377561,0.364723,0.225993,0.088004,0.053634,0.803376,0.494904,27.523017,0.377662
std,6.767516,0.476572,8.185009,14.331393,16.76354,9.688359,0.67898,0.571968,0.283303,0.225296,0.397449,0.499978,6.050164,0.484806
min,30.0,0.0,55.0,11.0,60.0,30.0,0.0,0.0,0.0,0.0,0.0,0.0,3.471784,0.0
25%,48.0,0.0,159.0,65.0,120.0,80.0,0.0,0.0,0.0,0.0,1.0,0.0,23.875115,0.0
50%,54.0,0.0,165.0,72.0,120.0,80.0,0.0,0.0,0.0,0.0,1.0,0.0,26.346494,0.0
75%,58.0,1.0,170.0,82.0,140.0,90.0,1.0,0.0,0.0,0.0,1.0,1.0,30.119376,1.0
max,65.0,1.0,250.0,200.0,240.0,190.0,2.0,2.0,1.0,1.0,1.0,1.0,298.666667,1.0


In [28]:
df = df[['age', 'gender', 'cholesterol','glucose', 'smoke', 'alcohol_consumption', 'active', 'blood_pressure', 'bmi_class', 'cardio_status']]

## Validation, Testing & Prediction

In [29]:
df.shape

(68781, 10)

In [30]:
df.columns

Index(['age', 'gender', 'cholesterol', 'glucose', 'smoke',
       'alcohol_consumption', 'active', 'blood_pressure', 'bmi_class',
       'cardio_status'],
      dtype='object')

In [31]:
# split data set to 60/20/20 for validation and testing
data_full_train, data_test = train_test_split(df, test_size=0.2, random_state=42)
data_train, data_val = train_test_split(data_full_train, test_size=0.25, random_state=42)

In [32]:
len(data_train), len(data_val), len(data_test)

(41268, 13756, 13757)

In [33]:
y_train = data_train.cardio_status.values
y_val = data_val.cardio_status.values
y_test = data_test.cardio_status.values

del data_train['cardio_status']
del data_val['cardio_status']
del data_test['cardio_status']

In [34]:
numerical_cols = ['age', 'gender', 'cholesterol','glucose', 'smoke', 'alcohol_consumption', 'active', 'bmi_class']

In [35]:
categorical_cols = ['blood_pressure']

#### Check AUC Score

In [36]:
# from sklearn.metrics import roc_auc_score

for cols in numerical_cols:
    rauc = roc_auc_score(y_train, data_train[cols])
    if rauc < 0.5: #if variable is negatively correlated
        rauc = roc_auc_score(y_train, -data_train[cols])
    print(f'{cols}, {rauc:.3}')

age, 0.637
gender, 0.504
cholesterol, 0.594
glucose, 0.535
smoke, 0.505
alcohol_consumption, 0.502
active, 0.515
bmi_class, 0.576


In [37]:
dv = DictVectorizer(sparse=False)

train_dict = data_train[categorical_cols + numerical_cols].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

val_dict = data_val[categorical_cols + numerical_cols].to_dict(orient='records')
X_val = dv.transform(val_dict)

In [38]:
X_train.shape

(41268, 13)

In [39]:
# models to test
classifiers = [
#     LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42),
    LogisticRegression(),
#     LinearRegression(),
#     Ridge(solver="sag", alpha=1.0, random_state=42),
    RandomForestClassifier(),
#     RandomForestClassifier(n_estimators=10, random_state=42, n_jobs=-1),
    DecisionTreeClassifier(),
    RidgeClassifier(),
#      DecisionTreeClassifier(max_depth=1, random_state=42),
#     RandomForestClassifier(n_estimators=10, random_state=42)
]
# get names of the objects in list (too lazy for c&p...)
names = [re.match(r"[^\(]+", name.__str__())[0] for name in classifiers]
print(f"Classifiers to test: {names}")

Classifiers to test: ['LogisticRegression', 'RandomForestClassifier', 'DecisionTreeClassifier', 'RidgeClassifier']


In [40]:
%%time
# test all classifiers and save pred. results on test data
results = {}
for name, clf in zip(names, classifiers):
    print(f"Training classifier: {name}")
    clf.fit(X_train, y_train)
    prediction = clf.predict(X_val)
    report = classification_report(y_val, prediction)
    results[name] = report

Training classifier: LogisticRegression
Training classifier: RandomForestClassifier
Training classifier: DecisionTreeClassifier
Training classifier: RidgeClassifier
Wall time: 5.01 s


In [41]:
# Prediction results
for k, v in results.items():
    print(f"Results for {k}:")
    print(f"{v}\n")

Results for LogisticRegression:
              precision    recall  f1-score   support

           0       0.70      0.79      0.74      6949
           1       0.75      0.66      0.70      6807

    accuracy                           0.72     13756
   macro avg       0.73      0.72      0.72     13756
weighted avg       0.73      0.72      0.72     13756


Results for RandomForestClassifier:
              precision    recall  f1-score   support

           0       0.70      0.73      0.72      6949
           1       0.72      0.69      0.70      6807

    accuracy                           0.71     13756
   macro avg       0.71      0.71      0.71     13756
weighted avg       0.71      0.71      0.71     13756


Results for DecisionTreeClassifier:
              precision    recall  f1-score   support

           0       0.69      0.75      0.72      6949
           1       0.72      0.65      0.69      6807

    accuracy                           0.70     13756
   macro avg       0.7

#### Logistic Regression

In [42]:
for i in range(100, 150, 10):
    print(f'For {i} \n')
    for n in range(1, 5, 1):
        lr = LogisticRegression(solver='liblinear', C= n, max_iter= i)

        lr.fit(X_train, y_train)

        lr_score = lr.score(X_val, y_val)

        print(f'C:{n} has an accuracy score: {lr_score} \n')

For 100 

C:1 has an accuracy score: 0.7239749927304449 

C:2 has an accuracy score: 0.7239022971794126 

C:3 has an accuracy score: 0.7239022971794126 

C:4 has an accuracy score: 0.7239022971794126 

For 110 

C:1 has an accuracy score: 0.7239749927304449 

C:2 has an accuracy score: 0.7239022971794126 

C:3 has an accuracy score: 0.7239022971794126 

C:4 has an accuracy score: 0.7239022971794126 

For 120 

C:1 has an accuracy score: 0.7239749927304449 

C:2 has an accuracy score: 0.7239022971794126 

C:3 has an accuracy score: 0.7239022971794126 

C:4 has an accuracy score: 0.7239022971794126 

For 130 

C:1 has an accuracy score: 0.7239749927304449 

C:2 has an accuracy score: 0.7239022971794126 

C:3 has an accuracy score: 0.7239022971794126 

C:4 has an accuracy score: 0.7239022971794126 

For 140 

C:1 has an accuracy score: 0.7239749927304449 

C:2 has an accuracy score: 0.7239022971794126 

C:3 has an accuracy score: 0.7239022971794126 

C:4 has an accuracy score: 0.723902297

##### Highest accuracy score of 0.7239 with `max_iter = 100 and C=1.0`

In [43]:
model = LogisticRegression(solver='liblinear', C= 1, max_iter= 100)
model.fit(X_train, y_train)
lr_pred = model.predict(X_val)
lr_class = classification_report(y_val, lr_pred)
print(lr_class)

              precision    recall  f1-score   support

           0       0.70      0.79      0.74      6949
           1       0.75      0.66      0.70      6807

    accuracy                           0.72     13756
   macro avg       0.73      0.72      0.72     13756
weighted avg       0.73      0.72      0.72     13756



##### Random Forest

In [44]:
for depth in [5,10,15]:
    print(f'For max-depth {depth}: \n')
    for n in range(150, 201, 10):
        model = RandomForestClassifier(n_estimators=n, random_state=42, n_jobs=-1, max_depth = depth)
        model.fit(X_train, y_train)
        rf_score = model.score(X_val, y_val)
        print(f'{n} estimator has an accuray_score: {rf_score}')
    print('--------------- \n')

For max-depth 5: 

150 estimator has an accuray_score: 0.7229572550159931
160 estimator has an accuray_score: 0.7229572550159931
170 estimator has an accuray_score: 0.7230299505670253
180 estimator has an accuray_score: 0.7233207327711544
190 estimator has an accuray_score: 0.7232480372201221
200 estimator has an accuray_score: 0.7231753416690898
--------------- 

For max-depth 10: 

150 estimator has an accuray_score: 0.7255015993021227
160 estimator has an accuray_score: 0.7255015993021227
170 estimator has an accuray_score: 0.7255015993021227
180 estimator has an accuray_score: 0.7253562082000582
190 estimator has an accuray_score: 0.7253562082000582
200 estimator has an accuray_score: 0.7255015993021227
--------------- 

For max-depth 15: 

150 estimator has an accuray_score: 0.714815353300378
160 estimator has an accuray_score: 0.7145972666472812
170 estimator has an accuray_score: 0.7147426577493458
180 estimator has an accuray_score: 0.7134341378307647
190 estimator has an accur

#### Highest accuracy score of 0.7233 with `max_depth = 5 and n_estimator=180`

In [45]:
model = RandomForestClassifier(n_estimators=180, random_state=42, n_jobs=-1, max_depth = 5)
model.fit(X_train, y_train)
rf_pred = model.predict(X_val)
rf_class = classification_report(y_val, rf_pred)
print(rf_class)

              precision    recall  f1-score   support

           0       0.69      0.82      0.75      6949
           1       0.77      0.63      0.69      6807

    accuracy                           0.72     13756
   macro avg       0.73      0.72      0.72     13756
weighted avg       0.73      0.72      0.72     13756



##### Decision Tree

In [46]:
for depth in [5,10,15]:
    print(f'For max-depth {depth}: \n')
    for n in range(150, 201, 10):
        model = DecisionTreeClassifier(max_depth=depth, max_leaf_nodes=n, random_state=42)
        model.fit(X_train, y_train)
        rf_score = model.score(X_val, y_val)
        print(f'with max_leaf_nodes, {n} has an accuray_score: {rf_score}')
    print('--------------- \n')

For max-depth 5: 

with max_leaf_nodes, 150 has an accuray_score: 0.7244111660366386
with max_leaf_nodes, 160 has an accuray_score: 0.7244111660366386
with max_leaf_nodes, 170 has an accuray_score: 0.7244111660366386
with max_leaf_nodes, 180 has an accuray_score: 0.7244111660366386
with max_leaf_nodes, 190 has an accuray_score: 0.7244111660366386
with max_leaf_nodes, 200 has an accuray_score: 0.7244111660366386
--------------- 

For max-depth 10: 

with max_leaf_nodes, 150 has an accuray_score: 0.7249927304448968
with max_leaf_nodes, 160 has an accuray_score: 0.7249200348938645
with max_leaf_nodes, 170 has an accuray_score: 0.725065425995929
with max_leaf_nodes, 180 has an accuray_score: 0.7255742948531549
with max_leaf_nodes, 190 has an accuray_score: 0.7256469904041872
with max_leaf_nodes, 200 has an accuray_score: 0.7256469904041872
--------------- 

For max-depth 15: 

with max_leaf_nodes, 150 has an accuray_score: 0.7215033439953474
with max_leaf_nodes, 160 has an accuray_score: 0

#### Highest accuracy score of 0.7256 with `max_depth = 10 and max_leaf_nodes=200`

In [47]:
model = DecisionTreeClassifier(max_depth=10, max_leaf_nodes=200, random_state=42)
model.fit(X_train, y_train)
dt_pred = model.predict(X_val)
dt_class = classification_report(y_val, dt_pred)
print(dt_class)

              precision    recall  f1-score   support

           0       0.72      0.76      0.74      6949
           1       0.74      0.69      0.71      6807

    accuracy                           0.73     13756
   macro avg       0.73      0.73      0.73     13756
weighted avg       0.73      0.73      0.73     13756



##### Ridge Classifier

In [48]:
for i in [0.001, 0.01, 0.1, 1, 10]:
    rcl = RidgeClassifier(solver="sag", alpha=0.01, random_state=42)

    rcl.fit(X_train, y_train)

    rcl_score = rcl.score(X_val, y_val)

    print(f'for alpha = {i}, accuracy score is {rcl_score}')

for alpha = 0.001, accuracy score is 0.7214306484443153
for alpha = 0.01, accuracy score is 0.7214306484443153
for alpha = 0.1, accuracy score is 0.7214306484443153
for alpha = 1, accuracy score is 0.7214306484443153
for alpha = 10, accuracy score is 0.7214306484443153


##### Highest accuracy score of 0.7214

In [49]:
model = RidgeClassifier(solver="sag", alpha=0.01, random_state=42)
model.fit(X_train, y_train)
rd_pred = model.predict(X_val)
rd_class = classification_report(y_val, rd_pred)
print(rd_class)

              precision    recall  f1-score   support

           0       0.69      0.80      0.74      6949
           1       0.76      0.64      0.69      6807

    accuracy                           0.72     13756
   macro avg       0.73      0.72      0.72     13756
weighted avg       0.73      0.72      0.72     13756



### Cross Validation

##### K-Fold validation was performed across all models 

In [50]:
def train(data_train, y_train, C=1.0):
    dicts = data_train[categorical_cols + numerical_cols].to_dict(orient='records')

    dv = DictVectorizer(sparse=False)
    X_train = dv.fit_transform(dicts)

    model = LogisticRegression(solver='liblinear', C= 1, max_iter= 100)
    model.fit(X_train, y_train)
    
    return dv, model

In [51]:
def predict(data, dv, model):
    dicts = data[categorical_cols + numerical_cols].to_dict(orient='records')

    X = dv.transform(dicts)
    y_pred = model.predict_proba(X)[:, 1]

    return y_pred

In [52]:
n_splits = 5
pred_models = [lr_pred, rf_pred, dt_pred, rd_pred]

for C in tqdm([0.01, 0.1, 0.5, 10]):
    kfold = KFold(n_splits=n_splits, shuffle=True, random_state=1)

    scores = []

    for train_idx, val_idx in kfold.split(data_full_train):
        data_train = data_full_train.iloc[train_idx]
        data_val = data_full_train.iloc[val_idx]

        y_train = data_train.cardio_status.values
        y_val = data_val.cardio_status.values
        
        dv, model = train(data_train, y_train, C=C)
        y_pred = predict(data_val, dv, model)
        
        
        auc = f1_score(y_val, np.round(y_pred))
        scores.append(auc)
            

    print('C=%s %.3f +- %.3f' % (C, np.mean(scores), np.std(scores)))

HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))

C=0.01 0.697 +- 0.003
C=0.1 0.697 +- 0.003
C=0.5 0.697 +- 0.003
C=10 0.697 +- 0.003



#### `The Logistic Regression model performed best with the mean of F1 score of cross validated results`

### Prediction

In [53]:
dicts_full_train = data_full_train[numerical_cols + categorical_cols].to_dict(orient='records')

In [54]:
dv = DictVectorizer(sparse=False)
X_full_train = dv.fit_transform(dicts_full_train)

In [55]:
y_full_train = data_full_train.cardio_status.values

In [66]:
modellr = LogisticRegression(solver='liblinear', C= 1, max_iter= 100)
modellr.fit(X_full_train, y_full_train)

LogisticRegression(C=1, solver='liblinear')

In [67]:
dicts_test = data_test[numerical_cols + categorical_cols].to_dict(orient='records')
X_test = dv.transform(dicts_test)

In [68]:
lr_pred = modellr.predict(X_test)
lr_class = classification_report(y_test, lr_pred)
lr_score = accuracy_score(y_test, lr_pred)
print(lr_class)
print(f'\n Accuracy score of {lr_score} with Logistic Regression')

              precision    recall  f1-score   support

           0       0.70      0.79      0.74      6926
           1       0.76      0.65      0.70      6831

    accuracy                           0.72     13757
   macro avg       0.73      0.72      0.72     13757
weighted avg       0.73      0.72      0.72     13757


 Accuracy score of 0.7232681543941266 with Logistic Regression


In [70]:
import pickle

In [71]:
with open('modellr.pkl', 'wb') as f_out:
    pickle.dump((dv, modellr), f_out)

print(f'the model is saved to modellr.pkl')

the model is saved to modellr.pkl
