<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Heart-diseases-prediction" data-toc-modified-id="Heart-diseases-prediction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Heart diseases prediction</a></span><ul class="toc-item"><li><span><a href="#General-information-about-data" data-toc-modified-id="General-information-about-data-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>General information about data</a></span><ul class="toc-item"><li><span><a href="#Test-dataset-preview" data-toc-modified-id="Test-dataset-preview-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Test dataset preview</a></span></li><li><span><a href="#Train-dataset-preview" data-toc-modified-id="Train-dataset-preview-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Train dataset preview</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-1.1.3"><span class="toc-item-num">1.1.3&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></li><li><span><a href="#Data-preprocessing" data-toc-modified-id="Data-preprocessing-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Data preprocessing</a></span></li><li><span><a href="#Exploratory-data-analysis-and-Feature-engineering" data-toc-modified-id="Exploratory-data-analysis-and-Feature-engineering-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Exploratory data analysis and Feature engineering</a></span><ul class="toc-item"><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></li><li><span><a href="#Preprocessing-for-models" data-toc-modified-id="Preprocessing-for-models-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Preprocessing for models</a></span></li><li><span><a href="#Training" data-toc-modified-id="Training-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Training</a></span><ul class="toc-item"><li><span><a href="#LogRegression" data-toc-modified-id="LogRegression-1.5.1"><span class="toc-item-num">1.5.1&nbsp;&nbsp;</span>LogRegression</a></span></li><li><span><a href="#Catboost" data-toc-modified-id="Catboost-1.5.2"><span class="toc-item-num">1.5.2&nbsp;&nbsp;</span>Catboost</a></span></li><li><span><a href="#LightGBM" data-toc-modified-id="LightGBM-1.5.3"><span class="toc-item-num">1.5.3&nbsp;&nbsp;</span>LightGBM</a></span></li></ul></li></ul></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>

# Heart diseases prediction

https://www.kaggle.com/competitions/yap8-heart-diseases-predictions/data

**TO DO:**  
Predict risk of heart diseases from patient lifestyle information.  
We have to solve a **binary classification problem**.  
The evaluation metric for this competition is **ROC AUC score**.

In [1]:
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.metrics import roc_auc_score, make_scorer, accuracy_score, f1_score
from sklearn.utils import shuffle

from catboost import CatBoostClassifier, Pool, cv
from sklearn.linear_model import LogisticRegression
import lightgbm as lgb

import warnings
warnings.filterwarnings("ignore")
pd.options.mode.chained_assignment = None

## General information about data

In [2]:
df = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

### Test dataset preview

In [3]:
test.shape

(30000, 12)

In [4]:
test.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active
0,5,18888,1,154,85.0,130,80,1,1,0,0,1
1,6,19042,2,170,69.0,130,90,1,1,0,0,1
2,7,20432,1,160,70.0,120,75,1,1,0,0,0
3,10,18133,2,185,94.0,130,90,1,1,0,0,1
4,11,16093,2,170,76.0,120,80,1,1,0,0,1


In [5]:
test.isna().sum().sum()

0

In [6]:
test.duplicated().sum()

0

In [7]:
cat_cols = test.nunique()[test.nunique() < 4].keys().tolist()
cat_cols

['gender', 'cholesterol', 'gluc', 'smoke', 'alco', 'active']

In [8]:
num_cols = [i for i in test.columns if i not in cat_cols]
num_cols

['id', 'age', 'height', 'weight', 'ap_hi', 'ap_lo']

In [9]:
test.id.nunique()

30000

So, we need to predict **cardio** values for test dataset

### Train dataset preview

In [10]:
df.shape

(70000, 13)

In [11]:
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0


In [12]:
df.cardio.value_counts()

0    35021
1    34979
Name: cardio, dtype: int64

We have the balance of two classes

In [13]:
df.isna().sum().sum()

0

In [14]:
df.duplicated().sum()

0

In [15]:
df.describe()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
count,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0
mean,49972.4199,19468.865814,1.349571,164.359229,74.20569,128.817286,96.630414,1.366871,1.226457,0.088129,0.053771,0.803729,0.4997
std,28851.302323,2467.251667,0.476838,8.210126,14.395757,154.011419,188.47253,0.68025,0.57227,0.283484,0.225568,0.397179,0.500003
min,0.0,10798.0,1.0,55.0,10.0,-150.0,-70.0,1.0,1.0,0.0,0.0,0.0,0.0
25%,25006.75,17664.0,1.0,159.0,65.0,120.0,80.0,1.0,1.0,0.0,0.0,1.0,0.0
50%,50001.5,19703.0,1.0,165.0,72.0,120.0,80.0,1.0,1.0,0.0,0.0,1.0,0.0
75%,74889.25,21327.0,2.0,170.0,82.0,140.0,90.0,2.0,1.0,0.0,0.0,1.0,1.0
max,99999.0,23713.0,2.0,250.0,200.0,16020.0,11000.0,3.0,3.0,1.0,1.0,1.0,1.0


In [16]:
df.nunique()

id             70000
age             8076
gender             2
height           109
weight           287
ap_hi            153
ap_lo            157
cholesterol        3
gluc               3
smoke              2
alco               2
active             2
cardio             2
dtype: int64

In [17]:
df.dtypes

id               int64
age              int64
gender           int64
height           int64
weight         float64
ap_hi            int64
ap_lo            int64
cholesterol      int64
gluc             int64
smoke            int64
alco             int64
active           int64
cardio           int64
dtype: object

In [18]:
for i in cat_cols:
    print(i, df[i].value_counts(), sep='\n', end='\n\n')

gender
1    45530
2    24470
Name: gender, dtype: int64

cholesterol
1    52385
2     9549
3     8066
Name: cholesterol, dtype: int64

gluc
1    59479
3     5331
2     5190
Name: gluc, dtype: int64

smoke
0    63831
1     6169
Name: smoke, dtype: int64

alco
0    66236
1     3764
Name: alco, dtype: int64

active
1    56261
0    13739
Name: active, dtype: int64



### Conclusion

- We have **both classes in target are balanced** (can't believe in that fact XD)
- There are 6 category columns in data: 'gender', 'cholesterol', 'gluc', 'smoke', 'alco', 'active' and 5 numerical (others).
- The types of the data-columns: integer and float.
- The **both datasets haven't nulls and duplicates**, ids are unique - the good news is.
- We can see, the data in column 'age' is presented as days, not as years. Guess, it is inconvenient presentation of data for visual analysis. We can round days to years.
- Presumably, max height (250sm) and min weight (10kg) aren't correct: weight must be checked for age; height needs to be checked for outliers.
- Max ap_hi and ap_low aren't correct (too high scores of blood pressure)
- Incorrect negative values in columns ap_hi and ap_low (blood pressure can't be negative)


## Data preprocessing

At first, we'll convert days to years (col `age`)

In [19]:
df_copy = df.copy()
df_copy.set_index('id', inplace=True)
df_copy.head()

Unnamed: 0_level_0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,17474,1,156,56.0,100,60,1,1,0,0,0,0


In [20]:
test_copy = test.copy()
test_copy.set_index('id', inplace=True)
test_copy.head()

Unnamed: 0_level_0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
5,18888,1,154,85.0,130,80,1,1,0,0,1
6,19042,2,170,69.0,130,90,1,1,0,0,1
7,20432,1,160,70.0,120,75,1,1,0,0,0
10,18133,2,185,94.0,130,90,1,1,0,0,1
11,16093,2,170,76.0,120,80,1,1,0,0,1


In [21]:
df_copy['years'] = df_copy['age']//365
df_copy.years.sort_values().value_counts()

55    3927
53    3868
57    3686
56    3607
54    3605
59    3576
49    3417
58    3409
51    3368
52    3279
50    3216
60    3200
63    2736
61    2728
62    2199
47    2197
64    2187
45    2087
43    2031
41    1903
48    1811
39    1780
46    1625
40    1622
44    1514
42    1418
29       3
30       1
Name: years, dtype: int64

There aren't any child in data, we can see (the col `age` include [30, 58] years - all adults).  
So, we can delete the row includes min weight = 10kg and others rows with similar mistakes:

In [22]:
df_copy.query('weight < 35 and height > 145')

Unnamed: 0_level_0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,years
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
24167,17272,2,170,31.0,150,90,2,2,0,0,1,1,47
26503,18140,1,160,30.0,120,80,1,1,0,0,1,1,49
31439,15359,1,146,32.0,100,70,1,1,0,0,0,0,42
38312,23284,1,157,23.0,110,80,1,1,0,0,1,0,63
42156,20408,2,177,22.0,120,80,1,1,1,1,1,0,55
47872,21081,1,153,34.0,110,70,3,3,0,0,1,1,57
48318,21582,2,178,11.0,130,90,1,1,0,0,1,1,59
50443,19802,1,146,32.0,130,80,1,2,0,0,0,0,54
54851,21809,1,154,32.0,110,60,1,1,0,0,1,0,59
79686,23370,1,152,34.0,140,90,1,1,0,0,1,1,64


Can't believe that people with weight < 35 and height > 145 can be alive. So, we can delete these few rows:

In [23]:
df_copy = df_copy.drop(df_copy.query('weight < 35 and height > 145').index)
df_copy.shape

(69987, 13)

In [24]:
df_copy.query('weight < 35 and height < 145')

Unnamed: 0_level_0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,years
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
5306,15400,1,120,30.0,110,70,1,1,0,0,1,0,42
21040,22663,1,143,34.0,100,70,1,1,0,0,1,0,62
48976,14664,2,128,28.0,120,80,1,1,0,0,1,0,40
59853,21334,1,143,30.0,103,61,2,1,0,0,1,0,58
68667,19255,1,143,33.0,100,60,1,1,0,0,1,0,52
73914,19817,2,139,34.0,120,70,1,1,0,0,1,0,54


We can see that people with weight < 35 and height < 145 are old and they are low. Because of it, their scores of height and weight can be real. So, we can't delete these lines.

Let's find rows with negative values in ap_hi and ap_lo:

In [25]:
df_copy.query('ap_hi < 0 or ap_lo < 0')

Unnamed: 0_level_0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,years
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
6525,15281,1,165,78.0,-100,80,2,1,0,0,1,0,41
22881,22108,2,161,90.0,-115,70,1,1,0,0,1,0,60
29313,15581,1,153,54.0,-100,70,1,1,0,0,1,0,42
34295,18301,1,162,74.0,-140,90,1,1,0,0,1,1,50
36025,14711,2,168,50.0,-120,80,2,1,0,0,0,1,40
50055,23325,2,168,59.0,-150,80,1,1,0,0,1,1,63
66571,23646,2,160,59.0,-120,80,1,1,0,0,0,0,64
85816,22571,1,167,74.0,15,-70,1,1,0,0,1,1,61


We can fix them by abs:

In [26]:
df_copy = abs(df_copy)
df_copy.describe()

Unnamed: 0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,years
count,69987.0,69987.0,69987.0,69987.0,69987.0,69987.0,69987.0,69987.0,69987.0,69987.0,69987.0,69987.0,69987.0
mean,19468.74371,1.349579,164.359881,74.214601,128.842071,96.621215,1.366854,1.226428,0.088131,0.053767,0.803721,0.499679,52.840342
std,2467.268622,0.47684,8.209528,14.38183,154.00526,188.450657,0.680257,0.572256,0.283487,0.225559,0.397185,0.500003,6.766834
min,10798.0,1.0,55.0,28.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,29.0
25%,17664.0,1.0,159.0,65.0,120.0,80.0,1.0,1.0,0.0,0.0,1.0,0.0,48.0
50%,19703.0,1.0,165.0,72.0,120.0,80.0,1.0,1.0,0.0,0.0,1.0,0.0,53.0
75%,21327.0,2.0,170.0,82.0,140.0,90.0,2.0,1.0,0.0,0.0,1.0,1.0,58.0
max,23713.0,2.0,250.0,200.0,16020.0,11000.0,3.0,3.0,1.0,1.0,1.0,1.0,64.0


Max value of ap_lo is so large: 1001. And 16020 for ap_hi. It isn't realistic.  

Max ap_hi can be 250 and min -- 50  
Max ap_lo can be 180 and min -- 40  
(realy bad score for blood pressure)  

In [27]:

print('ap_lo > 180:', df_copy.query('ap_lo > 180').ap_lo.count(), 
      'ap_lo >= 1000:', df_copy.query('ap_lo > 999').ap_lo.count(),
      'ap_lo < 40:', df_copy.query('ap_lo < 40').ap_lo.count(),
      sep='\n', end='\n\n'
     )
print('ap_hi > 250:', df_copy.query('ap_hi > 250').ap_hi.count(), 
      'ap_hi >= 1000:', df_copy.query('ap_hi > 999').ap_hi.count(),
      'ap_hi < 50:', df_copy.query('ap_hi < 50').ap_hi.count(),
      sep='\n'
     )

ap_lo > 180:
955
ap_lo >= 1000:
920
ap_lo < 40:
58

ap_hi > 250:
40
ap_hi >= 1000:
24
ap_hi < 50:
181


Guess, one character in the each value (>=1000) is extra. We can delete last character of each value and look through the data again.

In [28]:
while df_copy.query('ap_lo > 400').ap_lo.count() != 0:
    df_copy.loc[(df_copy.ap_lo > 400), 'ap_lo'] = df_copy.ap_lo//10
    print(df_copy.query('ap_lo > 400').ap_lo.count())

24
0


In [29]:
df_copy.query('ap_lo < 40').ap_lo.count()

58

In [30]:
df_copy = df_copy.drop(df_copy.query('ap_lo < 40').index)
df_copy.shape

(69929, 13)

In [31]:
while df_copy.query('ap_hi > 400').ap_hi.count() != 0:
    df_copy.loc[(df_copy.ap_hi > 400), 'ap_hi'] = df_copy.ap_hi//10
    print(df_copy.query('ap_hi > 400').ap_hi.count())

9
0


In [32]:
df_copy.query('ap_hi < 50').ap_hi.count()

178

In [33]:
df_copy = df_copy.drop(df_copy.query('ap_hi < 50').index)
df_copy.shape

(69751, 13)

In [34]:
df_copy.describe()

Unnamed: 0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,years
count,69751.0,69751.0,69751.0,69751.0,69751.0,69751.0,69751.0,69751.0,69751.0,69751.0,69751.0,69751.0,69751.0
mean,19469.464337,1.349601,164.361371,74.219886,127.000903,81.657123,1.367249,1.226592,0.088271,0.05382,0.803773,0.499735,52.842224
std,2467.174408,0.476847,8.190967,14.383973,17.09936,9.943735,0.680592,0.572434,0.283691,0.225664,0.397145,0.500004,6.766545
min,10798.0,1.0,55.0,28.0,60.0,40.0,1.0,1.0,0.0,0.0,0.0,0.0,29.0
25%,17665.0,1.0,159.0,65.0,120.0,80.0,1.0,1.0,0.0,0.0,1.0,0.0,48.0
50%,19703.0,1.0,165.0,72.0,120.0,80.0,1.0,1.0,0.0,0.0,1.0,0.0,53.0
75%,21327.0,2.0,170.0,82.0,140.0,90.0,2.0,1.0,0.0,0.0,1.0,1.0,58.0
max,23713.0,2.0,250.0,200.0,240.0,190.0,3.0,3.0,1.0,1.0,1.0,1.0,64.0


So, we can see that problem was solved: max and min values in columns `ap_hi` and `ap_lo` are correct.

Let's look on max/min values of the columns `hight` and `weight`:

In [35]:
df_copy.query('weight < 35').weight.count()

7

In [36]:
df_copy.query('height < 120').height.count()

49

In [37]:
df_copy = df_copy.drop(df_copy.query('height < 120 or weight < 35').index)
df_copy.shape

(69695, 13)

In [38]:
df_copy.query('height > 200')

Unnamed: 0_level_0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,years
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
9223,21220,1,250,86.0,140,100,3,1,0,0,1,1,58
30894,19054,2,207,78.0,100,70,1,1,0,1,1,0,52


In [39]:
df_copy.query('weight > 180')

Unnamed: 0_level_0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,years
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
618,16765,1,186,200.0,130,70,1,1,0,0,0,0,45
52564,19630,1,161,181.0,180,110,2,1,0,0,1,1,53
71945,15117,2,180,200.0,150,90,1,1,0,0,1,1,41
87498,20939,2,180,183.0,110,80,3,3,0,1,1,1,57


In [40]:
df_copy = df_copy.drop(df_copy.query('height > 200').index)
df_copy.shape

(69693, 13)

In [41]:
df_copy.isna().sum().sum()

0

## Exploratory data analysis and Feature engineering

1. age and gender / cardio

In [42]:
df_copy['age_group'] = pd.qcut(df_copy['age'], 4)

In [43]:
df_copy.groupby('age_group')['cardio'].agg(['count', 'mean']).sort_values(by='mean', ascending=False)

Unnamed: 0_level_0,count,mean
age_group,Unnamed: 1_level_1,Unnamed: 2_level_1
"(21327.0, 23713.0]",17423,0.651782
"(19704.0, 21327.0]",17412,0.540374
"(17665.0, 19704.0]",17434,0.452392
"(10797.999, 17665.0]",17424,0.354626


We see that the risk of heart disease increases linearly with age:  
- The highest risk of heart diseases is in group of 58-64 years old.  
- The lowest risk have group of young people (29-48 years old).

In [44]:
df_copy.groupby(['gender'])['cardio'].agg(['count', 'mean'])

Unnamed: 0_level_0,count,mean
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
1,45327,0.496878
2,24366,0.505171


In [45]:
df_copy.gender.value_counts(normalize=True)

1    0.650381
2    0.349619
Name: gender, dtype: float64

Risk of heart disease is equal between men and women, available data: 0.497 ~ 0.505

2. active / cardio

In [46]:
df_copy.groupby(['active'])['cardio'].agg(['count', 'mean'])

Unnamed: 0_level_0,count,mean
active,Unnamed: 1_level_1,Unnamed: 2_level_1
0,13678,0.536043
1,56015,0.490922


3. smoke / cardio

In [47]:
df_copy.groupby(['smoke'])['cardio'].agg(['count', 'mean'])

Unnamed: 0_level_0,count,mean
smoke,Unnamed: 1_level_1,Unnamed: 2_level_1
0,63540,0.502188
1,6153,0.47489


4. alco / cardio

In [48]:
df_copy.groupby(['alco'])['cardio'].agg(['count', 'mean'])

Unnamed: 0_level_0,count,mean
alco,Unnamed: 1_level_1,Unnamed: 2_level_1
0,65942,0.500637
1,3751,0.484671


5. cholesterol / cardio

In [49]:
df_copy.groupby(['cholesterol'])['cardio'].agg(['count', 'mean'])

Unnamed: 0_level_0,count,mean
cholesterol,Unnamed: 1_level_1,Unnamed: 2_level_1
1,52136,0.440157
2,9512,0.601871
3,8045,0.765444


6. gluc / cardio

In [50]:
df_copy.groupby(['gluc'])['cardio'].agg(['count', 'mean'])

Unnamed: 0_level_0,count,mean
gluc,Unnamed: 1_level_1,Unnamed: 2_level_1
1,59206,0.48061
2,5173,0.593079
3,5314,0.622507


7. ap_hi and ap_lo (upper and lower blood pressure)

In [51]:
df_copy['ap_hi_group'] = pd.cut(df_copy['ap_hi'], bins=[0, 120, 130, 140, 180])
df_copy['ap_lo_group'] = pd.cut(df_copy['ap_lo'], bins=[0, 80, 90, 120])

In [52]:
df_copy.groupby('ap_hi_group')['cardio'].agg(['count', 'mean']).sort_values(by='mean', ascending=False)

Unnamed: 0_level_0,count,mean
ap_hi_group,Unnamed: 1_level_1,Unnamed: 2_level_1
"(140, 180]",9516,0.85971
"(130, 140]",9813,0.811882
"(120, 130]",9538,0.582512
"(0, 120]",40497,0.317184


In [53]:
df_copy.groupby('ap_lo_group')['cardio'].agg(['count', 'mean']).sort_values(by='mean', ascending=False)

Unnamed: 0_level_0,count,mean
ap_lo_group,Unnamed: 1_level_1,Unnamed: 2_level_1
"(90, 120]",5944,0.834623
"(80, 90]",14880,0.741667
"(0, 80]",48764,0.384382


8. weight and height / cardio

**body mass index (BMI) = weight (kg) / height^2 (m^2)**

In [54]:
df_copy['bmi'] = round(df_copy['weight'] / ((df_copy['height']/100)**2))

**BMI:**
- (0.0, 18.5] - underweight
- (18.5, 25.0] - normal
- (25.0, 30.0] - overweight
- (30.0, 35.0] - obesity

In [55]:
df_copy['bmi_group'] = pd.cut(df_copy['bmi'], bins=[0, 18.5, 25.0, 30.0, 35.0])

In [56]:
df_copy.groupby('bmi_group')['cardio'].agg(['count', 'mean']).sort_values(by='mean', ascending=False)

Unnamed: 0_level_0,count,mean
bmi_group,Unnamed: 1_level_1,Unnamed: 2_level_1
"(30.0, 35.0]",10760,0.615892
"(25.0, 30.0]",24114,0.520693
"(18.5, 25.0]",28758,0.408686
"(0.0, 18.5]",627,0.274322


We can see, people with overweight and obesity get heart disease much more often, then people with normal or underweight BMI-value.

### Conclusion

Increased risk of heart disease in people:
- 58-64 years old
- who is not active
- gluc = 3
- with overweight or obesity

Based on the provided data: smoke, alco and gender do not affect the presence of heart disease

## Preprocessing for models

In [57]:
df_copy.columns.to_list()

['age',
 'gender',
 'height',
 'weight',
 'ap_hi',
 'ap_lo',
 'cholesterol',
 'gluc',
 'smoke',
 'alco',
 'active',
 'cardio',
 'years',
 'age_group',
 'ap_hi_group',
 'ap_lo_group',
 'bmi',
 'bmi_group']

In [58]:
X, y = df_copy.drop(columns=['cardio', 'age_group', 'bmi', 'bmi_group', 'ap_hi_group', 'ap_lo_group', 'years'], axis=1), df_copy['cardio']
print('train features shape:', X.shape,
      '\ntrain target shape:', y.shape)

train features shape: (69693, 11) 
train target shape: (69693,)


In [61]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.15, random_state=42)

print('train shape:', X_train.shape, y_train.shape,
      '\nvalid shape:', X_valid.shape, y_valid.shape)

train shape: (59239, 11) (59239,) 
valid shape: (10454, 11) (10454,)


In [62]:
scaler = RobustScaler(quantile_range=(25, 75))

scaler.fit(X_train)

X_train1 = pd.DataFrame(scaler.transform(X_train), columns=X_train.columns)
X_valid1 = pd.DataFrame(scaler.transform(X_valid), columns=X_valid.columns)
test_copy1 = pd.DataFrame(scaler.transform(test_copy), columns=X_train.columns)

X_train1.head()

Unnamed: 0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active
0,-0.00492,0.0,0.545455,-0.764706,-1.0,-1.0,0.0,0.0,0.0,0.0,0.0
1,-0.573459,0.0,-1.363636,0.058824,-0.5,-1.0,0.0,0.0,0.0,0.0,0.0
2,-0.172475,0.0,-0.909091,-1.470588,1.0,1.0,2.0,2.0,0.0,0.0,0.0
3,0.01886,0.0,-0.818182,-0.764706,0.0,0.0,1.0,0.0,0.0,0.0,-1.0
4,-1.247506,0.0,-0.272727,-0.235294,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [63]:
test_copy1.head()

Unnamed: 0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active
0,-0.223862,0.0,-1.0,0.764706,0.5,0.0,0.0,0.0,0.0,0.0,0.0
1,-0.181768,1.0,0.454545,-0.176471,0.5,1.0,0.0,0.0,0.0,0.0,0.0
2,0.198169,0.0,-0.454545,-0.117647,0.0,-0.5,0.0,0.0,0.0,0.0,-1.0
3,-0.430231,1.0,1.818182,1.294118,0.5,1.0,0.0,0.0,0.0,0.0,0.0
4,-0.987837,1.0,0.454545,0.235294,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [147]:
# scaler = StandardScaler()

# scaler.fit(X_train)

# X_train = pd.DataFrame(scaler.transform(X_train), columns=X_train.columns)
# X_valid = pd.DataFrame(scaler.transform(X_valid), columns=X_valid.columns)
# new_test = pd.DataFrame(scaler.transform(test), columns=X_train.columns)

# X_train.head()

## Training

### LogRegression

In [65]:
lr =  LogisticRegression(random_state=42, solver='liblinear')

lr.fit(X_train1, y_train)
pred_lr = lr.predict_proba(X_valid1)
pred_lr_1_valid = pred_lr[:, 1]

auc_roc_lr = roc_auc_score(y_valid, pred_lr_1_valid).round(5)

print('AUC-ROC:', auc_roc_lr)

AUC-ROC: 0.78534


In [66]:
pred_lr_1_valid

array([0.85071137, 0.2112018 , 0.61420729, ..., 0.88994356, 0.05151809,
       0.72449777])

In [68]:
y_probas = lr.predict_proba(test_copy1)[:, 1]
y_probas

array([0.52418524, 0.51111358, 0.42945573, ..., 0.44336405, 0.38565199,
       0.4894006 ])

In [69]:
output = pd.DataFrame({'id': test.id,
                       'cardio': y_probas})

output.to_csv('submission.csv', index=False)

output.head()

Unnamed: 0,id,cardio
0,5,0.524185
1,6,0.511114
2,7,0.429456
3,10,0.534969
4,11,0.271192


In [70]:
output.shape

(30000, 2)

### Catboost
Let's try to choose parametrs of model via GridSearchCV

In [43]:
cat_model = CatBoostClassifier()

params = {'iterations': [500],
          'depth': [4, 6, 8],
          'loss_function': ['Logloss', 'CrossEntropy'],
          'logging_level':['Silent'],
          'random_seed': [42],
          'custom_loss':['AUC']
         }

# scorer = make_scorer(accuracy_score)

clf_grid = GridSearchCV(estimator=cat_model, param_grid=params, scoring='roc_auc', cv=5)

Train our model on train set

In [44]:
clf_grid.fit(X_train, y_train)
best_param = clf_grid.best_params_
best_param

{'custom_loss': 'AUC',
 'depth': 6,
 'iterations': 500,
 'logging_level': 'Silent',
 'loss_function': 'CrossEntropy',
 'random_seed': 42}

Save the best model with the best parametrs and train it on train_pool for crossvalidation

In [45]:
model = CatBoostClassifier(custom_loss='AUC',
                           depth= 6,
                           iterations = 500,
                           logging_level= 'Silent',
                           loss_function= 'CrossEntropy',
                           random_seed= 42)

In [46]:
# cat_features = [0]
train_pool = Pool(X_train, y_train)

params = {'custom_loss': 'AUC',
         'depth': 6,
         'iterations': 500,
         'logging_level': 'Silent',
         'loss_function': 'CrossEntropy',
         'random_seed': 42}

scores = cv(train_pool,
            params,
            fold_count=2, 
            plot="True")

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

In [47]:
model.fit(train_pool, eval_set=(X_valid, y_valid))

<catboost.core.CatBoostClassifier at 0x20f8dbb9b80>

It is time for our test set:

In [48]:
pred_cb = model.predict_proba(test_copy)[:, 1]

In [49]:
output = pd.DataFrame({'id': test.id,
                       'cardio': pred_cb})

output.to_csv('submission_cb2.csv', index=False)

output.head()

Unnamed: 0,id,cardio
0,5,0.500617
1,6,0.534682
2,7,0.401653
3,10,0.536805
4,11,0.230042


### LightGBM

In [18]:
lgbm = lgb.LGBMClassifier(learning_rate=0.01, first_metric_only = True)

lgbm.fit(
        X_train, y_train, 
        eval_set =[(X_valid, y_valid)], 
        eval_metric=['auc'],
        early_stopping_rounds = 10,
        verbose = 2
       )

[2]	valid_0's auc: 0.796296	valid_0's binary_logloss: 0.687749
[4]	valid_0's auc: 0.796351	valid_0's binary_logloss: 0.682562
[6]	valid_0's auc: 0.796439	valid_0's binary_logloss: 0.677575
[8]	valid_0's auc: 0.796749	valid_0's binary_logloss: 0.672782
[10]	valid_0's auc: 0.7968	valid_0's binary_logloss: 0.668179
[12]	valid_0's auc: 0.796854	valid_0's binary_logloss: 0.663746
[14]	valid_0's auc: 0.797266	valid_0's binary_logloss: 0.659478
[16]	valid_0's auc: 0.797637	valid_0's binary_logloss: 0.655375
[18]	valid_0's auc: 0.797654	valid_0's binary_logloss: 0.651421
[20]	valid_0's auc: 0.797681	valid_0's binary_logloss: 0.64762
[22]	valid_0's auc: 0.797683	valid_0's binary_logloss: 0.64396
[24]	valid_0's auc: 0.797725	valid_0's binary_logloss: 0.640427
[26]	valid_0's auc: 0.798086	valid_0's binary_logloss: 0.637021
[28]	valid_0's auc: 0.798201	valid_0's binary_logloss: 0.63374
[30]	valid_0's auc: 0.798093	valid_0's binary_logloss: 0.630574
[32]	valid_0's auc: 0.79805	valid_0's binary_logl

LGBMClassifier(first_metric_only=True, learning_rate=0.01)

In [19]:
roc_auc_score(y_valid, lgbm.predict_proba(X_valid)[:, 1])

0.8004203420516175

In [20]:
pred_lgbm = lgbm.predict_proba(test_copy)[:, 1]

In [21]:
output = pd.DataFrame({'id': test.id,
                       'cardio': pred_lgbm})

output.to_csv('submission_lgbm.csv', index=False)

output.head()

Unnamed: 0,id,cardio
0,5,0.495841
1,6,0.55862
2,7,0.46148
3,10,0.541806
4,11,0.325914


# Conclusion

The best result was given by the **catboost model: 0.80436**