##### Imports

In [35]:
import pandas as pd
from IPython.display import display
import numpy as np
import random
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, recall_score, precision_score

##### Constants

In [34]:
DATASET_PATH = "scaled_myocardial_infarction_dataset.csv"
SEED = 0
MAX_ITER = 1000

In [13]:
np.random.seed(SEED)
random.seed(SEED)

##### Dataset Loading 

In [18]:
infarction_df = pd.read_csv(DATASET_PATH)
infarction_df.head(5)

Unnamed: 0,AGE,SEX,INF_ANAM,STENOK_AN,FK_STENOK,IBS_POST,GB,SIM_GIPERT,DLIT_AG,ZSN_A,...,JELUD_TAH,FIBR_JELUD,A_V_BLOK,OTEK_LANC,RAZRIV,DRESSLER,ZSN,REC_IM,P_IM_STEN,LET_IS
0,0.875,0.0,2.0,0.0,-0.5,1.0,0.5,0.0,0.857143,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,-0.5,0.0,1.0,-0.2,-1.0,-1.0,-1.0,0.0,-0.142857,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,-0.6875,0.0,0.0,-0.2,-1.0,1.0,0.0,0.0,0.142857,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.3125,-1.0,0.0,-0.2,-1.0,1.0,0.0,0.0,0.285714,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,-0.1875,0.0,0.0,-0.2,-1.0,1.0,0.5,0.0,0.857143,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
numeric_columns = [
    "AGE",
    "S_AD_KBRIG",
    "D_AD_KBRIG",
    "S_AD_ORIT",
    "D_AD_ORIT",
    "K_BLOOD",
    "NA_BLOOD",
    "ALT_BLOOD",
    "AST_BLOOD",
    "L_BLOOD",
    "ROE"
]

target_columns = [
    "FIBR_PREDS",
    "PREDS_TAH",
    "JELUD_TAH",
    "FIBR_JELUD",
    "A_V_BLOK",
    "OTEK_LANC",
    "RAZRIV",
    "DRESSLER",
    "ZSN",
    "REC_IM",
    "P_IM_STEN",
    "LET_IS"
]

categorical_columns = []
for col in infarction_df.columns:
    if col not in numeric_columns and col not in target_columns:
        categorical_columns.append(col)

We preprocessed our data during the Checkpoint 3, so the only thing that remains is to test the various ML-models for predicting the targets.

As authors of the Dataset pointed out, it's possible to tackle multiple tasks regarding the dataset. There are __four__ possible time moments for __complication prediction__: on base of the information known at:
- __the time of admission to hospital__: all input columns (2-112) except 93, 94, 95, 100, 101, 102, 103, 104, 105 can be used for prediction;
- __the end of the first day__ (24 hours after admission to the hospital): all input columns (2-112) except 94, 95, 101, 102, 104, 105 can be used for prediction;
- __the end of the second day__ (48 hours after admission to the hospital) all input columns (2-112) except 95, 102, 105 can be used for prediction;
- __the end of the third day__ (72 hours after admission to the hospital) all input columns (2-112) can be used for prediction.

In this notebook we'll try to solve the task of predicting targets on __the time of admission to hospital__. 

Thus, we should get rid of columns numbered 93, 94, 95, 100, 101, 102, 103, 104, 105.

More info about the features can be found in _Checkpoint 3 EDA_ notebook, as well as in _Checkpoint 2_ Markdown file.

In [20]:
columns_to_drop = [
    "R_AB_1_n",
    "R_AB_2_n",
    "R_AB_3_n",
    "NA_R_1_n",
    "NA_R_2_n",
    "NA_R_3_n",
    "NOT_NA_1_n",
    "NOT_NA_2_n",
    "NOT_NA_3_n"
]

infarction_df.drop(columns=columns_to_drop, inplace=True)
for col_to_drop in columns_to_drop:
    if col_to_drop in categorical_columns:
        categorical_columns.remove(col_to_drop)

It's important to note that the task at hand is __multilabel__ classification.

In [21]:
for target in target_columns:
    display(infarction_df[target].value_counts().to_frame())

Unnamed: 0,FIBR_PREDS
0.0,1530
1.0,170


Unnamed: 0,PREDS_TAH
0.0,1680
1.0,20


Unnamed: 0,JELUD_TAH
0.0,1658
1.0,42


Unnamed: 0,FIBR_JELUD
0.0,1629
1.0,71


Unnamed: 0,A_V_BLOK
0.0,1643
1.0,57


Unnamed: 0,OTEK_LANC
0.0,1541
1.0,159


Unnamed: 0,RAZRIV
0.0,1646
1.0,54


Unnamed: 0,DRESSLER
0.0,1625
1.0,75


Unnamed: 0,ZSN
0.0,1306
1.0,394


Unnamed: 0,REC_IM
0.0,1541
1.0,159


Unnamed: 0,P_IM_STEN
0.0,1552
1.0,148


Unnamed: 0,LET_IS
0.0,1429
1.0,110
3.0,54
7.0,27
6.0,27
4.0,23
2.0,18
5.0,12


Most of the targets are binary, only the last one `LET_IS` needs special treatment.

##### Metric

The `F1-score` seems like a good idea for such task given that we have a multiclass target `LET_IS` (we would use averaging methods to show how they react to given predictions from model), as well as due to the fact that the targets are highly imbalanced and we need to account for this.

##### Linear Models

We'll use __Logistic Regression__ as one of the most reliable algorithms from Linear models familty.

As for how we would structure our learning phase - it's quite simple, actually. We would train a banch of Logistic Regressions on the dataset each for one target and append their results in the end.

Let's try and just train Logistic Regressions with standard parameters.

In [59]:
for target in target_columns:
    log = LogisticRegression(max_iter=MAX_ITER, random_state=SEED)
    log.fit(
        infarction_df[numeric_columns + categorical_columns], 
        infarction_df[target]
    )
    if target != "LET_IS":
        average = "binary"
    else:
        average = "micro"
        
    preds = log.predict(infarction_df[numeric_columns + categorical_columns])
    
    print(f"Target = {target}")
    print(
        "f1-score: ",
        round(f1_score(infarction_df[target], preds, average=average), 3),
        end=", "
    )
    print(
        "Recall: ",
        round(recall_score(infarction_df[target], preds, average=average), 3),
        end=", "
    )
    print(
        "Precision: ",
        round(precision_score(infarction_df[target], preds, average=average), 3),
    )

Target = FIBR_PREDS
f1-score:  0.158, Recall:  0.094, Precision:  0.5
Target = PREDS_TAH
f1-score:  0.0, Recall:  0.0, Precision:  0.0
Target = JELUD_TAH
f1-score:  0.047, Recall:  0.024, Precision:  1.0
Target = FIBR_JELUD
f1-score:  0.2, Recall:  0.113, Precision:  0.889
Target = A_V_BLOK
f1-score:  0.152, Recall:  0.088, Precision:  0.556
Target = OTEK_LANC
f1-score:  0.255, Recall:  0.157, Precision:  0.676


  _warn_prf(average, modifier, msg_start, len(result))


Target = RAZRIV
f1-score:  0.167, Recall:  0.093, Precision:  0.833
Target = DRESSLER
f1-score:  0.052, Recall:  0.027, Precision:  1.0
Target = ZSN
f1-score:  0.256, Recall:  0.16, Precision:  0.636
Target = REC_IM
f1-score:  0.025, Recall:  0.013, Precision:  0.5
Target = P_IM_STEN
f1-score:  0.013, Recall:  0.007, Precision:  1.0
Target = LET_IS
f1-score:  0.889, Recall:  0.889, Precision:  0.889


We encounter one of the most typical problems of working with Medical Data: our __precision is quite high__ in most of the cases, yet __recall is exceptionally low__. That's mainly due to the fact that we have little samples with positive class for each of targets and model can't really distinguish them from the total number of the samples.

Let's try and predict probabilites instead for binary targets. Moreover, we would classify the positive class if probability for this class is $>=0.30$.

In [61]:
for target in target_columns:
    log = LogisticRegression(max_iter=MAX_ITER, random_state=SEED)
    log.fit(
        infarction_df[numeric_columns + categorical_columns], 
        infarction_df[target]
    )
    if target != "LET_IS":
        average = "binary"
        preds = log.predict_proba(infarction_df[numeric_columns + categorical_columns])
        preds = np.where(preds[:, 1] > 0.30, 1, 0)
    else:
        average = "micro"
        preds = log.predict(infarction_df[numeric_columns + categorical_columns])

    print(f"Target = {target}")
    print(
        "f1-score: ",
        round(f1_score(infarction_df[target], preds, average=average), 3),
        end=", "
    )
    print(
        "Recall: ",
        round(recall_score(infarction_df[target], preds, average=average), 3),
        end=", "
    )
    print(
        "Precision: ",
        round(precision_score(infarction_df[target], preds, average=average), 3),
    )

Target = FIBR_PREDS
f1-score:  0.358, Recall:  0.282, Precision:  0.49
Target = PREDS_TAH
f1-score:  0.095, Recall:  0.05, Precision:  1.0
Target = JELUD_TAH
f1-score:  0.286, Recall:  0.19, Precision:  0.571
Target = FIBR_JELUD
f1-score:  0.272, Recall:  0.197, Precision:  0.438
Target = A_V_BLOK
f1-score:  0.306, Recall:  0.228, Precision:  0.464
Target = OTEK_LANC
f1-score:  0.41, Recall:  0.358, Precision:  0.479
Target = RAZRIV
f1-score:  0.354, Recall:  0.259, Precision:  0.56
Target = DRESSLER
f1-score:  0.096, Recall:  0.053, Precision:  0.5
Target = ZSN
f1-score:  0.455, Recall:  0.472, Precision:  0.44
Target = REC_IM
f1-score:  0.272, Recall:  0.182, Precision:  0.537
Target = P_IM_STEN
f1-score:  0.281, Recall:  0.189, Precision:  0.549
Target = LET_IS
f1-score:  0.889, Recall:  0.889, Precision:  0.889


Well, precision lowered on most of the targets, and quite high on some of them. At the same time, we see a positive change in recall - our baseline model can now classify more positive samples correctly.