# This notebooks is used to find the best scaler method and resampling method for MIT dataset. 

We will perform several base machine learning model then evaluate the model with f1 score to find the best one. 

We will perform two simulations: 
1. for binary classificaion  
2. for multiple classification 

Conclusion: 
| | rescaling | resampling | 
| --- | --- | --- |
| Binary | minmax | oversampling |
| Multiple | minmax | oversampling  |

In [2]:
import pandas as pd
import numpy as np

def addColumnsToDataframe(df):
    """
    As the dataset is composed with 188 columns with the 188th columns as the category values,
    so we give the last column the name 'target', others named with 'c_0 to c_186'
    """
    num_columns= df.shape[1]
    feature_col_name = ['c_' + str(i) for i in range(0, num_columns - 1)]
    df_columns = feature_col_name + ['target']
    df.columns = df_columns
    return df

def convertColumnAsInt(df, column):
    """
    As the category value is in float type. We want to get the int to identify the category.
    """
    df[column] = df[column].astype(int)
    return df



In [3]:
mitbih_train_raw = pd.read_csv('../data/raw/mitbih_train.csv', header=None)
mitbih_test_raw = pd.read_csv('../data/raw/mitbih_test.csv', header=None)

mitbih_train_with_column = addColumnsToDataframe(mitbih_train_raw)
mitbih_test_with_column = addColumnsToDataframe(mitbih_test_raw)

mitbih_train_label_target = convertColumnAsInt(mitbih_train_with_column, 'target')
mitbih_test_label_target = convertColumnAsInt(mitbih_test_with_column, 'target')


# target value and meanings
all_class_mapping = {
    0: 'Normal',
    1: 'Supraventricular',
    2: 'Ventricular',
    3: 'Fusion',
    4: 'Unclassifiable'
}

mitbih_train_label_target['target'] = mitbih_train_label_target['target'].map(all_class_mapping)
mitbih_test_label_target['target'] = mitbih_test_label_target['target'].map(all_class_mapping)

# drop unclassifiable
mitbih_train_label_target = mitbih_train_label_target[mitbih_train_label_target['target'] != 'Unclassifiable']
mitbih_test_label_target = mitbih_test_label_target[mitbih_test_label_target['target'] != 'Unclassifiable']

In [4]:
## generate dataframe for binary classification

mit_binary_train = mitbih_train_label_target.copy()
mit_binary_test = mitbih_test_label_target.copy()


# convert to binary classification Combine abnormal categories
mit_binary_train['target'] = mit_binary_train['target'].replace(['Supraventricular', 'Ventricular', 'Fusion'], 'abnormal')
mit_binary_test['target'] = mit_binary_test['target'].replace(['Supraventricular', 'Ventricular', 'Fusion'], 'abnormal')

# Encode the labels: normal as 0, abnormal as 1
mit_binary_train['target'] = mit_binary_train['target'].replace({'Normal': 0, 'abnormal': 1})
mit_binary_test['target'] = mit_binary_test['target'].replace({'Normal': 0, 'abnormal': 1})


mit_binary_train = mit_binary_train.dropna()
mit_binary_test = mit_binary_test.dropna()

mit_binary_train["target"].value_counts()

  mit_binary_train['target'] = mit_binary_train['target'].replace({'Normal': 0, 'abnormal': 1})
  mit_binary_test['target'] = mit_binary_test['target'].replace({'Normal': 0, 'abnormal': 1})


target
0    72471
1     8652
Name: count, dtype: int64

In [5]:
# generate dataframe for multi classification
mit_multi_train = mitbih_train_label_target.copy()
mit_multi_test = mitbih_test_label_target.copy()


mit_multi_train = mit_multi_train.dropna()
mit_multi_test = mit_multi_test.dropna()

mit_multi_train['target'].value_counts()

target
Normal              72471
Ventricular          5788
Supraventricular     2223
Fusion                641
Name: count, dtype: int64

## Binary Classificaiton


In [22]:
X_train_binary = mit_binary_train.drop( columns=['target'])
y_train_binary = mit_binary_train['target']

X_test_binary = mit_binary_test.drop(columns=['target'])
y_test_binary = mit_binary_test['target']


find the best rescaling

In [23]:
from sklearn.metrics import f1_score
from sklearn.model_selection import StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler

# define general used scalers.
scalers = { 
    "StandardScaler": StandardScaler(),
    "MinMaxScaler": MinMaxScaler(),
    "RobustScaler": RobustScaler(),
    "None": None,
}

models = {
    "LogisticRegression": LogisticRegression(class_weight='balanced', max_iter=1000),
    "Tree": DecisionTreeClassifier(class_weight='balanced', random_state=42),
     # "SVM": SVC(class_weight='balanced', probability=True, random_state=42), # SVM is computationally too expensive, after 80 minutes, it was still running!!!
    "KNN": KNeighborsClassifier(n_neighbors=5, weights='distance', n_jobs=-1)
}

def evaluate_scalers(X, y, scalers, models):
    results = {}
    for scaler_name, scaler in scalers.items():
        print(f"Scaler: {scaler_name}", end="\n\n")

        if scaler is None: 
            X_scaled = X.to_numpy()
        else:
            X_scaled = scaler.fit_transform(X)

        skf = StratifiedKFold(n_splits=5)

        for model_name, model in models.items():
            f_score = []
            print(f"Model: {model_name}", end="\n\n")

            for train_index, test_index in skf.split(X_scaled, y):
                X_train_, y_train_ = X_scaled[train_index], y[train_index]
                X_test_, y_test_ = X_scaled[test_index], y[test_index]

                model.fit(X_train_, y_train_)

                y_pred_ = model.predict(X_test_)

                f_score.append(f1_score(y_test_, y_pred_))

            mean_f1_score = np.mean(f_score)
            print("The scores: ", end="\n\n")
            print([round(f, 2) for f in f_score], end="\n\n")
            print('F1-Score mean=%.5f' % (mean_f1_score), end="\n\n")

            if scaler_name not in results:
                results[scaler_name] = {}
            results[scaler_name][model_name] = mean_f1_score
    return results

# Apply evaluation
results = evaluate_scalers(X_train_binary, y_train_binary, scalers, models)
# Summarize the results
for scaler_name, model_scores in results.items():
    print(f"Scaler: {scaler_name}")
    for model_name, score in model_scores.items():
        print(f"  Model: {model_name}, F1-Score mean: {score:.5f}")
    print("\n")


Scaler: StandardScaler

Model: LogisticRegression

The scores: 

[0.18, 0.46, 0.5, 0.48, 0.42]

F1-Score mean=0.40887

Model: Tree

The scores: 

[0.66, 0.83, 0.85, 0.85, 0.72]

F1-Score mean=0.78168

Model: KNN

The scores: 

[0.74, 0.89, 0.91, 0.92, 0.77]

F1-Score mean=0.84705

Scaler: MinMaxScaler

Model: LogisticRegression



KeyboardInterrupt: 

We get result: 

| Model | StandardScaler | MinMaxScaler | RobustScaler | None |
| --- | --- | --- | --- | ---- | 
| LogisticRegression | 0.40887 | 0.40801 |  0.40920 | 0.40801 |
| Decision Tree |  0.78168 | 0.78168 | 0.78168 | 0.78168 | 
| KNN | 0.84705 | 0.85536 | 0.84558 | 0.85536 | 


From the table:
RobustScaler wins for LogisticRegression by a narrow margin.     
Decision Tree got all equal result which is logic because it doesn't require scaler.     
MinmaxScaler/None scaler win for KNN with a little advantage.   

As KNN is adapted for anomaly detection, which is more like the cases that we will treat. We can count more important according to KNN result. 

To conclude: MinMaxScaler could be better for us. Even in certain model it is not the best but the margin is very narrow. We can accept that. 



In [None]:
# apply minmax scaler 
scaler = MinMaxScaler()

X_train_binary_scaled = scaler.fit_transform(X_train_binary)
X_test_binary_scaled = scaler.transform(X_test_binary)

X_train_binary_scaled = pd.DataFrame(X_train_binary_scaled, columns=X_train_binary.columns)
X_test_binary_scaled = pd.DataFrame(X_test_binary_scaled, columns=X_test_binary.columns)

Find the best resampling method with the best scaler we found

In [17]:
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

models = {
    "LogisticRegression": LogisticRegression(class_weight='balanced', max_iter=1000),
    "Tree": DecisionTreeClassifier(class_weight='balanced', random_state=42),
    #"SVM": SVC(class_weight='balanced', probability=True, random_state=42),
    "KNN": KNeighborsClassifier(n_neighbors=5, weights='distance', n_jobs=-1)
}

def crossvalidation(X, y, models):
    resampling_methods = {
        "SMOTE": SMOTE(),
        "Oversampling": RandomOverSampler(sampling_strategy='not majority'),
        "Undersampling": RandomUnderSampler(sampling_strategy='majority'),
        "NONE": None
    }

    skf = StratifiedKFold(n_splits=5)
    results = {}

    for name, resample in resampling_methods.items():
        print(name, end="\n\n")
        results[name] = {}

        for model_name, model in models.items():
            f_score = []
            print(f"Model: {model_name}", end="\n\n")

            for train_index, test_index in skf.split(X, y):
                X_train_, y_train_ = X.loc[train_index], y.loc[train_index]
                X_test_, y_test_ = X.loc[test_index], y.loc[test_index]

                if name == "BalancedRandomForest" or name == 'NONE':
                    model.fit(X_train_, y_train_)
                else:
                    X_train_resampled, y_train_resampled = resample.fit_resample(X_train_, y_train_)
                    model.fit(X_train_resampled, y_train_resampled)

                y_pred_ = model.predict(X_test_)

                f_score.append(f1_score(y_test_, y_pred_))

            results[name][model_name] = np.mean(f_score)
            print("The scores: ", end="\n\n")
            print([round(f, 2) for f in f_score], end="\n\n")
            print('F1-Score mean=%.5f' % (np.mean(f_score)), end="\n\n")

    return results, resampling_methods



results_resample, resampling_methods = crossvalidation(X_train_binary_scaled, y_train_binary, models)


SMOTE

Model: LogisticRegression

The scores: 

[0.19, 0.46, 0.5, 0.47, 0.42]

F1-Score mean=0.40885

Model: Tree

The scores: 

[0.66, 0.79, 0.8, 0.8, 0.71]

F1-Score mean=0.75390

Model: KNN

The scores: 

[0.76, 0.85, 0.87, 0.87, 0.77]

F1-Score mean=0.82550

Oversampling

Model: LogisticRegression

The scores: 

[0.18, 0.46, 0.5, 0.48, 0.42]

F1-Score mean=0.40727

Model: Tree

The scores: 

[0.68, 0.82, 0.85, 0.85, 0.72]

F1-Score mean=0.78438

Model: KNN

The scores: 

[0.77, 0.87, 0.88, 0.89, 0.77]

F1-Score mean=0.83522

Undersampling

Model: LogisticRegression

The scores: 

[0.15, 0.46, 0.5, 0.49, 0.43]

F1-Score mean=0.40558

Model: Tree

The scores: 

[0.58, 0.64, 0.65, 0.65, 0.57]

F1-Score mean=0.62156

Model: KNN

The scores: 

[0.69, 0.78, 0.79, 0.8, 0.68]

F1-Score mean=0.74698

NONE

Model: LogisticRegression

The scores: 

[0.17, 0.47, 0.5, 0.48, 0.42]

F1-Score mean=0.40801

Model: Tree

The scores: 

[0.66, 0.83, 0.85, 0.85, 0.72]

F1-Score mean=0.78168

Model: KNN

We get result: 

| Model | SMOTE | Oversampling | Undersampling | None |
| --- | --- | --- | --- | ---- | 
| LogisticRegression | 0.40885 | 0.40727 | 0.40558  | 0.40801 |
| Decision Tree |  0.75390 | 0.78438 | 0.62156 | 0.78168 | 
| KNN | 0.82550 | 0.83522 | 0.74698 | 0.85536 | 


From the table:   
SMOTE wins for LogisticRegression by a narrow margin.        
Oversampling wins for Decision Tree   
None scaler win for KNN.  


To conclude: Oversampling could be better for us. Even in certain model it is not the best but the margin is very narrow. We can accept that. 



## Multiple Classification 

In [6]:
X_train_multi = mit_multi_train.drop(columns=['target'])
y_train_multi = mit_multi_train['target']

X_test_multi = mit_multi_test.drop(columns=['target'])
y_test_multi = mit_multi_test['target']

find the best rescaling method for multiple classification

In [8]:
from sklearn.metrics import f1_score
from sklearn.model_selection import StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler

# define general used scalers.
scalers = { 
    "StandardScaler": StandardScaler(),
    "MinMaxScaler": MinMaxScaler(),
    "RobustScaler": RobustScaler(),
    "None": None,
}

models = {
    "LogisticRegression": LogisticRegression(class_weight='balanced', max_iter=1000),
    "Tree": DecisionTreeClassifier(class_weight='balanced', random_state=42),
     # "SVM": SVC(class_weight='balanced', probability=True, random_state=42), # SVM is computationally too expensive, after 80 minutes, it was still running!!!
    "KNN": KNeighborsClassifier(n_neighbors=5, weights='distance', n_jobs=-1)
}

def evaluate_scalers(X, y, scalers, models):
    results = {}
    for scaler_name, scaler in scalers.items():
        print(f"Scaler: {scaler_name}", end="\n\n")

        if scaler is None: 
            X_scaled = X.to_numpy()
        else:
            X_scaled = scaler.fit_transform(X)

        skf = StratifiedKFold(n_splits=5)

        for model_name, model in models.items():
            f_score = []
            print(f"Model: {model_name}", end="\n\n")

            for train_index, test_index in skf.split(X_scaled, y):
                X_train_, y_train_ = X_scaled[train_index], y[train_index]
                X_test_, y_test_ = X_scaled[test_index], y[test_index]

                model.fit(X_train_, y_train_)

                y_pred_ = model.predict(X_test_)

                f_score.append(f1_score(y_test_, y_pred_, average='weighted'))

            mean_f1_score = np.mean(f_score)
            print("The scores: ", end="\n\n")
            print([round(f, 2) for f in f_score], end="\n\n")
            print('F1-Score mean=%.5f' % (mean_f1_score), end="\n\n")

            if scaler_name not in results:
                results[scaler_name] = {}
            results[scaler_name][model_name] = mean_f1_score
    return results

# Apply evaluation
results = evaluate_scalers(X_train_multi, y_train_multi, scalers, models)
# Summarize the results
for scaler_name, model_scores in results.items():
    print(f"Scaler: {scaler_name}")
    for model_name, score in model_scores.items():
        print(f"  Model: {model_name}, F1-Score mean: {score:.5f}")
    print("\n")


Scaler: StandardScaler

Model: LogisticRegression

The scores: 

[0.74, 0.74, 0.74, 0.74, 0.74]

F1-Score mean=0.74037

Model: Tree

The scores: 

[0.96, 0.96, 0.96, 0.96, 0.96]

F1-Score mean=0.95981

Model: KNN

The scores: 

[0.97, 0.98, 0.97, 0.97, 0.97]

F1-Score mean=0.97421

Scaler: MinMaxScaler

Model: LogisticRegression

The scores: 

[0.74, 0.74, 0.74, 0.74, 0.74]

F1-Score mean=0.73981

Model: Tree

The scores: 

[0.96, 0.96, 0.96, 0.96, 0.96]

F1-Score mean=0.95981

Model: KNN

The scores: 

[0.97, 0.98, 0.98, 0.97, 0.98]

F1-Score mean=0.97559

Scaler: RobustScaler

Model: LogisticRegression



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

The scores: 

[0.74, 0.74, 0.74, 0.74, 0.74]

F1-Score mean=0.74129

Model: Tree

The scores: 

[0.96, 0.96, 0.96, 0.96, 0.96]

F1-Score mean=0.95981

Model: KNN

The scores: 

[0.97, 0.97, 0.97, 0.97, 0.97]

F1-Score mean=0.97366

Scaler: None

Model: LogisticRegression

The scores: 

[0.74, 0.74, 0.74, 0.74, 0.74]

F1-Score mean=0.73981

Model: Tree

The scores: 

[0.96, 0.96, 0.96, 0.96, 0.96]

F1-Score mean=0.95981

Model: KNN

The scores: 

[0.97, 0.98, 0.98, 0.97, 0.98]

F1-Score mean=0.97559

Scaler: StandardScaler
  Model: LogisticRegression, F1-Score mean: 0.74037
  Model: Tree, F1-Score mean: 0.95981
  Model: KNN, F1-Score mean: 0.97421


Scaler: MinMaxScaler
  Model: LogisticRegression, F1-Score mean: 0.73981
  Model: Tree, F1-Score mean: 0.95981
  Model: KNN, F1-Score mean: 0.97559


Scaler: RobustScaler
  Model: LogisticRegression, F1-Score mean: 0.74129
  Model: Tree, F1-Score mean: 0.95981
  Model: KNN, F1-Score mean: 0.97366


Scaler: None
  Model: LogisticRegression, F

We get result: 

| Model | StandardScaler | MinMaxScaler | RobustScaler | None |
| --- | --- | --- | --- | ---- | 
| LogisticRegression | 0.74037 | 0.73981  | 0.74129  | 0.73981 |
| Decision Tree | 0.95981  | 0.95981 | 0.95981 |0.95981  | 
| KNN | 0.97421  | 0.97559 | 0.97366 | 0.97559 | 


From the table:   
RobustScaler wins for LogisticRegression by a narrow margin.        
Decision Tree got all equal result which is logic because it doesn't require scaler.        
MinmaxScaler/None scaler win for KNN with a little advantage.      

As KNN is adapted for anomaly detection, which is more like the cases that we will treat. We can count more important according to KNN result. 

To conclude: MinMaxScaler could be better for us. Even in certain model it is not the best but the margin is very narrow. We can accept that. 



In [9]:
## apply scaler to multiple classification
scaler = MinMaxScaler()

X_train_multi_scaled = scaler.fit_transform(X_train_multi)
X_test_multi_scaled = scaler.transform(X_test_multi)

X_train_multi_scaled = pd.DataFrame(X_train_multi_scaled, columns=X_train_multi.columns)
X_test_multi_scaled = pd.DataFrame(X_test_multi_scaled, columns=X_test_multi.columns)

In [11]:
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

models = {
    "LogisticRegression": LogisticRegression(class_weight='balanced', max_iter=1000),
    "Tree": DecisionTreeClassifier(class_weight='balanced', random_state=42),
    #"SVM": SVC(class_weight='balanced', probability=True, random_state=42),
    "KNN": KNeighborsClassifier(n_neighbors=5, weights='distance', n_jobs=-1)
}

def crossvalidation(X, y, models):
    resampling_methods = {
        "SMOTE": SMOTE(),
        "Oversampling": RandomOverSampler(sampling_strategy='not majority'),
        "Undersampling": RandomUnderSampler(sampling_strategy='majority'),
        "NONE": None
    }

    skf = StratifiedKFold(n_splits=5)
    results = {}

    for name, resample in resampling_methods.items():
        print(name, end="\n\n")
        results[name] = {}

        for model_name, model in models.items():
            f_score = []
            print(f"Model: {model_name}", end="\n\n")

            for train_index, test_index in skf.split(X, y):
                X_train_, y_train_ = X.loc[train_index], y.loc[train_index]
                X_test_, y_test_ = X.loc[test_index], y.loc[test_index]

                if name == "BalancedRandomForest" or name == 'NONE':
                    model.fit(X_train_, y_train_)
                else:
                    X_train_resampled, y_train_resampled = resample.fit_resample(X_train_, y_train_)
                    model.fit(X_train_resampled, y_train_resampled)

                y_pred_ = model.predict(X_test_)

                f_score.append(f1_score(y_test_, y_pred_, average='weighted'))

            results[name][model_name] = np.mean(f_score)
            print("The scores: ", end="\n\n")
            print([round(f, 2) for f in f_score], end="\n\n")
            print('F1-Score mean=%.5f' % (np.mean(f_score)), end="\n\n")

    return results, resampling_methods



results_resample, resampling_methods = crossvalidation(X_train_multi_scaled, y_train_multi, models)


SMOTE

Model: LogisticRegression

The scores: 

[0.74, 0.74, 0.74, 0.74, 0.75]

F1-Score mean=0.74190

Model: Tree

The scores: 

[0.95, 0.95, 0.94, 0.94, 0.95]

F1-Score mean=0.94544

Model: KNN

The scores: 

[0.96, 0.96, 0.96, 0.96, 0.96]

F1-Score mean=0.95845

Oversampling

Model: LogisticRegression

The scores: 

[0.74, 0.74, 0.74, 0.74, 0.74]

F1-Score mean=0.74028

Model: Tree

The scores: 

[0.96, 0.96, 0.96, 0.96, 0.96]

F1-Score mean=0.95991

Model: KNN

The scores: 

[0.97, 0.97, 0.97, 0.97, 0.97]

F1-Score mean=0.96854

Undersampling

Model: LogisticRegression

The scores: 

[0.7, 0.7, 0.69, 0.7, 0.71]

F1-Score mean=0.69925

Model: Tree

The scores: 

[0.66, 0.68, 0.67, 0.68, 0.67]

F1-Score mean=0.67426

Model: KNN

The scores: 

[0.67, 0.67, 0.68, 0.67, 0.66]

F1-Score mean=0.66985

NONE

Model: LogisticRegression

The scores: 

[0.74, 0.74, 0.74, 0.74, 0.74]

F1-Score mean=0.73981

Model: Tree

The scores: 

[0.96, 0.96, 0.96, 0.96, 0.96]

F1-Score mean=0.95981

Model:

We get result: 

| Model | SMOTE | Oversampling | Undersampling | None |
| --- | --- | --- | --- | ---- | 
| LogisticRegression | 0.74190 | 0.74028 | 0.69925  | 0.73981 |
| Decision Tree | 0.94544  | 0.95991 | 0.67426 | 0.95981 | 
| KNN |0.95845 | 0.96854 | 0.66985 | 0.97559 | 


From the table:  
SMOTE wins for LogisticRegression by a narrow margin compare to Oversampling.       
Oversampling wins for Decision Tree  
None scaler win for KNN.    

To conclude: Oversampling could be better for us. Even in certain model it is not the best but the margin is very narrow. We can accept that. 

