# Stroke Prediction Model Searching v2

Let's do some research to create stroke prediction model based on the healthcare-dataset-stroke-data dataset.<br>
<br>
Source: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset<br>

#### Acknowledgements
(Confidential Source) - Use only for educational purposes
If you use this dataset in your research, please credit the author.

It is a second searching notebook. The first one did not provide satisfactory results.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import FeatureUnion
from sklearn.model_selection import cross_val_score

Let's get together the pipelines from Stroke_Prediction_Cleansing_and_Preprocessing notebook.

In [2]:
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names]

class MyLabelBinarizer(LabelBinarizer):
    def fit_transform(self, X, y=None):
        return super(LabelBinarizer, self).fit_transform(X)

class CustomLimitedImputer(BaseEstimator, TransformerMixin):
    ''' Simple customized imputer to change the following:
        smoking_status to "never smoked" if age < 10 and smoking_status = "Unknown"
        work_type to "children" if age < 17 and swork_type = "Never_worked" '''
    def __init__(self, attribute_names):
        assert all(attr in ['smoking_status', 'work_type'] for attr in attribute_names), 'Only smoking_status and work_type are supported'
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X = X.copy()
        for attr in self.attribute_names:
            if attr == 'smoking_status':
                X.loc[(X.age < 10) & (X.smoking_status == 'Unknown'), 'smoking_status'] = 'never smoked'
            elif attr == 'work_type':
                X.loc[(X.age < 17) & (X.work_type == 'Never_worked'), 'work_type'] = 'children'
        X.drop(['age'], axis=1, inplace=True)
        return X.values

# pipelines
cat_yn_pipeline = Pipeline([
        ("select_bin", DataFrameSelector(['ever_married'])),
        ("bin_encoder", MyLabelBinarizer()),
    ])

cat_oh_pipeline = Pipeline([
        ("select_cat", DataFrameSelector(['smoking_status', 'work_type', 'age'])), # age is used as a parameter
        ("imputer", CustomLimitedImputer(['smoking_status', 'work_type'])),
        ("cat_encoder", OneHotEncoder(sparse=False)),
    ])

num_pipeline = Pipeline([
        ("select_numeric", DataFrameSelector(['age', 'avg_glucose_level'])),
        ("scale", StandardScaler()),
    ])

# final preprocessing pipeline
preprocess_pipeline = FeatureUnion(transformer_list=[
        ('cat_yn_pipeline', cat_yn_pipeline),
        ('cat_oh_pipeline', cat_oh_pipeline),
        ('num_pipeline', num_pipeline),
    ])

In [3]:
data = pd.read_csv('Data/healthcare-dataset-stroke-data.csv', index_col='id')

In [4]:
data

Unnamed: 0_level_0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
...,...,...,...,...,...,...,...,...,...,...,...
18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0
44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0
19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0


In [5]:
X_train_val, X_test, y_train_val, y_test = train_test_split(data[data.columns[:-1]], data.stroke, test_size=0.2, random_state=24, stratify=data.stroke)

In [6]:
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.2, random_state=24, stratify=y_train_val)

In [7]:
X_train = preprocess_pipeline.fit_transform(X_train)
X_val = preprocess_pipeline.transform(X_val)
X_test = preprocess_pipeline.transform(X_test)

In [8]:
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)
print(y_train.shape)
print(y_val.shape)
print(y_test.shape)
print(y_train.value_counts())
print(y_val.value_counts())

(3270, 12)
(818, 12)
(1022, 12)
(3270,)
(818,)
(1022,)
0    3111
1     159
Name: stroke, dtype: int64
0    778
1     40
Name: stroke, dtype: int64


The train set will be used for training the models, the val set to validate the models and the test set will be locked until the final model and hyperparameters will be chosen.

Firstly, the metric needs to be selected.<br>
The choice is F2 score because is better for the models where positive class and preventing false negatives are more important.<br>
Alternatively, the PR AUC can be considered.<br>

In [8]:
from sklearn.metrics import fbeta_score, make_scorer

In [9]:
f2_scorer = make_scorer(fbeta_score, beta=2)

Let's start with a couple basic models.

Logistic Regression

In [18]:
from sklearn.linear_model import LogisticRegression

In [35]:
lreg = LogisticRegression(random_state=24)

lreg_scores = cross_val_score(lreg, X_train, y_train, cv=10, scoring=f2_scorer)
lreg_scores.mean()

0.0

Nice, good start.

Linear Discriminant Analysis

In [36]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [38]:
lda = LinearDiscriminantAnalysis()

lda_scores = cross_val_score(lda, X_train, y_train, cv=10, scoring=f2_scorer)
lda_scores.mean()

0.051719022855985106

Gaussian Naive Bayes

In [40]:
from sklearn.naive_bayes import GaussianNB

In [41]:
gnb = GaussianNB()

gnb_scores = cross_val_score(gnb, X_train, y_train, cv=10, scoring=f2_scorer)
gnb_scores.mean()

0.2633839674974345

K-Neighbors Classifier

In [44]:
from sklearn.neighbors import KNeighborsClassifier

In [49]:
neigh = KNeighborsClassifier()

neigh_scores = cross_val_score(neigh, X_train, y_train, cv=10, scoring=f2_scorer)
neigh_scores.mean()

0.06736892424757447

Support Vector Machines

In [51]:
from sklearn.svm import SVC 

In [52]:
# C-Support Vector Classification
svc = SVC()

svc_scores = cross_val_score(svc, X_train, y_train, cv=10, scoring=f2_scorer)
svc_scores.mean()

0.0

In [54]:
from sklearn.svm import LinearSVC

In [55]:
#Linear Support Vector Classification
lsvc = LinearSVC()

lsvc_scores = cross_val_score(lsvc, X_train, y_train, cv=10, scoring=f2_scorer)
lsvc_scores.mean()

0.0

In [57]:
from sklearn.linear_model import SGDClassifier

In [58]:
# SVM with stochastic gradient descent (SGD)
sgdc = SGDClassifier(random_state=24)

sgdc_scores = cross_val_score(sgdc, X_train, y_train, cv=10, scoring=f2_scorer)
sgdc_scores.mean()

0.0

Decision Tree Classifier

In [59]:
from sklearn.tree import DecisionTreeClassifier

In [60]:
dtc = DecisionTreeClassifier(random_state=24)

dtc_scores = cross_val_score(dtc, X_train, y_train, cv=10, scoring=f2_scorer)
dtc_scores.mean()

0.14150210093329052

Random Forest Classifier

In [11]:
from sklearn.ensemble import RandomForestClassifier

In [12]:
rfc = RandomForestClassifier(random_state=24)

rfc_scores = cross_val_score(rfc, X_train, y_train, cv=10, scoring=f2_scorer)
rfc_scores.mean()

0.04307001428863033

In [None]:
Extra Trees Classifier

In [13]:
from sklearn.ensemble import ExtraTreesClassifier

In [14]:
etc = ExtraTreesClassifier(random_state=24)

etc_scores = cross_val_score(etc, X_train, y_train, cv=10, scoring=f2_scorer)
etc_scores.mean()

0.06170446876071828

Bagging Classifier

In [15]:
from sklearn.ensemble import BaggingClassifier

In [16]:
bc = BaggingClassifier(random_state=24)

bc_scores = cross_val_score(bc, X_train, y_train, cv=10, scoring=f2_scorer)
bc_scores.mean()

0.0636905976924348

Gradient Boosting Classifier

In [11]:
from sklearn.ensemble import GradientBoostingClassifier

In [12]:
gbc = GradientBoostingClassifier(random_state=24)

gbc_scores = cross_val_score(gbc, X_train, y_train, cv=10, scoring=f2_scorer)
gbc_scores.mean()

0.01551226551226551

AdaBoost Classifier

In [19]:
from sklearn.ensemble import AdaBoostClassifier

In [25]:
abc = AdaBoostClassifier(random_state=24)

abc_scores = cross_val_score(abc, X_train, y_train, cv=10, scoring=f2_scorer)
abc_scores.mean()

0.008064516129032256

Hist Gradient Boosting Classifier

In [27]:
from sklearn.ensemble import HistGradientBoostingClassifier
# from sklearn.experimental import enable_hist_gradient_boosting

In [30]:
hgbc = HistGradientBoostingClassifier(random_state=24)

hgbc_scores = cross_val_score(hgbc, X_train, y_train, cv=10, scoring=f2_scorer)
hgbc_scores.mean()

0.06430613376193506

Let's look into some outlier detection and anomaly detection algorithms.

The new datasets are required to evaluate the model.

In [38]:
X_train_out, X_test_out, y_train_out, y_test_out = train_test_split(X_train, y_train, test_size=0.5, random_state=24, stratify=y_train)

In [None]:
One-Class SVM

In [31]:
from sklearn.svm import OneClassSVM

In [39]:
oc_svm = OneClassSVM()

X_train_out = X_train_out[y_train_out==0]
oc_svm.fit(X_train_out)

y_pred = oc_svm.predict(X_test_out)

# the predict output is -1 and 1 instead of 1 and 0, respectively
y_test_out[y_test_out == 1] = -1
y_test_out[y_test_out == 0] = 1

f2_score = fbeta_score(y_test_out, y_pred, pos_label=-1, beta=2)
f2_score

0.3093220338983051

Isolation Forest

In [40]:
from sklearn.ensemble import IsolationForest

In [41]:
ilf = IsolationForest(random_state=24)
ilf.fit(X_train_out)

y_pred = ilf.predict(X_test_out)


f2_score = fbeta_score(y_test_out, y_pred, pos_label=-1, beta=2)
f2_score

0.27702089009990916

Local Outlier Factor

In [42]:
from sklearn.neighbors import LocalOutlierFactor

In [43]:
lof = LocalOutlierFactor(n_neighbors=5, novelty=True)
lof.fit(X_train_out)

y_pred = lof.predict(X_test_out)


f2_score = fbeta_score(y_test_out, y_pred, pos_label=-1, beta=2)
f2_score

0.11930585683297178

First run with default hiperparameters is done. Let's check the results:<br>
Result  Classification Algorithm<br>
__________________________________
0.0&emsp;&emsp;LogisticRegression<br>
0.0517&emsp;LinearDiscriminantAnalysis<br>
0.2634&emsp;GaussianNB<br>
0.0674&emsp;KNeighborsClassifier<br>
0.0&emsp;&emsp;SVC<br>
0.0&emsp;&emsp;LinearSVC<br>
0.0&emsp;&emsp;SGDClassifier<br>
0.1415&emsp;DecisionTreeClassifier<br>
0.0431&emsp;RandomForestClassifier<br>
0.0617&emsp;ExtraTreesClassifier<br>
0.0637&emsp;BaggingClassifier<br>
0.0155&emsp;GradientBoostingClassifier<br>
0.0081&emsp;AdaBoostClassifier<br>
0.0643&emsp;HistGradientBoostingClassifier<br>
<br>
Result  Outlier Detection/Anomaly Detection Algorithm<br>
_____________________________________________________
0.3093&emsp;OneClassSVM<br>
0.2770&emsp;IsolationForest<br>
0.1193&emsp;LocalOutlierFactor<br>
<br>
Looks like the outlier/anomaly detection algorithms perform better. Anyway, before jumping to the conclusion let's check the classification algorithms with some data resampling. <br>

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC 
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import HistGradientBoostingClassifier

In [11]:
models = [
    LogisticRegression(random_state=24),
    LinearDiscriminantAnalysis(),
    GaussianNB(),
    KNeighborsClassifier(),
    SVC(),
    LinearSVC(),
    SGDClassifier(random_state=24),
    DecisionTreeClassifier(random_state=24),
    RandomForestClassifier(random_state=24),
    ExtraTreesClassifier(random_state=24),
    BaggingClassifier(random_state=24),
    GradientBoostingClassifier(random_state=24),
    AdaBoostClassifier(random_state=24),
    HistGradientBoostingClassifier(random_state=24)
]

In [12]:
def get_models_scores(models_in, X, y, cv):
    ''' Get scores for models based on cross_val_score.
        Parameters:
            models_in - models to evaluate
            X         - X train dataset
            y         - y train dataset
            cv        - cross-validation splitting strategy
        Result:
            Dataframe with the coss_val_score results'''
    results = pd.DataFrame(columns=['Model', cv])
    for model in models_in:
        results.loc[len(results)] = [str(model).split('(')[0], round(cross_val_scores(model, X, y, cv), 4)]
    results.index += 1
    return results

In [13]:
def cross_val_scores(model, X, y, cv):
    ''' Calculate the cross validation scores based on f2_scores metric and then the mean of them.
        Parameters:
            model    - model to be used for fitting data and predictions
            X        - X train dataset
            y        - y train dataset
            cv       - Cross validation splitting strategy
        Returns:
            The mean value of calculated f2 scores. '''
    scores = cross_val_score(model, X, y, cv=cv, scoring=f2_scorer)
    return scores.mean()

Before resampling let's check if RepeatedStratifiedKFold has any impact on the results.

In [14]:
from sklearn.model_selection import RepeatedStratifiedKFold
splitter = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

In [30]:
scores = get_models_scores(models, X_train, y_train, cv=splitter)

In [31]:
scores

Unnamed: 0,Model,"RepeatedStratifiedKFold(n_repeats=3, n_splits=10, random_state=1)"
1,LogisticRegression,0.0
2,LinearDiscriminantAnalysis,0.0402
3,GaussianNB,0.2632
4,KNeighborsClassifier,0.0717
5,SVC,0.0
6,LinearSVC,0.0
7,SGDClassifier,0.0
8,DecisionTreeClassifier,0.1564
9,RandomForestClassifier,0.0677
10,ExtraTreesClassifier,0.0646


Seems they are slightly better, generally speaking.

Resampling

Random Oversampling

In [15]:
from imblearn.over_sampling import RandomOverSampler

In [33]:
ros = RandomOverSampler(sampling_strategy=0.1, random_state=24)
X_ros, y_ros = ros.fit_resample(X_train, y_train)

In [34]:
'X', X_ros.shape, X_train.shape, 'y', y_ros.shape, y_train.shape

('X', (3422, 12), (3270, 12), 'y', (3422,), (3270,))

In [35]:
y_ros.value_counts(), y_train.value_counts()

(0    3111
 1     311
 Name: stroke, dtype: int64,
 0    3111
 1     159
 Name: stroke, dtype: int64)

In [36]:
ros_scores = get_models_scores(models, X_ros, y_ros, cv=splitter)
ros_scores

Unnamed: 0,Model,"RepeatedStratifiedKFold(n_repeats=3, n_splits=10, random_state=1)"
1,LogisticRegression,0.0582
2,LinearDiscriminantAnalysis,0.1631
3,GaussianNB,0.4076
4,KNeighborsClassifier,0.3514
5,SVC,0.0
6,LinearSVC,0.0
7,SGDClassifier,0.0
8,DecisionTreeClassifier,0.7731
9,RandomForestClassifier,0.7966
10,ExtraTreesClassifier,0.7896


In [44]:
scores.values[:, 1].mean()

0.05753571428571428

In [45]:
ros_scores.values[:, 1].mean()

0.3636928571428571

In [48]:
sample_vs = [0.2, 0.4, 0.6, 0.8, 1]
for sample_v in sample_vs:
    ros = RandomOverSampler(sampling_strategy=sample_v, random_state=24)
    print('==========================')
    print('Sampling strategy = ', sample_v)
    X_ros, y_ros = ros.fit_resample(X_train, y_train)
    ros_scores = get_models_scores(models, X_ros, y_ros, cv=splitter)
    display(ros_scores)
    print('Mean', ros_scores.values[:, 1].mean())

Sampling strategy =  0.2


Unnamed: 0,Model,"RepeatedStratifiedKFold(n_repeats=3, n_splits=10, random_state=1)"
1,LogisticRegression,0.2915
2,LinearDiscriminantAnalysis,0.3719
3,GaussianNB,0.5704
4,KNeighborsClassifier,0.7569
5,SVC,0.2851
6,LinearSVC,0.2492
7,SGDClassifier,0.2074
8,DecisionTreeClassifier,0.9414
9,RandomForestClassifier,0.9549
10,ExtraTreesClassifier,0.9609


Mean 0.59375
Sampling strategy =  0.4


Unnamed: 0,Model,"RepeatedStratifiedKFold(n_repeats=3, n_splits=10, random_state=1)"
1,LogisticRegression,0.6168
2,LinearDiscriminantAnalysis,0.6506
3,GaussianNB,0.719
4,KNeighborsClassifier,0.934
5,SVC,0.6761
6,LinearSVC,0.6208
7,SGDClassifier,0.6095
8,DecisionTreeClassifier,0.9751
9,RandomForestClassifier,0.9824
10,ExtraTreesClassifier,0.9875


Mean 0.7935142857142857
Sampling strategy =  0.6


Unnamed: 0,Model,"RepeatedStratifiedKFold(n_repeats=3, n_splits=10, random_state=1)"
1,LogisticRegression,0.7105
2,LinearDiscriminantAnalysis,0.7246
3,GaussianNB,0.7908
4,KNeighborsClassifier,0.9622
5,SVC,0.7698
6,LinearSVC,0.7194
7,SGDClassifier,0.7166
8,DecisionTreeClassifier,0.9848
9,RandomForestClassifier,0.9886
10,ExtraTreesClassifier,0.992


Mean 0.851057142857143
Sampling strategy =  0.8


Unnamed: 0,Model,"RepeatedStratifiedKFold(n_repeats=3, n_splits=10, random_state=1)"
1,LogisticRegression,0.7599
2,LinearDiscriminantAnalysis,0.7599
3,GaussianNB,0.8337
4,KNeighborsClassifier,0.9712
5,SVC,0.8036
6,LinearSVC,0.763
7,SGDClassifier,0.7517
8,DecisionTreeClassifier,0.9888
9,RandomForestClassifier,0.9911
10,ExtraTreesClassifier,0.9937


Mean 0.8783714285714287
Sampling strategy =  1


Unnamed: 0,Model,"RepeatedStratifiedKFold(n_repeats=3, n_splits=10, random_state=1)"
1,LogisticRegression,0.8009
2,LinearDiscriminantAnalysis,0.8045
3,GaussianNB,0.858
4,KNeighborsClassifier,0.9768
5,SVC,0.8363
6,LinearSVC,0.801
7,SGDClassifier,0.795
8,DecisionTreeClassifier,0.9914
9,RandomForestClassifier,0.9934
10,ExtraTreesClassifier,0.995


Mean 0.8999642857142857


In [16]:
def get_models_scores_vs_val_for_sampling(models_in, X, y, cv, X_v, y_v, sample_value):
    ''' Get scores for models based on cross_val_score on train dataset 
        and prediction scores for valudation dataset.
        Parameters:
            models_in    - models to evaluate
            X            - X train dataset
            y            - y train dataset
            cv           - cross-validation splitting strategy
            X_v          - X validation dataset
            y_v          - y validation dataset
            sample_value - used only for Dataframe columns name
        Result:
            Dataframe with the coss_val_score and validation results'''
    results = pd.DataFrame()
    results['Models'] = [str(model).split('(')[0] for model in models_in]
    results.set_index('Models', inplace=True)
    for model in models_in:
        results.loc[str(model).split('(')[0], [f'ss={sample_value} train']] = round(cross_val_scores(model, X, y, cv), 4)
        results.loc[str(model).split('(')[0], [f'ss={sample_value} val']] = round(get_val_pred_score(model, X, y, X_v, y_v), 4)
    return results

In [17]:
def get_val_pred_score(model, X, y, X_v, y_v):
    ''' Calculate the f2_score of prediction for given model. 
        Parameters:
            model    - model to be used for fitting data and predictions
            X        - X train dataset
            y        - y train dataset
            X_v      - X validation dataset (will work for test dataset as well)
            y_v      - y validation dataset (will work for test dataset as well)
        Result:
            f2_score of predicted values. '''
    model.fit(X, y)
    y_pred = model.predict(X_v)
    return fbeta_score(y_v, y_pred, beta=2)

In [87]:
sample_vs = [x/10 for x in range(1, 10, 2)]+[1]
ros_results = pd.DataFrame()
for sample_v in sample_vs:
    ros = RandomOverSampler(sampling_strategy=sample_v, random_state=24)
#     print('==========================')
#     print('Sampling strategy = ', sample_v)
    X_ros, y_ros = ros.fit_resample(X_train, y_train)
    ros_scores = get_models_scores_vs_val_for_sampling(models, X_ros, y_ros, splitter, X_val, y_val, sample_v)
    ros_results = pd.concat([ros_results, ros_scores], axis=1)
ros_results

Unnamed: 0_level_0,ss=0.1 train,ss=0.1 val,ss=0.3 train,ss=0.3 val,ss=0.5 train,ss=0.5 val,ss=0.7 train,ss=0.7 val,ss=0.9 train,ss=0.9 val,ss=1 train,ss=1 val
Models,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LogisticRegression,0.0582,0.0,0.5009,0.3802,0.6704,0.3963,0.7443,0.4365,0.7863,0.4084,0.8009,0.4096
LinearDiscriminantAnalysis,0.1631,0.0565,0.5424,0.3788,0.6835,0.4154,0.7476,0.4319,0.7848,0.4237,0.8045,0.4118
GaussianNB,0.4076,0.271,0.6634,0.2621,0.7594,0.2567,0.8157,0.2564,0.8467,0.2538,0.858,0.2528
KNeighborsClassifier,0.3514,0.1768,0.8874,0.2281,0.9524,0.2281,0.9677,0.2281,0.9743,0.2281,0.9768,0.2281
SVC,0.0,0.0,0.5606,0.2917,0.7408,0.3583,0.7903,0.3239,0.8266,0.3476,0.8363,0.3759
LinearSVC,0.0,0.0,0.4734,0.377,0.6776,0.3892,0.7456,0.4331,0.7865,0.4177,0.801,0.4127
SGDClassifier,0.0,0.0,0.4311,0.4315,0.6406,0.4108,0.7377,0.4373,0.7806,0.3795,0.795,0.4146
DecisionTreeClassifier,0.7731,0.1733,0.9692,0.1302,0.9809,0.1202,0.9863,0.1036,0.9896,0.0781,0.9914,0.0777
RandomForestClassifier,0.7966,0.0843,0.9768,0.1064,0.9864,0.0785,0.9901,0.0798,0.9923,0.1036,0.9934,0.1042
ExtraTreesClassifier,0.7896,0.1124,0.9833,0.1143,0.9903,0.0852,0.9933,0.0872,0.9949,0.0857,0.995,0.0862


Let's run it for more oversampling and undersampling techniques.

In [18]:
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import BorderlineSMOTE
from imblearn.over_sampling import SVMSMOTE
from imblearn.over_sampling import SMOTENC
from imblearn.over_sampling import ADASYN

from imblearn.under_sampling import RandomUnderSampler
from imblearn.under_sampling import CondensedNearestNeighbour
from imblearn.under_sampling import TomekLinks
from imblearn.under_sampling import EditedNearestNeighbours
from imblearn.under_sampling import NeighbourhoodCleaningRule

In [19]:
# resampling techniques which use the sampling_strategy
resampling_tqs_ss = [
    SMOTE,
    BorderlineSMOTE,
    SVMSMOTE,
    SMOTENC,
    ADASYN,
    RandomUnderSampler,
]

# resampling techniques which not use the sampling_strategy
resampling_tqs_no_ss = [
    CondensedNearestNeighbour,
    TomekLinks, # no random
    EditedNearestNeighbours, # no random
    NeighbourhoodCleaningRule # no random
]

# techniques without random state
tqs_wo_random_state = [TomekLinks, EditedNearestNeighbours, NeighbourhoodCleaningRule]

In [20]:
def get_train_val_results_for_sampling(sampling_tq, tqs_wo_rs, sample_vs, models, X, y, splitter, X_val, y_val, ss=True):
    ''' Get scores for models based on cross_val_score on train dataset 
        and prediction scores for valudation dataset
        based on resampling technique and sampling strategies.
    Parameters:
        sampling_tq  - sampling technique
        tqs_wo_rs    - list of techniques which not use a random_state
        sample_vs    - list of ratios to be used for methods with sampling_strategy
        models       - models to evaluate
        X            - X train dataset
        y            - y train dataset
        splitter     - cross-validation splitting strategy
        X_val        - X validation dataset
        y_val        - y validation dataset
        ss           - indicate whether the models use sampling_strategy or not
    Result:
        Dataframe with the coss_val_score and validation results for given models and sampling technique. '''
    
    stq_results = pd.DataFrame()
    if ss:
        for sample_v in sample_vs:
            if sampling_tq == SMOTENC:
                stq = sampling_tq(sampling_strategy=sample_v, categorical_features=[x for x in range(10)], random_state=24)
            elif sampling_tq in tqs_wo_rs:
                stq = sampling_tq(sampling_strategy=sample_v)
            else:
                stq = sampling_tq(sampling_strategy=sample_v, random_state=24)
            X_stq, y_stq = stq.fit_resample(X, y)
            stq_scores = get_models_scores_vs_val_for_sampling(models, X_stq, y_stq, splitter, X_val, y_val, sample_v)
            stq_results = pd.concat([stq_results, stq_scores], axis=1)
    else:
        if sampling_tq in tqs_wo_rs:
            stq = sampling_tq()
        else:
            stq = sampling_tq(random_state=24)
        X_stq, y_stq = stq.fit_resample(X, y)
        stq_results = get_models_scores_vs_val_for_sampling(models, X_stq, y_stq, splitter, X_val, y_val, sample_v)
    return stq_results

In [21]:
sample_vs = [x/10 for x in range(1, 10, 2)]+[1]
for resampling_tq in resampling_tqs_ss:
    resampling_tq_str = str(resampling_tq).split('.')[-1][:-2]
    print(f'Train vs validation results for {resampling_tq_str} technique.')
    display(get_train_val_results_for_sampling(resampling_tq, 
                                               tqs_wo_random_state, 
                                               sample_vs, 
                                               models, 
                                               X_train, y_train, 
                                               splitter, 
                                               X_val, y_val))
    print()

Train vs validation results for SMOTE technique.


Unnamed: 0_level_0,ss=0.1 train,ss=0.1 val,ss=0.3 train,ss=0.3 val,ss=0.5 train,ss=0.5 val,ss=0.7 train,ss=0.7 val,ss=0.9 train,ss=0.9 val,ss=1 train,ss=1 val
Models,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LogisticRegression,0.0959,0.0,0.5437,0.3745,0.6909,0.4066,0.7341,0.3846,0.7881,0.399,0.808,0.3865
LinearDiscriminantAnalysis,0.2396,0.0838,0.5802,0.3506,0.7039,0.4303,0.7445,0.4178,0.7974,0.4106,0.8201,0.4019
GaussianNB,0.4154,0.2759,0.6813,0.2778,0.777,0.2774,0.8282,0.2743,0.8568,0.2743,0.8713,0.2755
KNeighborsClassifier,0.3469,0.1823,0.7436,0.2222,0.8657,0.2226,0.8969,0.2358,0.9266,0.241,0.9355,0.236
SVC,0.0013,0.0,0.6032,0.3,0.7468,0.3642,0.8009,0.3478,0.8414,0.3342,0.8558,0.3316
LinearSVC,0.0,0.0,0.5438,0.3654,0.6979,0.4054,0.7403,0.3856,0.7955,0.4034,0.8106,0.3929
SGDClassifier,0.0226,0.0,0.4806,0.3052,0.686,0.4084,0.7471,0.4063,0.8109,0.3896,0.8075,0.4134
DecisionTreeClassifier,0.4386,0.25,0.7094,0.2058,0.8168,0.1953,0.857,0.1946,0.902,0.2236,0.9048,0.22
RandomForestClassifier,0.3916,0.0829,0.7652,0.1802,0.8657,0.1452,0.9031,0.1758,0.9254,0.1838,0.9396,0.1685
ExtraTreesClassifier,0.4846,0.0789,0.7731,0.1556,0.8703,0.1282,0.9069,0.1765,0.9292,0.1705,0.9361,0.1866



Train vs validation results for BorderlineSMOTE technique.


Unnamed: 0_level_0,ss=0.1 train,ss=0.1 val,ss=0.3 train,ss=0.3 val,ss=0.5 train,ss=0.5 val,ss=0.7 train,ss=0.7 val,ss=0.9 train,ss=0.9 val,ss=1 train,ss=1 val
Models,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LogisticRegression,0.1543,0.0,0.6279,0.3933,0.7588,0.405,0.8161,0.3868,0.8606,0.4212,0.8725,0.4068
LinearDiscriminantAnalysis,0.2867,0.0811,0.6666,0.3717,0.7807,0.4328,0.8302,0.4201,0.8657,0.4113,0.8783,0.403
GaussianNB,0.4207,0.2789,0.6979,0.2806,0.7941,0.2802,0.8448,0.2806,0.875,0.2826,0.8855,0.2806
KNeighborsClassifier,0.451,0.1832,0.8058,0.2697,0.8941,0.2642,0.931,0.2491,0.9464,0.266,0.9527,0.2534
SVC,0.1373,0.0,0.7444,0.3529,0.8348,0.363,0.876,0.3549,0.8987,0.3433,0.8975,0.3529
LinearSVC,0.0132,0.0,0.6347,0.3788,0.7665,0.3963,0.8231,0.3955,0.8664,0.4122,0.8795,0.4061
SGDClassifier,0.0463,0.0,0.6289,0.4204,0.7508,0.4076,0.8355,0.3869,0.8576,0.4286,0.883,0.4201
DecisionTreeClassifier,0.465,0.1136,0.7659,0.1717,0.8492,0.1483,0.8898,0.1875,0.9142,0.2016,0.9198,0.2033
RandomForestClassifier,0.453,0.1093,0.8101,0.1914,0.8961,0.1786,0.9264,0.1923,0.9438,0.1883,0.95,0.1883
ExtraTreesClassifier,0.4886,0.1316,0.8284,0.1628,0.9021,0.1106,0.9292,0.1899,0.9435,0.1674,0.9541,0.1867



Train vs validation results for SVMSMOTE technique.


Unnamed: 0_level_0,ss=0.1 train,ss=0.1 val,ss=0.3 train,ss=0.3 val,ss=0.5 train,ss=0.5 val,ss=0.7 train,ss=0.7 val,ss=0.9 train,ss=0.9 val,ss=1 train,ss=1 val
Models,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LogisticRegression,0.0874,0.0,0.4873,0.2597,0.6712,0.3763,0.744,0.405,0.8012,0.3959,0.8121,0.3868
LinearDiscriminantAnalysis,0.2083,0.0843,0.5501,0.2754,0.6991,0.3966,0.7692,0.4367,0.8163,0.4213,0.8264,0.4121
GaussianNB,0.3775,0.2797,0.6209,0.2814,0.7265,0.2865,0.7861,0.2802,0.8258,0.281,0.8389,0.2798
KNeighborsClassifier,0.3404,0.1613,0.7493,0.2889,0.8324,0.2419,0.8818,0.2471,0.9132,0.266,0.9214,0.2698
SVC,0.0079,0.0,0.5971,0.2336,0.7552,0.3612,0.8152,0.3691,0.8585,0.3616,0.8744,0.356
LinearSVC,0.0,0.0,0.4694,0.2093,0.6772,0.3873,0.7541,0.4012,0.8094,0.3924,0.8232,0.3835
SGDClassifier,0.0269,0.1087,0.4042,0.2212,0.6615,0.381,0.7499,0.4268,0.8246,0.4155,0.8292,0.4156
DecisionTreeClassifier,0.3976,0.0691,0.6977,0.0676,0.7922,0.1674,0.8314,0.2263,0.8757,0.1717,0.8876,0.1502
RandomForestClassifier,0.3689,0.1117,0.7458,0.1485,0.8411,0.1422,0.8856,0.2045,0.9131,0.1739,0.9173,0.1948
ExtraTreesClassifier,0.404,0.1337,0.7675,0.1699,0.8491,0.1422,0.888,0.2009,0.9164,0.1528,0.9222,0.1667



Train vs validation results for SMOTENC technique.


Unnamed: 0_level_0,ss=0.1 train,ss=0.1 val,ss=0.3 train,ss=0.3 val,ss=0.5 train,ss=0.5 val,ss=0.7 train,ss=0.7 val,ss=0.9 train,ss=0.9 val,ss=1 train,ss=1 val
Models,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LogisticRegression,0.1373,0.0,0.5798,0.3252,0.7068,0.3746,0.7572,0.3561,0.7896,0.3476,0.8162,0.3488
LinearDiscriminantAnalysis,0.1782,0.0,0.5602,0.2532,0.7064,0.3746,0.7583,0.374,0.7966,0.3856,0.8176,0.3571
GaussianNB,0.4227,0.2669,0.687,0.2395,0.7871,0.2395,0.8373,0.2395,0.8687,0.2395,0.8808,0.2395
KNeighborsClassifier,0.3464,0.1309,0.7733,0.1969,0.8617,0.1812,0.8995,0.1993,0.9229,0.2063,0.9318,0.2012
SVC,0.0182,0.0,0.6424,0.2521,0.7626,0.3025,0.8007,0.3165,0.837,0.2967,0.8538,0.3056
LinearSVC,0.0664,0.0,0.5716,0.2754,0.7182,0.3698,0.7628,0.3561,0.7959,0.359,0.819,0.3479
SGDClassifier,0.093,0.0,0.552,0.2041,0.6692,0.355,0.7618,0.3467,0.8053,0.3417,0.8108,0.3365
DecisionTreeClassifier,0.4226,0.2064,0.7296,0.2579,0.8316,0.0772,0.8692,0.119,0.9005,0.2372,0.9099,0.1807
RandomForestClassifier,0.3776,0.0798,0.767,0.1345,0.8585,0.125,0.9012,0.1544,0.9216,0.2037,0.9343,0.1718
ExtraTreesClassifier,0.4291,0.1302,0.7703,0.1339,0.8541,0.2024,0.8989,0.1931,0.9186,0.1901,0.9277,0.2015



Train vs validation results for ADASYN technique.


Unnamed: 0_level_0,ss=0.1 train,ss=0.1 val,ss=0.3 train,ss=0.3 val,ss=0.5 train,ss=0.5 val,ss=0.7 train,ss=0.7 val,ss=0.9 train,ss=0.9 val,ss=1 train,ss=1 val
Models,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LogisticRegression,0.0642,0.0,0.4926,0.3774,0.682,0.403,0.7468,0.3958,0.7879,0.3922,0.81,0.4048
LinearDiscriminantAnalysis,0.1485,0.0286,0.5354,0.3717,0.6986,0.4252,0.7587,0.4113,0.7955,0.4077,0.8171,0.406
GaussianNB,0.4129,0.2751,0.6741,0.274,0.7705,0.2743,0.8287,0.2743,0.855,0.2717,0.8656,0.2717
KNeighborsClassifier,0.299,0.2094,0.7544,0.2425,0.8538,0.2475,0.8959,0.2454,0.9218,0.2259,0.9311,0.2609
SVC,0.0013,0.0,0.5489,0.2964,0.7493,0.3448,0.8084,0.3492,0.8418,0.3351,0.8539,0.3426
LinearSVC,0.0,0.0,0.4743,0.3953,0.6886,0.4018,0.7549,0.4134,0.7937,0.4024,0.8136,0.4019
SGDClassifier,0.0078,0.0,0.3552,0.0,0.6463,0.4201,0.7366,0.4122,0.808,0.4125,0.7916,0.4057
DecisionTreeClassifier,0.4016,0.1382,0.7129,0.1489,0.8149,0.1498,0.8722,0.1793,0.902,0.1255,0.909,0.1562
RandomForestClassifier,0.348,0.082,0.7613,0.1754,0.8535,0.1464,0.8995,0.1538,0.9215,0.2007,0.9286,0.2206
ExtraTreesClassifier,0.4207,0.1323,0.7627,0.1549,0.8629,0.1417,0.9009,0.214,0.9248,0.203,0.9295,0.2022



Train vs validation results for RandomUnderSampler technique.


Unnamed: 0_level_0,ss=0.1 train,ss=0.1 val,ss=0.3 train,ss=0.3 val,ss=0.5 train,ss=0.5 val,ss=0.7 train,ss=0.7 val,ss=0.9 train,ss=0.9 val,ss=1 train,ss=1 val
Models,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LogisticRegression,0.073,0.0,0.5432,0.371,0.6728,0.4094,0.7352,0.3916,0.7687,0.3947,0.7749,0.3837
LinearDiscriminantAnalysis,0.1874,0.0281,0.5856,0.396,0.7004,0.419,0.7388,0.4113,0.7716,0.4028,0.7808,0.3941
GaussianNB,0.4097,0.2703,0.6718,0.2581,0.7654,0.2548,0.8115,0.2494,0.8445,0.2484,0.855,0.2478
KNeighborsClassifier,0.1442,0.1685,0.4802,0.3036,0.6506,0.3324,0.717,0.3365,0.793,0.3354,0.8128,0.334
SVC,0.0,0.0,0.52,0.3636,0.6997,0.3944,0.7641,0.3771,0.7988,0.3625,0.8168,0.3622
LinearSVC,0.0,0.0,0.5533,0.3684,0.677,0.4143,0.7359,0.4005,0.7674,0.391,0.7776,0.3846
SGDClassifier,0.0705,0.0,0.5503,0.1732,0.6343,0.3872,0.6267,0.2648,0.703,0.3423,0.6854,0.3539
DecisionTreeClassifier,0.2843,0.1891,0.4748,0.1964,0.6079,0.271,0.6428,0.3125,0.6722,0.2758,0.6476,0.2975
RandomForestClassifier,0.1673,0.1058,0.4742,0.2669,0.6342,0.3343,0.6772,0.3741,0.7093,0.3524,0.7428,0.3433
ExtraTreesClassifier,0.1638,0.1232,0.453,0.2901,0.5794,0.2809,0.6222,0.3309,0.6723,0.3414,0.7069,0.332





In [93]:
sample_vs = [x/10 for x in range(1, 10, 2)]+[1]
for resampling_tq in resampling_tqs_no_ss:
    resampling_tq_str = str(resampling_tq).split('.')[-1][:-2]
    print(f'Train vs validation results for {resampling_tq_str} technique.')
    display(get_train_val_results_for_sampling(resampling_tq, 
                                               tqs_wo_random_state, 
                                               sample_vs, 
                                               models, 
                                               X_train, y_train, 
                                               splitter, 
                                               X_val, y_val, 
                                               ss=False))
    print()

Train vs validation results for CondensedNearestNeighbour technique.


Unnamed: 0_level_0,ss=0.1 train,ss=0.1 val,ss=0.3 train,ss=0.3 val,ss=0.5 train,ss=0.5 val,ss=0.7 train,ss=0.7 val,ss=0.9 train,ss=0.9 val,ss=1 train,ss=1 val
Models,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LogisticRegression,0.0127,0.0,0.0127,0.0,0.0127,0.0,0.0127,0.0,0.0127,0.0,0.0127,0.0
LinearDiscriminantAnalysis,0.0202,0.0,0.0202,0.0,0.0202,0.0,0.0202,0.0,0.0202,0.0,0.0202,0.0
GaussianNB,0.4079,0.2972,0.4079,0.2972,0.4079,0.2972,0.4079,0.2972,0.4079,0.2972,0.4079,0.2972
KNeighborsClassifier,0.1163,0.1381,0.1163,0.1381,0.1163,0.1381,0.1163,0.1381,0.1163,0.1381,0.1163,0.1381
SVC,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
LinearSVC,0.0026,0.0,0.0026,0.0,0.0026,0.0,0.0026,0.0,0.0026,0.0,0.0026,0.0
SGDClassifier,0.3503,0.1892,0.3503,0.1892,0.3503,0.1892,0.3503,0.1892,0.3503,0.1892,0.3503,0.1892
DecisionTreeClassifier,0.2289,0.1661,0.2289,0.1661,0.2289,0.1661,0.2289,0.1661,0.2289,0.1661,0.2289,0.1661
RandomForestClassifier,0.1111,0.0802,0.1111,0.0802,0.1111,0.0802,0.1111,0.0802,0.1111,0.0802,0.1111,0.0802
ExtraTreesClassifier,0.1148,0.1244,0.1148,0.1244,0.1148,0.1244,0.1148,0.1244,0.1148,0.1244,0.1148,0.1244



Train vs validation results for TomekLinks technique.


Unnamed: 0_level_0,ss=0.1 train,ss=0.1 val,ss=0.3 train,ss=0.3 val,ss=0.5 train,ss=0.5 val,ss=0.7 train,ss=0.7 val,ss=0.9 train,ss=0.9 val,ss=1 train,ss=1 val
Models,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LogisticRegression,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
LinearDiscriminantAnalysis,0.0818,0.0,0.0818,0.0,0.0818,0.0,0.0818,0.0,0.0818,0.0,0.0818,0.0
GaussianNB,0.2711,0.2725,0.2711,0.2725,0.2711,0.2725,0.2711,0.2725,0.2711,0.2725,0.2711,0.2725
KNeighborsClassifier,0.1084,0.0867,0.1084,0.0867,0.1084,0.0867,0.1084,0.0867,0.1084,0.0867,0.1084,0.0867
SVC,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
LinearSVC,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SGDClassifier,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DecisionTreeClassifier,0.2181,0.2036,0.2181,0.2036,0.2181,0.2036,0.2181,0.2036,0.2181,0.2036,0.2181,0.2036
RandomForestClassifier,0.131,0.0568,0.131,0.0568,0.131,0.0568,0.131,0.0568,0.131,0.0568,0.131,0.0568
ExtraTreesClassifier,0.1603,0.1351,0.1603,0.1351,0.1603,0.1351,0.1603,0.1351,0.1603,0.1351,0.1603,0.1351



Train vs validation results for EditedNearestNeighbours technique.


Unnamed: 0_level_0,ss=0.1 train,ss=0.1 val,ss=0.3 train,ss=0.3 val,ss=0.5 train,ss=0.5 val,ss=0.7 train,ss=0.7 val,ss=0.9 train,ss=0.9 val,ss=1 train,ss=1 val
Models,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LogisticRegression,0.0922,0.0,0.0922,0.0,0.0922,0.0,0.0922,0.0,0.0922,0.0,0.0922,0.0
LinearDiscriminantAnalysis,0.2692,0.1309,0.2692,0.1309,0.2692,0.1309,0.2692,0.1309,0.2692,0.1309,0.2692,0.1309
GaussianNB,0.2965,0.2725,0.2965,0.2725,0.2965,0.2725,0.2965,0.2725,0.2965,0.2725,0.2965,0.2725
KNeighborsClassifier,0.2439,0.203,0.2439,0.203,0.2439,0.203,0.2439,0.203,0.2439,0.203,0.2439,0.203
SVC,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
LinearSVC,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SGDClassifier,0.0232,0.0,0.0232,0.0,0.0232,0.0,0.0232,0.0,0.0232,0.0,0.0232,0.0
DecisionTreeClassifier,0.3905,0.235,0.3905,0.235,0.3905,0.235,0.3905,0.235,0.3905,0.235,0.3905,0.235
RandomForestClassifier,0.3077,0.1741,0.3077,0.1741,0.3077,0.1741,0.3077,0.1741,0.3077,0.1741,0.3077,0.1741
ExtraTreesClassifier,0.3358,0.2027,0.3358,0.2027,0.3358,0.2027,0.3358,0.2027,0.3358,0.2027,0.3358,0.2027



Train vs validation results for NeighbourhoodCleaningRule technique.


Unnamed: 0_level_0,ss=0.1 train,ss=0.1 val,ss=0.3 train,ss=0.3 val,ss=0.5 train,ss=0.5 val,ss=0.7 train,ss=0.7 val,ss=0.9 train,ss=0.9 val,ss=1 train,ss=1 val
Models,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LogisticRegression,0.0991,0.0,0.0991,0.0,0.0991,0.0,0.0991,0.0,0.0991,0.0,0.0991,0.0
LinearDiscriminantAnalysis,0.2759,0.1295,0.2759,0.1295,0.2759,0.1295,0.2759,0.1295,0.2759,0.1295,0.2759,0.1295
GaussianNB,0.2966,0.2725,0.2966,0.2725,0.2966,0.2725,0.2966,0.2725,0.2966,0.2725,0.2966,0.2725
KNeighborsClassifier,0.2465,0.1562,0.2465,0.1562,0.2465,0.1562,0.2465,0.1562,0.2465,0.1562,0.2465,0.1562
SVC,0.0203,0.0,0.0203,0.0,0.0203,0.0,0.0203,0.0,0.0203,0.0,0.0203,0.0
LinearSVC,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SGDClassifier,0.0225,0.0,0.0225,0.0,0.0225,0.0,0.0225,0.0,0.0225,0.0,0.0225,0.0
DecisionTreeClassifier,0.3717,0.1815,0.3717,0.1815,0.3717,0.1815,0.3717,0.1815,0.3717,0.1815,0.3717,0.1815
RandomForestClassifier,0.3182,0.1546,0.3182,0.1546,0.3182,0.1546,0.3182,0.1546,0.3182,0.1546,0.3182,0.1546
ExtraTreesClassifier,0.3475,0.1651,0.3475,0.1651,0.3475,0.1651,0.3475,0.1651,0.3475,0.1651,0.3475,0.1651





#### Let's look on the results.
    Resampling results were chosen based on val results.
    For validation dataset the constant strategy (dummy classification) which choose only 1 gives ~0.2045 f2_score so let's focus on higher ones.
    SMOTE
        ss=0.5 
            Model	train score / val score
            LinearDiscriminantAnalysis 0.7039	0.4303
            LinearSVC 0.6979	0.4054
            SGDClassifier 0.6860	0.4084
            AdaBoostClassifier 0.7325	0.3947
        
    BorderlineSMOTE 
        ss=0.3
            LogisticRegression 0.6279	0.3933
            SGDClassifier 0.6289	0.4204
    
        ss=0.5
            LinearDiscriminantAnalysis 0.7807	0.4328	
            LinearSVC 0.7665	0.3963
    
    SVMSMOTE 
        ss=0.7
            LinearDiscriminantAnalysis 0.7692	0.4367
            LinearSVC 0.7541	0.4012
            SGDClassifier 0.7499	0.4268	
    
    SMOTENC 
        ss=0.5
            LogisticRegression 0.7068	0.3746
            LinearDiscriminantAnalysis 0.7064	0.3746	
            LinearSVC 0.7626	0.3025	
    
    ADASYN 
        ss=0.5
            LogisticRegression 0.6820	0.4030
            LinearDiscriminantAnalysis 0.6986	0.4252	
            LinearSVC 0.6886	0.4018	
            SGDClassifier 0.6463	0.4201	
            AdaBoostClassifier 0.7167	0.3548
    
    RandomUnderSampler 
        ss=0.5
            LogisticRegression 0.6728	0.4094
            LinearDiscriminantAnalysis 0.7004	0.4190
            SVC 0.6997	0.3944
            LinearSVC 0.6770	0.4143	
            SGDClassifier 0.6343	0.3872
            RandomForestClassifier 0.6342	0.3343
            GradientBoostingClassifier 0.6616	0.3295
    
        ss=0.7
            BaggingClassifier 0.6708	0.3358
    
    
    CondensedNearestNeighbour 
        not sufficient
        
    TomekLinks 
        not sufficient
        
    EditedNearestNeighbours 
        not sufficient
        
    NeighbourhoodCleaningRule 
        not sufficient

Let's give a chance for some combinations:<br>
&emsp;&emsp;It seems SMOTE performs better than BorderlineSMOTE so there is no needs to check them both.<br>
&emsp;&emsp;Similar for SMOTE and SMOTENC.<br>
&emsp;&emsp;For undersampling only RandomUnderSampler showed some interesting results.<br>
Combinations to check:<br>
&emsp;&emsp;SMOTE + RandomUnderSampler<br>
&emsp;&emsp;SVMSMOTE + RandomUnderSampler<br>
&emsp;&emsp;ADASYN + RandomUnderSampler<br>

In [22]:
from imblearn.pipeline import Pipeline as imPipeline

In [23]:
# effective (over/under)sampling techniques
oversample_tqs = [SMOTE, SVMSMOTE, ADASYN]
undersample_tqs = [RandomUnderSampler]

# effective models
models_new = [
    LogisticRegression(random_state=24),
    LinearDiscriminantAnalysis(),
    SVC(),
    LinearSVC(),
    SGDClassifier(random_state=24),
    RandomForestClassifier(random_state=24),
    BaggingClassifier(random_state=24),
    GradientBoostingClassifier(random_state=24),
    AdaBoostClassifier(random_state=24)
]

In [24]:
def get_train_val_results_for_combined_resampling(oversampling_tq, undersampling_tq, oversample_vs, undersample_vs,
                                                  models, X, y, splitter, X_val, y_val):
    ''' Get scores for models based on cross_val_score on train dataset 
        and prediction scores for valudation dataset
        based on resampling technique and sampling strategies.
    Parameters:
        oversampling_tq   - oversampling technique
        undersampling_tq  - undersampling technique
        oversample_vs     - list of ratios to be used for oversampling
        undersample_vs    - list of ratios to be used for undersampling
        models            - models to evaluate
        X                 - X train dataset
        y                 - y train dataset
        splitter          - cross-validation splitting strategy
        X_val             - X validation dataset
        y_val             - y validation dataset
    Result:
        Dataframe with the coss_val_score and validation results for given models, all resampling techniques
        and ratios of sampling_strategy '''
    
    stq_results = pd.DataFrame()

    for oversample_v in oversample_vs:
        oversample_tq = oversampling_tq(sampling_strategy=oversample_v, random_state=24)
        for undersample_v in undersample_vs:
            # undersampling cannot be performed if len(minority_class)/undersample_v < oversample_v*len(majority_class)
            # because that will require to generate a new sample what is not the purpose of undersampling
            # ADASYN has a little different mechanism of sample generating so need to be separated 
            # actually there is no sense to run a undersample_v >= oversample_v because if they be eq there will
            # be no change in the data structure 
#             if oversampling_tq == ADASYN:
#                 if undersample_v > oversample_v:
#                     undersample_tq = undersampling_tq(sampling_strategy=undersample_v, random_state=24)
#                 else:
#                     continue
            if undersample_v > oversample_v:
                undersample_tq = undersampling_tq(sampling_strategy=undersample_v, random_state=24)
            else:
                continue
            resample_pipeline = imPipeline([('oversampling', oversample_tq), ('undersampling', undersample_tq)])
            X_stq, y_stq = resample_pipeline.fit_resample(X, y)
            sample_v = f'o{oversample_v}/u{undersample_v}'
            stq_scores = get_models_scores_vs_val_for_sampling(models, X_stq, y_stq, splitter, X_val, y_val, sample_v)
            stq_results = pd.concat([stq_results, stq_scores], axis=1)
    return stq_results

In [25]:
o_sample_vs = [0.3, 0.5, 0.7] # based on the previous research these were pretty efective
u_sample_vs = [0.3, 0.5, 0.7, 1.0] 
for oversampling_tq in oversample_tqs:
    for undersample_tq in undersample_tqs:
        oversampling_tq_str = str(oversampling_tq).split('.')[-1][:-2]
        undersampling_tq_str = str(undersample_tq).split('.')[-1][:-2]

        print(f'Train vs validation results for combined {oversampling_tq_str} and {undersampling_tq_str} techniques.')
        display(get_train_val_results_for_combined_resampling(oversampling_tq, 
                                                   undersample_tq, 
                                                   o_sample_vs,
                                                   u_sample_vs,
                                                   models_new, 
                                                   X_train, y_train, 
                                                   splitter, 
                                                   X_val, y_val))
        print()

Train vs validation results for combined SMOTE and RandomUnderSampler techniques.


Unnamed: 0_level_0,ss=o0.3/u0.5 train,ss=o0.3/u0.5 val,ss=o0.3/u0.7 train,ss=o0.3/u0.7 val,ss=o0.3/u1.0 train,ss=o0.3/u1.0 val,ss=o0.5/u0.7 train,ss=o0.5/u0.7 val,ss=o0.5/u1.0 train,ss=o0.5/u1.0 val,ss=o0.7/u1.0 train,ss=o0.7/u1.0 val
Models,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LogisticRegression,0.6829,0.3881,0.7499,0.393,0.8125,0.4067,0.7467,0.3878,0.808,0.4005,0.8108,0.3957
LinearDiscriminantAnalysis,0.7014,0.4215,0.7592,0.4047,0.8231,0.4079,0.752,0.4188,0.8198,0.4009,0.8229,0.4098
SVC,0.7587,0.3448,0.806,0.3591,0.8492,0.3874,0.8013,0.3448,0.8418,0.3769,0.8404,0.3671
LinearSVC,0.6911,0.4006,0.7565,0.3968,0.8124,0.4028,0.7503,0.4021,0.8101,0.3947,0.815,0.4048
SGDClassifier,0.7261,0.3614,0.7622,0.4028,0.7959,0.2273,0.7576,0.4054,0.8142,0.4064,0.8124,0.3532
RandomForestClassifier,0.8358,0.2191,0.864,0.2055,0.898,0.2128,0.8908,0.1195,0.9174,0.2048,0.9274,0.1487
BaggingClassifier,0.788,0.2227,0.8264,0.1971,0.8658,0.216,0.8671,0.1464,0.8921,0.1654,0.9056,0.125
GradientBoostingClassifier,0.7541,0.2347,0.8172,0.3402,0.877,0.3867,0.8352,0.3188,0.8802,0.3533,0.8902,0.3
AdaBoostClassifier,0.706,0.3628,0.7894,0.3675,0.8412,0.3901,0.7994,0.3457,0.8489,0.3563,0.8471,0.3285



Train vs validation results for combined SVMSMOTE and RandomUnderSampler techniques.


Unnamed: 0_level_0,ss=o0.3/u0.5 train,ss=o0.3/u0.5 val,ss=o0.3/u0.7 train,ss=o0.3/u0.7 val,ss=o0.3/u1.0 train,ss=o0.3/u1.0 val,ss=o0.5/u0.7 train,ss=o0.5/u0.7 val,ss=o0.5/u1.0 train,ss=o0.5/u1.0 val,ss=o0.7/u1.0 train,ss=o0.7/u1.0 val
Models,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LogisticRegression,0.7558,0.4012,0.8227,0.4062,0.8608,0.4135,0.8194,0.4178,0.8702,0.4188,0.868,0.3995
LinearDiscriminantAnalysis,0.7719,0.4252,0.8325,0.4255,0.8597,0.4136,0.8313,0.4178,0.8716,0.4136,0.8701,0.4136
SVC,0.7986,0.3607,0.836,0.3779,0.8745,0.3786,0.8588,0.3392,0.8827,0.3723,0.8882,0.3533
LinearSVC,0.7638,0.4042,0.8274,0.4121,0.8617,0.4104,0.8267,0.4132,0.874,0.4094,0.8711,0.3924
SGDClassifier,0.6694,0.4178,0.7783,0.3931,0.8511,0.3827,0.8211,0.3939,0.8657,0.4087,0.8523,0.3956
RandomForestClassifier,0.8393,0.2227,0.8643,0.2182,0.8983,0.2751,0.8886,0.2165,0.9169,0.2105,0.93,0.1986
BaggingClassifier,0.8076,0.1619,0.836,0.2612,0.8737,0.2508,0.8558,0.2124,0.8982,0.2273,0.913,0.1661
GradientBoostingClassifier,0.7774,0.2564,0.825,0.3182,0.8677,0.3482,0.853,0.3247,0.893,0.3803,0.8963,0.3372
AdaBoostClassifier,0.7547,0.3896,0.8142,0.3779,0.8622,0.3722,0.8252,0.379,0.8797,0.3444,0.8699,0.3562



Train vs validation results for combined ADASYN and RandomUnderSampler techniques.


Unnamed: 0_level_0,ss=o0.3/u0.5 train,ss=o0.3/u0.5 val,ss=o0.3/u0.7 train,ss=o0.3/u0.7 val,ss=o0.3/u1.0 train,ss=o0.3/u1.0 val,ss=o0.5/u0.7 train,ss=o0.5/u0.7 val,ss=o0.5/u1.0 train,ss=o0.5/u1.0 val,ss=o0.7/u1.0 train,ss=o0.7/u1.0 val
Models,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LogisticRegression,0.6797,0.4018,0.7427,0.4058,0.8014,0.3953,0.7609,0.3979,0.8154,0.4019,0.803,0.4028
LinearDiscriminantAnalysis,0.6962,0.4179,0.7517,0.4061,0.8089,0.4005,0.7693,0.4134,0.8257,0.4032,0.8123,0.4042
SVC,0.7323,0.3538,0.7798,0.3794,0.8332,0.3738,0.8111,0.3581,0.8416,0.3511,0.8552,0.3571
LinearSVC,0.6885,0.4094,0.7467,0.4124,0.8045,0.4014,0.7676,0.4145,0.8194,0.406,0.8072,0.3963
SGDClassifier,0.6872,0.3654,0.6838,0.3715,0.7811,0.389,0.7621,0.4282,0.8,0.4177,0.8051,0.4015
RandomForestClassifier,0.8277,0.2115,0.8564,0.1923,0.8926,0.2322,0.8855,0.1556,0.9123,0.2249,0.9255,0.1805
BaggingClassifier,0.7794,0.2254,0.8138,0.1498,0.8467,0.2013,0.8529,0.1619,0.885,0.1667,0.909,0.1815
GradientBoostingClassifier,0.7499,0.3041,0.7926,0.3323,0.8452,0.3856,0.8293,0.2961,0.8814,0.3794,0.8862,0.3482
AdaBoostClassifier,0.7032,0.3692,0.7688,0.3453,0.8215,0.3488,0.7933,0.3388,0.8356,0.3614,0.8417,0.3519





#### The results comparing to the previous ones:
    SMOTE + RandomUnderSampler
        No better results than only SMOTE
    
    SVMSMOTE + RandomUnderSampler
        ss = o0.5/u0.7
        Model		train score / val score
        LinearSVC	0.8267	0.4132	
        
    ADASYN + RandomUnderSampler
        ss = o0.5/u0.7
        LinearSVC 0.7676	0.4145
        ss=o0.3/u0.5
        AdaBoostClassifier	0.7032	0.3692	
    
    ss=o0.3/u1.0
    GradientBoostingClassifier 0.8452	0.3856
    
    The ADASYN + RandomUnderSampler with ss = o0.5/u0.7 looks promising.
    The alternatives are only SMOTE or SVMSMOTE for ss=0.5 and 0.7, respectively.

ADASYN has a n_neighbors argument which define the number of nearest neighbors used for new samples generation. Let's run some test for chosen setup.

In [130]:
oversample_v = 0.5
undersample_v = 0.7
nbrs_results = pd.DataFrame()
for nn_no in range(1, 14):
    print('n_neighbors:', nn_no)
    oversample_tq = ADASYN(sampling_strategy=oversample_v, n_neighbors=nn_no, random_state=24)
    undersample_tq = RandomUnderSampler(sampling_strategy=undersample_v, random_state=24)
    resample_pipeline = imPipeline([('oversampling', oversample_tq), ('undersampling', undersample_tq)])
    X_ou, y_ou = resample_pipeline.fit_resample(X_train, y_train)
    sample_v = f'nbrs{nn_no}_o{oversample_v}/u{undersample_v}'
    scores = get_models_scores_vs_val_for_sampling(models, X_ou, y_ou, splitter, X_val, y_val, sample_v)
    nbrs_results = pd.concat([nbrs_results, scores], axis=1)

n_neighbors: 1
n_neighbors: 2
n_neighbors: 3
n_neighbors: 4
n_neighbors: 5
n_neighbors: 6
n_neighbors: 7
n_neighbors: 8
n_neighbors: 9
n_neighbors: 10
n_neighbors: 11
n_neighbors: 12
n_neighbors: 13


In [132]:
nbrs_results[nbrs_results.columns[:10]]

Unnamed: 0_level_0,ss=nbrs1_o0.5/u0.7 train,ss=nbrs1_o0.5/u0.7 val,ss=nbrs2_o0.5/u0.7 train,ss=nbrs2_o0.5/u0.7 val,ss=nbrs3_o0.5/u0.7 train,ss=nbrs3_o0.5/u0.7 val,ss=nbrs4_o0.5/u0.7 train,ss=nbrs4_o0.5/u0.7 val,ss=nbrs5_o0.5/u0.7 train,ss=nbrs5_o0.5/u0.7 val
Models,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
LogisticRegression,0.744,0.3958,0.7405,0.3989,0.7446,0.3989,0.744,0.4011,0.7609,0.3979
LinearDiscriminantAnalysis,0.7535,0.422,0.7508,0.4231,0.7532,0.4231,0.7578,0.4113,0.7693,0.4134
GaussianNB,0.8199,0.2677,0.8229,0.2695,0.8252,0.2736,0.8256,0.2729,0.825,0.274
KNeighborsClassifier,0.941,0.231,0.9131,0.228,0.903,0.2308,0.8949,0.2308,0.888,0.2687
SVC,0.8067,0.3649,0.7906,0.3397,0.7959,0.3672,0.7961,0.3463,0.8111,0.3581
LinearSVC,0.7532,0.4134,0.7491,0.4178,0.7548,0.4178,0.7532,0.4167,0.7676,0.4145
SGDClassifier,0.7419,0.3944,0.7463,0.4101,0.7547,0.4187,0.7383,0.4098,0.7621,0.4282
DecisionTreeClassifier,0.8843,0.3042,0.8684,0.16,0.8502,0.1373,0.8499,0.2583,0.8346,0.1444
RandomForestClassifier,0.9313,0.1556,0.9134,0.1195,0.8972,0.1373,0.8917,0.1383,0.8855,0.1556
ExtraTreesClassifier,0.9415,0.1515,0.9173,0.081,0.9031,0.1357,0.9015,0.1556,0.8883,0.1509


In [133]:
nbrs_results[nbrs_results.columns[10:20]]

Unnamed: 0_level_0,ss=nbrs6_o0.5/u0.7 train,ss=nbrs6_o0.5/u0.7 val,ss=nbrs7_o0.5/u0.7 train,ss=nbrs7_o0.5/u0.7 val,ss=nbrs8_o0.5/u0.7 train,ss=nbrs8_o0.5/u0.7 val,ss=nbrs9_o0.5/u0.7 train,ss=nbrs9_o0.5/u0.7 val,ss=nbrs10_o0.5/u0.7 train,ss=nbrs10_o0.5/u0.7 val
Models,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
LogisticRegression,0.7423,0.3887,0.7552,0.3887,0.7557,0.4065,0.7459,0.3919,0.7514,0.3898
LinearDiscriminantAnalysis,0.7546,0.4124,0.7647,0.4124,0.766,0.4156,0.7565,0.4145,0.7615,0.4145
GaussianNB,0.8249,0.2732,0.8261,0.2747,0.8248,0.2729,0.8267,0.2729,0.8277,0.2751
KNeighborsClassifier,0.8844,0.2727,0.8854,0.2827,0.8826,0.2932,0.8673,0.2591,0.8704,0.2308
SVC,0.8042,0.3482,0.8005,0.3601,0.812,0.3482,0.8045,0.3662,0.811,0.3662
LinearSVC,0.7515,0.4058,0.7633,0.4026,0.7622,0.409,0.7552,0.4079,0.7614,0.4079
SGDClassifier,0.7392,0.4308,0.7336,0.3895,0.7612,0.4155,0.7246,0.4177,0.7402,0.3812
DecisionTreeClassifier,0.844,0.2099,0.8452,0.1901,0.8266,0.1661,0.8182,0.188,0.8251,0.2069
RandomForestClassifier,0.8806,0.1894,0.8715,0.2052,0.8717,0.2107,0.8643,0.1758,0.8781,0.1779
ExtraTreesClassifier,0.8852,0.1186,0.8726,0.1487,0.875,0.1515,0.8704,0.1737,0.8759,0.1533


In [134]:
nbrs_results[nbrs_results.columns[20:]]

Unnamed: 0_level_0,ss=nbrs11_o0.5/u0.7 train,ss=nbrs11_o0.5/u0.7 val,ss=nbrs12_o0.5/u0.7 train,ss=nbrs12_o0.5/u0.7 val,ss=nbrs13_o0.5/u0.7 train,ss=nbrs13_o0.5/u0.7 val
Models,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
LogisticRegression,0.7481,0.393,0.7392,0.3804,0.7527,0.3784
LinearDiscriminantAnalysis,0.7583,0.4145,0.7492,0.4134,0.7623,0.4145
GaussianNB,0.8281,0.2762,0.827,0.274,0.8278,0.2743
KNeighborsClassifier,0.8762,0.295,0.8606,0.2673,0.8642,0.3096
SVC,0.8142,0.3541,0.8036,0.3501,0.8133,0.3551
LinearSVC,0.7562,0.4079,0.7463,0.3968,0.7593,0.3979
SGDClassifier,0.7243,0.3991,0.762,0.4192,0.7142,0.4187
DecisionTreeClassifier,0.8219,0.239,0.8362,0.2174,0.8292,0.2778
RandomForestClassifier,0.8671,0.1992,0.8599,0.1575,0.8714,0.1807
ExtraTreesClassifier,0.8752,0.1992,0.8666,0.1606,0.8704,0.2236


Looks like the default n_neighbors: 5 gives pretty good results for the validation dataset.

Let's check it for the SMOTE and SVMSMOTE as well.

In [144]:
sample_vs = [0.5]
knbrs_results = pd.DataFrame()
print('Train vs validation results for SMOTE technique.')
for kn_no in range(1, 14):
    print('k_neighbors:', kn_no)
    oversample_tq = SMOTE(sampling_strategy=oversample_v, k_neighbors=kn_no, random_state=24)
    X_o, y_o = oversample_tq.fit_resample(X_train, y_train)
    sample_v = f'nbrs{kn_no}_o{sample_vs[0]}'
    results_df = get_models_scores_vs_val_for_sampling(models, X_o, y_o, splitter, X_val, y_val, sample_v)
    knbrs_results = pd.concat([knbrs_results, results_df], axis=1)

Train vs validation results for SMOTE technique.
k_neighbors: 1
k_neighbors: 2
k_neighbors: 3
k_neighbors: 4
k_neighbors: 5
k_neighbors: 6
k_neighbors: 7
k_neighbors: 8
k_neighbors: 9
k_neighbors: 10
k_neighbors: 11
k_neighbors: 12
k_neighbors: 13


In [145]:
knbrs_results[knbrs_results.columns[:14]]

Unnamed: 0_level_0,ss=nbrs1_o0.5 train,ss=nbrs1_o0.5 val,ss=nbrs2_o0.5 train,ss=nbrs2_o0.5 val,ss=nbrs3_o0.5 train,ss=nbrs3_o0.5 val,ss=nbrs4_o0.5 train,ss=nbrs4_o0.5 val,ss=nbrs5_o0.5 train,ss=nbrs5_o0.5 val,ss=nbrs6_o0.5 train,ss=nbrs6_o0.5 val,ss=nbrs7_o0.5 train,ss=nbrs7_o0.5 val
Models,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
LogisticRegression,0.6861,0.3904,0.6853,0.3904,0.7018,0.3916,0.6869,0.4042,0.6909,0.4066,0.6895,0.403,0.7118,0.4079
LinearDiscriminantAnalysis,0.7056,0.4215,0.7077,0.4215,0.7185,0.4082,0.7008,0.429,0.7039,0.4303,0.7031,0.4203,0.7286,0.4265
GaussianNB,0.7675,0.2699,0.7719,0.2717,0.7708,0.2714,0.7758,0.2751,0.777,0.2774,0.7775,0.2766,0.7804,0.2716
KNeighborsClassifier,0.9173,0.2555,0.8889,0.2076,0.8739,0.2542,0.8669,0.2577,0.8657,0.2226,0.8551,0.264,0.8577,0.2759
SVC,0.7651,0.3517,0.752,0.3427,0.7546,0.3503,0.7527,0.3526,0.7468,0.3642,0.7499,0.3571,0.767,0.3454
LinearSVC,0.6959,0.4006,0.6938,0.3982,0.7084,0.4006,0.692,0.4006,0.6979,0.4054,0.6939,0.4006,0.718,0.4006
SGDClassifier,0.6371,0.4167,0.6652,0.4144,0.6887,0.4276,0.6744,0.3881,0.686,0.4084,0.642,0.3804,0.6868,0.3968
DecisionTreeClassifier,0.8687,0.234,0.8385,0.1562,0.8333,0.2066,0.8292,0.1101,0.8168,0.1953,0.8085,0.1562,0.8074,0.2321
RandomForestClassifier,0.9159,0.1429,0.8825,0.1717,0.8736,0.1502,0.8645,0.1471,0.8657,0.1452,0.8511,0.1464,0.8652,0.2083
ExtraTreesClassifier,0.9307,0.1415,0.8922,0.1087,0.8842,0.1496,0.8712,0.1446,0.8703,0.1282,0.8515,0.1606,0.8622,0.1923


In [146]:
knbrs_results[knbrs_results.columns[14:]]

Unnamed: 0_level_0,ss=nbrs8_o0.5 train,ss=nbrs8_o0.5 val,ss=nbrs9_o0.5 train,ss=nbrs9_o0.5 val,ss=nbrs10_o0.5 train,ss=nbrs10_o0.5 val,ss=nbrs11_o0.5 train,ss=nbrs11_o0.5 val,ss=nbrs12_o0.5 train,ss=nbrs12_o0.5 val,ss=nbrs13_o0.5 train,ss=nbrs13_o0.5 val
Models,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LogisticRegression,0.7027,0.4066,0.7009,0.4,0.6876,0.4066,0.6852,0.4054,0.6897,0.4091,0.6941,0.418
LinearDiscriminantAnalysis,0.725,0.4412,0.7181,0.4167,0.7063,0.4179,0.7029,0.4277,0.7104,0.4425,0.7153,0.4421
GaussianNB,0.7785,0.2705,0.7735,0.2759,0.7756,0.2712,0.7764,0.2755,0.7779,0.277,0.7745,0.2732
KNeighborsClassifier,0.8539,0.2448,0.8418,0.256,0.8375,0.2827,0.8383,0.2737,0.8317,0.2321,0.8292,0.266
SVC,0.7579,0.3618,0.7601,0.3654,0.7431,0.3729,0.7609,0.3704,0.7508,0.3497,0.7287,0.3148
LinearSVC,0.7151,0.4042,0.7089,0.3939,0.6971,0.4042,0.6941,0.4066,0.7007,0.4066,0.702,0.4154
SGDClassifier,0.7094,0.4195,0.6672,0.3453,0.6117,0.4136,0.6721,0.425,0.6903,0.4006,0.6679,0.3972
DecisionTreeClassifier,0.8048,0.1992,0.8145,0.251,0.7945,0.1296,0.802,0.1496,0.79,0.214,0.7909,0.1446
RandomForestClassifier,0.8505,0.2092,0.848,0.1695,0.836,0.1717,0.8448,0.1923,0.8372,0.1739,0.841,0.1535
ExtraTreesClassifier,0.856,0.1674,0.8574,0.1496,0.8456,0.1899,0.8458,0.1709,0.8422,0.1293,0.8425,0.1288


In [148]:
sample_vs = [0.7]
knbrs_results = pd.DataFrame()
print('Train vs validation results for SVMSMOTE technique.')
for kn_no in range(1, 14):
    print('k_neighbors:', kn_no)
    oversample_tq = SVMSMOTE(sampling_strategy=oversample_v, k_neighbors=kn_no, random_state=24)
    X_o, y_o = oversample_tq.fit_resample(X_train, y_train)
    sample_v = f'nbrs{kn_no}_o{sample_vs[0]}'
    results_df = get_models_scores_vs_val_for_sampling(models, X_o, y_o, splitter, X_val, y_val, sample_v)
    knbrs_results = pd.concat([knbrs_results, results_df], axis=1)

Train vs validation results for SVMSMOTE technique.
k_neighbors: 1
k_neighbors: 2
k_neighbors: 3
k_neighbors: 4
k_neighbors: 5
k_neighbors: 6
k_neighbors: 7
k_neighbors: 8
k_neighbors: 9
k_neighbors: 10
k_neighbors: 11
k_neighbors: 12
k_neighbors: 13


In [149]:
knbrs_results[knbrs_results.columns[:14]]

Unnamed: 0_level_0,ss=nbrs1_o0.7 train,ss=nbrs1_o0.7 val,ss=nbrs2_o0.7 train,ss=nbrs2_o0.7 val,ss=nbrs3_o0.7 train,ss=nbrs3_o0.7 val,ss=nbrs4_o0.7 train,ss=nbrs4_o0.7 val,ss=nbrs5_o0.7 train,ss=nbrs5_o0.7 val,ss=nbrs6_o0.7 train,ss=nbrs6_o0.7 val,ss=nbrs7_o0.7 train,ss=nbrs7_o0.7 val
Models,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
LogisticRegression,0.6572,0.3623,0.6655,0.4225,0.662,0.4064,0.6714,0.4035,0.6712,0.3763,0.6571,0.3636,0.6707,0.3887
LinearDiscriminantAnalysis,0.6896,0.3873,0.7027,0.3819,0.7038,0.3833,0.7105,0.3767,0.6991,0.3966,0.6997,0.3915,0.7007,0.3767
GaussianNB,0.7271,0.2878,0.7302,0.2911,0.73,0.281,0.7306,0.2838,0.7265,0.2865,0.729,0.2806,0.7299,0.2859
KNeighborsClassifier,0.9164,0.2814,0.8906,0.2675,0.8827,0.2664,0.8564,0.2621,0.8324,0.2419,0.8442,0.2569,0.8284,0.26
SVC,0.7674,0.3612,0.7643,0.3791,0.7753,0.3731,0.7832,0.3802,0.7552,0.3612,0.7534,0.3053,0.7683,0.3745
LinearSVC,0.6646,0.3777,0.6782,0.3806,0.6745,0.3979,0.6817,0.3979,0.6772,0.3873,0.6753,0.371,0.6808,0.3659
SGDClassifier,0.6452,0.3987,0.6614,0.375,0.6868,0.3869,0.6461,0.4019,0.6615,0.381,0.6882,0.4059,0.6167,0.3906
DecisionTreeClassifier,0.8578,0.0746,0.8367,0.1826,0.8163,0.1982,0.7932,0.2381,0.7922,0.1674,0.8034,0.3057,0.7784,0.2033
RandomForestClassifier,0.9082,0.1309,0.8855,0.197,0.8752,0.1707,0.8508,0.1683,0.8411,0.1422,0.8343,0.1208,0.8445,0.2074
ExtraTreesClassifier,0.9229,0.1295,0.8982,0.1724,0.8838,0.1435,0.8656,0.1422,0.8491,0.1422,0.8417,0.186,0.8472,0.1835


In [150]:
knbrs_results[knbrs_results.columns[14:]]

Unnamed: 0_level_0,ss=nbrs8_o0.7 train,ss=nbrs8_o0.7 val,ss=nbrs9_o0.7 train,ss=nbrs9_o0.7 val,ss=nbrs10_o0.7 train,ss=nbrs10_o0.7 val,ss=nbrs11_o0.7 train,ss=nbrs11_o0.7 val,ss=nbrs12_o0.7 train,ss=nbrs12_o0.7 val,ss=nbrs13_o0.7 train,ss=nbrs13_o0.7 val
Models,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LogisticRegression,0.6549,0.3915,0.6788,0.3804,0.6715,0.3777,0.6821,0.3791,0.6714,0.3943,0.6628,0.3833
LinearDiscriminantAnalysis,0.6836,0.3898,0.7112,0.3819,0.6995,0.3966,0.7111,0.4049,0.7028,0.3833,0.694,0.3925
GaussianNB,0.7314,0.2863,0.7338,0.2889,0.7322,0.2863,0.7333,0.288,0.7331,0.2859,0.7318,0.2863
KNeighborsClassifier,0.8338,0.2918,0.8187,0.2632,0.8119,0.2653,0.8044,0.2686,0.8111,0.2449,0.7951,0.2282
SVC,0.7412,0.3774,0.7806,0.3759,0.7438,0.3282,0.7498,0.3414,0.7448,0.336,0.7554,0.3422
LinearSVC,0.6608,0.3671,0.6896,0.3697,0.679,0.3873,0.6896,0.3929,0.6802,0.3833,0.6716,0.3952
SGDClassifier,0.6522,0.0781,0.6862,0.283,0.6642,0.2392,0.6751,0.403,0.6632,0.2672,0.6918,0.404
DecisionTreeClassifier,0.7814,0.1471,0.7857,0.2232,0.7793,0.2183,0.7743,0.1681,0.7693,0.25,0.7691,0.1245
RandomForestClassifier,0.8289,0.2294,0.8436,0.186,0.8266,0.1643,0.8219,0.119,0.8231,0.186,0.8161,0.2143
ExtraTreesClassifier,0.8393,0.1613,0.8476,0.1598,0.8239,0.1843,0.8285,0.1376,0.8291,0.1628,0.8248,0.1843


Looks similar to ADASYN case.

Before hyperparameter tuning let's check the scores for dummy classifier. That can be treat as a benchmark.

In [21]:
from sklearn.dummy import DummyClassifier

In [46]:
dc = DummyClassifier(strategy='constant', constant=1, random_state=24)
dc.fit(X_train, y_train)
dc_y_pred = dc.predict(X_train)

In [48]:
fbeta_score(y_train, dc_y_pred, beta=2)

0.2035330261136713

In case the model predicts only 1 the f2_score will be 0.2035.

In [27]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint

In [32]:
oversample_v = 0.5
undersample_v = 0.7
oversample_tq = ADASYN(sampling_strategy=oversample_v, random_state=24)
undersample_tq = RandomUnderSampler(sampling_strategy=undersample_v, random_state=24)
resample_pipeline = imPipeline([('oversampling', oversample_tq), ('undersampling', undersample_tq)])
X_ou, y_ou = resample_pipeline.fit_resample(X_train, y_train)

Logistic Regression

In [171]:
lr_rsc = LogisticRegression(random_state=24)
distributions = dict(C=uniform(loc=0, scale=4), 
                     penalty=['l2'], 
                     solver=['newton-cg', 'lbfgs', 'sag', 'saga', 'liblinear'], 
                     max_iter=randint(200, 1500),
#                      fit_intercept=[True, False],
#                      class_weight=[None, 'balanced']
                    ) #'saga' can get l1, rest only l2 
rscv = RandomizedSearchCV(lr_rsc, distributions, cv=splitter, scoring=f2_scorer, random_state=24, n_iter=2877)

In [172]:
%%time
lr_search = rscv.fit(X_ou, y_ou)

Wall time: 50min 30s


In [174]:
lr_search.best_params_

{'C': 0.008632525096375243,
 'max_iter': 973,
 'penalty': 'l2',
 'solver': 'liblinear'}

In [168]:
# lr_tng = LogisticRegression(C=0.008632525096375243, max_iter=361, penalty='l2', solver='newton-cg', 
#                             class_weight='balanced', random_state=24)
# lr_tng = LogisticRegression(C=0.05097413466747014, max_iter=604, penalty='l2', solver='sag', random_state=24)
lr_tng = LogisticRegression(C=0.008632525096375243, max_iter=973, penalty='l2', solver='liblinear', random_state=24)

In [169]:
cross_val_scores(lr_tng, X_ou, y_ou, splitter)

0.7802527931152647

In [170]:
get_val_pred_score(lr_tng, X_ou, y_ou, X_val, y_val)

0.42635658914728686

Linear Discriminant Analysis

In [206]:
lda_rsc = LinearDiscriminantAnalysis()
distributions = dict(solver=['svd', 'lsqr', 'eigen'], 
#                      shrinkage=[None, 'auto']+list(np.arange(0, 1, 0.01)),
                     shrinkage=np.arange(0, 1, 0.01),
                     tol=[1/10**x for x in range(1, 8)]
                    )
rscv = RandomizedSearchCV(lda_rsc, distributions, cv=splitter, scoring=f2_scorer, random_state=24, n_iter=1000)

In [216]:
%%time
lda_search = rscv.fit(X_ou, y_ou)

Wall time: 2min 20s


In [213]:
lda_search.best_params_

{'tol': 0.0001, 'solver': 'lsqr', 'shrinkage': 0.6900000000000001}

In [229]:
lda_tng = LinearDiscriminantAnalysis(tol=1.0e-4, solver='lsqr', shrinkage=0.69)

In [230]:
cross_val_scores(lda_tng, X_ou, y_ou, splitter)

0.7974659544885918

In [231]:
get_val_pred_score(lda_tng, X_ou, y_ou, X_val, y_val)

0.41463414634146345

Achieved results for LinearDiscriminantAnalysis are 0.7975/0.4146 for train/val dataset, respectively.
The validations is better by 0.3% (before 0.4134).

Linear SVC

In [349]:
lsvc_rsc = LinearSVC(random_state=24)
distributions = dict(C=uniform(loc=0, scale=4), 
                     penalty=['l1', 'l2'],
                     loss=['hinge', 'squared_hinge'],
#                      dual=[True, False], if n_samples > n_features then dual should be False
                     dual=[False],
                     tol=[1/10**x for x in range(1, 8)],
                     max_iter=randint(1000, 12000),
                     fit_intercept=[True, False]
#                      class_weight=[None, 'balanced']
                    ) 
rscv = RandomizedSearchCV(lsvc_rsc, distributions, cv=splitter, scoring=f2_scorer, random_state=24, n_iter=8888, error_score=np.NINF)

In [26]:
# %%time
lsvc_search = rscv.fit(X_ou, y_ou)

In [351]:
lsvc_search.best_params_

{'C': 0.005750984762251665,
 'dual': False,
 'fit_intercept': True,
 'loss': 'squared_hinge',
 'max_iter': 3235,
 'penalty': 'l1',
 'tol': 1e-06}

In [427]:
# lsvc_tng = LinearSVC(C=0.01619532623902531, dual=False, fit_intercept=False, loss='squared_hinge', penalty='l1', max_iter=6559,
#                     tol=1.0e-4, random_state=24)
lsvc_tng = LinearSVC(C=0.005750984762251665, dual=False, fit_intercept=True, loss='squared_hinge', penalty='l1', max_iter=3235,
                    tol=1.0e-6, random_state=24)

In [428]:
cross_val_scores(lsvc_tng, X_ou, y_ou, splitter)

0.783315184222561

In [429]:
get_val_pred_score(lsvc_tng, X_ou, y_ou, X_val, y_val)

0.414572864321608

Achieved results for LinearSVC are 0.7791/0.4188 for train/val dataset, respectively.
The validations is better by 1% (before 0.4145).

SGDClassifier

In [272]:
sgdc_rsc = SGDClassifier(random_state=24)
distributions = dict( 
#                      penalty=['l1', 'elasticnet'], 
                     penalty=['elasticnet'],
#                      loss=['squared_hinge', 'perceptron'],
                     alpha=np.arange(0.00001, 0.001, 0.0001),
                     l1_ratio=np.arange(0.05, 1, 0.05),
#                      dual=[True, False], # if n_samples > n_features then dual should be False
#                      tol=np.arange(0.00001, 0.001, 0.0001),
#                      max_iter=randint(750, 1500),
                     learning_rate=['optimal', 'invscaling', 'adaptive'],
                     eta0=np.arange(0.001, 2, 0.001),
                     power_t=[0.1, 0.25, 0.5, 0.75, 1, 2],
#                      fit_intercept=[True, False],
#                      class_weight=[None, 'balanced'],
#                      early_stopping=[True, False],
#                      validation_fraction=[0.1, 0.2, 0.3],
#                      n_iter_no_change=[3, 5, 10, 20],
                     random_state=[24]
                    ) 
rscv = RandomizedSearchCV(sgdc_rsc, distributions, cv=splitter, scoring=f2_scorer, random_state=24, n_iter=9876, error_score=np.NINF, n_jobs=-1)

In [273]:
%%time
sgdc_search = rscv.fit(X_ou, y_ou)

Wall time: 1.99 s


In [79]:
# sgdc_search.best_params_

In [27]:
def get_params_from_dict(dict_params):
    for key, value in dict_params.items():
        if type(value) == str:
            print(f"{key}='{value}',")
        else:
            print(f'{key}={value},')

In [170]:
get_params_from_dict(sgdc_search.best_params_)

tol=0.00041000000000000005,
random_state=24,
penalty='elasticnet',
learning_rate='optimal',
l1_ratio=0.1,
alpha=0.00051,


In [511]:
# sgdc_tng = SGDClassifier(alpha=0.26191, class_weight='balanced', early_stopping=True, epsilon=0.05, fit_intercept=True,
#                         l1_ratio=0.65, loss='epsilon_insensitive', max_iter=1797, penalty='l1', tol=1.0e-6)
# sgdc_tng = SGDClassifier(alpha=0.24441, early_stopping=True, epsilon=0.05, l1_ratio=0.1, penalty='elasticnet', random_state=24)
# sgdc_tng = SGDClassifier(alpha=0.0049, early_stopping=False, penalty='l1', loss='perceptron', 
#                          l1_ratio=0.15, tol=0.25, random_state=24)
# sgdc_tng = SGDClassifier(alpha=0.0005, tol=0.0001, penalty='elasticnet', l1_ratio=0.05, random_state=24)
sgdc_tng = SGDClassifier(alpha=0.0004, penalty='elasticnet', l1_ratio=0.05, random_state=24)
# sgdc_tng = SGDClassifier(alpha=0.0004, penalty='elasticnet', l1_ratio=0.05, random_state=24, learning_rate='invscaling', eta0=2, power_t=0.5)

In [512]:
cross_val_scores(sgdc_tng, X_ou, y_ou, splitter)

0.773184552660715

In [513]:
get_val_pred_score(sgdc_tng, X_ou, y_ou, X_val, y_val)

0.43814432989690727

0.4282

Achieved results for SGDClassifier are 0.7732/0.4381 for train/val dataset, respectively.
The validations is better by 2.3% (before 0.4282).

AdaBoost Classifier

#### The estimators supported by AdaBoost Classifier i.a. are: 
        BernoulliNB,
        DecisionTreeClassifier,
        ExtraTreeClassifier,
        ExtraTreesClassifier,
        MultinomialNB,
        NuSVC,
        Perceptron,
        RandomForestClassifier,
        RidgeClassifierCV,
        SGDClassifier,
        SVC
        
    Let's choose a couple and check the results.
    Ada Boost Classifier by default uses the DecisionTreeClassifier initialized with max_depth=1, so the good idea is to check other tree based algorithms with the same parameter. Based on that the RandomizedSearchCV will be performed.
    DecisionTreeClassifier, ExtraTreesClassifier, ExtraTreeClassifier and RandomForestClassifier.

In [62]:
from sklearn.tree import ExtraTreeClassifier

In [None]:
abc_tng = AdaBoostClassifier(base_estimator=RandomForestClassifier(max_depth=1), random_state=24)
cross_val_scores(abc_tng, X_ou, y_ou, splitter)

In [None]:
get_val_pred_score(abc_tng, X_ou, y_ou, X_val, y_val)

Two best are ExtraTreeClassifier and ExtraTreesClassifier.<br>
The Ada Boost was designed to work with weak learners so let's focus on the single Tree instead of ensembled algorithm.<br>

In [275]:
abc_rsc = AdaBoostClassifier(random_state=24)
distributions = dict(
#                      base_estimator=[DecisionTreeClassifier(), ExtraTreesClassifier(), ExtraTreeClassifier(), RandomForestClassifier()],
                     base_estimator=[ExtraTreeClassifier(random_state=24)],
#                      base_estimator__max_depth=[x for x in range(1, 11, 2)],
                     base_estimator__max_depth=[1, 2, 3],
                     base_estimator__min_samples_leaf=[x for x in range(1, 20)],
                     n_estimators=[x for x in range(25, 2001, 25)],
                     learning_rate=np.arange(0.05, 1, 0.05),
                     algorithm=['SAMME.R', 'SAMME']
                    ) 
rscv = RandomizedSearchCV(abc_rsc, distributions, cv=splitter, scoring=f2_scorer, random_state=24, n_iter=300, error_score=np.NINF)

In [276]:
%%time
abc_search = rscv.fit(X_ou, y_ou)

Wall time: 5h 15min 57s


In [277]:
get_params_from_dict(abc_search.best_params_)

algorithm='SAMME.R',
base_estimator=ExtraTreeClassifier(max_depth=7, min_samples_leaf=10, random_state=24),
base_estimator__max_depth=7,
base_estimator__min_samples_leaf=10,
learning_rate=0.05,
n_estimators=915,


In [505]:
# abc_tng = AdaBoostClassifier(base_estimator=ExtraTreeClassifier(max_depth=1), n_estimators=133, learning_rate=1.85, algorithm='SAMME', random_state=24)
# abc_tng = AdaBoostClassifier(base_estimator=RandomForestClassifier(max_depth=1), random_state=24)
# abc_tng = AdaBoostClassifier(random_state=24)
abc_tng = AdaBoostClassifier(base_estimator=ExtraTreeClassifier(max_depth=2, min_samples_leaf=10, random_state=24),
                             learning_rate=0.05,
                             n_estimators=900,
                             random_state=24)

In [506]:
cross_val_scores(abc_tng, X_ou, y_ou, splitter)

0.8111404412281389

In [507]:
get_val_pred_score(abc_tng, X_ou, y_ou, X_val, y_val)

0.41033434650455936

Some notes:<br>
if the max_depth is high the model become overfitted<br>
learning_rate learning rate over 1 decrease the score<br>

0.3963 AdaBoostClassifier(base_estimator=ExtraTreeClassifier(max_depth=2, min_samples_leaf=1, random_state=24),
                             learning_rate=0.05, n_estimators=915, random_state=24)<br>
0.3887 AdaBoostClassifier(base_estimator=ExtraTreeClassifier(max_depth=1, min_samples_leaf=1, random_state=24),
                             learning_rate=1, n_estimators=90, random_state=24)<br>
0.3939 AdaBoostClassifier(base_estimator=ExtraTreeClassifier(max_depth=2, min_samples_leaf=8, random_state=24),
                             learning_rate=0.05, n_estimators=900, random_state=24)<br>
0.4103 AdaBoostClassifier(base_estimator=ExtraTreeClassifier(max_depth=2, min_samples_leaf=10, random_state=24),
                             learning_rate=0.05, n_estimators=900, random_state=24)<br>

Achieved results for AdaBoostClassifier(ExtraTreeClassifier) are 0.8111/0.4103 for train/val dataset, respectively.<br>
The validations is better by 8.7% (before 0.3774).<br>

Let's take a look on one of the outlier/anomaly detection algorithm.

In [None]:
OneClass SVM

In [28]:
from sklearn.svm import OneClassSVM

In [29]:
X_train_out, X_test_out, y_train_out, y_test_out = train_test_split(X_train, y_train, test_size=0.5, random_state=24, stratify=y_train)

In [30]:
X_train_out = X_train_out[y_train_out==0]

# the predict output is -1 and 1 instead of 1 and 0, respectively
y_test_out[y_test_out == 1] = -1
y_test_out[y_test_out == 0] = 1

In [31]:
# oc_svm = OneClassSVM(kernel='sigmoid', gamma='scale', coef0=0.5) #‘rbf’, ‘sigmoid’ # gamma{‘scale’, ‘auto’} coef0
oc_svm = OneClassSVM(kernel='rbf', gamma='scale', tol=2, nu=0.37, shrinking=False)
oc_svm.fit(X_train_out)

y_pred = oc_svm.predict(X_test_out)

f2_score = fbeta_score(y_test_out, y_pred, pos_label=-1, beta=2)
f2_score

0.3509719222462203

Let's test the model

In [32]:
X_val_out = X_val.copy()
y_val_out = y_val.copy()

In [33]:
y_val_out[y_val_out == 1] = -1
y_val_out[y_val_out == 0] = 1

In [34]:
# oc_svm = OneClassSVM(kernel='rbf', gamma='scale', tol=2, nu=0.37, shrinking=False)
oc_svm = OneClassSVM()
oc_svm.fit(X_train_out)

y_pred = oc_svm.predict(X_val_out)

f2_score = fbeta_score(y_val_out, y_pred, pos_label=-1, beta=2)
f2_score

0.28735632183908044

0.3109 OneClassSVM(kernel='rbf', gamma='scale', tol=2)<br>
0.3510 OneClassSVM(kernel='rbf', gamma='scale', tol=2, nu=0.37)<br>

Achieved result for OneClassSVM is 0.3510 for test (new created for OCSVM) dataset, respectively.<br>
The validations is better by 13.5% (before 0.3093).<br>
For the validation dataset, which can be treated as test dataset (not used for evaluation and tuning) it is:<br>
&emsp;0.2874 for initial model<br>
&emsp;0.3298 for tuned model<br>
&emsp;That gives 14.8% of improvment.<br>

#### Let's look at the results of RandomizedSearchCV.

    Model                      gain    score
    LogisticRegression         7.2%    0.4264
    LinearDiscriminantAnalysis 0.3%    0.4146
    LinearSVC                  1%      0.4188
    SGDClassifier              2.3%    0.4381
    AdaBoostClassifier         8.7%    0.4103
    OneClassSVM                13.5%   0.3510
    
    Let's focus on the three algorithms with best score (1) and best improvement (2).
    SGDClassifier              2.3%    0.4381
    AdaBoostClassifier         8.7%    0.4103
    OneClassSVM                13.5%   0.3510
    
    Features for further GridSearchCV searching:
    SGDClassifier 
        alpha (low values e.g. 0.0004), 
        penalty,
        l1_ratio (low values e.g. 0.05), 
        random_state=24, 
        learning_rate,
        eta0 (values [0.01, 2])
    
    AdaBoostClassifier
        base_estimator ExtraTreeClassifier
            max_depth values [1, 3], 
            min_samples_leaf [8, 12]
            random_state=24,
        learning_rate values [0.01, 1],
        n_estimators [800, 1100] step 25,
        random_state=24
    
    OneClassSVM (kernel='rbf', gamma='scale', tol=2, nu=0.37)
        kernel,
        gamma,
        tol values [0.1, 3] step 0.1,
        nu values [0.2, 0.6] step 0.01

GridSearchCV

SGDClassifier

In [514]:
from sklearn.model_selection import GridSearchCV

In [515]:
param_grid = dict( 
                     penalty=['l2', 'l1', 'elasticnet'], 
                     alpha=np.arange(0.0001, 0.001, 0.0005),
                     l1_ratio=np.arange(0.05, 1, 0.05),
                     learning_rate=['optimal', 'invscaling', 'adaptive'],
                     eta0=np.arange(0.01, 2, 0.02),
                     random_state=[24]
                    ) 

In [516]:
sgdc_clf = SGDClassifier(random_state=24)
grid_search = GridSearchCV(sgdc_clf, param_grid, cv=splitter, scoring=f2_scorer)
grid_search.fit(X_ou, y_ou)

GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=3, n_splits=10, random_state=1),
             estimator=SGDClassifier(random_state=24),
             param_grid={'alpha': array([0.0001, 0.0006]),
                         'eta0': array([0.01, 0.03, 0.05, 0.07, 0.09, 0.11, 0.13, 0.15, 0.17, 0.19, 0.21,
       0.23, 0.25, 0.27, 0.29, 0.31, 0.33, 0.35, 0.37, 0.39, 0.41, 0.43,
       0.45, 0.47, 0.49, 0.51, 0.53, 0.55, 0.57, 0.59, 0.61, 0.63, 0.65,
       0.67, 0....
       1.55, 1.57, 1.59, 1.61, 1.63, 1.65, 1.67, 1.69, 1.71, 1.73, 1.75,
       1.77, 1.79, 1.81, 1.83, 1.85, 1.87, 1.89, 1.91, 1.93, 1.95, 1.97,
       1.99]),
                         'l1_ratio': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
       0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95]),
                         'learning_rate': ['optimal', 'invscaling', 'adaptive'],
                         'penalty': ['l2', 'l1', 'elasticnet'],
                         'random_state': [24]},
           

In [517]:
grid_search.best_estimator_

SGDClassifier(eta0=0.01, l1_ratio=0.05, learning_rate='invscaling',
              random_state=24)

In [518]:
cross_val_scores(grid_search.best_estimator_, X_ou, y_ou, splitter)

0.8003889714657323

In [519]:
get_val_pred_score(grid_search.best_estimator_, X_ou, y_ou, X_val, y_val)

0.4057279236276849

Seems maximising the cross-validation score for train dataset does not improve the validation dataset score.

AdaBoostClassifier

In [523]:
param_grid = dict( 
                     base_estimator=[ExtraTreeClassifier(random_state=24)],
                     base_estimator__max_depth=[1, 2, 3],
                     base_estimator__min_samples_leaf=[x for x in range(8, 13)],
                     n_estimators=[x for x in range(800, 1101, 25)],
                     learning_rate=np.arange(0.05, 0.26, 0.05),
                     algorithm=['SAMME.R']
                    ) 

In [524]:
ab_clf = AdaBoostClassifier(random_state=24)
grid_search = GridSearchCV(ab_clf, param_grid, cv=splitter, scoring=f2_scorer)
grid_search.fit(X_ou, y_ou)

GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=3, n_splits=10, random_state=1),
             estimator=AdaBoostClassifier(random_state=24),
             param_grid={'algorithm': ['SAMME.R'],
                         'base_estimator': [ExtraTreeClassifier(max_depth=3,
                                                                min_samples_leaf=8,
                                                                random_state=24)],
                         'base_estimator__max_depth': [1, 2, 3],
                         'base_estimator__min_samples_leaf': [8, 9, 10, 11, 12],
                         'learning_rate': array([0.05, 0.1 , 0.15, 0.2 , 0.25]),
                         'n_estimators': [800, 825, 850, 875, 900, 925, 950,
                                          975, 1000, 1025, 1050, 1075, 1100]},
             scoring=make_scorer(fbeta_score, beta=2))

In [525]:
grid_search.best_estimator_

AdaBoostClassifier(base_estimator=ExtraTreeClassifier(max_depth=3,
                                                      min_samples_leaf=8,
                                                      random_state=24),
                   learning_rate=0.25, n_estimators=1100, random_state=24)

In [526]:
cross_val_scores(grid_search.best_estimator_, X_ou, y_ou, splitter)

0.8734042619994079

In [527]:
get_val_pred_score(grid_search.best_estimator_, X_ou, y_ou, X_val, y_val)

0.26490066225165565

Seems maximising the cross-validation score for AdaBoostClassifier led to overfitting. The gap between the scores for cross validation and validation dataset prediction is pretty significant.

OneClassSVM

GridSearchCV does not work for novelty detection. Let's create simple loops stack to search the hyperparameters.

In [47]:
param_grid = dict( 
                     kernel=['rbf', 'sigmoid'],
                     gamma=['auto', 'scale'],
                     tol=np.arange(0.1, 3.1, 0.1),
                     nu=np.arange(0.2, 0.61, 0.01)
                    ) 

In [37]:
def get_f2_score_for_oc_svm(kernel, gamma, tol, nu):
    oc_svm = OneClassSVM(kernel=kernel, gamma=gamma, tol=tol, nu=nu)
    oc_svm.fit(X_train_out)

    y_pred = oc_svm.predict(X_test_out)

    f2_score = fbeta_score(y_test_out, y_pred, pos_label=-1, beta=2)
    return f2_score

In [48]:
best_score = 0
best_hp = []
for kernel in param_grid['kernel']:
    for gamma in param_grid['gamma']:
        for tol in param_grid['tol']:
            for nu in param_grid['nu']:
                score = get_f2_score_for_oc_svm(kernel, gamma, tol, round(nu, 2))
                if score > best_score:
                    best_hp = [kernel, gamma, tol, round(nu, 2)]
                    best_score = score

In [49]:
f'Best score {best_score} has been achieved for {best_hp}'

"Best score 0.3509719222462203 has been achieved for ['rbf', 'scale', 1.5000000000000002, 0.37]"

In [50]:
oc_svm = OneClassSVM(kernel='rbf', gamma='scale', tol=1.5, nu=0.37)
oc_svm.fit(X_train_out)

y_pred = oc_svm.predict(X_test_out)

f2_score = fbeta_score(y_test_out, y_pred, pos_label=-1, beta=2)
f2_score

0.3509719222462203

In [51]:
oc_svm = OneClassSVM(kernel='rbf', gamma='scale', tol=1.5, nu=0.37)
oc_svm.fit(X_train_out)

y_pred = oc_svm.predict(X_val_out)

f2_score = fbeta_score(y_val_out, y_pred, pos_label=-1, beta=2)
f2_score

0.32978723404255317

The previous and current OneClassSVM give the same result.

The grid searching have not led to validation dataset improvement.<br>
<br>
The hyperameters for these models remain the same:<br>
0.4381 SGDClassifier(alpha=0.0004, penalty='elasticnet', l1_ratio=0.05, random_state=24)<br>
<br>
0.4103 AdaBoostClassifier(base_estimator=ExtraTreeClassifier(max_depth=2, min_samples_leaf=10, random_state=24),
                             learning_rate=0.05, n_estimators=900, random_state=24)<br>
<br>
0.3510 OneClassSVM(kernel='rbf', gamma='scale', tol=1.5, nu=0.37)<br>

The last thing to consider is combining the outlier detection and classification algorithms.<br>
Firstly, the outliers will be removed from train dataset but only for non-stroke samples. Secondly, the model will be trained on the new dataset (without outliers). Let's see if that will improve the predictions performance.<br>
Can be done for orginal data and over/under sampled data.<br>

OneClassSVM + SGDClassifier

In [52]:
# split stroke and non-stroke samples from train dataset
non_stroke_X_train = X_train[y_train == 0].copy()
non_stroke_y_train = y_train[y_train == 0].copy()
stroke_X_train = X_train[y_train == 1].copy()
stroke_y_train = y_train[y_train == 1].copy()

# for under/oversampled data
# non_stroke_X_train = X_ou[y_ou == 0].copy()
# non_stroke_y_train = y_ou[y_ou == 0].copy()
# stroke_X_train = X_ou[y_ou == 1].copy()
# stroke_y_train = y_ou[y_ou == 1].copy()

# outliers detection in the non-stroke dataset
oc_svm = OneClassSVM(kernel='rbf', gamma='scale', tol=1.5, nu=0.37)
y_od = oc_svm.fit_predict(non_stroke_X_train)

# the predict output is -1 and 1, where 1 means inliers
# so let's keep only these labaled as 1
mask = y_od == 1
non_stroke_X_train, non_stroke_y_train = non_stroke_X_train[mask, :], non_stroke_y_train[mask]

# concat the new non_stroke with stroke datasets
new_X_train = np.concatenate((non_stroke_X_train, stroke_X_train))
new_y_train = np.concatenate((non_stroke_y_train, stroke_y_train))


sgdc = SGDClassifier(alpha=0.0004, penalty='elasticnet', l1_ratio=0.05, random_state=24)

In [56]:
new_X_train.shape, new_y_train.shape

((2083, 12), (2083,))

In [78]:
len(mask[mask == True]), len(mask[mask == False])

(1924, 1187)

In [59]:
cross_val_scores(sgdc, new_X_train, new_y_train, splitter)

0.5443090728814661

In [60]:
get_val_pred_score(sgdc, new_X_train, new_y_train, X_val, y_val)

0.311284046692607

OneClassSVM + AdaBoostClassifier

The data are already prepared so it is just matter of running cross-validation and validation test.

In [63]:
abc_tng = AdaBoostClassifier(base_estimator=ExtraTreeClassifier(max_depth=2, min_samples_leaf=10, random_state=24),
                             learning_rate=0.05, n_estimators=900, random_state=24)

In [64]:
cross_val_scores(abc_tng, new_X_train, new_y_train, splitter)

0.7138814894256138

In [65]:
get_val_pred_score(abc_tng, new_X_train, new_y_train, X_val, y_val)

0.39589442815249265

Combining models do not improve the results for validation dataset. Could be that after additional hyperparameters tuning the results would be better. OneClassSVM labeled 38% of non-stroke samples as the outliers. That shows the dataset is highly varied. It is too many to just ignore them. 

Let's summarise the research in the final notebook the "Stroke_Prediction_Model".