### Choosing a metric

I first looked at class ditribution. They are binary classes, and the dataset is highly imbalanced because the true class makes up only ~0.19% of all labels.

I suspected if I just run a simple logistic regression on the dataset, I could achieve 99.81% accuracy, because the model can simply cheat by classifying every sample as 0. I ran logistic regression using sklearn with default params, and the result confirmed my hypothesis - the model predicted every sample as 0.

Obviously, accuracy is a bad metric for such an imbalanced dataset, especially in the RTB context, because being able to find samples that likely will result in a click-through is more important than overall correctness. If the model predicts the majority label for every sample, precision will be perfect but recall will be 0. If the model predicts the minority label for every sample, the recall will be perfect but precision will be 1. One potential metric I can use is the f1 score, but it states that precision and recall are equally important, which in reality may not be true. If capturing most of the click-throughs are more important than serving efficiency (most of the served impressions result in click-throughs, in other words, there is no infrastructure bottlenck), then recall is more important than precision. On the other hand, if serving efficiency is more important than capturing most of click-throughs, which may be true when the scale is large, then precision is more important. But in either case, f1 score can be very low even if we are happy with the model.

Without making too much an assumption of what we would want in real life, ROC AUC is a better metric to use because it depicts the tradeoff between specificity and senstivity. A large ROC AUC value indicates the model is more robust, and the optimal value should approach 1.

### Baseline models

#### Logistic Regression
I reran the default logistic regression model and print ROC AUC, f1 score and the confusion matrix. I use the binary labels as predicted values rather than scores, because I didn't want ROC AUC to be skewed by precision, and ultimatly I care more about the repdiced labels than probablities
```
================================
Confusion Matrix:
True Negative = 199623
False Negative = 377
True Positive = 0
False Positive = 0
================================
f1 score = 0.000
================================
ROC AUC Score = 0.500
```
F1 score is coerced to 0 because there are no predicted positives at all. ROC AUC score is low - just as good as a random classifier. Just for fun, I added l2 regularization `penalty='l2'`, the result is the same. The imbalance is just to strong for the model to even try to predict positives.

#### Neural Net
The baseline neural net has 1 hidden layer that outputs 88 weights for all features, and an output layer with 2 outputs, one probability for each class, with a softmax activation. Categorical crossentropy is used as loss.

The model is fit with 3 epochs, and a batch size of 32. The baseline neural net should be equivalent of a logisc regression, except that I'm using the SGD optimizer, so I picked a learning rate of 0.01 (it usually worked well for me when I train baseline models). I specifically set the inital weights as `Ones`, because Keras defaults the inital weights to `glorot_uniform`, which generally works really well (and too well for a baseline model).

I used set split to 0.2 for validation after each epoch.
Both training and validation error went down, and validation error is slightly smaller than training error in all epochs, indicating there is no overfitting. 

Unsurprisingly, the accuracy is still 99.81%, with the vast majority of samples predicted negative. ROC AUC score is still 0.5.

The result:
```
================================
Confusion Matrix:
True Negative = 199619
False Negative = 377
True Positive = 0
False Positive = 4
================================
f1 score = 0.000
================================
ROC AUC Score = 0.500
```

Because I set the initial weights to `Ones`, I wanted to eliminate the possibility that my weights are exploding or diminished. After checking the hidden layer's output weights, they are all between -1 and 1. So my baseline neural net is fine, even if I remove my initial weights so that they are set to `glorot_uniform`, the result is similar, there is one true positive and more false positives
```
================================
Confusion Matrix:
True Negative = 199418
False Negative = 376
True Positive = 1
False Positive = 205
================================
f1 score = 0.003
================================
ROC AUC Score = 0.501
```

### Model improvements
I think the main obstacle to achieving higher ROC AUC is the severe data imbalance. There are a few potential migations I can try:

- Deepen the neural network model
  <p>
    If add more layers, the model will have a chance to explore nonlinearty, which might cover the positive space.
  </p>
- Down sampling
  <p>
    The negative class can be overpowering. We could find a good class ratio so that the model can be trained for positive classes without underfitting the negative class. A side benefit is the model will take less time to train.
  </p>
- Over sampling
  <p>
    Like down sampling, the goal is to make the data more balanced. If we observe underfitting for the negative class, we can try oversampling the positive class and find a good class ratio. Oversampling can be done with random selection and duplication. This could cause overfitting for the positive class, and definitely slows down training and can increase model size.
  </p>
- SMOTE
  <p>
    We can also try a hybrid approach, down sampling negative class while generating synthetic samples for the positive class eith nearest neighbours. This could help prevent positive class overfitting. But just like oversampling, it causes the model to train slower and the size to increase.
  </p>
  
  
#### Deepen the neural network model
I added one more hidden layer with 64 outputs, without any regularization, and left the initial weights to `glorot_uniform`. Both training and validation lossses are smaller (around 0.015 v.s. 0.030 in the baseline neural net). However, there is no improvement in metrics. The vast majority of predicted clases are still negative, and ROC AUC remains 0.5. The validation error also failed to decrease while training error did. This indicates overfitting, so I added `l2(0.01)` to the first hidden layer. This time validation error decreased with training error, but there's no improvement metrics, in fact, the model predicts all negatives again. In effect, the extra layer only enabled the model to explore some nonlinearity and as a result the error declined, but it heavily benefits the negative error, as the negative class severly outweighs the positive class.

I tried to tune learning rate (0.1, 0.001, 0.0001) and batch size (64, 128, 256) on different scales and tried different weight initializers (random normal, he_normal, etc). There is no improvement. Adding yet another layer didn't help either, because as expected, the reduced loss benefits the negative class much more than the positive class.

Since the imbalance is preventing the model to train for positives, I set used weighted loss to balance the classes by setting `class_weight` (roughly 1:526) in the `fit` function. The losses are huge, and validation error plateaued, and the model again tried to predict all negatives. I probably overshot the class weight, so I adjusted to 1:100, and added l2(0.01). There were more false positives, but there was still overfitting, so I changed l2 to l1 in the first layer, in an attempt to reduce features. This helped. I have more true positives, but also more false positives, but the ROC AUC is 0.541!

```
================================
Confusion Matrix:
True Negative = 190136
False Negative = 328
True Positive = 49
False Positive = 9487
================================
f1 score = 0.010
================================
ROC AUC Score = 0.541
```

There's still a little overfitting, so I added a `Dropout` layer between the two hidden layers, with rate = 0.1. However, this nudged the model back to all negatives again. I tried to tune the dropout rate and although it decreases losses when the rate is 0.05, it also prevented the mode from predicting positives, probably because most of the loss it regulated is from negatives. I also tuned the learn rate, decay, and tried out RMSprop. It doesn't seem I had a problem with decreasing loss.

I figured I could keep adding layers and tuning the extra hyper parameters, but I decided to explore the sampling route, as my mode will be exponentially slower and harder to tune as trained parameters hyper parameters increase.

#### Undersampling/Oversampling
I used Logistic Regression to evaluate my sampling stratigies.

Initially I tried just random undersampling with a negative-positive ratio of 10:1. I tried 1:1 but this lead to overfitting due to extremely small positive size. 10:1 resulted in an ROC AUC score of 0.501. I used `class_weight` to counter balance so that to total weight of the two classes were equal (i.e. setting it to 1:10). This time the ROC AUC score improved to 0.663!

```
================================
Confusion Matrix:
True Negative = 129499
False Negative = 122
True Positive = 255
False Positive = 70124
================================
f1 score = 0.007
================================
ROC AUC Score = 0.663
```

I tried random oversampling next. However, the sampling process was very slow this time, because I was upsampling the minority class to have an equal size with the majority class. So I put the undersampler before the oversampler, and built a pipeline.

```
Undersample majority to 10:1 -> Oversample minority to 1:1
```

Because of the lack of positive samples, I worried simple oversampling will cause overfitting, as the classifier tends to memorize them, I built another pipeline with SMOTE as the second step.

```
Undersample majority to 10:1 -> SMOTE 1:1
```

The results are:
```
================================
Confusion Matrix:
True Negative = 129713
False Negative = 123
True Positive = 254
False Positive = 69910
================================
f1 score = 0.007
================================
ROC AUC Score = 0.662
```

```
================================
Confusion Matrix:
True Negative = 130028
False Negative = 125
True Positive = 252
False Positive = 69595
================================
f1 score = 0.007
================================
ROC AUC Score = 0.660
```

There is little difference between oversampling vs SMOTE, and having an oversampling step doesn't yield better results than just undersampling.

Because of undersampling, the overall size is still small. I removed the random seed and repeated the tests above 10 times and the results are similar.

I tuned the initial undersampling ratio higher, 20:1, and 40:1, the results were not better, but slightly worse. I think it's because higher undersampling ratio lead to more noisy oversampled data down the pipeline, as the minority class has to be sampled/synthesizes to a large size.

After a closer look at the sampling results above, SMOTE actually results in fewer false positives and more true negatives. I played around with the two params, `k_neighbors` and `m_neighbors`, and was able to reduce the false positives a little  further. The best config is `k_neighbors=1`, `m_neighbors=3`

```
================================
Confusion Matrix:
True Negative = 130268
False Negative = 125
True Positive = 252
False Positive = 69355
================================
f1 score = 0.007
================================
ROC AUC Score = 0.661
```

#### Ensembling
So far I hadn't spent much energy on improving the undersampling. I saw decent improvement in ROC AUC score from 0.5 to 0.66, but false positives were still high. I stumbled upon `BalanceCascade` and wondered if more sophisticated estimators with iterative undersampling could help reduce potential noise in the features and negative samples.

I still had to undersample as the first step so that ensembling doesn't take forever.

__EasyEnsembler__
No estimator, just iterative random undersampling
Results are in line with random undersampling
```
================================
Confusion Matrix:
True Negative = 126066
False Negative = 124
True Positive = 253
False Positive = 73557
================================
f1 score = 0.007
================================
ROC AUC Score = 0.651
```
__BalanceCascade__
Without an estimator, it uses KNN
```
================================
Confusion Matrix:
True Negative = 115446
False Negative = 149
True Positive = 228
False Positive = 84177
================================
f1 score = 0.005
================================
ROC AUC Score = 0.592
```

__BalanceCascade + DecisionTreeClassifier__
This is not bad as I left all params default.
```
================================
Confusion Matrix:
True Negative = 128417
False Negative = 152
True Positive = 225
False Positive = 71206
================================
f1 score = 0.006
================================
ROC AUC Score = 0.620
```
Since I'm not limiting max subsets. I tried 20, 10, 5.
5 had the best result, with ROC AUC = 0.674, where as 20 -> 0.636, 10 -> 0.644.
```
================================
Confusion Matrix:
True Negative = 127183
False Negative = 109
True Positive = 268
False Positive = 72440
================================
f1 score = 0.007
================================
ROC AUC Score = 0.674
```
Then I tried to limit `max_features`, with the assumption that maybe there is still too much feature noise. Unfortunately, as the `max_features` ratio goes down from 0.8 to 0.2, the ROC AUC also declines.

I tried to tune `min_sample_leaves` as well but didn't see improvement.
Maybe it's because I already severly limited max subsets? I increased max subsets to 10, and this time, the best max feature ratio is 0.8. ROC AUC is better than when max subsets was 5, but it still doesn't beat 0.674. 

__BalanceCascade + RandomForestClassifier__
I could have tried to tune this. But since the baseline result below isn't better then decision tree, I just focused on decision tree instead.
```
================================
Confusion Matrix:
True Negative = 132453
False Negative = 161
True Positive = 216
False Positive = 67170
================================
f1 score = 0.006
================================
ROC AUC Score = 0.618
```

__BalanceCascade + XGBClassifier__
I was going to try XGB, but random forest didn't perform quite well, and XGB was taking a very long time.

With the best sampling config (__BalanceCascade + DecisionTreeClassifier__ with a max subset of 5), I tried to tune logistic regression with different params on log scale: l2/l1, regularization strength (C from 0.001 to 2). Nothing beats l2 with C=1.















In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import numpy as np
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import NearMiss, RandomUnderSampler
from imblearn.combine import SMOTEENN,SMOTETomek
from imblearn.ensemble import BalanceCascade, EasyEnsemble
from imblearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.model_selection import GridSearchCV
import h5py
import keras
from sklearn.utils import class_weight
from keras.utils import to_categorical
from keras.optimizers import *
from keras.regularizers import *
from keras.models import Sequential
from keras.layers import Dense, Dropout
from sklearn.metrics import confusion_matrix, roc_auc_score, f1_score, precision_recall_fscore_support
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
import xgboost as xgb
from xgboost import XGBClassifier


%matplotlib inline

KFOLD_SEED = 42


def rtb_confusion_matrix(test_labels, y_preds):
    m = confusion_matrix(test_labels, y_preds)
    
    print("================================")
    print("Confusion Matrix:")
    print("True Negative = %d" % m[0][0])
    print("False Negative = %d" % m[1][0])
    print("True Positive = %d" % m[1][1])
    print("False Positive = %d" % m[0][1])


def rtb_f1_score(test_labels, y_preds):
    f = f1_score(test_labels, y_preds)
    print("================================")
    print("f1 score = %0.3f" % f)


def print_metrics(true_labels, y_preds, y_scores, is_train=True):
    if is_train:
        print("--------train---------")
    else:
        print("--------test---------")
    
    rtb_confusion_matrix(true_labels, y_preds)
    rtb_f1_score(true_labels, y_preds)
    print("================================")
    print("ROC AUC Score = %0.3f" % roc_auc_score(true_labels, y_scores.argmax(axis=-1)))
    
def keras_print_metrics(true_labels, y_scores, is_train=True):
    y_preds = y_scores.argmax(axis=-1)
    
    if is_train:
        print("--------train---------")
    else:
        print("--------test---------")
    
    rtb_confusion_matrix(true_labels, y_preds)
    rtb_f1_score(true_labels, y_preds)
    print("================================")
    print("ROC AUC Score = %0.3f" % roc_auc_score(true_labels, y_preds))

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
input_path = '~/data/biddings.csv'
data = pd.read_csv(input_path)
print(data.shape)

train = data[:800000]
test = data[800000:]

sample = train.sample(frac=1)
features = sample.drop('convert', axis=1).values
labels = sample.convert.ravel()
categorical_labels = to_categorical(labels, 2)

test_features = test.drop('convert', axis=1).values
test_labels = test.convert.ravel()
categorical_test_labels = to_categorical(test_labels, 2)

(1000000, 89)


### Baseline models

In [9]:
lr = LogisticRegression(penalty='l2', random_state=KFOLD_SEED, verbose=2)

model = lr.fit(features, labels)
predicted_scores = model.predict_proba(test_features)
predicted_labels = model.predict(test_features)
print(predicted_scores.shape, predicted_labels.shape)

print_metrics(test_labels, predicted_labels, predicted_scores, is_train=False)


[LibLinear](200000, 2) (200000,)
--------test---------
Confusion Matrix:
True Negative = 199623
False Negative = 377
True Positive = 0
False Positive = 0
f1 score = 0.000
ROC AUC Score = 0.500


  'precision', 'predicted', average, warn_for)


In [21]:
model = Sequential()
model.add(Dense(88, input_shape=(88,)))
model.add(Dense(2, activation='softmax'))
model.summary()

model.compile(loss='categorical_crossentropy',
              optimizer=SGD(lr=0.01),
              metrics=['accuracy'])

history = model.fit(features, categorical_labels,
                    batch_size=32,
                    epochs=3,
                    callbacks=[keras.callbacks.EarlyStopping()],
                    validation_split=0.2,
                    verbose=1)

y_scores = model.predict(test_features, verbose=1)

keras_print_metrics(test_labels, y_scores)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_11 (Dense)             (None, 88)                7832      
_________________________________________________________________
dense_12 (Dense)             (None, 2)                 178       
Total params: 8,010
Trainable params: 8,010
Non-trainable params: 0
_________________________________________________________________
Train on 640000 samples, validate on 160000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
--------train---------
Confusion Matrix:
True Negative = 199418
False Negative = 376
True Positive = 1
False Positive = 205
f1 score = 0.003
ROC AUC Score = 0.501


#### Deeper neural nets

In [13]:
model = Sequential()
model.add(Dense(88, kernel_regularizer=l1(0.01), input_shape=(88,)))
model.add(Dense(64, kernel_regularizer=l2(0.01)))
model.add(Dense(2, activation='softmax'))
model.summary()

model.compile(loss='categorical_crossentropy',
              optimizer=SGD(lr=0.01),
              metrics=['accuracy'])

history = model.fit(features, categorical_labels,
                    batch_size=32,
                    class_weight={0:1, 1:100},
                    epochs=3,
                    callbacks=[keras.callbacks.EarlyStopping()],
                    validation_split=0.2,
                    verbose=1)

y_scores = model.predict(test_features, verbose=1)

keras_print_metrics(test_labels, y_scores)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_28 (Dense)             (None, 88)                7832      
_________________________________________________________________
dense_29 (Dense)             (None, 64)                5696      
_________________________________________________________________
dense_30 (Dense)             (None, 2)                 130       
Total params: 13,658
Trainable params: 13,658
Non-trainable params: 0
_________________________________________________________________
Train on 640000 samples, validate on 160000 samples
Epoch 1/3
Epoch 2/3
--------train---------
Confusion Matrix:
True Negative = 199437
False Negative = 375
True Positive = 2
False Positive = 186
f1 score = 0.007
ROC AUC Score = 0.502


In [20]:
w = model.get_layer(name='dense_10').get_weights()

In [6]:
def logistic_regression():
    lr = LogisticRegression(penalty='l2', random_state=KFOLD_SEED, verbose=2)
    return lr


def pipeline_test(pipeline, features, labels):
    pipeline.fit(features, labels)
    
    predicted_scores = pipeline.predict_proba(test_features)
    predicted_labels = pipeline.predict(test_features)
        
    print_metrics(test_labels, predicted_labels, predicted_scores, is_train=False)


### Undersampling / Oversampling

In [7]:
def sample_pipelines_test(pipeline_test_fn=pipeline_test):
    rus = RandomUnderSampler(ratio={0: 1531*10, 1: 1531}, random_state=KFOLD_SEED)
    ros = RandomOverSampler(random_state=KFOLD_SEED)
    smote = SMOTE(n_jobs=-1, k_neighbors=1, m_neighbors = 3, random_state=KFOLD_SEED)
    
    lr1 = LogisticRegression(penalty='l2', class_weight={0:1, 1:10}, random_state=KFOLD_SEED, verbose=2)
    p1 = Pipeline([('rus', rus), ('lr', lr1)])
    pipeline_test_fn(p1, features, labels)
    
    p2 = Pipeline([('rus', rus), ('ros', ros), ('lr', logistic_regression())])
    pipeline_test_fn(p2, features, labels)
    
    p3 = Pipeline([('rus', rus), ('smote', smote), ('lr', logistic_regression())])
    pipeline_test_fn(p3, features, labels)
    
    

sample_pipelines_test(pipeline_test_fn=pipeline_test)


[LibLinear]--------test---------
Confusion Matrix:
True Negative = 128714
False Negative = 123
True Positive = 254
False Positive = 70909
f1 score = 0.007
ROC AUC Score = 0.659
[LibLinear]--------test---------
Confusion Matrix:
True Negative = 129604
False Negative = 124
True Positive = 253
False Positive = 70019
f1 score = 0.007
ROC AUC Score = 0.660
[LibLinear]--------test---------
Confusion Matrix:
True Negative = 130225
False Negative = 121
True Positive = 256
False Positive = 69398
f1 score = 0.007
ROC AUC Score = 0.666


### Ensemble

In [8]:
def ensembler_test(classifier_fn, ensemblers):
    rus = RandomUnderSampler(ratio={0: 1531*10, 1: 1531}, random_state=KFOLD_SEED)
    X_us, y_us = rus.fit_sample(features, labels)
    
    for i, e in enumerate(ensemblers):
        print("fitting sample")
        X_res, y_res = e.fit_sample(X_us, y_us)
        print(X_res.shape, y_res.shape)
        clf = classifier_fn()
        print("training")
        
        for j, X_train in enumerate(X_res):
            model = clf.fit(X_train, y_res[j])
        
        predicted_scores = model.predict_proba(test_features)
        predicted_labels = model.predict(test_features)
        
        print("Ensembler %d" % i)
        print_metrics(test_labels, predicted_labels, predicted_scores, is_train=False)

EasyEnsemble and decision tree are consistenly the best

In [9]:
ee = EasyEnsemble(random_state=KFOLD_SEED)
bc = BalanceCascade(random_state=KFOLD_SEED)

dt = DecisionTreeClassifier(class_weight='balanced', random_state=KFOLD_SEED)
bc_dt = BalanceCascade(estimator=dt, random_state=KFOLD_SEED)

rf = RandomForestClassifier(class_weight='balanced', n_jobs=-1, random_state=KFOLD_SEED, verbose=1)
bc_rf = BalanceCascade(estimator=rf, random_state=KFOLD_SEED)

# xgbc = XGBClassifier(n_jobs=-1, n_estimators=10, scale_pos_weight=10)
# bc_xgbc = BalanceCascade(estimator=xgbc, random_state=KFOLD_SEED)


ensemblers = [ee, bc, bc_dt, bc_rf]
ensembler_test(logistic_regression, ensemblers)

fitting sample
(10, 3062, 88) (10, 3062)
training
[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]Ensembler 0
--------test---------
Confusion Matrix:
True Negative = 126771
False Negative = 119
True Positive = 258
False Positive = 72852
f1 score = 0.007
ROC AUC Score = 0.660
fitting sample
(18, 3062, 88) (18, 3062)
training
[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]Ensembler 1
--------test---------
Confusion Matrix:
True Negative = 109272
False Negative = 146
True Positive = 231
False Positive = 90351
f1 score = 0.005
ROC AUC Score = 0.580
fitting sample
(17, 3062, 88) (17, 3062)
training
[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear

[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.1s finished
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.1s finished
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.1s finished
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.1s finished
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.1s finished
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.1s finished
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.1s finished
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:    0.0s finished
[Parallel(n_j

(15, 3062, 88) (15, 3062)
training
[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]Ensembler 3
--------test---------
Confusion Matrix:
True Negative = 143210
False Negative = 172
True Positive = 205
False Positive = 56413
f1 score = 0.007
ROC AUC Score = 0.631


Try to tune EasyEnsemble by adjusting subsets. It does not affect f1-score or ROC AUC socre
```
ee = EasyEnsemble(n_subsets = 100, random_state=KFOLD_SEED)
ensembler_test(logistic_regression, [ee])

ee = EasyEnsemble(n_subsets = 4, random_state=KFOLD_SEED)
ensembler_test(logistic_regression, [ee])
```
Both result in:
```
================================
Confusion Matrix:
True Negative = 127470
False Negative = 132
True Positive = 245
False Positive = 72153
================================
f1 score = 0.007
================================
ROC AUC Score = 0.644
```

Tune max subset for BalanceCascade with DecisionTreeClassifier

5 is the best

In [11]:
dt_20 = DecisionTreeClassifier(random_state=KFOLD_SEED)
bc_dt_20 = BalanceCascade(estimator=dt_20, n_max_subset=20, random_state=KFOLD_SEED)

dt_10 = DecisionTreeClassifier(random_state=KFOLD_SEED)
bc_dt_10 = BalanceCascade(estimator=dt_10, n_max_subset=10, random_state=KFOLD_SEED)

dt_5 = DecisionTreeClassifier(random_state=KFOLD_SEED)
bc_dt_5 = BalanceCascade(estimator=dt_5, n_max_subset=5, random_state=KFOLD_SEED)

ensemblers = [bc_dt_20, bc_dt_10, bc_dt_5]
ensembler_test(logistic_regression, ensemblers)

fitting sample
(17, 3062, 88) (17, 3062)
training
[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]Ensembler 0
--------test---------
Confusion Matrix:
True Negative = 134970
False Negative = 152
True Positive = 225
False Positive = 64653
f1 score = 0.007
ROC AUC Score = 0.636
fitting sample
(10, 3062, 88) (10, 3062)
training
[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]Ensembler 1
--------test---------
Confusion Matrix:
True Negative = 130663
False Negative = 138
True Positive = 239
False Positive = 68960
f1 score = 0.007
ROC AUC Score = 0.644
fitting sample
(5, 3062, 88) (5, 3062)
training
[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]Ensembler 2
--------test---------
Confusion Matrix:
True Negative = 127183
False Negative = 109
True Positive = 268
False Positive = 72440
f1 sc

Tune max features, overall it doesn't affect ROC AUC much, but 17 features is slightly higher than others

In [12]:
# 70 features
dt_08 = DecisionTreeClassifier(max_features=0.8, random_state=KFOLD_SEED)
bc_dt_08 = BalanceCascade(estimator=dt_08, n_max_subset=5, random_state=KFOLD_SEED)

# 35 features
dt_04 = DecisionTreeClassifier(max_features=0.4, random_state=KFOLD_SEED)
bc_dt_04 = BalanceCascade(estimator=dt_04, n_max_subset=5, random_state=KFOLD_SEED)

# 17 features
dt_02 = DecisionTreeClassifier(max_features=0.2, random_state=KFOLD_SEED)
bc_dt_02 = BalanceCascade(estimator=dt_02, n_max_subset=5, random_state=KFOLD_SEED)

# Auto is sqrt(n_features) ~= 9
dt_auto = DecisionTreeClassifier(max_features='auto', random_state=KFOLD_SEED)
bc_dt_auto = BalanceCascade(estimator=dt_auto, n_max_subset=5, random_state=KFOLD_SEED)

ensemblers = [bc_dt_08, bc_dt_04, bc_dt_02, bc_dt_auto]
ensembler_test(logistic_regression, ensemblers)

fitting sample
(5, 3062, 88) (5, 3062)
training
[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]Ensembler 0
--------test---------
Confusion Matrix:
True Negative = 127258
False Negative = 121
True Positive = 256
False Positive = 72365
f1 score = 0.007
ROC AUC Score = 0.658
fitting sample
(5, 3062, 88) (5, 3062)
training
[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]Ensembler 1
--------test---------
Confusion Matrix:
True Negative = 128352
False Negative = 125
True Positive = 252
False Positive = 71271
f1 score = 0.007
ROC AUC Score = 0.656
fitting sample
(5, 3062, 88) (5, 3062)
training
[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]Ensembler 2
--------test---------
Confusion Matrix:
True Negative = 131504
False Negative = 141
True Positive = 236
False Positive = 68119
f1 score = 0.007
ROC AUC Score = 0.642
fitting sample
(5, 3062, 88) (5, 3062)
training
[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]Ensembler 3
--------test---------
Confusion Matrix:
Tr

No need to tune decision tree's class weight, since ensembler already ensures both classes have equal samples

Min samples at leaves do not seem to matter

In [13]:
dt_min_samples_50 = DecisionTreeClassifier(min_samples_leaf=50, random_state=KFOLD_SEED)
bc_dt_min_samples_50 = BalanceCascade(estimator=dt_min_samples_50, n_max_subset=5, random_state=KFOLD_SEED)

dt_min_samples_20 = DecisionTreeClassifier(min_samples_leaf=20, random_state=KFOLD_SEED)
bc_dt_min_samples_20 = BalanceCascade(estimator=dt_min_samples_20, n_max_subset=5, random_state=KFOLD_SEED)

dt_min_samples_10 = DecisionTreeClassifier(min_samples_leaf=10, random_state=KFOLD_SEED)
bc_dt_min_samples_10 = BalanceCascade(estimator=dt_min_samples_10, n_max_subset=5, random_state=KFOLD_SEED)

dt_min_samples_5 = DecisionTreeClassifier(min_samples_leaf=5, random_state=KFOLD_SEED)
bc_dt_min_samples_5 = BalanceCascade(estimator=dt_min_samples_5, n_max_subset=5, random_state=KFOLD_SEED)

ensemblers = [bc_dt_min_samples_50, bc_dt_min_samples_20, bc_dt_min_samples_10, bc_dt_min_samples_5]
ensembler_test(logistic_regression, ensemblers)

fitting sample
(5, 3062, 88) (5, 3062)
training
[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]Ensembler 0
--------test---------
Confusion Matrix:
True Negative = 130197
False Negative = 134
True Positive = 243
False Positive = 69426
f1 score = 0.007
ROC AUC Score = 0.648
fitting sample
(5, 3062, 88) (5, 3062)
training
[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]Ensembler 1
--------test---------
Confusion Matrix:
True Negative = 129656
False Negative = 138
True Positive = 239
False Positive = 69967
f1 score = 0.007
ROC AUC Score = 0.642
fitting sample
(5, 3062, 88) (5, 3062)
training
[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]Ensembler 2
--------test---------
Confusion Matrix:
True Negative = 129531
False Negative = 140
True Positive = 237
False Positive = 70092
f1 score = 0.007
ROC AUC Score = 0.639
fitting sample
(5, 3062, 88) (5, 3062)
training
[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]Ensembler 3
--------test---------
Confusion Matrix:
Tr

Class weight has to be balanced!
If negative class is heavier, both TPR and FNR decrease, but TPR decrease causes more harm to ROC AUC.
If positive class is heavier, the observation is the opposite

In [15]:
def create_lr_proxy(class_weight='balanced'):
    def create_lr():
        return LogisticRegression(penalty='l2', class_weight=class_weight)
    return create_lr

dt = DecisionTreeClassifier(random_state=KFOLD_SEED)
bc = BalanceCascade(estimator=dt, n_max_subset=5, random_state=KFOLD_SEED)

ensembler_test(create_lr_proxy({0: 2, 1: 1}), [bc])
ensembler_test(create_lr_proxy(), [bc])
ensembler_test(create_lr_proxy({0: 1, 1: 2}), [bc])
ensembler_test(create_lr_proxy({0: 1, 1: 4}), [bc])

fitting sample
(5, 3062, 88) (5, 3062)
training
Ensembler 0
--------test---------
Confusion Matrix:
True Negative = 181161
False Negative = 271
True Positive = 106
False Positive = 18462
f1 score = 0.011
ROC AUC Score = 0.594
fitting sample
(5, 3062, 88) (5, 3062)
training
Ensembler 0
--------test---------
Confusion Matrix:
True Negative = 127183
False Negative = 109
True Positive = 268
False Positive = 72440
f1 score = 0.007
ROC AUC Score = 0.674
fitting sample
(5, 3062, 88) (5, 3062)
training
Ensembler 0
--------test---------
Confusion Matrix:
True Negative = 50431
False Negative = 30
True Positive = 347
False Positive = 149192
f1 score = 0.005
ROC AUC Score = 0.587
fitting sample
(5, 3062, 88) (5, 3062)
training
Ensembler 0
--------test---------
Confusion Matrix:
True Negative = 14369
False Negative = 11
True Positive = 366
False Positive = 185254
f1 score = 0.004
ROC AUC Score = 0.521


Try L1 regularization, C=1.0 is just right.

In [14]:
def create_lr_proxy(C=1.0):
    def create_lr():
        return LogisticRegression(penalty='l1', C=C, random_state=KFOLD_SEED)
    return create_lr

dt = DecisionTreeClassifier(max_features=0.2, random_state=KFOLD_SEED)
bc = BalanceCascade(estimator=dt, n_max_subset=5, random_state=KFOLD_SEED)

ensembler_test(create_lr_proxy(2.0), [bc])
ensembler_test(create_lr_proxy(1.0), [bc])
ensembler_test(create_lr_proxy(0.8), [bc])
ensembler_test(create_lr_proxy(0.5), [bc])
ensembler_test(create_lr_proxy(0.2), [bc])

fitting sample
(5, 3062, 88) (5, 3062)
training
Ensembler 0
--------test---------
Confusion Matrix:
True Negative = 131468
False Negative = 140
True Positive = 237
False Positive = 68155
f1 score = 0.007
ROC AUC Score = 0.644
fitting sample
(5, 3062, 88) (5, 3062)
training
Ensembler 0
--------test---------
Confusion Matrix:
True Negative = 131521
False Negative = 140
True Positive = 237
False Positive = 68102
f1 score = 0.007
ROC AUC Score = 0.644
fitting sample
(5, 3062, 88) (5, 3062)
training
Ensembler 0
--------test---------
Confusion Matrix:
True Negative = 131493
False Negative = 140
True Positive = 237
False Positive = 68130
f1 score = 0.007
ROC AUC Score = 0.644
fitting sample
(5, 3062, 88) (5, 3062)
training
Ensembler 0
--------test---------
Confusion Matrix:
True Negative = 131491
False Negative = 144
True Positive = 233
False Positive = 68132
f1 score = 0.007
ROC AUC Score = 0.638
fitting sample
(5, 3062, 88) (5, 3062)
training
Ensembler 0
--------test---------
Confusion Matr