# Benchmarks

## Setup

We'll use the [credit card dataset](https://datahub.io/machine-learning/creditcard) that is commonly used as an imbalanced binary classification task.

In [2]:
!wget https://datahub.io/machine-learning/creditcard/r/creditcard.csv

--2020-08-12 18:20:10--  https://datahub.io/machine-learning/creditcard/r/creditcard.csv
Resolving datahub.io (datahub.io)... 104.18.48.253, 104.18.49.253, 172.67.157.38
Connecting to datahub.io (datahub.io)|104.18.48.253|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://pkgstore.datahub.io/machine-learning/creditcard/creditcard_csv/data/ebdc64b6837b3026238f3fcad3402337/creditcard_csv.csv [following]
--2020-08-12 18:20:11--  https://pkgstore.datahub.io/machine-learning/creditcard/creditcard_csv/data/ebdc64b6837b3026238f3fcad3402337/creditcard_csv.csv
Resolving pkgstore.datahub.io (pkgstore.datahub.io)... 104.18.49.253, 172.67.157.38, 104.18.48.253
Connecting to pkgstore.datahub.io (pkgstore.datahub.io)|104.18.49.253|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 151114991 (144M) [text/csv]
Saving to: ‘creditcard.csv’


2020-08-12 18:20:21 (16.7 MB/s) - ‘creditcard.csv’ saved [151114991/151114991]



Do a train/test split.

In [72]:
import pandas as pd
from sklearn import preprocessing

credit = pd.read_csv('creditcard.csv')
credit = credit.drop(columns=['Time'])
features = credit.columns.drop('Class')
credit.loc[:, features] = preprocessing.scale(credit.loc[:, features])
credit['Class'] = (credit['Class'] == "'1'").astype(int)

n_test = 40_000
credit[:-n_test].to_csv('train.csv', index=False)
credit[-n_test:].to_csv('test.csv', index=False)

While we're at it, let's look at the class distribution.

In [94]:
credit['Class'].value_counts(normalize=True)

0    0.998273
1    0.001727
Name: Class, dtype: float64

Define a helper function to train a PyTorch model on an `IterableDataset`.

In [73]:
import torch

def train(net, optimizer, criterion, train_batches):

    for x_batch, y_batch in train_batches:

        optimizer.zero_grad()
        y_pred = net(x_batch)
        loss = criterion(y_pred[:, 0], y_batch.float())
        loss.backward()
        optimizer.step()
        
    return net

Define a helper function to score a PyTorch model on a test set.

In [84]:
import numpy as np

def score(net, test_batches, metric):
    
    y_true = []
    y_pred = []
    
    for x_batch, y_batch in test_batches:
        y_true.extend(y_batch.detach().numpy())
        y_pred.extend(net(x_batch).detach().numpy()[:, 0])
        
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    
    return metric(y_true, y_pred)

Let's also create an `IterableDataset` that reads from a CSV file. The following implementation is not very generic nor flexible, but it will do for this notebook.

In [78]:
import csv

class IterableCSV(torch.utils.data.IterableDataset):
    
    def __init__(self, path):
        self.path = path
        
    def __iter__(self):
        
        with open(self.path) as file:
            reader = csv.reader(file)
            header = next(reader)
            for row in reader:
                x = [float(el) for el in row[:-1]]
                y = int(row[-1])
                yield torch.tensor(x), y

We can now define the training and test set loaders.

In [79]:
train_set = IterableCSV(path='train.csv')
test_set = IterableCSV(path='test.csv')

## Vanilla

In [117]:
def make_net(n_features):
    
    torch.manual_seed(0)
    
    return torch.nn.Sequential(
        torch.nn.Linear(n_features, 30),
        torch.nn.Linear(30, 1)
    )

In [118]:
%%time

net = make_net(len(features))

net = train(
    net,
    optimizer=torch.optim.SGD(net.parameters(), lr=1e-2),
    criterion=torch.nn.BCEWithLogitsLoss(),
    train_batches=torch.utils.data.DataLoader(train_set, batch_size=16)
)

CPU times: user 10.6 s, sys: 240 ms, total: 10.8 s
Wall time: 10.8 s


In [119]:
from sklearn import metrics

score(
    net,
    test_batches=torch.utils.data.DataLoader(test_set, batch_size=16),
    metric=metrics.roc_auc_score
)

0.9557763595155662

## Under-sampling

In [120]:
import pytorch_resample

train_sample = pytorch_resample.UnderSampler(
    train_set,
    desired_dist={0: .8, 1: .2},
    seed=42
)

In [121]:
%%time

net = make_net(len(features))

net = train(
    net,
    optimizer=torch.optim.SGD(net.parameters(), lr=1e-2),
    criterion=torch.nn.BCEWithLogitsLoss(),
    train_batches=torch.utils.data.DataLoader(train_sample, batch_size=16)
)

CPU times: user 5.98 s, sys: 54.9 ms, total: 6.04 s
Wall time: 6.05 s


In [122]:
score(
    net,
    test_batches=torch.utils.data.DataLoader(test_set, batch_size=16),
    metric=metrics.roc_auc_score
)

0.9162552315799719

## Over-sampling

In [123]:
train_sample = pytorch_resample.OverSampler(
    train_set,
    desired_dist={0: .8, 1: .2},
    seed=42
)

In [124]:
%%time

net = make_net(len(features))

net = train(
    net,
    optimizer=torch.optim.SGD(net.parameters(), lr=1e-2),
    criterion=torch.nn.BCEWithLogitsLoss(),
    train_batches=torch.utils.data.DataLoader(train_sample, batch_size=16)
)

CPU times: user 12.1 s, sys: 287 ms, total: 12.4 s
Wall time: 12.4 s


In [125]:
score(
    net,
    test_batches=torch.utils.data.DataLoader(test_set, batch_size=16),
    metric=metrics.roc_auc_score
)

0.9642164101280608

## Hybrid method

In [126]:
train_sample = pytorch_resample.HybridSampler(
    train_set,
    desired_dist={0: .8, 1: .2},
    sampling_rate=.5,
    seed=42
)

In [127]:
%%time

net = make_net(len(features))

net = train(
    net,
    optimizer=torch.optim.SGD(net.parameters(), lr=1e-2),
    criterion=torch.nn.BCEWithLogitsLoss(),
    train_batches=torch.utils.data.DataLoader(train_sample, batch_size=16)
)

CPU times: user 8.95 s, sys: 166 ms, total: 9.11 s
Wall time: 9.14 s


In [128]:
score(
    net,
    test_batches=torch.utils.data.DataLoader(test_set, batch_size=16),
    metric=metrics.roc_auc_score
)

0.9687554053866155