## Baseline: Logistic regression
Baseline model for Kaggle's Santander Customer Transaction Prediction challenge.

In [1]:
import pandas as pd
import numpy as np
from data_utils import *
from LogisticRegression import *
%load_ext autoreload
%autoreload 2

In [2]:
# load data and remove useless columns
train = pd.read_csv("../data/train.csv")
y_train = train.pop("target")
train.pop("ID_code")
test = pd.read_csv("../data/test.csv")
ID_test = test.pop("ID_code")

# split validation set from training set
train_i, val_i, _ = split_data_indices(len(train), w=[0.8, 0.2, 0])
X_train = np.array(train.iloc[train_i,:].reset_index(drop=True))
X_val = np.array(train.iloc[val_i,:].reset_index(drop=True))
y_val = np.array(y_train[val_i].reset_index(drop=True))
y_train = np.array(y_train[train_i].reset_index(drop=True))

print("Subset sizes: Train = {}, Val = {}, Test = {}".format(len(X_train), len(X_val), len(test)))

Subset sizes: Train = 159858, Val = 40142, Test = 200000


Logistic regression using the algorithm developed for week 1's assignment.

In [3]:
from LogisticRegression import LogisticRegression

# hyperparameters for each model
learning_rates = [10**i for i in range(-3,3)]
regularization_values = [0, 0.001, 0.01, 0.02, 0.5]
polynomial_degrees = [1, 2, 3, 4]
n_models = len(learning_rates)*len(regularization_values)*len(polynomial_degrees)

best_val_auc = 0 # save the best performance
best_regression = None # save here the instantiated regression object with the best performance
trained = 0
for lr in learning_rates:
    for regularization in regularization_values:
        for M in polynomial_degrees:
            poly = LogisticRegression(poly_degree=M, n_predictors=X_train.shape[1], regularization=regularization,
                                      learning_rate=lr, batch_size=5000)
            poly.optimize(X=X_train, y=y_train.reshape((-1,1)), method='gd')
            val_auc = poly.performance(X_val, y_val)
            # is this the best val_MSE yet?
            if val_auc > best_val_auc:
                train_auc = poly.performance(X_train, y_train)
                print("New record! val_auc={}, train_auc={} with M={}, lambda={}, lr={},"
                      .format(val_auc, train_auc, M, regularization, lr))
                best_val_auc = val_auc
                best_regression = poly
            # report progress sometimes
            trained += 1
            if (trained % 10) == 1:
                print("{} out of {} models trained so far".format(trained, n_models))

print("Best hyperparameters: lr={}, M={}, regularization={}".format(best_regression.learning_rate,best_regression.M,best_regression.regularization))
print("Best validation AUC: {}".format(best_val_auc))
print("Its AUC on the train set: {}".format(best_regression.performance(X_train, y_train)))

New record! val_auc=0.5014770340792719, train_auc=0.5009215206360037 with M=1, lambda=0, lr=0.001,
1 out of 120 models trained so far
New record! val_auc=0.5635700316535979, train_auc=0.5664515703294013 with M=3, lambda=0, lr=0.001,
New record! val_auc=0.5708659860568422, train_auc=0.5673462935796597 with M=4, lambda=0, lr=0.001,
New record! val_auc=0.5708689214266266, train_auc=0.5726304156317806 with M=3, lambda=0.001, lr=0.001,
New record! val_auc=0.5712698640666688, train_auc=0.5659435871925738 with M=4, lambda=0.001, lr=0.001,
New record! val_auc=0.5721107065846159, train_auc=0.5696939882228952 with M=3, lambda=0.01, lr=0.001,
11 out of 120 models trained so far
21 out of 120 models trained so far
New record! val_auc=0.6018271597625944, train_auc=0.603015034804836 with M=3, lambda=0, lr=0.01,
New record! val_auc=0.606487812042182, train_auc=0.6077102994523034 with M=3, lambda=0.001, lr=0.01,
New record! val_auc=0.6084084787998109, train_auc=0.6072505304038759 with M=3, lambda=0.01

Best validation AUC obtained was $0.666$. I tried several choices for the hyperparameter search but was unable to increase the training AUC beyond this point. Thus, this baseline has a bias problem.