## Model per feature
The people of Kaggle have hypothesized that the given data are actually the result of the application of PCA to an original dataset because:
1. The features seem independent.
2. There are exactly 200 features.

Thus, it might be a good idea to make a predictor for each feature and then aggregate them; this will prevent us from finding spurious correlations between features.

In [1]:
import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.metrics import roc_curve, auc
import time
import sys
sys.path.insert(0, '../utils')
# these are our own
from data_utils import *
from gbm_wrappers import *

In [2]:
# load and prepare the data
X = pd.read_csv("../data/train.csv")
X.pop("ID_code")
t = X.pop("target")
# split it into train (80%) and val (20%)
train_i, val_i, _ = split_data_indices(len(X), w=[0.8, 0.2, 0])
X_train = X[train_i]
t_train = t[train_i]
X_val = X[val_i]
t_val = t[val_i]

# test data
test = pd.read_csv('../data/test.csv')
test_id = test.pop("ID_code")

Now train a GBM for each feature, following [this kernel](https://www.kaggle.com/ymatioun/santander-model-one-feature-at-a-time).

In [3]:
features = range(X.shape[1])
val_individual_preds = np.zeros(X_val.shape)
train_individual_preds = np.zeros(X_train.shape)
val_individual_auc = np.zeros(X.shape[1])
train_individual_auc = np.zeros(X.shape[1])

test_individual_preds = np.zeros(test.shape)

# These recommendations as seen in the kernel (thank you Youri Matiounine):
# Use lightgbm for prediction
# Assume all features are independent, so fit model to one feature at a time
# Then final prediction is a product of all predictions based on a single feature
# Since data contains only one feature, do not use CV - just used fixed number of iterations
params = {
    'task': 'train', 'max_depth': 1, 'boosting_type': 'gbdt',
    'objective': 'binary', 'num_leaves': 3, 'learning_rate': 0.1,
    'feature_fraction': 0.9, 'bagging_fraction': 0.8, 'bagging_freq': 5,
    'lambda_l1': 1, 'lambda_l2': 60, 'verbose': -99
}

for f in features:
    # train a model with just this feature
    lgb_train = lgb.Dataset(X_train.iloc[:,f:f+1], t_train)
    gbm = lgb.train(params, lgb_train, 45, verbose_eval=1000)
    
    # store individual validation and training predictions and AUC (not in kernel)
    #train
    train_individual_preds[:,f] = gbm.predict(X_train.iloc[:,f:f+1], num_iteration=gbm.best_iteration)
    fpr, tpr, _ = roc_curve(t_train, train_individual_preds[:,f])
    train_individual_auc[f] = auc(fpr, tpr)
    # val
    val_individual_preds[:,f] = gbm.predict(X_val.iloc[:,f:f+1], num_iteration=gbm.best_iteration)
    fpr, tpr, _ = roc_curve(t_val, val_individual_preds[:,f])
    val_individual_auc[f] = auc(fpr, tpr)
    # test
    test_individual_preds[:,f] = gbm.predict(test.iloc[:,f:f+1], num_iteration=gbm.best_iteration)

That same kernel then suggests to take a product of individual predictions as a way of aggregating them. Let's see how well that would work on our validation set. Notice how the validation set has had nothing to do with the training so far.

In [4]:
print("Train AUC={:.4f}".format(auc_wrapper(train_individual_preds, t_train)))
print("Val AUC={:.4f}".format(auc_wrapper(val_individual_preds, t_val)))
# also predict on test set for submission
test_preds = (10 * test_individual_preds).prod(axis=1)
submission = pd.DataFrame({'ID_code': test_id, 'target': test_preds.astype('float32')})
submission.to_csv('sub1f.csv', index=False)

Train AUC=0.9110
Val AUC=0.9032


This submission file obtained a test AUC of $0.89885$. Now we deviate from the kernel. Let's start by trying different ways of averaging the observations. In particular I tried weighting each feature by the $AUC$ of its individual predictions.

In [8]:
print("Casual mean")
print("Train AUC={:.4f}".format(auc_wrapper(train_individual_preds, t_train, "mean")))
print("Val AUC={:.4f}".format(auc_wrapper(val_individual_preds, t_val, "mean")))

print("\nCasual mean weights are train individual AUC")
print("Train AUC={:.4f}".format(
    auc_wrapper(train_individual_preds, t_train, "weighted", train_individual_auc)
    ))
print("Val AUC={:.4f}".format(
    auc_wrapper(val_individual_preds, t_val, "weighted", train_individual_auc)
    ))

print("\nProduct mean weights are train individual AUC")
print("Train AUC={:.4f}".format(
    auc_wrapper(train_individual_preds, t_train, "prod_weights", train_individual_auc)
    ))
print("Val AUC={:.4f}".format(
    auc_wrapper(val_individual_preds, t_val, "prod_weights", train_individual_auc)
    ))

print("\nLet go of the worst 10 individual AUC features")
to_keep = np.argsort(train_individual_auc)[:-10]
print("Train AUC={:.4f}".format(
    auc_wrapper(train_individual_preds, t_train, "keep", None, to_keep)
    ))
print("Val AUC={:.4f}".format(
    auc_wrapper(val_individual_preds, t_val, "keep", None, to_keep)
    ))

Casual mean
Train AUC=0.9106
Val AUC=0.8961

Casual mean weights are train individual AUC
Train AUC=0.9102
Val AUC=0.8960

Product mean weights are train individual AUC
Train AUC=0.9121
Val AUC=0.8972

Let go of the worst 10 individual AUC features
Train AUC=0.8891
Val AUC=0.8718


None of these was really better then the rest. Losing variables, however, seems like a bad idea. This last observation, along with the small difference between training and validation $AUC$ is an indicator that our model could use higher complexity to further decrease bias. We now try to increase the model complexity by retuning parameters.

In [8]:
features = range(X.shape[1])
val_individual_preds = np.zeros(X_val.shape)
train_individual_preds = np.zeros(X_train.shape)
val_individual_auc = np.zeros(X.shape[1])
train_individual_auc = np.zeros(X.shape[1])

# These recommendations as seen in the kernel (thank you Youri Matiounine):
# Use lightgbm for prediction
# Assume all features are independent, so fit model to one feature at a time
# Then final prediction is a product of all predictions based on a single feature
# Since data contains only one feature, do not use CV - just used fixed number of iterations
params = {
    'task': 'train', 'max_depth': 1, 'boosting_type': 'gbdt',
    'objective': 'binary', 'num_leaves': 5, 'learning_rate': 0.1,
    'feature_fraction': 0.9, 'bagging_fraction': 0.8, 'bagging_freq': 5,
    'lambda_l1': 1, 'lambda_l2': 60, 'verbose': -99
}

for f in features:
    # train a model with just this feature
    lgb_train = lgb.Dataset(X_train.iloc[:,f:f+1], t_train)
    gbm = lgb.train(params, lgb_train, 45, verbose_eval=1000)
    
    # store individual validation and training predictions and AUC (not in kernel)
    #train
    train_individual_preds[:,f] = gbm.predict(X_train.iloc[:,f:f+1], num_iteration=gbm.best_iteration)
    fpr, tpr, _ = roc_curve(t_train, train_individual_preds[:,f])
    train_individual_auc[f] = auc(fpr, tpr)
    # val
    val_individual_preds[:,f] = gbm.predict(X_val.iloc[:,f:f+1], num_iteration=gbm.best_iteration)
    fpr, tpr, _ = roc_curve(t_val, val_individual_preds[:,f])
    val_individual_auc[f] = auc(fpr, tpr)

In [9]:
print("Train AUC={:.4f}".format(auc_wrapper(train_individual_preds, t_train)))
print("Val AUC={:.4f}".format(auc_wrapper(val_individual_preds, t_val)))

Train AUC=0.9115
Val AUC=0.9020
