# Adversarial validation approach

Here we explore adversarial validation to see if the usual cross-validation strategy will work. Many thanks to this kernel:
* https://www.kaggle.com/konradb/adversarial-validation-and-other-scary-terms

Also thanks to FastML: 
* http://fastml.com/adversarial-validation-part-one/
* http://fastml.com/adversarial-validation-part-two/

## To Summarize what Ad. Val. is..
***"The general idea is to check the degree of similarity between training and tests in terms of feature distribution: if they are difficult to distinguish, the distribution is probably similar and the usual validation techniques should work. It does not seem to be the case, so we can suspect they are quite different. This intuition can be quantified by combining train and test sets, assigning 0/1 labels (0 - train, 1-test) and evaluating a binary classification task."***

In [1]:
import matplotlib.pyplot as plt
import cv2
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, KFold, cross_val_score, GridSearchCV, RepeatedStratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn import model_selection
from sklearn.metrics import accuracy_score, roc_auc_score
import json
import ast
import time
from sklearn import linear_model
import eli5
import gc
gc.enable()
gc.collect()

0

In [2]:
train_df = pd.read_csv("/Users/JoonH/dont-overfit-ii/train.csv")
test_df = pd.read_csv("/Users/JoonH/dont-overfit-ii/test.csv")

In [4]:
x_train = train_df.drop(['target','id'],axis=1)
id_train = train_df['id']
x_test = test_df.drop(['id'],axis=1)
id_test = test_df['id']

x_train['is_test'] = 0
x_test['is_test'] = 1

x_combined = pd.concat([x_train,x_test],axis=0)

In [5]:
y = x_combined['is_test']
x_combined = x_combined.drop(['is_test'], axis = 1)

In [9]:
n_fold = 4
folds = StratifiedKFold(n_splits=n_fold, shuffle=True, random_state=42)

In [13]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(class_weight='balanced', penalty='l1', C=0.1, solver='liblinear', max_iter = 1000)

In [14]:
for train_index, test_index in folds.split(x_combined, y):
        x0, x1 = x_combined.iloc[train_index], x_combined.iloc[test_index]
        y0, y1 = y.iloc[train_index], y.iloc[test_index]        
        print(x0.shape)
        clf.fit(x0, y0)
                
        prval = clf.predict_proba(x1)[:,1]
        print(roc_auc_score(y1,prval))

(14999, 300)
0.515091901483153
(14999, 300)
0.526715397918314
(15001, 300)
0.48282880422353924
(15001, 300)
0.5226662397825506


Quite Interestingly, the technique seems to show us that it is difficult for logistic regression, our currently best working model, fails to distinguish between the training dataset and test dataset. This tells us the two are not that easy to distinguish, thus indicating that the distribution is very similar and our usual validation score should work.

# How to utilize Adversarial CV

To implement this method given that the distribution was actually easy to distinguish, we would do the following:
1. Train a model to predict whether given data belong to test set
2. Predict train data to select subset that is closest to the test set
3. Utilize that subset as our main cv data.