# 1. Introduction

<div style="color:white;display:fill;
            background-color:#48AFFF;font-size:160%;
            font-family:Arial">
    <p style="padding: 4px;color:white;"><b>1.1 Context</b></p>
</div>

Adversarial validation for the Porto Seguro's Safe Driver Prediction competition to test if the training and test data originate from the same data distribution or not.

<div style="color:white;display:fill;
            background-color:#48AFFF;font-size:160%;
            font-family:Arial">
    <p style="padding: 4px;color:white;"><b>1.2 Used code</b></p>
</div>

The code used is adapted from the chapter on adversarial validation in [The Kaggle Book](https://www.kaggle.com/general/320574) by [(Banachewicz & Massaron)](#3.-References).

# 2. Implementation

<div style="color:white;display:fill;
            background-color:#48AFFF;font-size:160%;
            font-family:Arial">
    <p style="padding: 4px;color:white;"><b>2.1 Libraries</b></p>
</div>


In [1]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import roc_auc_score

<div style="color:white;display:fill;
            background-color:#48AFFF;font-size:160%;
            font-family:Arial">
    <p style="padding: 4px;color:white;"><b>2.2 Data</b></p>
</div>


In [2]:
train = pd.read_csv("../input/porto-seguro-safe-driver-prediction/train.csv")
test = pd.read_csv("../input/porto-seguro-safe-driver-prediction/test.csv")

In [3]:
train.head()

Unnamed: 0,id,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,7,0,2,2,5,1,0,0,1,0,...,9,1,5,8,0,1,1,0,0,1
1,9,0,1,1,7,0,0,0,0,1,...,3,1,1,9,0,1,1,0,1,0
2,13,0,5,4,9,1,0,0,0,1,...,4,2,7,7,0,1,1,0,1,0
3,16,0,0,1,2,0,0,1,0,0,...,2,2,4,9,0,0,0,0,0,0
4,17,0,0,2,0,1,0,1,0,0,...,3,1,1,3,0,0,0,1,1,0


In [4]:
test.head()

Unnamed: 0,id,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,ps_ind_09_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,0,0,1,8,1,0,0,1,0,0,...,1,1,1,12,0,1,1,0,0,1
1,1,4,2,5,1,0,0,0,0,1,...,2,0,3,10,0,0,1,1,0,1
2,2,5,1,3,0,0,0,0,0,1,...,4,0,2,4,0,0,0,0,0,0
3,3,0,1,6,0,0,1,0,0,0,...,5,1,0,5,1,0,1,0,0,0
4,4,5,1,7,0,0,0,0,0,1,...,4,0,0,4,0,1,1,0,0,1


<div style="color:white;display:fill;
            background-color:#48AFFF;font-size:160%;
            font-family:Arial">
    <p style="padding: 4px;color:white;"><b>2.3 Data Preparation</b></p>
</div>

Replacing missing values with -1 and droping the target as well as the id.

In [5]:
train = train.fillna(-1).drop(["id", "target"], axis=1)
test = test.fillna(-1).drop(["id"], axis=1)
X = train.append(test)
y = [0] * len(train) + [1] * len(test) 

In [6]:
train.head()

Unnamed: 0,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,ps_ind_09_bin,ps_ind_10_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,2,2,5,1,0,0,1,0,0,0,...,9,1,5,8,0,1,1,0,0,1
1,1,1,7,0,0,0,0,1,0,0,...,3,1,1,9,0,1,1,0,1,0
2,5,4,9,1,0,0,0,1,0,0,...,4,2,7,7,0,1,1,0,1,0
3,0,1,2,0,0,1,0,0,0,0,...,2,2,4,9,0,0,0,0,0,0
4,0,2,0,1,0,1,0,0,0,0,...,3,1,1,3,0,0,0,1,1,0


In [7]:
test.head()

Unnamed: 0,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,ps_ind_09_bin,ps_ind_10_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,0,1,8,1,0,0,1,0,0,0,...,1,1,1,12,0,1,1,0,0,1
1,4,2,5,1,0,0,0,0,1,0,...,2,0,3,10,0,0,1,1,0,1
2,5,1,3,0,0,0,0,0,1,0,...,4,0,2,4,0,0,0,0,0,0
3,0,1,6,0,0,1,0,0,0,0,...,5,1,0,5,1,0,1,0,0,0
4,5,1,7,0,0,0,0,0,1,0,...,4,0,0,4,0,1,1,0,0,1


<div style="color:white;display:fill;
            background-color:#48AFFF;font-size:160%;
            font-family:Arial">
    <p style="padding: 4px;color:white;"><b>2.4 Predictions</b></p>
</div>

Generating RandomForrestClassifier predictions which creates a cross-validation scheme.

In [8]:
model = RandomForestClassifier()
cv_preds = cross_val_predict(model, X, y, cv=5, n_jobs=-1, method='predict_proba', verbose=2)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed: 38.1min finished


"If your ROC-AUC is around 0.5, it means that the training and test data are not easily distinguishable and are apparently from the same distribution. ROC-AUC values higher than 0.5 and nearing 1.0 signal that it is easy for the algorithm to figure out what is from the training set and what is from the test set: in such a case, don’t expect to be able to easily generalize to the test set because it clearly comes from a different distribution." [(Banachewicz & Massaron)](#3.-References)

In [9]:
print(roc_auc_score(y_true=y, y_score=cv_preds[:,1]))

0.5005761821344328


The ROC-AUC is around 0.5, so training and test data are most probably from the same distribution.

# 3. References

<div style="color:white;display:fill;
            background-color:#48AFFF;font-size:160%;
            font-family:Arial">
    <p style="padding: 4px;color:white;"><b>3.1 References</b></p>
</div>

* Banachewicz, Konrad; Massaron, Luca. [The Kaggle Book](https://www.kaggle.com/general/320574): Data analysis and machine learning for competitive data science. Packt Publishing
* [Adversarial validation and other scary terms](https://www.kaggle.com/code/konradb/adversarial-validation-and-other-scary-terms/notebook) by [Konrad Banachewicz](https://www.kaggle.com/konradb)