# Adversarial Validation

The adversarial validation workflow and feature correction adds 0.001 to the public leaderboard score (0.132 $\rightarrow$ 0.133) for a simple LightGBM baseline.

When we build models were are concerned to ensure that the model generalizes well out-of-sample. A key issue (in real world production deployment and in Kaggle) is distribution shift in the feature space. In other words, if the "shape" if the feature changes with time, the model could be overfitting to the training shape. We want to know which features can be distinguished between train and validation across folds -- this is **Adversarial Validation**. To do this, we build a model to predict not the target, but simply **to classify whether or not a sample row is in the train or validation set**. As per the graphic, we

- (1) Start with our traditional cross-validation split
- (2) Create a new target, which is an indicator of whether the sample is in the train or test set
- (3) We concatenate the train and test X and y into a single dataframe *and shuffle*
- (4) We then split that into (X' train, y' train) and (X' valid, y' valid).

![](https://i.imgur.com/qoz5jl0.png)

And then we `.fit(...)` the classifier on that!

We then look at the **feature importance** of this model to see what the offending features are. To avoid overfitting, we need to do something with these offending features: drop them (if they are unimportant in the core model) or transform them. Here is what we will do:

   - (1) Train LGB regressor over CV as usual to predict the target; save the feature importances
   - (2) Train LGB classifier over CV to predict if a sample is in the train set or validation set
   - (3) Look for important features from 2; if we find any, it means they are non-stationary and/or leaky
   - (4) Compare the important features from the classifier to feature importance from the standard regressor
   - (5) If a feature is important in 1 and 2, we need to transform that feature
   - (6) If a feature is important in 2 but not in 1, then we drop it...

In [None]:
import gc, time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import lightgbm as lgb

from scipy.stats import pearsonr
from tqdm.notebook import tqdm
from sklearn.base import clone
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss, mean_squared_error

plt.rcParams['figure.figsize'] = (16, 5)

import warnings
warnings.filterwarnings("ignore")

# Load and Inspect the Data

Thanks [Rob Mulla](https://www.kaggle.com/robikscube) for the reduced memory version of the data.

In [None]:
%%time
train = (pd.read_parquet('../input/ubiquant-parquet/train_low_mem.parquet')
         .sort_values(['time_id', 'investment_id'])
         .drop(columns=['row_id'])
         .query('time_id > 599')
         .reset_index(drop=True));

gc.collect()

In [None]:
all_columns = train.columns
features = all_columns[train.columns.str.contains('f_')]

In [None]:
train.shape

In [None]:
train.head()

The CV scheme, `PurgedGroupTimeSeriesSplit` is from my notebook ["Purged Time Series CV, XGBoost, Optuna"](https://www.kaggle.com/marketneutral/purged-time-series-cv-xgboost-optuna). You can read details of this CV scheme there.

In [None]:
import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection._split import _BaseKFold, indexable, _num_samples
from sklearn.utils.validation import _deprecate_positional_args

# modified code for group gaps; source
# https://github.com/getgaurav2/scikit-learn/blob/d4a3af5cc9da3a76f0266932644b884c99724c57/sklearn/model_selection/_split.py#L2243
class PurgedGroupTimeSeriesSplit(_BaseKFold):
    """Time Series cross-validator variant with non-overlapping groups.
    Allows for a gap in groups to avoid potentially leaking info from
    train into test if the model has windowed or lag features.
    Provides train/test indices to split time series data samples
    that are observed at fixed time intervals according to a
    third-party provided group.
    In each split, test indices must be higher than before, and thus shuffling
    in cross validator is inappropriate.
    This cross-validation object is a variation of :class:`KFold`.
    In the kth split, it returns first k folds as train set and the
    (k+1)th fold as test set.
    The same group will not appear in two different folds (the number of
    distinct groups has to be at least equal to the number of folds).
    Note that unlike standard cross-validation methods, successive
    training sets are supersets of those that come before them.
    Read more in the :ref:`User Guide <cross_validation>`.
    Parameters
    ----------
    n_splits : int, default=5
        Number of splits. Must be at least 2.
    max_train_group_size : int, default=Inf
        Maximum group size for a single training set.
    group_gap : int, default=None
        Gap between train and test
    max_test_group_size : int, default=Inf
        We discard this number of groups from the end of each train split
    """

    @_deprecate_positional_args
    def __init__(self,
                 n_splits=5,
                 *,
                 max_train_group_size=np.inf,
                 max_test_group_size=np.inf,
                 group_gap=None,
                 verbose=False
                 ):
        super().__init__(n_splits, shuffle=False, random_state=None)
        self.max_train_group_size = max_train_group_size
        self.group_gap = group_gap
        self.max_test_group_size = max_test_group_size
        self.verbose = verbose

    def split(self, X, y=None, groups=None):
        """Generate indices to split data into training and test set.
        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Training data, where n_samples is the number of samples
            and n_features is the number of features.
        y : array-like of shape (n_samples,)
            Always ignored, exists for compatibility.
        groups : array-like of shape (n_samples,)
            Group labels for the samples used while splitting the dataset into
            train/test set.
        Yields
        ------
        train : ndarray
            The training set indices for that split.
        test : ndarray
            The testing set indices for that split.
        """
        if groups is None:
            raise ValueError(
                "The 'groups' parameter should not be None")
        X, y, groups = indexable(X, y, groups)
        n_samples = _num_samples(X)
        n_splits = self.n_splits
        group_gap = self.group_gap
        max_test_group_size = self.max_test_group_size
        max_train_group_size = self.max_train_group_size
        n_folds = n_splits + 1
        group_dict = {}
        u, ind = np.unique(groups, return_index=True)
        unique_groups = u[np.argsort(ind)]
        n_samples = _num_samples(X)
        n_groups = _num_samples(unique_groups)
        for idx in np.arange(n_samples):
            if (groups[idx] in group_dict):
                group_dict[groups[idx]].append(idx)
            else:
                group_dict[groups[idx]] = [idx]
        if n_folds > n_groups:
            raise ValueError(
                ("Cannot have number of folds={0} greater than"
                 " the number of groups={1}").format(n_folds,
                                                     n_groups))

        group_test_size = min(n_groups // n_folds, max_test_group_size)
        group_test_starts = range(n_groups - n_splits * group_test_size,
                                  n_groups, group_test_size)
        for group_test_start in group_test_starts:
            train_array = []
            test_array = []

            group_st = max(0, group_test_start - group_gap - max_train_group_size)
            for train_group_idx in unique_groups[group_st:(group_test_start - group_gap)]:
                train_array_tmp = group_dict[train_group_idx]
                
                train_array = np.sort(np.unique(
                                      np.concatenate((train_array,
                                                      train_array_tmp)),
                                      axis=None), axis=None)

            train_end = train_array.size
 
            for test_group_idx in unique_groups[group_test_start:
                                                group_test_start +
                                                group_test_size]:
                test_array_tmp = group_dict[test_group_idx]
                test_array = np.sort(np.unique(
                                              np.concatenate((test_array,
                                                              test_array_tmp)),
                                     axis=None), axis=None)

            test_array  = test_array[group_gap:]
            
            
            if self.verbose > 0:
                    pass
                    
            yield [int(i) for i in train_array], [int(i) for i in test_array]

# Train LGB Regressor and Store Feature Importance

This is just a baseline traditional LGB model as usual.

In [None]:
# based on source https://www.kaggle.com/artgor/dota-eda-fe-and-models

def train_model(
    X,
    y,
    params,
    cv,
    score_func,
    plot_feature_importance=False,
    cat_features=[],
    importance_type='gain',
    groups=None,
    clip=True,
    clip_bounds=(-15,15)
):

    oof = []
    scores = []
    feature_importance = pd.DataFrame()
    models = []
    
    for fold_n, (train_index, valid_index) in enumerate(cv.split(X, groups=groups)):
        print('Fold', fold_n+1, 'started at', time.ctime())
        X_train, X_valid = X.loc[train_index], X.loc[valid_index]
        y_train, y_valid = y[train_index], y[valid_index]
        
        if clip:
            X_train = X_train.clip(clip_bounds[0], clip_bounds[1])
        
        train_data = lgb.Dataset(X_train, label=y_train)
        valid_data = lgb.Dataset(X_valid, label=y_valid)

        model = lgb.train(
            params=params,
            train_set=train_data,
            num_boost_round=2000,
            valid_sets=[
                train_data,
                valid_data
            ],
            callbacks=[
                lgb.early_stopping(stopping_rounds=200),
                lgb.log_evaluation(period=200)
            ],
            categorical_feature = \
                cat_features if len(cat_features) > 0 else 'auto'
        )
        
        models.append(model)

        y_pred_valid = model.predict(X_valid)

        oof.append(pd.DataFrame(index=valid_index, data=y_pred_valid.reshape(-1,), columns=['pred']))
        scores.append(score_func(y_valid, y_pred_valid))

        fold_importance = pd.DataFrame()
        fold_importance["feature"] = X.columns
        fold_importance["importance"] = model.feature_importance(importance_type)
        fold_importance["fold"] = fold_n + 1
        feature_importance = \
            pd.concat([feature_importance, fold_importance], axis=0)

    print('CV mean score: {0:.4f}, std: {1:.4f}.'.format(np.mean(scores),
                                                         np.std(scores)))
    
    feature_importance["importance"] /= (fold_n + 1)

    if plot_feature_importance:
        cols = (feature_importance[["feature", "importance"]]
                  .groupby("feature")
                  .mean()
                  .sort_values(by="importance", ascending=False)[:50].index)

        best_features = feature_importance.loc[feature_importance.feature.isin(cols)]

        plt.figure(figsize=(16, 10));
        sns.barplot(
            x="importance",
            y="feature",
            data=best_features.sort_values(by="importance", ascending=False));
        plt.title(f'LGBM Feature Importances (avgerage over folds): {importance_type}');
        
        return oof, scores, feature_importance, models
    else:
        return oof, scores, feature_importance, models

In [None]:
tscv = PurgedGroupTimeSeriesSplit(
    n_splits=3,
    max_train_group_size=120,
    group_gap=10,
    max_test_group_size=40
)

In [None]:
params = {
    'feature_fraction': 0.85,
    'bagging_fraction': 0.85,
    'learning_rate': 0.05,
    'max_depth': -1,
    'min_data_in_leaf': 500,
    'num_threads': -1,
    'verbosity': -1,
    'objective': "regression"
}

In [None]:
%%time
oof, scores, feature_imp, lgb_models = train_model(
    train.dropna()[features].reset_index(drop=True),
    train.dropna()['target'].reset_index(drop=True),
    params=params,
    cv=tscv,
    plot_feature_importance=True,
    score_func=mean_squared_error,
    groups=train.dropna().reset_index(drop=True)['time_id']
)

# Train LGB Classifier and Inspect Feature Importance

Here is the adversarial classifier.

In [None]:
# based on source https://www.kaggle.com/artgor/dota-eda-fe-and-models

def train_adversarial(
    X,
    y,
    params,
    cv,
    score_func,
    plot_feature_importance=False,
    cat_features=[],
    importance_type='gain',
    groups=None,
    clip=True,
    clip_bounds=(-15,15)
):

    oof = []
    scores = []
    feature_importance = pd.DataFrame()
    models = []
    
    for fold_n, (train_index, valid_index) in enumerate(cv.split(X, groups=groups)):
        print('Fold', fold_n+1, 'started at', time.ctime())
        
        X_train, X_valid = X.loc[train_index], X.loc[valid_index]
        X_train['is_valid'] = 0
        X_valid['is_valid'] = 1
        
        X_train = pd.concat([X_train, X_valid])
        y_train = X_train.pop('is_valid')
        
        X_train, X_valid, y_train, y_valid = train_test_split(
            X_train, y_train, test_size=0.25, random_state=42)
        
        if clip:
            X_train = X_train.clip(clip_bounds[0], clip_bounds[1])
        
        train_data = lgb.Dataset(X_train, label=y_train)
        valid_data = lgb.Dataset(X_valid, label=y_valid)

        model = lgb.train(
            params=params,
            train_set=train_data,
            num_boost_round=100,
            valid_sets=[
                train_data,
                valid_data
            ],
            callbacks=[
                lgb.log_evaluation(period=50)
            ],
            categorical_feature = \
                cat_features if len(cat_features) > 0 else 'auto'
        )
        
        models.append(model)

        y_pred_valid = model.predict(X_valid)

        #oof.append(pd.DataFrame(index=valid_index, data=y_pred_valid.reshape(-1,), columns=['pred']))
        scores.append(score_func(y_valid, y_pred_valid))

        fold_importance = pd.DataFrame()
        fold_importance["feature"] = X.columns
        fold_importance["importance"] = model.feature_importance(importance_type)
        fold_importance["fold"] = fold_n + 1
        feature_importance = \
            pd.concat([feature_importance, fold_importance], axis=0)

    print('CV mean score: {0:.4f}, std: {1:.4f}.'.format(np.mean(scores),
                                                         np.std(scores)))
    
    feature_importance["importance"] /= (fold_n + 1)

    if plot_feature_importance:
        cols = (feature_importance[["feature", "importance"]]
                  .groupby("feature")
                  .mean()
                  .sort_values(by="importance", ascending=False)[:50].index)

        best_features = feature_importance.loc[feature_importance.feature.isin(cols)]

        plt.figure(figsize=(16, 10));
        sns.barplot(
            x="importance",
            y="feature",
            data=best_features.sort_values(by="importance", ascending=False));
        plt.xlim(0, 900_000);
        plt.title(f'LGBM Adversarial Feature Importances (avgerage over folds): {importance_type}');

        
    return scores, feature_importance, models



In [None]:
params = {
    'feature_fraction': 0.85,
    'bagging_fraction': 0.85,
    'learning_rate': 0.05,
    'max_depth': -1,
    'min_data_in_leaf': 500,
    'num_threads': -1,
    'verbosity': -1,
    'objective': 'binary', #Binary target feature
    'metric': 'binary_logloss' #metric for binary classification
}

In [None]:
tscv = PurgedGroupTimeSeriesSplit(
    n_splits=3,
    max_train_group_size=120,
    group_gap=10,
    max_test_group_size=40
)

# Aversarial Train First Pass

In [None]:
%%time
scores_adv_1, feature_imp_adv_1, lgb_models_adv_1 = train_adversarial(
    train.dropna()[features].reset_index(drop=True),
    train.dropna()['target'].reset_index(drop=True),
    params=params,
    cv=tscv,
    plot_feature_importance=True,
    score_func=log_loss,
    groups = train.dropna().reset_index(drop=True)['time_id']
)

# 🚩🚩🚩 We have a problem 🚩🚩🚩

Log loss of 0.693 is random guessing. That's what we want. We don't want the classifier to be able to distinguish between the train and valid set. We are getting a very low log loss... **the classifier is good, which is bad!**

Fortunately, we see that the offending features are small in number:
- `f_74`
- `f_142`
- `f_63`

In [None]:
sorted_regression_importance = feature_imp.groupby('feature')['importance'].mean().sort_values(ascending=False)
sorted_regression_importance.head()

Let's see how important these are in target prediction across the 300 features. Lower number is more important.

In [None]:
sorted_regression_importance.index.get_loc('f_74')

In [None]:
sorted_regression_importance.index.get_loc('f_142')

In [None]:
sorted_regression_importance.index.get_loc('f_63')

# Transform Offenders

These offending features are indeed reasonably important in the regressor. We will do a rank transformation by `time_id` to make these features stationary.

In [None]:
offenders = ['f_74', 'f_142', 'f_63']

In [None]:
train['f_74']  = train[['time_id', 'f_74']].groupby('time_id').rank(pct=True)

In [None]:
train['f_142']  = train[['time_id', 'f_142']].groupby('time_id').rank(pct=True)

In [None]:
train['f_63']  = train[['time_id', 'f_63']].groupby('time_id').rank(pct=True)

# "Why does the rank transfomation help here?"

This is a questions I have gotten a few times, so I am making a section on it. Imagine we have 3 stocks with a single feature that evolves over time like the following.

In [None]:
np.random.seed(0)

feature_unstacked = pd.DataFrame(
    index = np.arange(100),
    data = {
        'A': np.cumsum(0.60 + np.random.normal(size=100)),
        'B': np.cumsum(0.30 + np.random.normal(size=100)),
        'C': np.cumsum(0.0 + np.random.normal(size=100))
})

In [None]:
feature = (
    feature_unstacked
    .stack()
    .to_frame()
    .reset_index()
    .rename(columns={'level_0': 'time_id', 'level_1': 'investment_id', 0: 'feature'})
)

feature.head()

In [None]:
target = (pd.DataFrame(
    index = np.arange(100),
    data = {
        'A': (0.60 + np.random.normal(size=100)),
        'B': (0.30 + np.random.normal(size=100)),
        'C': (0.0 + np.random.normal(size=100))
}).stack()
  .to_frame()
  .reset_index()
  .rename(columns={'level_0': 'time_id', 'level_1': 'investment_id', 0: 'target'}))

target.head()

In [None]:
plt.scatter(feature['feature'], target['target'])
plt.xlabel('feature value')
plt.ylabel('target');

This is a great feature! Pearson corr would be the highest in all the Ubiquant data.

In [None]:
pearsonr(feature['feature'], target['target'])[0]

So, what's the problem? If we plot the feature by stock, we see that the feature is highly non-stationary. We don't need an adversarial training run to see by eye that we could easily distinguigh the earlier period from the later period. This is the key problem that adversarial validation is meant to uncover.

In [None]:
feature_unstacked.plot()

What can we do??? We want to retain this feature because it looks like a good predictior, but it is highly non-stationary so our GBDT model will overfit. We can **rank transform by stock per day** as follows.

In [None]:
feature_unstacked.rank(axis=1).plot()

We retain the primary relationship in the data: A > B > C. But now it's going to be very hard to tell across folds what time stage we are in. Problem solved!

In [None]:
feature_fixed = (
    feature_unstacked
    .rank(axis=1)
    .stack()
    .to_frame()
    .reset_index()
    .rename(columns={'level_0': 'time_id', 'level_1': 'investment_id', 0: 'feature'})
)


In [None]:
plt.scatter(feature_fixed['feature'], target['target'])

In [None]:
pearsonr(feature_fixed['feature'], target['target'])[0]

And there it is... we turned a non-stationary feature with a correl of ~0.24 to a stationary feature with a correl of ~0.24. **This is the power of the rank transformataion for financial problems like this.**

# Adversarial Second Pass

When we run the adversarial train again, we see that the classifier gets worse!

In [None]:
tscv = PurgedGroupTimeSeriesSplit(
    n_splits=3,
    max_train_group_size=120,
    group_gap=10,
    max_test_group_size=40
)

In [None]:
%%time
scores_adv_1, feature_imp_adv_1, lgb_models_adv_1 = train_adversarial(
    train.dropna()[features].reset_index(drop=True),
    train.dropna()['target'].reset_index(drop=True),
    params=params,
    cv=tscv,
    plot_feature_importance=True,
    score_func=log_loss,
    groups = train.dropna().reset_index(drop=True)['time_id']
)

After transforming the offending features, we see that our classifier is worse (which is good!).

![](https://64.media.tumblr.com/a8ac2ebbec3f7b9829502e1b7ad4ba5f/tumblr_mqrz7oEOYc1rt1ivno7_250.gifv)

# Train Final Model on Transformed Features

In [None]:
tscv = PurgedGroupTimeSeriesSplit(
    n_splits=3,
    max_train_group_size=120,
    group_gap=10,
    max_test_group_size=40
)

params = {
    'feature_fraction': 0.85,
    'bagging_fraction': 0.85,
    'learning_rate': 0.05,
    'max_depth': -1,
    'min_data_in_leaf': 500,
    'num_threads': -1,
    'verbosity': -1,
    'objective': "regression"
}

In [None]:
%%time
oof_t, scores_t, feature_imp_t, lgb_models_t = train_model(
    train.dropna()[features].reset_index(drop=True),
    train.dropna()['target'].reset_index(drop=True),
    params=params,
    cv=tscv,
    plot_feature_importance=False,
    score_func=mean_squared_error,
    groups=train.dropna().reset_index(drop=True)['time_id']
)

In [None]:
np.mean(scores), np.std(scores)

In [None]:
np.mean(scores_t), np.std(scores_t)

Nice! We get a small improvement in both mean CV score and standard deviation across folds.

# Prediction Time!

In [None]:
%%time

import ubiquant

env = ubiquant.make_env()   # initialize the environment
iter_test = env.iter_test()    # an iterator which loops over the test set and sample submission

for (test_df, sample_prediction_df) in iter_test:
    
    test_df['target'] = 0
    
    # we need to transform like we did in training!
    for feat in offenders:
        test_df[feat] = test_df[feat].rank(pct=True)
    
    # we are predicting with the models trained on the transformed offenders
    for i, mod in enumerate(lgb_models_t): 
        test_df['target'] += mod.predict(test_df[features])
    test_df['target'] /= len(lgb_models_t)

    env.predict(test_df[['row_id','target']])

[Hamel Husain](https://twitter.com/HamelHusain/status/1504516124632772628?s=20&t=uEiVIr9KO-aOpHPCB1VdRg) kindly shared this slide about AV to monitor production models in a reponse to one of my Tweets.

![](https://i.imgur.com/K1W4NAg.png)

Thank you for taking a look at this notebook. Please leave comments and suggestions.