Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add adversarial validation to sample_similarity #4

Open
Matgrb opened this issue Nov 12, 2020 · 8 comments
Open

Add adversarial validation to sample_similarity #4

Matgrb opened this issue Nov 12, 2020 · 8 comments
Labels
enhancement New feature or request investigation needed One to do research related to this issue, and share findings

Comments

@Matgrb
Copy link
Collaborator

Matgrb commented Nov 12, 2020

Once we find out that the distribution of the training and test data are different, we can use adversarial validation to identify a subset of training data that is similar to the test data.
This subset can then be used as the validation set.Thus we will have a good idea of how our model will perform on the test set, which belongs to a different distribution than our training set.
The pseudo-code to include adversarial validation can be as follows :


    """
    Args:
        train   : training dataframe
        test    : testing dataframe
        clf     : classifier used to seperate train and test
        threshold : threshold , default = 0.5

    Returns:
       adv_val_set : validation set.

    """

    train['istest']=0
    test['istest']=1
    df = pd.concat([train,test],axis=0)
    y = df['istest']
    X = df.drop(columns=['istest'],axis=1)
    proba = cross_val_predict(clf,X,y,cv=3,method='predict_proba')
    df['test_proba'] = proba[:,1]
    adv_val_set = df.query('istest==0 and test_proba > {threshold}')
    return adv_val_set

More information can be found here and here

@Matgrb Matgrb added the enhancement New feature or request label Nov 12, 2020
@timvink
Copy link
Collaborator

timvink commented Nov 27, 2020

I just watched a presentation by @nanne-aben on covariate shift that details a different approach:

  1. train a resemblance model (he calls it an adversial model) between train and test
  2. determine the sample weight w as p(non-train | Xi) / p(train | Xi) for each train instance
  3. train your actual model using the sample weights

Benefits of that approach is that you do not have to subsample your training data, not losing any information.

probatus already offers SHAPImportanceResemblance for 1). For 2), I think a helper method might actually be really useful. For 3), passing sample weights is straightforward enough already :)

Definition of done would be the helper method for sample weights + a tutorial on "dealing with covariate shift" in the probatus docs.

Thoughts?

@Matgrb
Copy link
Collaborator Author

Matgrb commented Nov 27, 2020

This is definitely something that would be nice to have. A couple of thoughts:

  1. How do we see this feature being further used? In some way we would use quite a lot of information from the test, even if we don't use labels. Wouldn't this cause a bias when we measure OOT Test score?

  2. Implementing this would require some work on how we handle data.

Now we do train/test split within resemblance model (here train and test is created from combined X1 and X2, unfortunately in your example it is also train and test). In order to calculate the sample weights, we would need to compute the predictions on all samples of X1, which would require use of cross-validation.

That is why, making this would either require making a completely separate feature that is similar to SHAPImportanceResemblance, but implementing the CV correctly, or would require rework of the entire sample_similarity module, to use CV instead of train/test split. I would be voting for the first option, because in Resemblance model you don't really need the CV, since it is a simple test and it is not about squeezing the most out of the performance of the model.

@nanne-aben
Copy link

nanne-aben commented Nov 27, 2020 via email

@nanne-aben
Copy link

nanne-aben commented Nov 27, 2020 via email

@Matgrb
Copy link
Collaborator Author

Matgrb commented Nov 30, 2020

I like it, especially the second option that you have presented with the use of CV, it is more data efficient. Another tweak that can be done there is using a model with class_weight='balanced'.

Could you share the experiments? I am interested how this works in practice.

Regarding the bias, this is tricky. Imagine having a OOT Test set, which covers the entire pandemics of Covid-19. In that times, the dataset has changed dramatically, compared to pre-pandemics Train. If you use the data distribution during the pandemics to make the pre-pandemic dataset training better suited, this will cause a strong leakage of information from test to train. The model will be definitely better suited for the future, assuming that the situation doesnt change so much post pandemic, but estimated performance is less realistic, because in this case the model "knew" about upcoming data shift, even though in production it would not. This is of course an extreme example, but I wanted to illustrate where this could go wrong. In the end it is user's choice, whether this bias is an issue or not for a given problem.

Couple of use scenarios I can think of that would decrease possible impact of such bias:

  • Set last month of Train data as validation set. In this case, the older Train data can be weighted to better represent most recent times, and no bias would be introduced by using information from the test set.
  • Split Test set into two parts, use one part to do adversarial validation. Then the performance on the first and second part of the test set can be illustrated to indicate whether there is any bias introduced (in case the performance between Test1 and Test2 differs)

@nanne-aben
Copy link

nanne-aben commented Dec 2, 2020 via email

@timvink
Copy link
Collaborator

timvink commented Dec 22, 2020

Interesting discussion.

Framed slightly differently, you could use adversarial/resemblance modelling to calibrate your model as a last (retraining) step, in order to improve performance in production where there is a (known) distribution shift like covid-19.

To do that without leakage, you need get X_train_adversarial by splitting your out-of-time test set into two, or take previously unused out-of-time data for which you don't have labels yet. Then you train a resemblance model on X_train_adversarial, and use that model to sets instance weights for your original model, and retrain it one more time using those. You can then measure the performance difference between your original model and your calibrated model using the same out-of-time test dataset.

Back to probatus. I think there is an opportunity to build some tooling & documentation for this in a new probatus.calibration module. Some pseudo code:

# We have X_train, y_train, X_test, y_test, X_adversarial

# Normal model
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier().fit(X_train, y_train)

# Resemblance model
from probatus.sample_similarity import SHAPImportanceResemblance
clf = RandomForestClassifier()
rm = SHAPImportanceResemblance(clf)
shap_resemblance_model = rm.fit_compute(X_train, X_adverserial)

# Model calibration
resemblance_model = shap_resemblance_model.model # new method
probs = resemblance_model.predict(X_train)
weights = calculate_weight(probs) # new function
calibrated_model = LGBMClassifier().fit(X_train, y_train, sample_weights = weights)

# Compare performance
# get AUC from model.predict(X_test, y_test) 
# get AUC from calibrated_model.predict(X_test, y_test) 

The new parts are in the model calibration section. I think we can simplify that process a bit more, maybe something like:

ac = probatus.calibration.AdversialCalibrator()
ac.fit_compute(model, resemblance_model, X_train, y_train, X_train, X_test, X_adverserial) # returns pd.DataFrame comparing calibrated model with non-calibrated model

Thoughts?

@nanne-aben
Copy link

nanne-aben commented Jan 4, 2021 via email

@timvink timvink added the investigation needed One to do research related to this issue, and share findings label Mar 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request investigation needed One to do research related to this issue, and share findings
Projects
None yet
Development

No branches or pull requests

3 participants