Add adversarial validation to sample_similarity #4

Matgrb · 2020-11-12T12:15:09Z

Once we find out that the distribution of the training and test data are different, we can use adversarial validation to identify a subset of training data that is similar to the test data.
This subset can then be used as the validation set.Thus we will have a good idea of how our model will perform on the test set, which belongs to a different distribution than our training set.
The pseudo-code to include adversarial validation can be as follows :


    """
    Args:
        train   : training dataframe
        test    : testing dataframe
        clf     : classifier used to seperate train and test
        threshold : threshold , default = 0.5

    Returns:
       adv_val_set : validation set.

    """

    train['istest']=0
    test['istest']=1
    df = pd.concat([train,test],axis=0)
    y = df['istest']
    X = df.drop(columns=['istest'],axis=1)
    proba = cross_val_predict(clf,X,y,cv=3,method='predict_proba')
    df['test_proba'] = proba[:,1]
    adv_val_set = df.query('istest==0 and test_proba > {threshold}')
    return adv_val_set

More information can be found here and here

The text was updated successfully, but these errors were encountered:

timvink · 2020-11-27T12:37:10Z

I just watched a presentation by @nanne-aben on covariate shift that details a different approach:

train a resemblance model (he calls it an adversial model) between train and test
determine the sample weight w as p(non-train | Xi) / p(train | Xi) for each train instance
train your actual model using the sample weights

Benefits of that approach is that you do not have to subsample your training data, not losing any information.

probatus already offers SHAPImportanceResemblance for 1). For 2), I think a helper method might actually be really useful. For 3), passing sample weights is straightforward enough already :)

Definition of done would be the helper method for sample weights + a tutorial on "dealing with covariate shift" in the probatus docs.

Thoughts?

Matgrb · 2020-11-27T13:01:15Z

This is definitely something that would be nice to have. A couple of thoughts:

How do we see this feature being further used? In some way we would use quite a lot of information from the test, even if we don't use labels. Wouldn't this cause a bias when we measure OOT Test score?
Implementing this would require some work on how we handle data.

Now we do train/test split within resemblance model (here train and test is created from combined X1 and X2, unfortunately in your example it is also train and test). In order to calculate the sample weights, we would need to compute the predictions on all samples of X1, which would require use of cross-validation.

That is why, making this would either require making a completely separate feature that is similar to SHAPImportanceResemblance, but implementing the CV correctly, or would require rework of the entire sample_similarity module, to use CV instead of train/test split. I would be voting for the first option, because in Resemblance model you don't really need the CV, since it is a simple test and it is not about squeezing the most out of the performance of the model.

nanne-aben · 2020-11-27T17:37:47Z

Happy to hear you liked the approach! Maybe it helps if I give some more background on how we use this exactly. When we apply the model, we basically 1. Take a sample of our X_train and of our X_test. Let's call these X_train_adv and X_test_adv. We make sure these are of the same size btw, so our adversarial prediction problem is balanced. 2. Train our adversarial model on the concatenated [X_train_adv, X_test_adv] 3. Determine the sample weights w for the remaining part of X_train (which we'll call X_train_dash) 4. Train our model with X_train_dash, y_train_dash and w. When we want to make an estimate of our test set performance (without knowing y_test), we: 1. Train our adversarial (or resemblance) model on the concatenated [X_train_adv, X_test_adv] 2. Determine the sample weights w for X_train_dash 3. Do cross-validation with X_train_dash, y_train_dash and w. Here, w feeds in to both the model fit (e.g. fit this point as if it's 10x as important), as well as the determination of the performance (e.g. miss-classifying this point would count 10x as heavy) Finally, we have some artificial data experiments where we do know y_test, which we use to validate whether the approach works. In these settings we keep a part of y_test separate (never to be seen by the adversarial model) to test performance, just to be sure. I guess this would address your concerns about bias. I'm not sure whether you really have to be that careful to mitigate these biases in all situations (since you don't reuse information on y anywhere), but we have a big enough sample size to set some data aside, so we figured we'd stay on the safe side. Happy to think along with you, let me know if I can help!

…

On Fri, 27 Nov 2020 at 14:01, Mateusz Garbacz ***@***.***> wrote: This is definitely something that would be nice to have. A couple of thoughts: 1. How do we see this feature being further used? In some way we would use quite a lot of information from the test, even if we don't use labels. Wouldn't this cause a bias when we measure OOT Test score? 2. Implementing this would require some work on how we handle data. Now we do train/test split within resemblance model (here train and test is created from combined X1 and X2, unfortunately in your example it is also train and test). In order to calculate the sample weights, we would need to compute the predictions on all samples of X1, which would require use of cross-validation. That is why, making this would either require making a completely separate feature that is similar to SHAPImportanceResemblance, but implementing the CV correctly, or would require rework of the entire sample_similarity module, to use CV instead of train/test split. I would be voting for the first option, because in Resemblance model you don't really need the CV, since it is a simple test and it is not about squeezing the most out of the performance of the model. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALOBCX55ZNSFOFVHHSPP3ZLSR6PKTANCNFSM4TTGUZHA> .

nanne-aben · 2020-11-27T19:23:20Z

Now that I think about it, we actually do have some artificial data experiments where we were less strict in not using the same data twice. And it still performed really well on unseen test data. So I guess how strict you want to be depends on the user and the use-case. Happy to share some of those experiments btw, if you're interested.

…

On Fri, Nov 27, 2020 at 6:37 PM Nanne Aben ***@***.***> wrote: Happy to hear you liked the approach! Maybe it helps if I give some more background on how we use this exactly. When we apply the model, we basically 1. Take a sample of our X_train and of our X_test. Let's call these X_train_adv and X_test_adv. We make sure these are of the same size btw, so our adversarial prediction problem is balanced. 2. Train our adversarial model on the concatenated [X_train_adv, X_test_adv] 3. Determine the sample weights w for the remaining part of X_train (which we'll call X_train_dash) 4. Train our model with X_train_dash, y_train_dash and w. When we want to make an estimate of our test set performance (without knowing y_test), we: 1. Train our adversarial (or resemblance) model on the concatenated [X_train_adv, X_test_adv] 2. Determine the sample weights w for X_train_dash 3. Do cross-validation with X_train_dash, y_train_dash and w. Here, w feeds in to both the model fit (e.g. fit this point as if it's 10x as important), as well as the determination of the performance (e.g. miss-classifying this point would count 10x as heavy) Finally, we have some artificial data experiments where we do know y_test, which we use to validate whether the approach works. In these settings we keep a part of y_test separate (never to be seen by the adversarial model) to test performance, just to be sure. I guess this would address your concerns about bias. I'm not sure whether you really have to be that careful to mitigate these biases in all situations (since you don't reuse information on y anywhere), but we have a big enough sample size to set some data aside, so we figured we'd stay on the safe side. Happy to think along with you, let me know if I can help! On Fri, 27 Nov 2020 at 14:01, Mateusz Garbacz ***@***.***> wrote: > This is definitely something that would be nice to have. A couple of > thoughts: > > 1. > > How do we see this feature being further used? In some way we would > use quite a lot of information from the test, even if we don't use labels. > Wouldn't this cause a bias when we measure OOT Test score? > 2. > > Implementing this would require some work on how we handle data. > > Now we do train/test split within resemblance model (here train and test > is created from combined X1 and X2, unfortunately in your example it is > also train and test). In order to calculate the sample weights, we would > need to compute the predictions on all samples of X1, which would > require use of cross-validation. > > That is why, making this would either require making a completely > separate feature that is similar to SHAPImportanceResemblance, but > implementing the CV correctly, or would require rework of the entire > sample_similarity module, to use CV instead of train/test split. I would be > voting for the first option, because in Resemblance model you don't really > need the CV, since it is a simple test and it is not about squeezing the > most out of the performance of the model. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#4 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ALOBCX55ZNSFOFVHHSPP3ZLSR6PKTANCNFSM4TTGUZHA> > . >

Matgrb · 2020-11-30T13:23:16Z

I like it, especially the second option that you have presented with the use of CV, it is more data efficient. Another tweak that can be done there is using a model with class_weight='balanced'.

Could you share the experiments? I am interested how this works in practice.

Regarding the bias, this is tricky. Imagine having a OOT Test set, which covers the entire pandemics of Covid-19. In that times, the dataset has changed dramatically, compared to pre-pandemics Train. If you use the data distribution during the pandemics to make the pre-pandemic dataset training better suited, this will cause a strong leakage of information from test to train. The model will be definitely better suited for the future, assuming that the situation doesnt change so much post pandemic, but estimated performance is less realistic, because in this case the model "knew" about upcoming data shift, even though in production it would not. This is of course an extreme example, but I wanted to illustrate where this could go wrong. In the end it is user's choice, whether this bias is an issue or not for a given problem.

Couple of use scenarios I can think of that would decrease possible impact of such bias:

Set last month of Train data as validation set. In this case, the older Train data can be weighted to better represent most recent times, and no bias would be introduced by using information from the test set.
Split Test set into two parts, use one part to do adversarial validation. Then the performance on the first and second part of the test set can be illustrated to indicate whether there is any bias introduced (in case the performance between Test1 and Test2 differs)

nanne-aben · 2020-12-02T08:27:46Z

Sure, happy to share. What's the best way of sharing it? W.r.t. the Covid data, let me answer using an example. For ease of notation, let's time-split the dataset into t1, t2, t3, etc. Let's say we have x_t1, y_t1 and x_t2 available (and hence y_t2 is unavailable). Using adversarial modelling, we could then create a model to predict y_t2. In the example above, you ask whether we would be able to predict x_t3. If x_t3 is similar to x_t2, you should be fine. If it's not, you'll need to retrain your adversarial model.

…

On Mon, Nov 30, 2020 at 2:23 PM Mateusz Garbacz ***@***.***> wrote: I like it, especially the second option that you have presented with the use of CV, it is more data efficient. Another tweak that can be done there is using a model with class_weight='balanced'. Could you share the experiments? I am interested how this works in practice. Regarding the bias, this is tricky. Imagine having a OOT Test set, which covers the entire pandemics of Covid-19. In that times, the dataset has changed dramatically, compared to pre-pandemics Train. If you use the data distribution during the pandemics to make the pre-pandemic dataset training better suited, this will cause a strong leakage of information from test to train. The model will be definitely better suited for the future, assuming that the situation doesnt change so much post pandemic, but estimated performance is less realistic, because in this case the model "knew" about upcoming data shift, even though in production it would not. This is of course an extreme example, but I wanted to illustrate where this could go wrong. In the end it is user's choice, whether this bias is an issue or not for a given problem. Couple of use scenarios I can think of that would decrease possible impact of such bias: - Set last month of Train data as validation set. In this case, the older Train data can be weighted to better represent most recent times, and no bias would be introduced by using information from the test set. - Split Test set into two parts, use one part to do adversarial validation. Then the performance on the first and second part of the test set can be illustrated to indicate whether there is any bias introduced (in case the performance between Test1 and Test2 differs) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALOBCXZYX2OBVQCLUS36RQDSSOMFFANCNFSM4TTGUZHA> .

timvink · 2020-12-22T13:07:54Z

Interesting discussion.

Framed slightly differently, you could use adversarial/resemblance modelling to calibrate your model as a last (retraining) step, in order to improve performance in production where there is a (known) distribution shift like covid-19.

To do that without leakage, you need get X_train_adversarial by splitting your out-of-time test set into two, or take previously unused out-of-time data for which you don't have labels yet. Then you train a resemblance model on X_train_adversarial, and use that model to sets instance weights for your original model, and retrain it one more time using those. You can then measure the performance difference between your original model and your calibrated model using the same out-of-time test dataset.

Back to probatus. I think there is an opportunity to build some tooling & documentation for this in a new probatus.calibration module. Some pseudo code:

# We have X_train, y_train, X_test, y_test, X_adversarial

# Normal model
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier().fit(X_train, y_train)

# Resemblance model
from probatus.sample_similarity import SHAPImportanceResemblance
clf = RandomForestClassifier()
rm = SHAPImportanceResemblance(clf)
shap_resemblance_model = rm.fit_compute(X_train, X_adverserial)

# Model calibration
resemblance_model = shap_resemblance_model.model # new method
probs = resemblance_model.predict(X_train)
weights = calculate_weight(probs) # new function
calibrated_model = LGBMClassifier().fit(X_train, y_train, sample_weights = weights)

# Compare performance
# get AUC from model.predict(X_test, y_test) 
# get AUC from calibrated_model.predict(X_test, y_test)

The new parts are in the model calibration section. I think we can simplify that process a bit more, maybe something like:

ac = probatus.calibration.AdversialCalibrator()
ac.fit_compute(model, resemblance_model, X_train, y_train, X_train, X_test, X_adverserial) # returns pd.DataFrame comparing calibrated model with non-calibrated model

Thoughts?

nanne-aben · 2021-01-04T09:26:43Z

Sorry for the late reply, I was already on Christmas leave :)

Framed slightly differently, you could use adversarial/resemblance modelling to *calibrate* your model as a last (retraining) step, in order to improve performance in production where there is a (known) distribution shift like covid-19. To do that without leakage, you need get X_train_adversarial by splitting your out-of-time test set into two, or take previously unused out-of-time data for which you don't have labels yet. Then you train a resemblance model on X_train_adversarial, and use that model to sets instance weights for your original model, and retrain it one more time using those. You can then measure the performance difference between your original model and your calibrated model using the same out-of-time test dataset.

Yes, that's it exactly!

…

On Tue, Dec 22, 2020 at 2:08 PM Tim Vink ***@***.***> wrote: Interesting discussion. Framed slightly differently, you could use adversarial/resemblance modelling to *calibrate* your model as a last (retraining) step, in order to improve performance in production where there is a (known) distribution shift like covid-19. To do that without leakage, you need get X_train_adversarial by splitting your out-of-time test set into two, or take previously unused out-of-time data for which you don't have labels yet. Then you train a resemblance model on X_train_adversarial, and use that model to sets instance weights for your original model, and retrain it one more time using those. You can then measure the performance difference between your original model and your calibrated model using the same out-of-time test dataset. Back to probatus. I think there is an opportunity to build some tooling & documentation for this in a new probatus.calibration module. Some pseudo code: # We have X_train, y_train, X_test, y_test, X_adversarial # Normal modelfrom sklearn.ensemble import GradientBoostingClassifiermodel = GradientBoostingClassifier().fit(X_train, y_train) # Resemblance modelfrom probatus.sample_similarity import SHAPImportanceResemblanceclf = RandomForestClassifier()rm = SHAPImportanceResemblance(clf)shap_resemblance_model = rm.fit_compute(X_train, X_adverserial) # Model calibrationresemblance_model = shap_resemblance_model.model # new methodprobs = resemblance_model.predict(X_train)weights = calculate_weight(probs) # new functioncalibrated_model = LGBMClassifier().fit(X_train, y_train, sample_weights = weights) # Compare performance# get AUC from model.predict(X_test, y_test) # get AUC from calibrated_model.predict(X_test, y_test) The new parts are in the model calibration section. I think we can simplify that process a bit more, maybe something like: ac = probatus.calibration.AdversialCalibrator()ac.fit_compute(model, resemblance_model, X_train, y_train, X_train, X_test, X_adverserial) # returns pd.DataFrame comparing calibrated model with non-calibrated model Thoughts? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALOBCX2IELHRCVBKIRA23TLSWCK3TANCNFSM4TTGUZHA> .

Matgrb added the enhancement New feature or request label Nov 12, 2020

timvink added the investigation needed One to do research related to this issue, and share findings label Mar 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add adversarial validation to sample_similarity #4

Add adversarial validation to sample_similarity #4

Matgrb commented Nov 12, 2020

timvink commented Nov 27, 2020

Matgrb commented Nov 27, 2020

nanne-aben commented Nov 27, 2020 via email

nanne-aben commented Nov 27, 2020 via email

Matgrb commented Nov 30, 2020

nanne-aben commented Dec 2, 2020 via email

timvink commented Dec 22, 2020

nanne-aben commented Jan 4, 2021 via email

Add adversarial validation to sample_similarity #4

Add adversarial validation to sample_similarity #4

Comments

Matgrb commented Nov 12, 2020

timvink commented Nov 27, 2020

Matgrb commented Nov 27, 2020

nanne-aben commented Nov 27, 2020 via email

nanne-aben commented Nov 27, 2020 via email

Matgrb commented Nov 30, 2020

nanne-aben commented Dec 2, 2020 via email

timvink commented Dec 22, 2020

nanne-aben commented Jan 4, 2021 via email