Closed
Description
Once we find out that the distribution of the training and test data are different, we can use adversarial validation to identify a subset of training data that is similar to the test data.
This subset can then be used as the validation set.Thus we will have a good idea of how our model will perform on the test set, which belongs to a different distribution than our training set.
The pseudo-code to include adversarial validation can be as follows :
"""
Args:
train : training dataframe
test : testing dataframe
clf : classifier used to seperate train and test
threshold : threshold , default = 0.5
Returns:
adv_val_set : validation set.
"""
train['istest']=0
test['istest']=1
df = pd.concat([train,test],axis=0)
y = df['istest']
X = df.drop(columns=['istest'],axis=1)
proba = cross_val_predict(clf,X,y,cv=3,method='predict_proba')
df['test_proba'] = proba[:,1]
adv_val_set = df.query('istest==0 and test_proba > {threshold}')
return adv_val_set
More information can be found here and here