Skip to content

Add adversarial validation to sample_similarity #4

Closed
@Matgrb

Description

@Matgrb

Once we find out that the distribution of the training and test data are different, we can use adversarial validation to identify a subset of training data that is similar to the test data.
This subset can then be used as the validation set.Thus we will have a good idea of how our model will perform on the test set, which belongs to a different distribution than our training set.
The pseudo-code to include adversarial validation can be as follows :


    """
    Args:
        train   : training dataframe
        test    : testing dataframe
        clf     : classifier used to seperate train and test
        threshold : threshold , default = 0.5

    Returns:
       adv_val_set : validation set.

    """

    train['istest']=0
    test['istest']=1
    df = pd.concat([train,test],axis=0)
    y = df['istest']
    X = df.drop(columns=['istest'],axis=1)
    proba = cross_val_predict(clf,X,y,cv=3,method='predict_proba')
    df['test_proba'] = proba[:,1]
    adv_val_set = df.query('istest==0 and test_proba > {threshold}')
    return adv_val_set

More information can be found here and here

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestinvestigation neededOne to do research related to this issue, and share findings

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions