# Introduction

This Notebook uses the TestSuite from mercury.robust in order to execute several tests and check the result afterwards. This is an alternative to running a test individually as shown in the notebook RobustTestingExample.ipynb. The TestSuite approach can be useful to guarantee some specific conditions of a trained model and used dataset 

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import clone as clone_sklearn_model
from sklearn.metrics import f1_score

pd.set_option('display.max_colwidth', None)

## Load Dataset

We will use the default credit card Dataset from the UCI machine learning repository. The dataset was used in [[2]](#[2])

In [2]:
path_dataset = "./data/credit/"
uci_credit = pd.read_csv(path_dataset + "UCI_Credit_card.csv")
uci_credit = uci_credit.drop('ID', axis=1)

uci_credit["SEX"] = uci_credit["SEX"].astype(str)
uci_credit["EDUCATION"] = uci_credit["EDUCATION"].astype(str)
uci_credit["MARRIAGE"] = uci_credit["MARRIAGE"].astype(str)

In [3]:
df_train, df_test = train_test_split(uci_credit, test_size=0.3, random_state=2000)
print(df_train.shape)
print(df_test.shape)

(21000, 24)
(9000, 24)


Let's specify which features will be numerical, which will be categorical, and what will be the target column

In [4]:
pay_feats = [c for c in df_train.columns if "PAY_" in c]
bill_feats = [c for c in df_train.columns if "BILL_" in c]
num_feats = ['LIMIT_BAL', 'AGE'] + pay_feats + bill_feats

cat_feats = ['SEX', 'EDUCATION', 'MARRIAGE']

label_col = "default.payment.next.month"

## Train model

We create a function to create the model. The model will be a sklearn pipeline, which will be composed by transformations of numerical and categorical features (separately) and a RandomForestClassifier.

We also define a function to train the model (we will use it to train the model now and later in our tests).

In [5]:
def create_model(num_feats=None, cat_feats=None, random_state=None, max_depth=None, min_samples_leaf=1):

    numeric_transformer = Pipeline(steps=[("scaler", StandardScaler())])
    categorical_transformer = OneHotEncoder(handle_unknown="ignore")

    preprocessor = ColumnTransformer(
        transformers=[
            ("num", numeric_transformer, num_feats),
            ("cat", categorical_transformer, cat_feats),
        ], remainder='drop'
    )

    pipeline = Pipeline(
        steps=[
            ("preprocessor", preprocessor), 
            ("classifier", RandomForestClassifier(
                random_state=random_state, 
                class_weight='balanced', 
                max_depth=None,
                min_samples_leaf=min_samples_leaf
            ))]
    )
    
    return pipeline

def train_model(model, X, y, train_params=None):
    return model.fit(X, y)
    

Let's now to train a model

In [6]:
pipeline = create_model(num_feats=num_feats, cat_feats=cat_feats)
unfitted_pipeline = clone_sklearn_model(pipeline)
fitted_pipeline = train_model(pipeline, df_train[num_feats + cat_feats], df_train[label_col])

print(f1_score(df_test[label_col], fitted_pipeline.predict(df_test)))

0.44204322200392926


## Create TestSuite

Now we will create a TestSuite with some tests that will check our data and our model. 

We need to import the tests that we are going to use and the `TestSuite` object from mercury.robust

In [7]:
from mercury.dataschema import DataSchema
from mercury.dataschema.feature import FeatType
from mercury.robust.data_tests import (
    SameSchemaTest, 
    LinearCombinationsTest,
    LabelLeakingTest,
    NoisyLabelsTest,
    SampleLeakingTest,
    NoDuplicatesTest
)
from mercury.robust.model_tests import ModelReproducibilityTest, TreeCoverageTest
from mercury.robust import TestSuite

Let's create a `DataSchema` object for our dataset. By default, it tries to infer the types of each variable. In this case, it identifies some payment features as categorical features because they have few unique values. However, we prefer them to be discrete, so we specify a custom_feature_mapping for this two columns:

In [8]:
custom_feature_mapping = {
    "PAY_0": FeatType.DISCRETE,
    "PAY_2": FeatType.DISCRETE,
    "PAY_3": FeatType.DISCRETE,
    "PAY_4": FeatType.DISCRETE,
    "PAY_5": FeatType.DISCRETE,
    "PAY_6": FeatType.DISCRETE,
}

schma_reference = DataSchema().generate(df_train, force_types=custom_feature_mapping).calculate_statistics()

Since our tests could fail, we might need to create the Suite several times, so let's define a function in order to be able to create the same several times easily. Our Test Suite will be composed by:
- `SampleLeakingTest`: Looks if there are samples in the test dataset that are identical to samples in the base/train dataset. If that is the case, the test fails.
- `NoDuplicatesTest`: Checks if no duplcated samples are present in a dataframe (training set in this case).
- `LinearCombinationsTest`: Ensures a certain dataset doesn't have any linear combination between its numerical columns and no categorical variable is redundant.
- `LabelLeakingTest`: Ensures that the target variable is not being leaked into the predictors.
- `NoisyLabelsTest`: Looks if the labels of a dataset contain a high level of noise.
- `ModelReproducibilityTest`: Checks if the training of a model is reproducible.
- `TreeCoverageTest`: Checks whether a given test_dataset covers a minimum percentage of all the branches of a tree (specific for tree-based models)

In [9]:
def create_suite(df_train, df_test, unfitted_model, fitted_model, label_name, num_feats, cat_feats, schma_reference):
    
    # Sample Leaking Test
    sample_leaking_test = SampleLeakingTest(
        base_dataset=df_train[num_feats + cat_feats + [label_name]],
        test_dataset=df_test[num_feats + cat_feats + [label_name]]
    )

    # NoDuplicatesTest
    no_dups_test = NoDuplicatesTest(
        dataset=df_train[cat_feats + num_feats + [label_name]]
    )

    # Linear Combinations Test
    linear_comb_test = LinearCombinationsTest(df_train, dataset_schema=schma_reference)

    # Label Leaking Test (we ignore str features since currently are not supported in this test)
    ignore_feats = [f for f in df_train.columns if f not in num_feats+cat_feats and f!=label_name]
    label_leak_test = LabelLeakingTest(
        df_train,
        label_name=label_name,
        ignore_feats=ignore_feats,
        threshold = 0.05,
        dataset_schema=schma_reference,
        handle_str_cols='transform'
    )

    # Noisy Labels Test
    noisy_labels_test = NoisyLabelsTest(
        df_train,
        label_name=label_name,
        threshold=0.30,
        preprocessor=unfitted_pipeline.steps[0][1],
        dataset_schema=schma_reference,
        #custom_feature_map = custom_feature_mapping,
        label_issues_args={"clf": LogisticRegression(solver='newton-cg')}
    )

    # Model Reproducibility Test
    def eval_model(model, X, y, eval_params=None):
        return f1_score(y, model.predict(X))

    def get_predictions(model, X, predict_params=None):
        return model.predict(X)

    reproducibility_test = ModelReproducibilityTest(
        model = unfitted_pipeline,
        train_dataset = df_train,
        target=label_name,
        train_fn = train_model,
        eval_fn = eval_model,
        train_params = None,
        eval_params = None,
        predict_fn = get_predictions,
        threshold_eval = 0,
        threshold_yhat = 0,
        test_dataset = df_test
    )

    # Tree Coverage Test
    tree_coverage_test = TreeCoverageTest(model=fitted_model, test_dataset=df_test[num_feats + cat_feats])
    
    # Create Suite
    test_suite = TestSuite(
        tests=[
            sample_leaking_test,
            no_dups_test,
            linear_comb_test,
            label_leak_test,
            noisy_labels_test,  
            reproducibility_test,
            tree_coverage_test
        ]
    )
    return test_suite

Now let's create a suite and run it. We can use the method `run()` to run the suite and the method `get_results_as_df()` to obtain a summary of the results once all the tests have run:

In [10]:
test_suite = create_suite(df_train, df_test, unfitted_pipeline, fitted_pipeline, label_col, num_feats, cat_feats, schma_reference)
test_results = test_suite.run()
test_suite.get_results_as_df()

Unnamed: 0,name,state,error,info
0,SampleLeakingTest,TestState.FAIL,Num of samples in test set that appear in train set is 17 (a proportion of 0.002 test samples )and the max allowed is 0,"{'num_duplicated': 17, 'percentage_duplicated': 0.001888888888888889}"
1,NoDuplicatesTest,TestState.FAIL,Your dataset has 17 duplicates. Drop or inspect them via the `info` method,"{'num_duplicated': 17, 'index_duplicates': [28779, 13556, 10374, 627, 19487, 21881, 22726, 13106, 18325, 11976, 6313, 6124, 1601, 13912, 21768, 8320, 27966]}"
2,LinearCombinationsTest,TestState.SUCCESS,,
3,LabelLeakingTest,TestState.SUCCESS,,"{'importances': {'LIMIT_BAL': 0.5, 'PAY_AMT4': 0.5, 'PAY_AMT3': 0.5, 'AGE': 0.5, 'PAY_AMT6': 0.5, 'SEX': 0.5, 'EDUCATION': 0.5, 'MARRIAGE': 0.5, 'PAY_AMT5': 0.5002164033758927, 'BILL_AMT6': 0.500324605063839, 'BILL_AMT4': 0.5004328067517854, 'PAY_AMT2': 0.5005104815455379, 'BILL_AMT5': 0.5005410084397317, 'PAY_AMT1': 0.5006963580272367, 'BILL_AMT3': 0.5007574118156243, 'BILL_AMT2': 0.5012872294830634, 'BILL_AMT1': 0.5015202538643211, 'PAY_6': 0.5907281954441032, 'PAY_5': 0.5970740834014303, 'PAY_4': 0.6097540805988488, 'PAY_3': 0.625762903155334, 'PAY_0': 0.642820612884643, 'PAY_2': 0.6474066164734326}}"
4,NoisyLabelsTest,TestState.SUCCESS,,{'rate_issues': 0.18576190476190477}
5,ModelReproducibilityTest,TestState.FAIL,Eval metric different in train dataset when training two times (0.9987 vs 0.9989). The max difference allowed is 0 The model is not reproducible.,
6,TreeCoverageTest,TestState.FAIL,Achieved a coverage of 0.6391809920839384 while the minimum required was 0.7,{'coverage': 0.6391809920839384}


We can see that there are four tests that failed:
- The `SampleLeakingTest` has found that some samples in the training set also appear in the test set
- The `NoDuplicatesTest` has found that the training dataset has some duplicates
- The `ModelReproducibilityTest` has failed because training 2 times has resulted in a different metric
- The `TreeCoverage` fails because when using the test in the resulting model the samples don't cover enough branches

NOTE: Before we have run all the tests and checked the results at the end. Alternatively, you can set the parameter `run_safe` of the suite to `False` in order to raise an Exception when the first test fails

## Trying to correct errors 

Looking at the failed tests, let's try to correct the errors.

First, the we remove the duplicate samples in the training set

In [11]:
df_train = df_train.drop_duplicates()

In [12]:
df_train.shape

(20983, 24)

Now we remove the samples from the test set that are identical to samples in the training set

In [13]:
# Remove test samples that also appear in training set
df_all = pd.concat([df_train, df_test])
is_duplicated = df_all.duplicated(keep=False)
print(df_test.shape)
df_test = df_test.loc[~is_duplicated.iloc[len(df_train):]]
print(df_test.shape)

(9000, 24)
(8981, 24)


When training the model, we will specify a random_state to make the training reproducible and we will set the paramaters max_depth and min_samples_leaf to limit the complexity of the model

In [14]:
pipeline = create_model(num_feats=num_feats, cat_feats=cat_feats, random_state=2020, max_depth=6, min_samples_leaf=5)
unfitted_pipeline = clone_sklearn_model(pipeline)
fitted_pipeline = train_model(pipeline, df_train[num_feats + cat_feats], df_train[label_col])

print(f1_score(df_test[label_col], fitted_pipeline.predict(df_test)))

0.541645633518425


<b> Running only specified tests </b>

In order iterate faster, we can only run some specified tests. This is useful when some of our test take some time to run and we are focused on fixing some failing tests. For that, we can use the `tests_to_run` argument in `run()` method. We can specify either the indices or the names of the test that we want to run. Let's see an example to execute only the test that failed previously. Note that we need to create the test suite each time we run the suite.

In [15]:
test_suite = create_suite(df_train, df_test, unfitted_pipeline, fitted_pipeline, label_col, num_feats, cat_feats, schma_reference)
test_results = test_suite.run(tests_to_run=[0,1,5,6])
test_suite.get_results_as_df()

Unnamed: 0,name,state,error,info
0,SampleLeakingTest,TestState.SUCCESS,,"{'num_duplicated': 0, 'percentage_duplicated': 0.0}"
1,NoDuplicatesTest,TestState.SUCCESS,,"{'num_duplicated': 0, 'index_duplicates': []}"
2,LinearCombinationsTest,TestState.NOT_EXECUTED,,
3,LabelLeakingTest,TestState.NOT_EXECUTED,,
4,NoisyLabelsTest,TestState.NOT_EXECUTED,,
5,ModelReproducibilityTest,TestState.SUCCESS,,
6,TreeCoverageTest,TestState.SUCCESS,,{'coverage': 0.9639066880256307}


Alternatively, in the previous cell you could have specified the name of the tests in `test_to_execute` argument, ie: `test_to_execute=['SampleLeakingTest', 'NoDuplicatesTest', 'ModelReproducibilityTest', 'TreeCoverageTest']`

Now that we see that the previously failing tests are passing, let's execute all the suite again. Again, we create the `TestSuite` object again.

In [16]:
test_suite = create_suite(df_train, df_test, unfitted_pipeline, fitted_pipeline, label_col, num_feats, cat_feats, schma_reference)
test_results = test_suite.run()
test_suite.get_results_as_df()

Unnamed: 0,name,state,error,info
0,SampleLeakingTest,TestState.SUCCESS,,"{'num_duplicated': 0, 'percentage_duplicated': 0.0}"
1,NoDuplicatesTest,TestState.SUCCESS,,"{'num_duplicated': 0, 'index_duplicates': []}"
2,LinearCombinationsTest,TestState.SUCCESS,,
3,LabelLeakingTest,TestState.SUCCESS,,"{'importances': {'LIMIT_BAL': 0.5, 'PAY_AMT4': 0.5, 'PAY_AMT3': 0.5, 'AGE': 0.5, 'PAY_AMT6': 0.5, 'SEX': 0.5, 'EDUCATION': 0.5, 'MARRIAGE': 0.5, 'PAY_AMT5': 0.5002164502164502, 'BILL_AMT6': 0.5003246753246753, 'BILL_AMT4': 0.5004329004329005, 'PAY_AMT2': 0.5005105687972522, 'BILL_AMT5': 0.5005411255411255, 'PAY_AMT1': 0.500696462269829, 'BILL_AMT3': 0.5007575757575757, 'BILL_AMT2': 0.5012872524407695, 'BILL_AMT1': 0.5015202575338247, 'PAY_6': 0.5907241498545049, 'PAY_5': 0.5970736891090739, 'PAY_4': 0.609751503563763, 'PAY_3': 0.6257576088276449, 'PAY_0': 0.6428361169071919, 'PAY_2': 0.6474076571231737}}"
4,NoisyLabelsTest,TestState.SUCCESS,,{'rate_issues': 0.18624600867368823}
5,ModelReproducibilityTest,TestState.SUCCESS,,
6,TreeCoverageTest,TestState.SUCCESS,,{'coverage': 0.9639066880256307}


We see now that all the tests have passed!
