# Dataset Generation

The workshop contains three different notebooks. Each one focuses on a different stage:
    
1. Dataset Generation. The first notebook (this one) focuses on generating a dataset for training the model. We will create a Robust Test Suite to check that the dataset generated meets certain conditions
2. Model Training. The second notebook focuses on training the model. We will create a Robust Test Suite to check that the trained model meets certain conditions.
3. Model Inference. In the last notebook, we use mercury.monitoring to monitor data drift and estimate the predicted performance of the model without having the labels

## Setup

You can install mercury-robust by running:

```
!pip install mercury-robust
```

In [None]:
import pandas as pd
import numpy as np
import os
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(seed=2021)

pd.set_option('display.max_colwidth', None)

SEED = 42

## Load Dataset

We will use the default credit card Dataset from the UCI machine learning repository. The dataset was used in [[1]](#[1]). Note that we will use a slightly modified version which contains a time column

In [None]:
df = pd.read_csv("data/uci_credit_drifted_historic.csv")

In [None]:
df.head()

## Prepare Dataset For Training

Let's preparing the dataset for training the model. Our label will be the "default.payment.next.month" variable. We select the features that we want to use in our model:

In [None]:
label = 'default.payment.next.month'
features = [c for c in df.columns if c not in [label, 'time', 'id']]
#features = [c for c in df.columns if c not in [label, 'time', 'id', 'WARNING_SENT']]

In [None]:
features

Now, let's define the function that will generate a train and test dataset:

In [None]:
from sklearn.model_selection import train_test_split

def prepare_dataset(df, features, label, test_size=0.3, random_state=42):
    
    # Only Keep Features and label
    df = df[features + [label]]
    
    # Drop Duplicates
    #df = df.drop_duplicates()
    
    # Split Train/Test
    df_train, df_test = train_test_split(df, test_size=test_size, random_state=random_state)
    
    return df, df_train, df_test

In [None]:
df, df_train, df_test = prepare_dataset(df, features, label, test_size=0.3, random_state=SEED)

print(df_train.shape)
print(df_test.shape)

## Create Data Schema

We now use [mercury.dataschema](https://bbva.github.io/mercury-dataschema/) to create a `DataSchema` which contains the feature types of the dataset. This will be used later when creating the Robust Tests. 

The [`DataSchema`](https://bbva.github.io/mercury-dataschema/reference/dataschema/#mercury.dataschema.schemagen.DataSchema.generate) auto-infers the feature types, but we can also specify some feature types in case that the auto-inference doesn't work exactly as we want.

In [None]:
from mercury.dataschema import DataSchema
from mercury.dataschema.feature import FeatType

custom_feature_mapping = {
    "PAY_0": FeatType.DISCRETE,
    "PAY_2": FeatType.DISCRETE,
    "PAY_3": FeatType.DISCRETE,
    "PAY_4": FeatType.DISCRETE,
    "PAY_5": FeatType.DISCRETE,
    "PAY_6": FeatType.DISCRETE,
}

schema = DataSchema().generate(df_train, force_types=custom_feature_mapping).calculate_statistics()

In [None]:
schema.feats

## Data Robust Tests

We now [mercury.robust](https://bbva.github.io/mercury-robust/) to create tests to check that the generated dataset meets certain conditions.

More concretely, we will create the next tests:
1. [LinearCombinationsTest](https://bbva.github.io/mercury-robust/reference/data_tests/#mercury.robust.data_tests.LinearCombinationsTest): Ensures that the dataset doesn't have any linear combination between its numerical columns and no categorical variable is redundant
2. [LabelLeakingTest](https://bbva.github.io/mercury-robust/reference/data_tests/#mercury.robust.data_tests.LabelLeakingTest): Ensures the target variable is not being leaked into the predictors.
3. [NoisyLabelTest](https://bbva.github.io/mercury-robust/reference/data_tests/#mercury.robust.data_tests.NoisyLabelsTest): Looks if the labels of a dataset contain a high level of noise.
4. [SampleLeakingTest](https://bbva.github.io/mercury-robust/reference/data_tests/#mercury.robust.data_tests.SampleLeakingTest): Looks if there are samples in the test dataset that are identical to samples in the base/train dataset.
5. [NoDuplicatesTest](https://bbva.github.io/mercury-robust/reference/data_tests/#mercury.robust.data_tests.NoDuplicatesTest): Checks no duplicated samples are present in a dataframe

In [None]:
from mercury.robust.data_tests import (
    LinearCombinationsTest,
    LabelLeakingTest,
    NoisyLabelsTest,
    SampleLeakingTest,
    NoDuplicatesTest
)

We have two options to execute the tests: We can just execute one test individually, or alternatively, run a group of test in a `TestSuite`.

Let's start running an individual test with the [`LinearCombinationsTest`](https://bbva.github.io/mercury-robust/reference/data_tests/#mercury.robust.data_tests.LinearCombinationsTest):

In [None]:
# LinearCombinationsTest
linear_combinations = LinearCombinationsTest(df[features], dataset_schema=schema)
linear_combinations.run()

When no exception is raised, the test has run successfully. Let's try another test:

In [None]:
# LabelLeakingTest
label_leaking = LabelLeakingTest(
    df[features + [label]], 
    label_name = label,
    task = "classification",
    dataset_schema=schema,
)
label_leaking.run()

Now the test has failed

## Test Suite

Now we will group several test in a `TestSuite` and execute them together.

In [None]:
from mercury.robust import TestSuite

def create_suite(df, df_train, df_test, schema, features, label):

    # LinearCombinationsTest
    linear_combinations = LinearCombinationsTest(df[features], dataset_schema=schema)
    
    # LabelLeakingTest
    label_leaking = LabelLeakingTest(
        df[features + [label]], 
        label_name = label,
        task = "classification",
        dataset_schema=schema,
    )
    
    # Noisy Labels
    noisy_labels = NoisyLabelsTest(
        base_dataset=df[features + [label]],
        label_name=label,
        calculate_idx_issues=True,
        threshold = 0.2,
        dataset_schema=schema,
        label_issues_args={"clf": LogisticRegression(solver='liblinear')}
    )
    
    # SampleLeaking
    sample_leaking = SampleLeakingTest(
        base_dataset=df_train[features + [label]], 
        test_dataset=df_test[features + [label]]
    )
    
    # NoDuplicates
    no_dups = NoDuplicatesTest(df_train)
    
    # Create Suite
    test_suite = TestSuite(
        tests=[
            linear_combinations,
            label_leaking,
            noisy_labels,
            sample_leaking,
            no_dups
        ]
    )
    
    return test_suite


In [None]:
test_suite = create_suite(df, df_train, df_test, schema, features, label)
test_results = test_suite.run()
test_suite.get_results_as_df()

## Save Dataset and Data Schema

Let's save our generated dataset and the `DataSchema`

In [None]:
path_dataset = "./dataset/"

if not os.path.exists(path_dataset):
    os.makedirs(path_dataset)

df.to_csv(path_dataset + "all.csv", index=False)
df_train.to_csv(path_dataset + "train.csv", index=False)
df_test.to_csv(path_dataset + "test.csv", index=False)

schema.save(path_dataset + "schema.json")

## References

<a id="[1]">[1]</a>
Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480. https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients