# FinAI Summit 2025 - Creating and accelerating ML Solutions with Mercury

[Mercury](https://www.bbvaaifactory.com/mercury/) is a modular Python library for Machine Learning and Data Science, developed at **BBVA**. It provides a comprehensive suite of tools designed to streamline and accelerate the creation of ML models, saving valuable development time while offering advanced Data Science functionality. Initially developed as [Inner Source](https://www.bbvaaifactory.com/mercury-acelerando-la-reutilizacion-en-ciencia-de-datos-dentro-de-bbva/), several modules of Mercury have been released as open source.

This notebook demonstrates how easily you can use Mercury to enhance your ML workflows. Whether you're validating datasets, running robust tests, explaining models, or monitoring data drift, Mercury provides an effective solution for each task. In this notebook, we will explore the following modules:

- **mercury-dataschema**: Ensures consistency by validating whether different datasets conform to the same schema.
- **mercury-robust**: Provides robust testing for ML models and datasets, ensuring reliability.
- **mercury-explainability**: Offers tools to interpret ML models, helping you understand model decisions.
- **mercury-monitoring**: Monitors models and data in production environments, detecting issues such as data drift.

## Why Use Mercury?

By leveraging Mercury, you'll experience:

- **Ease of use**: Its intuitive modular design allow seamless integration into your projects.
- **Faster development**: Pre-built components let you focus on building models rather than on redundant tasks.
- **Advanced functionality**: Gain access to tools for [schema validation](https://bbva.github.io/mercury-dataschema), [model explainability](https://bbva.github.io/mercury-explainability), [event sequence analysis](https://bbva.github.io/mercury-reels), [subset querying](https://bbva.github.io/mercury-settrie), [monitoring](https://bbva.github.io/mercury-monitoring) of models and data, and [robust testing](https://bbva.github.io/mercury-robust).

## Try it Yourself!

Explore the code below and modify it to fit your specific use cases. Whether you're validating data or explaining models, you'll find how Mercury simplifies your workflow.

Let’s dive in!

## Install Mercury and Setup

The next cell install the Mercury libraries that we will use. Additionally, we install alibi which is used in mercury-explainability

You may need to restart the kernel after the installation

In [None]:
# Suppressing error messages as they are related to dependency warnings of packages preinstalled by Kaggle in the environment and not used in the notebook
!pip install -U mercury-dataschema mercury-robust mercury-monitoring mercury-explainability alibi 2>/dev/null
#!pip install -U ipywidgets

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score


random_state= 232
n_sample = 100000

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 150)

## Load Data

In [None]:
df_url = "https://raw.githubusercontent.com/BBVA/mercury/refs/heads/master/src/data/finai_summit_2025/dataset.csv"
df = pd.read_csv(df_url)

## Preprocessing

Now, we apply some basic preprocessing

In [None]:
def apply_preprocessing(df):
    # Clean features with high number of nulls
    threshold_nulls = 0.25

    for f in df.columns:
        percent_nulls_f = df[f].isnull().sum() / len(df)
        if percent_nulls_f > threshold_nulls:
            df = df.drop(f, axis=1)


    return df

df = apply_preprocessing(df)

# Drop no feature columns
df = df.drop(['id', 'time'], axis=1)

label_col = "target"
feature_cols = [c for c in df.columns if c!=label_col]

### Train / Test Split

We split the dataset in train and test datasets

In [None]:
df = df.sample(n=n_sample, random_state=random_state)

df_train, df_test = train_test_split(df, test_size=0.25, random_state=random_state)

## Mercury-Dataschema

[Mercury-dataschema](https://bbva.github.io/mercury-dataschema/) is a utility tool that can auto-infer feature types and calculate different statistics. It is also used in the mercury-robust submodule. In this case, we will create a dataschema to use it later when creating the robust tests with mercury-robust.

Because we had the description of the features, we will manually specify the feature types instead of using the auto-inference.


In [None]:
from mercury.dataschema.schemagen import DataSchema

categorical_feats = ['type_1']
discrete_feats = ['num_past_requests_1', 'num_past_requests_2', 'count_1', 'count_2', 'count_3', 'count_4', 'count_5', 'num_1', 'num_2', 'num_3', 'num_4']
binary_feats = ['type_2']
continuos_feats = ['payment_1', 'payment_2', 'quantity_1', 'pendent_quantity_1', 'pendent_quanitty_2', 'quantity_2', 'rate_1', 'rate_2', 'quantity_3', 'quantity_4', 'quantity_5', 'amount_1']

schema = DataSchema().generate_manual(
    df_train,
    categ_columns=categorical_feats,
    discrete_columns=discrete_feats,
    binary_columns=binary_feats
)

## Further preprocessing

The next cell transform the categorical features. We will just use an `OrdinalEncoder` for simplicity, but other approaches might be more appropriate in this case.

In [None]:
encoders = {}
for c in df_train[categorical_feats + binary_feats]:
    encoders[c] = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan).fit(df_train[c].values.reshape(-1, 1))
    df_train[c] = encoders[c].transform(df_train[c].values.reshape(-1, 1))
    df_test[c] = encoders[c].transform(df_test[c].values.reshape(-1, 1))

### Replace nulls

Now we remove the remaining nulls values. Once again, we're using a simple method, but this could be replaced by a more sophisticated approach.

In [None]:
for f in schema.categorical_feats:
    df_train[f] = df_train[f].fillna(df_train[f].mode().values[0])
    df_test[f] = df_test[f].fillna(df_train[f].mode().values[0])

for f in schema.binary_feats:
    df_train[f] = df_train[f].fillna(df_train[f].mode().values[0])
    df_test[f] = df_test[f].fillna(df_train[f].mode().values[0])

for f in schema.discrete_feats:
    df_train[f] = df_train[f].fillna(df_train[f].median())
    df_test[f] = df_test[f].fillna(df_train[f].median())

for f in schema.continuous_feats:
    df_train[f] = df_train[f].fillna(df_train[f].mean())
    df_test[f] = df_test[f].fillna(df_train[f].mean())

## Robust Test Dataset

Let's proceed now to test the dataset that we created with [mercury-robust](https://bbva.github.io/mercury-robust/). Mercury robust allows us to create test suites in order to test different aspects of the dataset.
For example, you can create and run a `TestSuite` with just one test the next code:

```python
from mercury.robust.data_test import LinearCombinationsTest
from mercury.robust import TestSuite

linear_comb_test = LinearCombinationsTest(df_trian, dataset_schema=schma_reference)

test_suite = TestSuite(tests=[linear_comb_test])
test_suite.run()

test_suite.get_results_as_df()
```


Create in the next cell a TestSuite with the next tests:

- LinearCombinationTest: Check if linear combinations exist in our dataset.
- LabelLeakingTest: Check that no feature leaks information about the target variable.
- NoisyLabelsTest: Guarantee that the labels in our dataset have a minimum quality.
- NoDuplicatesTest: Check that we do not have repeated samples in the dataset.
- SampleLeakingTest: Check that our test dataset does not contain samples that are already included in the training dataset.

Then run the test suite and check which tests are failing. You can check the [documentation](https://bbva.github.io/mercury-robust/site/reference/data_tests/)

In [None]:
# WRITE YOUR CODE HERE

Which Tests are Failing?

## Train the model

Let's train a model. In this case, we'll train a Decision Tree, leaving all the parameters at their default values.

In [None]:
num_feats = discrete_feats + continuos_feats
cat_feats = categorical_feats + binary_feats

model_1 = DecisionTreeClassifier()

model_1 = model_1.fit(df_train[num_feats + cat_feats], df_train[label_col])

## Robust Model Tests

We can also create test suites to test our models. In the next cell, create a test suite with:

- ModelSimplicityChecker: To check if a simpler model gives similar or better performance.
- TreeCoverageTest: To check that when we apply a test set to the trained decision tree, we cover at least a percentage of the branches.

You can check the [documentation](https://bbva.github.io/mercury-robust/site/reference/model_tests/)

In [None]:
# WRITE YOUR CODE HERE TO CREATE AND RUN A TEST SUITE

Which test are failing?

#### Train a second model

In the next cell, try to train a new model that passes the tests failing before. Then execute again

In [None]:
# WRITE YOUR CODE TO TRAIN A NEW MODEL


In [None]:
# WRITE YOUR CODE TO CREATE AND RUN A TEST SUITE FOR THE NEW MODEL


## Mercury-Explainability

[Mercury-explainability](https://bbva.github.io/mercury-explainability/) provides several tools to help interpret ML models. Let's apply a couple of these components to gain deeper insights into how our model makes predictions and which features are most influential. Understanding the inner workings of the model will allow us to identify potential biases, assess the importance of features, and ensure the model's decisions align with expectations.

### CounterFactualExplainer

First, we will use the `CounterFactualExplainer`, which is a local explainability method, ie. it tries to explain individual predictions. This method looks for necessary changes in the inputs of a given instance so that the model prediction is an output predefined by us instead of the actual prediction.


First, we train a new model

In [None]:
model_3 = DecisionTreeClassifier(class_weight='balanced', max_depth=6)
num_feats_short = num_feats[0:5]
cat_feats_short = cat_feats[0:5]


model_3 = model_3.fit(df_train[num_feats_short + cat_feats_short].to_numpy(), df_train[label_col])

In the previous cell, we have limited the number of features to the first 5 (num_feats[0:5] for numerical features and cat_feats[0:5] for categorical features). This is done to simplify the model's graphical representation, making it easier to visualize and understand the structure without overwhelming detail. By reducing the feature set, we can focus on a smaller subset of data, which helps in producing more interpretable visual outputs.

In the next cell, create a `CounterFactualExplainerBasic` object. You can do in the next way:
```python
from mercury.explainability import CounterFactualExplainerBasic

counterfactual_basic = CounterFactualExplainerBasic(
  your_training_dataset,
  your_predict_method
)
```
You can check the [documentation](https://bbva.github.io/mercury-explainability/site/reference/explainers/#mercury.explainability.explainers.counter_fact_basic.CounterFactualExplainerBasic) for further details

In [None]:
# WRITE YOUR CODE HERE TO CREATE THE CounterFactualExplainerBasic

Let's check what is the prediction of our second instance in the test set:

In [None]:
model_3.predict_proba([df_test[num_feats_short + cat_feats_short].iloc[2].values])

Now, let's use our explainer to explain our second instance. We want to know which changes we need in the inputs of our instance in order to change the prediction:
```python
explanation = counterfactual_basic.explain(
  our_instance_to_explain,
  thresold= # probability that we want our instance to reach
  class_idx= # class that we are specifying the probability in threshold argument
  keep_explored_points=False
)
```

Then, you can show the explanation with `explanation.show()`

In [None]:
# WRITE YOUR CODE HERE

What variables do need to change in order to change the class prediction?

Now, let's create another model for the next explainer

In [None]:
model_4 = DecisionTreeClassifier(
    class_weight='balanced', max_depth=6
)

model_4 = model_4.fit(df_train[num_feats + cat_feats].to_numpy(), df_train[label_col])

### ALE Plots

ALE plots show how model inputs affect the prediction on average. ALE Plots tend to be more reliable than Partial Dependence Plots in cases with correlations between different inputs.

In the next cell, use the `ALEExplainer` to create the ALE Plots. Check the documentation [here](https://bbva.github.io/mercury-explainability/site/reference/explainers/#mercury.explainability.explainers.ale.ALEExplainer)


In [None]:
# WRITE YOUR CODE HERE TO CREATE  The ALEEXplainer

Once the `ALEExplainer` is created, call the method `explain` to obtain the explanation

In [None]:
# WRITE YOUR CODE HERE TO OBTAIN THE EXPLANATION

Now let's show the ALE plots. By selecting targets=[1] we indicate to show how changes the probability of default depending on the feature values

Finally, you need to use the [`plot_ale`](https://bbva.github.io/mercury-explainability/site/reference/explainers/#mercury.explainability.explainers.ale.plot_ale) function to show the plots

In [None]:
# WRITE YOUR CODE HERE TO SHOW THE PLOTS

How are the different features impacting the model output?

## Mercury-monitoring

[Mercury-monitoring](https://bbva.github.io/mercury-monitoring/) allows us to monitor data and model drift. In this case, we will use it to detect if there are changes in the input distribution.

Let's read the dataset again

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/BBVA/mercury/refs/heads/master/src/data/finai_summit_2025/dataset.csv")
df = apply_preprocessing(df)

for f in schema.categorical_feats:
    df[f] = df[f].fillna(df[f].mode().values[0])

for f in schema.binary_feats:
    df[f] = df[f].fillna(df[f].mode().values[0])

for f in schema.discrete_feats:
    df[f] = df[f].fillna(df[f].median())

for f in schema.continuous_feats:
    df[f] = df[f].fillna(df[f].mean())


In [None]:
df.head()

We will check first check if we have drift between the time 21 and the time 90:

In [None]:
df_time_21 = df[df["time"] == 21]
df_time_90 = df[df["time"] == 90]

### KS Drift

Let's check if we have data drift between time 21 and time 90. We will use the [`KSDrift`](https://bbva.github.io/mercury-monitoring/site/reference/drift/#mercury.monitoring.drift.ks_drift_detector.KSDrift), which allows to detect drift by calculating the Kolmogorov-Smirnov (KS) statistic. For each feature in the datasets, a KS test is performed. The method `calculate_drift()` returns a dictionary with different metrics. The key `drift_detected` is a boolean  indicating if drift was detected. The key `score` contains the average of all KS statistic computed for all features and it can be used as a measure to track drift.

Write the code in the next cell to calculate the data drift with the `KSDrift` detector. You can see more details in the [documentation](https://bbva.github.io/mercury-monitoring/site/reference/drift/#mercury.monitoring.drift.ks_drift_detector.KSDrift)

In [None]:
X_source = df_time_21[num_feats].to_numpy()
X_target = df_time_90[num_feats].to_numpy()

# WRITE YOUR CODE HERE TO CALCULATE THE DATA DRIFT WITH KSDrift

Do we have data drift? What is the amount of drift score?

If we want to know which are the features with data drift, we can use the [`get_drifted_features`](https://bbva.github.io/mercury-monitoring/site/reference/drift/#mercury.monitoring.drift.base.BaseBatchDriftDetector.get_drifted_features) method. Use the next cell to obtain the features with drift


In [None]:
# WRITE YOUR CODE HERE TO GET THE DRIFTED FEATURES

With the method [`plot_feature_drift_scores`](https://bbva.github.io/mercury-monitoring/site/reference/drift/#mercury.monitoring.drift.base.BaseBatchDriftDetector.plot_feature_drift_scores) we can easily create a plot with the drift score for each feature.

Use the next cell to plot the feature drift scores

In [None]:
# WRITE YOUR CODE HERE TO PLOT THE FEATURE DRIFT SCORES

Which are the features with higher drift?

### DomainClassifier Drift

Now, we will use the [`DomainClassifierDrift`](https://bbva.github.io/mercury-monitoring/site/reference/drift/#mercury.monitoring.drift.domain_classifier_drift_detector). It works similarly to the `KSDrift`, but in this case, it trains a classifier (Random Forest) to distinguish between a source dataset and a target dataset. If the better the classifier performance, the higher the data drift between both datasets.

Write the code in the next cell to calculate the data drift with the `DomainClassifierDrift` detector.

In [None]:
X_source = df_time_21[num_feats].dropna().to_numpy()
X_target = df_time_90[num_feats].dropna().to_numpy()

# WRITE YOUR CODE HERE TO CALCULATE THE DATA DRIFT WITH DomainClassifierDrift

We can see that drift was detected

### Drift over time

Now, let's use the KSDrift to track drift overtime. We will consider our initial dataset the data previous to time 50, as calculate the drift for the following times

In [None]:
df_train = df[df["time"] <= 70].copy()
df_inference = df[df["time"] > 70].copy()

print(len(df_train))
print(len(df_inference))

Then, you can calculate and save the drift at each `t` and plot later

In [None]:
drift_scores = []

for t in df_inference["time"].sort_values().unique():

    X_source = df_train[num_feats].to_numpy()
    X_target = df_inference[df_inference["time"] == t][num_feats].to_numpy()

    # WRITE THE CODE HERE TO CALCULATE THE DRIFT WITH THE KSDRIFT AND KEEP IT IN drift_scores


In [None]:
# WRITE YOUR CODE HERE TO PLOT THE DATA DRIFT OVER TIME

Is the data drift increasing over time?