# Getting started with TabFormerLite

In this tutorial, we will show how to use the library to accomplish any of the following tasks:

- Pre-processing input data to be compatible with the model.
- Pre-training the model through masked language modeling (MLM) task.
- Extracting embeddings from pre-trained models using various pooling strategies. These embeddings can be used as input to simpler machine-learning models to perform downstream tasks.
- Fine-tuning the model on a binary classification task.

In this tutorial, we will use the transaction dataset, a synthetic corpus for credit card transactions that has been made available by the authors of the paper [Tabular Transformers for modeling multivariate time series](https://arxiv.org/abs/2011.01843). The card transaction dataset can be downloaded from this [link](https://ibm.ent.box.com/v/tabformer-data).

## 0. Data loading and cleaning

Let's start by loading the transaction dataset in the notebook.

In [1]:
# Load dependencies

import os
import numpy as np
import pandas as pd
from glob import glob

In [2]:
# Path to dataset
data_path = "./data/card_transaction.v1.csv"

# Load data
df = pd.read_csv(data_path)

print(f"Data shape: {df.shape}\n")

# Show a few rows
df.head()

Data shape: (16897222, 15)



Unnamed: 0,User,Card,Year,Month,Day,Time,Amount,Use Chip,Merchant Name,Merchant City,Merchant State,Zip,MCC,Errors?,Is Fraud?
0,0,0,2002,9,1.0,06:21,$134.09,Swipe Transaction,3.527213e+18,La Verne,CA,91750.0,5300.0,,No
1,0,0,2002,9,1.0,06:42,$38.48,Swipe Transaction,-7.276121e+17,Monterey Park,CA,91754.0,5411.0,,No
2,0,0,2002,9,2.0,06:22,$120.34,Swipe Transaction,-7.276121e+17,Monterey Park,CA,91754.0,5411.0,,No
3,0,0,2002,9,2.0,17:45,$128.95,Swipe Transaction,3.414527e+18,Monterey Park,CA,91754.0,5651.0,,No
4,0,0,2002,9,3.0,06:23,$104.71,Swipe Transaction,5.817218e+18,La Verne,CA,91750.0,5912.0,,No


In [3]:
# Count unique users
len(df["User"].unique())

1375

In [4]:
# Show year span in data
print(f"Earliest year in data: {df['Year'].min()}")
print(f"Latest year in data: {df['Year'].max()}")

Earliest year in data: 1991
Latest year in data: 2020


In [5]:
# Show class imbalance in target label
df["Is Fraud?"].value_counts()

No     16876654
Yes       20567
Name: Is Fraud?, dtype: int64

The transactions dataset has approx. 16.9M transactions from 1,375 users spanning between 1991 and 2020. Each transaction has 12 columns of categorical, continuous, and discrete variables. The label column is: `Is Fraud?`, where 20,567 samples are labeled fraudulent.

Before feeding our data through the script for encoding raw tabular data, we need to address a couple of issues in the data, such as missing values, duplicated values, and uninformative characters (e.g., the `$` in the `Amount` column).

The cleaning steps implemented below were inspired by the work in this [repository](https://github.com/IBM/TabFormer/blob/main/README.md) and are specific to the card transactions dataset. You may have to adapt these steps to your particular dataset.

In [6]:
# Drop duplicated entries

print(f"Data shape before dropping duplicates: {df.shape}")

df.drop_duplicates(inplace=True)
print(f"Data shape after dropping duplicates: {df.shape}")

Data shape before dropping duplicates: (16897222, 15)
Data shape after dropping duplicates: (16897180, 15)


In [7]:
# Check the data for missing values

df.isna().sum().sort_values(ascending=False)

Errors?           16626102
Zip                2064608
Merchant State     1952676
Day                      1
Time                     1
Amount                   1
Use Chip                 1
Merchant Name            1
Merchant City            1
MCC                      1
Is Fraud?                1
User                     0
Card                     0
Year                     0
Month                    0
dtype: int64

In [8]:
# We drop the row where the target label is missing

df = df.loc[~df["Is Fraud?"].isna()]

print(f"Data shape: {df.shape}")

Data shape: (16897179, 15)


In [9]:
# Address missing values in the other columns

df.loc[:, "Errors?"].fillna(value="None", inplace=True)
df.loc[:, "Zip"].fillna(value=0, inplace=True)
df.loc[:, "Merchant State"].fillna(value="None", inplace=True)
df.loc[:, "Use Chip"].fillna(value="None", inplace=True)

In [10]:
# Check again for missing values

df.isna().sum().sort_values(ascending=False)

User              0
Card              0
Year              0
Month             0
Day               0
Time              0
Amount            0
Use Chip          0
Merchant Name     0
Merchant City     0
Merchant State    0
Zip               0
MCC               0
Errors?           0
Is Fraud?         0
dtype: int64

In [11]:
# Apply integer encoding to target column

df["Is Fraud?"] = df["Is Fraud?"].apply(lambda x: x == "Yes").astype(int)

**The Amount column** 

Below, we apply the following cleaning steps to the column `Amount`:

- remove the prefix "$"
- convert to float data type
- set to zero any negative values in the Amount column
- apply log-transformation

In [12]:
# Show a few samples from the "Amount" column

df["Amount"][0:5]

0    $134.09
1     $38.48
2    $120.34
3    $128.95
4    $104.71
Name: Amount, dtype: object

In [13]:
# Remove $ and convert to float
df["Amount"] = df["Amount"].str.replace("$", "", regex=True).astype("float")

# Set any negative amounts to zero
df["Amount"] = df["Amount"].apply(lambda x: max(0, x))

# Apply log-transformation
df["Amount"] = np.log1p(df["Amount"])

In [14]:
# Show a few samples from the "Amount" column after cleaning

df["Amount"][0:5]

0    4.905941
1    3.675794
2    4.798597
3    4.867150
4    4.660699
Name: Amount, dtype: float64

**Datetime columns**

Below, we combine all datetime columns (`Year`, `Month`, `Day` and `Time`) into a column that we will call "TimeStamp", representing the number of nanoseconds since the beginning of the Unix epoch on January 1, 1970.

In [15]:
# Unify datetime columns in a single "TimeStamp" column


def timeEncoder(X):
    X_hm = X["Time"].str.split(":", expand=True)
    d = pd.to_datetime(
        dict(
            year=X["Year"], month=X["Month"], day=X["Day"], hour=X_hm[0], minute=X_hm[1]
        )
    ).astype(int)
    return pd.DataFrame(d)


df["TimeStamp"] = timeEncoder(df[["Year", "Month", "Day", "Time"]])

We will also combine all datetime columns (`Year`, `Month`, `Day`, and `Time`) into a column that we will call "event_dt" to represent the date and time of each transaction.

Note that the columns `User` and `event_dt` will not be fed to the model. These columns will be used to sort the transactions by the user and the date.

In [16]:
# Create event_dt column

df["event_dt"] = pd.to_datetime(
    df["Year"].astype("str")
    + "-"
    + df["Month"].astype("str")
    + "-"
    + df["Day"].astype("int").astype("str")
    + " "
    + df["Time"],
    format="%Y-%m-%d %H:%M",
)

In [17]:
# Show a few rows
df.head(3)

Unnamed: 0,User,Card,Year,Month,Day,Time,Amount,Use Chip,Merchant Name,Merchant City,Merchant State,Zip,MCC,Errors?,Is Fraud?,TimeStamp,event_dt
0,0,0,2002,9,1.0,06:21,4.905941,Swipe Transaction,3.527213e+18,La Verne,CA,91750.0,5300.0,,0,1030861260000000000,2002-09-01 06:21:00
1,0,0,2002,9,1.0,06:42,3.675794,Swipe Transaction,-7.276121e+17,Monterey Park,CA,91754.0,5411.0,,0,1030862520000000000,2002-09-01 06:42:00
2,0,0,2002,9,2.0,06:22,4.798597,Swipe Transaction,-7.276121e+17,Monterey Park,CA,91754.0,5411.0,,0,1030947720000000000,2002-09-02 06:22:00


After creating the columns "TimeStamp" and "event_dt", the original datetime columns (`Year`, `Month`, `Day`, and `Time`) are not useful and can be removed. Below, we create a list of columns we want to keep for encoding the data.

In [18]:
# List of columns to keep

columns_to_keep = [
    "User",
    "event_dt",
    "Card",
    "TimeStamp",
    "Amount",
    "Use Chip",
    "Merchant Name",
    "Merchant City",
    "Merchant State",
    "Zip",
    "MCC",
    "Errors?",
    "Is Fraud?",
]

### Data splitting

So far, we addressed missing and duplicated values, cleaned the column 'Amount', and created two new columns ("TimeStamp" and "event_dt") from the DateTime features that were initially available in the data. 

We will now split the dataset into:
- data to use to pre-train the model (we call this data: "pretraining_data")
- data for extracting embeddings or fine-tuning the model (we call this data: "inference_data")

To prevent "data leakage" between the data for pre-training the model and the data for inference, we diligently use all data before 2019 to pre-train the data and the data after 2019 for inference (the year choice is arbitrary).

In [19]:
# Split data

# Pre-training data: < 2019
# Inference data: 2019 - 2020

pretraining_data = df.loc[df["Year"] < 2019, columns_to_keep]

# df["event_dt"] <= '2019-10-27': No pos_labels in inference data after this date
inference_data = df.loc[
    (df["Year"] >= 2019) & (df["event_dt"] <= "2019-10-27"), columns_to_keep
]

print(f"Pre-training data shape: {pretraining_data.shape}")
print(f"Inference data shape: {inference_data.shape}")

Pre-training data shape: (15478667, 13)
Inference data shape: (973021, 13)


We will use the inference data to extract embeddings from pre-trained models and then will use these embeddings as input to simpler machine learning models to perform downstream tasks.

Therefore, we need to further split the inference data into train/validation/test sets to be ready for downstream tasks.

In [20]:
# Further split inference data into train/valid/test sets

splitting_dates = {"train_end_date": "2019-08-01", "valid_end_date": "2019-09-15"}

train_idx = inference_data["event_dt"] < splitting_dates["train_end_date"]
valid_idx = (inference_data["event_dt"] >= splitting_dates["train_end_date"]) & (
    inference_data["event_dt"] < splitting_dates["valid_end_date"]
)
test_idx = inference_data["event_dt"] >= splitting_dates["valid_end_date"]

train_data = inference_data.loc[train_idx]
validation_data = inference_data.loc[valid_idx]
test_data = inference_data.loc[test_idx]

print(f"Training data shape: {train_data.shape}")
print(f"Validation data shape: {validation_data.shape}")
print(f"Test data shape: {test_data.shape}")

Training data shape: (689732, 13)
Validation data shape: (147094, 13)
Test data shape: (136195, 13)


In [21]:
# Checking class imbalance in train_data
train_data["Is Fraud?"].value_counts()

0    688770
1       962
Name: Is Fraud?, dtype: int64

In [22]:
# Checking class imbalance in validation_data
validation_data["Is Fraud?"].value_counts()

0    146893
1       201
Name: Is Fraud?, dtype: int64

In [23]:
# Checking class imbalance in test_data
test_data["Is Fraud?"].value_counts()

0    135940
1       255
Name: Is Fraud?, dtype: int64

In [24]:
# Export training/test data for easy reloading

clean_data_dir = "./data/card_dataset_clean"
# If unavailable, create directory
os.makedirs(os.path.join(clean_data_dir, "inference"), mode=0o777, exist_ok=True)

# Pretraining data
pretraining_data.to_csv(
    os.path.join(clean_data_dir, "pretraining_data_clean.csv"), index=False
)

# Inference data
train_data.to_csv(
    os.path.join(clean_data_dir, "inference", "train_data_clean.csv"), index=False
)
validation_data.to_csv(
    os.path.join(clean_data_dir, "inference", "valid_data_clean.csv"), index=False
)
test_data.to_csv(
    os.path.join(clean_data_dir, "inference", "test_data_clean.csv"), index=False
)

## 1. Data pre-processing

The model requires the input data to be, first, tokenized and encoded as unique numerical identifiers and then shaped into fixed-length sequences. The data pre-processing step prepares tabular data to be compatible with the model by implementing the following:

Discretizes tabular data into discrete information units, much like NLP tokens, and assigns unique numerical identifiers to these units.
A sliding window approach shapes the encoded tabular data into fixed-length sequences.

To pre-process the training data, we use the following command:
```
$ python3 scripts/encode_dataset.py -cfg ./configs/example/data_encoding/config_card_dataset_encoding.json
```

Let's have a look at the configuration file `config_card_dataset_encoding.json`:

* ```data_dir```: directory path where the csv file with the raw tabular data is available
* ```data_name```: "pretraining_data_clean.csv"
* ```user_col```: "User"
* ```date_col```: "event_dt"
* ```target_col```: "Is Fraud?"
* ```seq_len```: 10
* ```stride```: 5
* ```num_bins```: 10
* ```n_max```: 85000
* ```encoded_data_dir```: directory path where the folder with the outputs of the pre-processing step will be saved
* ```encoded_data_folder```: "card_dataset_encoded_seq_len_10_stride_5_bins_10"


You can find an explanation of the parameters in the configuration file for pre-processing tabular data in the `README file`. Below, we show how we select the `n_max` value for the card transactions dataset. 

We start by counting the unique values in each column in the `pretraining_data`.

In [25]:
# Count unique values in each column

unique_value_counts = []

for col in columns_to_keep:
    unique_value_counts.append(len(pretraining_data[col].unique()))

# Collect values in a dataframe
df_info = pd.DataFrame(
    {
        "column_name": columns_to_keep,
        "unique_value_counts": unique_value_counts,
        "data_type": pretraining_data[columns_to_keep].dtypes.values,
    }
)

# Sort values by unique_value_counts and reset index
df_info.sort_values(by="unique_value_counts", ascending=True, inplace=True)
df_info.reset_index(drop=True, inplace=True)

df_info

Unnamed: 0,column_name,unique_value_counts,data_type
0,Is Fraud?,2,int64
1,Use Chip,3,object
2,Card,9,int64
3,Errors?,24,object
4,MCC,109,float64
5,Merchant State,201,object
6,User,1106,int64
7,Merchant City,12724,object
8,Zip,25771,float64
9,Merchant Name,78388,float64


The data pre-processing step will discretize any non-categorical column with more than `n_max` unique values using quantile binning. We use `n_max` = 85,000, meaning only the columns `Amount` and `TimeStamp` will be discretized using quantile binning. Note that the columns `User` and `event_dt` are used to sort and group rows in the data by the user and date and won't be fed to the model. Therefore, these columns will not be transformed during the pre-processing step.

### Outputs

The pre-processing step generates and exports in the output directory all files that are necessary to pre-train the model:

```
└── encoded_data_dir/encoded_data_folder
    ├── binning.pickle                 <- Pickle file with all encoders and bins used in quantile binning
    ├── column_lists_by_dtype.json     <- Json file with lists of columns for each discretization strategy
    ├── config.json                    <- configuration file used in pre-processing step
    ├── processed_data_and_labels.h5   <- encoded data
    ├── vocab_tokenizer.nb             <- Tokenizer
    └── vocab.pickle                   <- Vocabulary
```

## 2. Pre-training

The model is pre-trained using a "masked language model" (MLM) pre-training objective.

Once the data pre-processing step is complete, we can pre-train the model using the following command.

```
$ python3 scripts/run_mlm_pretraining.py -cfg ./configs/example/pretraining/config_card_dataset_size_300.json
```
Let's have a look at the `config_card_dataset_size_300.json` configuration file:

* "seed"                  <- random seed,
* "encoded_data_dir"      <- directory path where the outputs of the preprocessing step are stored 
* "encoded_data_folder"   <- folder name containing the outputs of the preprocessing step
* "output_dir"            <- directory path where the outputs of the pre-training step are stored 
* "output_folder"         <- folder name containing the outputs of the pre-training step
* "add_date_suffix_to_output":  true,
* "field_hidden_size": 256,
* "tab_embeddings_num_attention_heads": 8,
* "tab_embedding_num_encoder_layers": 2,
* "tab_embedding_dropout": 0.1,
* "num_attention_heads": 12,
* "num_hidden_layers": 12,
* "hidden_size": 300,
* "mlm_average_loss": true,
* "mlm_probability": 0.15,
* "batch_size": 256,
* "grad_acc_steps": 16,
* "logging_per_epoch": 15,
* "num_epochs": 20,
* "checkpoint_every_N_epochs": 1,
* "lr_max": 5e-4,
* "warmup_steps_in_epochs": 1,
* "save_total_limit": 5,
* "resume_from_checkpoint": false
* "checkpoint_dir": ""

You can find an explanation of the parameters in the configuration file for pre-training in the `README file`.

### Outputs

The pre-training step generates and exports several files in the output directory, for which we show one example below.

```
└── output_dir/output_folder
    ├── checkpoint-...     <- checkpoint directory
    ├── checkpoint-...     <- checkpoint directory
    ├── logging            <- directory with logging files for TensorBoard  
    ├── callback_log.json  <- dictionary with training/validation metrics
    ├── config.json        <- configuration file used to launch pre-training
    └── training_args.json <- dictionary with model configuration and training arguments used
```

### MLM metrics

Below, we show an example of MLM metrics computed during the pre-training phase.

![MLM_metrics](../docs/img/mlm_metrics.png)

## 3. Embedding extraction from pre-trained models

Once the pre-training phase is completed, we can extract embeddings for each time step within a time-series sequence using the following command:

```
$ python3 scripts/inference_main.py -cfg configs/example/inference/config_card_dataset_inference.json
```

Let's have a look at the `config_card_dataset_inference.json` configuration file:

* "seed": <- random seed
* "pretrained_model_config":
    * "model_directory" <- directory path where the outputs of the pre-training step were stored (same as output_dir in the pre-training config file).
    * "model_name" <- folder name inside model_directory containing one (or more) model checkpoints (same as output_folder in the pre-training config file).
    * ckpt_dir"<- This optional argument can be used to specify the desired checkpoint to use for loading a pre-trained model.
* "pretraining_data_config":
    * "path_to_config_file" <- path to the configuration file with information about the data used to pre-train the model. 
* "inference_data_config": 
    * "data_directory": <- the path to the directory containing the train/validation/test datasets we want to extract embeddings for. 
* "inference_config": 
    * "batch_size" <- batch size
    * "pooling_on_time_axis": 
        * "strategy": "mean_pooling"
        * "nbr_days": 3
    * "pooling_on_layer_axis": 
        * "strategy": "single_layer_pooling",
        * "pooling_layer": -1
* "downstream_task_config":
    * "target_cols_to_include": ["Is Fraud?"]
    * "path_to_data_with_labels": <- the path to file(s) containing the user, date, and label columns that we want to merge with the extracted embeddings.
    * "output_dir": the path to the output directory where the embeddings and labels will be saved.
    * "user_col": "User",
    * "date_col": "event_dt"
    

You can find a detailed explanation of the parameters in the configuration file for extracting embeddings in the `README file`.


### Outputs

The embedding extraction step extracts embeddings and exports them in parquet or npz files in the specified output directory. Below, we show one example for loading embeddings from parquet files.

#### Example
In the this example, we are pooling embeddings from the last hidden layer and applying mean pooling over 3 days on the time-axis.

```
"pooling_on_time_axis": {
    "strategy": "mean_pooling",
    "nbr_days": 3
    },
"pooling_on_layer_axis": {
    "strategy": "single_layer_pooling",
    "pooling_layer": -1
    }
```

The embeddings were extracted in a parquet file and can be loaded in a dataframe, as shown below.

In [26]:
output_path = glob(
    (
        "inference/"
        "embeddings_with_labels/embedding_size_300/"
        "checkpoint-**_last_hidden_single_layer_pooling_10_days_mean_pooling.parquet"
    )
)[0]

df_embeddings = pd.read_parquet(output_path)

print(f"Data shape: {df_embeddings.shape}\n")

# Show a few samples
df_embeddings.head(3)

Data shape: (957301, 303)



Unnamed: 0,User,event_dt,Is Fraud?,f0,f1,f2,f3,f4,f5,f6,...,f290,f291,f292,f293,f294,f295,f296,f297,f298,f299
0,0,2019-01-04 13:15:00,0,-1.70523,-1.495667,-0.205145,-0.749355,1.149317,-1.173064,0.325741,...,-1.700586,-0.53514,1.226801,-1.317099,-0.056494,0.276329,0.856759,1.554424,-0.468213,-0.706564
1,0,2019-01-04 13:16:00,0,-1.759816,-1.421752,-0.251014,-0.832929,1.138421,-1.243008,0.310202,...,-1.688424,-0.460548,1.270176,-1.363025,-0.075365,0.291442,0.885366,1.593894,-0.383234,-0.660581
2,0,2019-01-04 13:17:00,0,-1.757202,-1.4211,-0.083118,-0.779717,1.104834,-1.295749,0.278178,...,-1.590062,-0.463647,1.356076,-1.289437,-0.115599,0.237998,0.837648,1.718791,-0.319569,-0.651324


The dataframe `df_embeddings` provides the user ID number, the transaction date, the target column, and the extracted embeddings. These embeddings can be used as input with a simple machine learning model, such as XGBoost, to predict fraudulent transactions.

## 4. Fine-tuning

To fine-tune the model we can use the following command:

```
$ python3 scripts/finetuning.py -cfg configs/example/finetuning/config_card_finetuning.json
```

Let's have a look at the `config_card_finetuning.json` configuration file:

* "pretrained_model_config":
    * "model_directory" <- directory path where the outputs of the pre-training step were stored (same as output_dir in the pre-training config file).
    * "model_name" <- folder name inside model_directory containing one (or more) model checkpoints (same as output_folder in the pre-training config file).
    * ckpt_dir"<- This optional argument can be used to specify the desired checkpoint to use for loading a pre-trained model.
* "pretraining_data_config":
    * "path_to_config_file" <- path to the configuration file with information about the data used to pre-train the model. 
* "finetuning_data_config": 
    * "data_directory": <- the path to the directory containing the train/validation/test datasets we want to use for finetuning
* "training_config": 
    * "seed" <- random seed
    * "batch_size" <- batch size
    * "output_dir": <- directory path where the outputs of the fine-tuning step are stored 
    * "output_folder": <- folder name containing the outputs of the fine-tuning step
    * "add_date_suffix_to_output": if true, a date suffix is added at the end of the output_folder name
    * "batch_size": batch_size
    * "grad_acc_steps": 16
    * "logs_per_epoch": 50,
    * "num_epochs": 15,
    * "checkpoint_every_N_epochs": 1,
    * "lr_max": 1e-4,
    * "warmup_steps_in_epochs": 2,
    * "save_tot_lim": 1,
    * "problem_type": "classification",
    * "compute_pos_weight": true,
    * "pos_weight": 716,
    * "load_weights_from_pretraining": true,
    * "pos_label": 1,
    * "metric_for_best_model": "f1_score_minority"
  }
}    
    
You can find a detailed explanation of the parameters in the configuration file for fine-tuning the model in the `README file`.

### Metrics

Below, we show an example of the metrics computed on the validation data during fine-tuning.

![Finetuning_metrics](../docs/img/finetuning_metrics.png)