In [1]:
from jenkspy import JenksNaturalBreaks
import pandas as pd
import numpy as np
import time
import os
import pickle
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
import itertools
import sqlite3

import modules.feature_selection as fs
import modules.helper_functions as hf
import modules.filter_rows as fr

### Helper Functions

All helper functions are included in the module ```helper_functions.py```, such that they can also be used in other notebooks as well.

# Data Preparation

According to IBM corporation (2013) the data preparation process can be outlined as follows (p. 18):

The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection, data cleaning, construction of new attributes, and transformation of data for modeling tools.

Data preparation is one of the most important and often time-consuming aspects of data mining. In fact, it is estimated that data preparation usually takes 50-70% of a project's time and effort. Devoting adequate energy to the earlier business understanding and data understanding phases can minimize this overhead, but you still need to expend a good amount of effort preparing and packaging the data for mining. Depending on your organization and its goals, data preparation typically involves the following tasks: 
* Merging data sets and/or records 
* Selecting a sample subset of data
* Aggregating records 
* Deriving new attributes 
* Sorting the data for modeling 
* Removing or replacing blank or missing values 
* Splitting into training and test data sets

## Select Data

The data selection phase contains the selection of rows and columns which are necessary for the further modeling process. For the case-study it is first of all important to address the given insights from the business side and keep only distinct purchases in the dataset. This means we have to remove all transactions where customers tried several times to transfer the money. If two transactions are within one minute, with the same amount of money and from the same country, it is (for a decent number of tries) safe to assume that they are payment attempts of the same purchase. This means we have to remove those unsuccessful transactions from the dataset, who meat the previously stated equality-criteria from business side.

### Rows

All functions needed to select the rows are stored in the module ```./modules/filter_rows.py```.

### Columnms

All functions needed to drop columns are stored in the module ```./modules/helper_functions.py```.

### Feature selection
Because CRISP-DM is a cyclic process we reference to a subsequent process back namely the formatting step. <br>
In this step we first have to load the needed and splitted data.

In terms of feature importance it is almost anytime important to remove redundant features which are highly correlated with each other. We have binary variables and continuous variables in the dataset. For the correlation of two continuous variables the correlation is calculated based on the Pearson correlation coefficient, for two binary variables it is the Phi-coefficient and for a binary and a continuous variable it is the point biseral correlation. The Phi-coefficient and the point biseral correlation are both special cases of the Pearson correlation. Hence, for all feature constellations in the dataset we can calculate the Pearson correlation coefficient in order to remove redundant features. Kuhn and Johnson (2013, p. 47) propose a pairwise between-predictor correlation of less than 0.75 for models that are particularly sensitive to multicollinearity. In order to find those predictors Kuhn and Johnson propose the following algorithm:

***
**Correlation based feature selection (Kuhn & Johnson, 2013)**
***
**Input**: training matrix, threshold
* Calculate the correlation matrix of the predictors
* Determine the two predictors associated with the largest absolute pairwise correlation (A and B)
* Determine the average correlation between A and the other variables. Do the same for predictor B.
* If A has a larger average correlation, remove it, otherwise remove B.
* Repeat the steps until no absolute correlations are above the threshold.

**Output**: List of featurs that have a higher between-predictor correlation than the specified threshold
***

The algorithm is applied in the function ```correlationFiltering(X_train, threshold = 0.75)```. There are no predictors in the dataset with a higher pairwise correlation than 0.75.

Based on the model a second round will be implemented in the modeling part. A model-based wrapper approach is planned as a second feature selection step.

In [2]:
def correlationFiltering(X_train, threshold = 0.75, figsize = 10):
    plt.figure(figsize=(figsize, figsize))
    sns.heatmap(X_train.corr().round(2), annot=False)
    plt.show()
    
    # create correlation matrix
    corrMatrix = X_train.corr().abs()
    # get upper triangle
    upperCorrMatrix = corrMatrix.where(
        np.triu(np.ones(corrMatrix.shape), k=1).astype(np.bool_))
    uniqueCorrPairs = upperCorrMatrix.unstack().dropna()
    sortedCorrPairs = uniqueCorrPairs.sort_values(ascending = False)
    # identify all paird with correlation greater than threshold
    pairsToFilter = sortedCorrPairs[sortedCorrPairs > threshold]
    toRemove = []
    for pair in pairsToFilter.index:
        # calculate average correlation between A and other variables and B with other variables
        a = pair[0]
        a_avg = corrMatrix[a].mean()
        b = pair[1]
        b_avg = corrMatrix[b].mean()
        # if A has a larger average correlation, remove it, otherwise remove B
        if a_avg > b_avg:
            toRemove.append(a)
        else:
            toRemove.append(b)

    return list(set(toRemove))

This function is also transferred to the module ```feature_selection.py```, such that it can be used in subsequent modeling steps.

## Clean Data
**Data Cleaning Report** <br>
Cleaning your data involves taking a closer look at the problems in the data that you've chosen to include for analysis. There are several ways to clean data using the Record and Field Operation nodes in IBM SPSS Modeler. Common data problems are:
* Missing data: Exclude rows or characteristics. Or, fill blanks with an estimated value.
* Data errors: Use logic to manually discover errors and replace. Or, exclude characteristics.
* Coding inconsistencies: Decide upon a single coding scheme, then convert and replace values.
* Missing or bad metadata: Manually examine suspect fields and track down correct meaning.

The Data Quality Report prepared during the data understanding phase contains details about the types of problems particular to your data. You can use it as a starting point for data manipulation in IBM SPSS Modeler.

## Construct Data
**Derived Attributes Generated Record** <br>
It is frequently the case that you'll need to construct new data. For example, it may be useful to create a new column flagging the purchase of an extended warranty for each transaction. This new field, purchased_warranty, can easily be generated using a Set to Flag node in IBM SPSS Modeler. There are two ways to construct new data: 
* Deriving attributes (columns or characteristics) 
* Generating records (rows) 

## Integrate Data
**Merged Data** <br>
It is not uncommon to have multiple data sources for the same set of business questions. For example, you may have access to mortgage loan data as well as purchased demographic data for the same set of clients. If these data sets contain the same unique identifier (such as social security number), you can merge them in IBM SPSS Modeler using this key field. There are two basic methods of integrating data:
* Merging data involves merging two data sets with similar records but different attributes. The data is merged using the same key identifier for each record (such as customer ID). The resulting data increases in columns or characteristics. 
* Appending data involves integrating two or more data sets with similar attributes but different records. The data is integrated based upon a similar fields (such as product name or contract length).

In [3]:
def get_tmsp_information(data):
    out = data.copy()
    out["month"] = out.copy()["tmsp"].dt.strftime('%b')
    out["dayOfMonth"] = out.copy()["tmsp"].dt.strftime('%#d').astype(int)
    out["weekday"] = out.copy().tmsp.dt.day_name()
    out["weekend"] = np.where(out['weekday'].isin(['Saturday', 'Sunday']), 1, 0)
    out["holiday"] = np.where((out['month'] == 'Jan') & (out['dayOfMonth'] == 1), 1, 0)
    
    return out

def get_daytime(data):
    out = data.copy()
    
    out['time'] = out['tmsp'].dt.strftime('%H:%M')
    out['daytime'] = np.where((out['time'] >= '00:00') & (out['time'] < '06:00'), 'night', 
                        np.where((out['time'] >= '06:00') & (out['time'] < '12:00'), 'morning',
                        np.where((out['time'] >= '12:00') & (out['time'] < '18:00'), 'afternoon', 'evening')))
    out["minuteOfDay"] = (out["tmsp"].dt.hour * 60) + (out["tmsp"].dt.minute)
    
    return out

def get_amountgroup(data, train_length = 0.7, on_training = True):
    out = data.copy()
    out = out.sort_values(by = ["tmsp"], ascending = True)
    out_length = len(out)
    parameters = {}
    
    if on_training:
        train_split = int(np.round(out_length*train_length))
        parameters['train_split_iloc'] = train_split
    else:
        train_split = out_length
    
    amount = list(out.iloc[:train_split,:]['amount'])
    
    jnb = JenksNaturalBreaks(5)
    jnb.fit(amount)
    bins = jnb.breaks_
    bins[0] = 0
    bins[len(bins) - 1] = 10000
    parameters["jenks"] = bins
    hf.writePickle('./data/parameters.pkl', parameters)
    
    out = hf.get_amountgroups(out, bins = bins)
    
    print("= Jenks natural breaks are:")
    print(jnb.breaks_)
    
    return out

def getCard_3DSec_PSP_Amountgroup(data):
    y = hf.get_y(groups=["card", "3D_secured", "PSP", "amountgroup_word"], data = data.copy())
    y = y.sort_values(by=['success_rate'], ascending = False)
    y_mod = y.drop('PSP', axis=1)
    y_mod = y_mod.groupby(["card", "3D_secured", "amountgroup_word"]).aggregate({'success_rate': 'max'}).reset_index()
    y_mod = y_mod.sort_values(by=["success_rate"], ascending = False)
    y_mod = y_mod.merge(y, how="left", on = ["card", "3D_secured", "amountgroup_word", "success_rate"])

    return y_mod

The success rate which is the most relevant optimization criterion hasn´t been explored yet to generate features for the machine learning model. It is not possible to just calculate the overall success rate for different attributes (e.g. PSP) because this would imply data leakage. Those features have to be calculated in a rolling window approach based on previous transactions. In order to not throw away data missing values will be imputed if there is no data based on the most probable success overall or for a given PSP. The overall success rate is 0.37. The success rate for Goldcard is 0.62, for UK Card 0.4, for Moneycard 0.37 and for Simplecard 0.26. So the missing data imputation which is only relevant for observations at the very beginning of the dataset is the imputation of 0 by default and 1 for the PSP Goldcard. But this case should be limited to just a few observations. In order to apply this rolling success rate calculation the dataset has to be ordered based on the timestamp which we already did.

In [4]:
def getOverallSR(data):
    out = data.copy()
    train_split = hf.loadPickle('./data/parameters.pkl')["train_split_iloc"]
    out["overallSR"] = out.iloc[:train_split, :].success.mean()
                
    return out

def combinatoric_SR(data, addColumns = ["PSP", "card", "3D_secured", "amountgroup_word"]):
    out = data.copy()

    train_split = hf.loadPickle('./data/parameters.pkl')["train_split_iloc"]
    combinations = {}
    colName = ""
    for col in addColumns:
        combinations[col] = list(out[col].unique())
        colName = colName + col + "_"

    colName = colName + "SR"
    print(colName)
    addColumns.append(colName)

    keys, values = zip(*combinations.items())
    permutations_dicts = [dict(zip(keys, v)) for v in itertools.product(*values)]

    joinFrame = pd.DataFrame()

    i = 1
    for permutation in permutations_dicts:
        subset = out.copy().iloc[:train_split, :]
        for key in permutation.keys():
            subset = subset[subset[key] == permutation[key]]
        subset[colName] = subset.success.mean()
        joinFrame = pd.concat([joinFrame, subset[addColumns]])

    out = out.merge(joinFrame.drop_duplicates(), how = 'left', on = list(set(out.columns).intersection(set(joinFrame.columns))))
    
    return out

def combinatoric_event_window_SR(data, 
                                 addColumns = ["PSP", "card", "3D_secured", "amountgroup_word"], 
                                 event_windows = [5, 10, 100, 200],
                                 allowed_missing = 0.05
                                ):
    out = data.copy()

    for event_window in event_windows:
        print("= Event window size: " + str(event_window))
        combinations = {}
        colName = ""
        replaceCol = ""
        for col in addColumns:
            combinations[col] = list(out[col].unique())
            colName = colName + col + "_"
            replaceCol = replaceCol + col + "_"

        colName = colName + "e" + str(event_window) + "_SR"
        replaceCol = replaceCol + "SR"
        print(colName)
        outCols = addColumns.copy()
        outCols.append(colName)

        keys, values = zip(*combinations.items())
        permutations_dicts = [dict(zip(keys, v)) for v in itertools.product(*values)]

        joinFrame = pd.DataFrame()

        for permutation in permutations_dicts:
            subset = out.copy()
            for key in permutation.keys():
                subset = subset[subset[key] == permutation[key]]
            subset[colName] = subset.success.shift().rolling(
                event_window, min_periods=int(np.ceil(event_window/10))
            ).mean()
            joinFrame = pd.concat([joinFrame, subset[outCols]])
        
        missing_ratio = joinFrame.isna().sum().sum()/len(joinFrame)
        if missing_ratio <= allowed_missing:
            out = out.join(joinFrame[colName])
            out[colName] = out[colName].fillna(out[replaceCol])
        else:
            print("--- Number of missing values too large ---")
    
    return out

def combinatoric_time_window_SR(data,
                                addColumns = ["PSP", "card", "3D_secured", "amountgroup_word"], 
                                time_windows = [1, 6, 12, 24, 72],
                                allowed_missing = 0.15
                                ):
    out = data.copy()

    for time_window in time_windows:
        print("= Time window size: " + str(time_window) + "h")
        combinations = {}
        colName = ""
        replaceCol = ""
        for col in addColumns:
            combinations[col] = list(out[col].unique())
            colName = colName + col + "_"
            replaceCol = replaceCol + col + "_"

        colName = colName + "t" + str(time_window) + "h_SR"
        replaceCol = replaceCol + "SR"
        print(colName)
        outCols = addColumns.copy()
        outCols.append(colName)

        keys, values = zip(*combinations.items())
        permutations_dicts = [dict(zip(keys, v)) for v in itertools.product(*values)]

        joinFrame = pd.DataFrame()

        for permutation in permutations_dicts:
            subset = out.copy()
            for key in permutation.keys():
                subset = subset[subset[key] == permutation[key]]
            subset[colName] = subset[["tmsp", "success"]].rolling(
                            str(time_window) + "h", on = "tmsp", min_periods=int(np.min([np.round(time_window/10), 3]))
                        ).apply(hf.getMeanRollingEvent)["success"]
            joinFrame = pd.concat([joinFrame, subset[outCols]])

        missing_ratio = joinFrame.isna().sum().sum()/len(joinFrame)
        if missing_ratio <= allowed_missing:
            out = out.join(joinFrame[colName])
            out[colName] = out[colName].fillna(out[replaceCol])
        else:
            print("--- Number of missing values too large: " + str(missing_ratio) + " ---")
    
    return out

Bygari, et al. (2021) developed a similar routing approach for an India-based payment service provider called Razorpay. Instead of different payment service providers the usecase has several terminals. For the described "Smart Routing Solution" the authors also proposed to calculate the success rates based on different event- and time-windows. This means that the success rates for a transaction are also calculated based on a rolling-window approach. This means for time-windows:
* The success rates are calculated based on the transaction-access in the last t seconds. This is done for all transactions in the time-wondow as well as for each payment service provider separately. This means concretely, if the timestamp for a transaction is T the success rates are calculated for the transactions between [T-t, T). Even though Bygari, et al. (2021) described a greedy approach to find the best suited timeintervals with a wide range of t´s, the t´s for this case-study will be limited to [300, 600, 3600] seconds. If a time-interval does not contain any transaction the missing values are imputed by the rolling overall success rate. Furthermore the number of minimum transactions to calculate the success rate is set to one.

This means for event-windows:
* This means for the rolling-event approach is calculated simularly. In this case t is replaced with e. This means that the success rate for a certain transaction is calculated for the e previous transactions. For this case-study e will be limited to [5, 10, 50] events. The number of minimum transactions is also set to one.

Furthermore all window-based feature engineering approaches can be computed overall and based on different features like ```PSP```, ```country``` or ```card```. This will be limited to the column ```PSP``` for this case-study.

In [5]:
def get_raw_data(dataPath = './data/PSP_Jan_Feb_2019.xlsx', pathDb = './data/PSP_Data.sqlite', table = "TB001_DATA_RAW"):
    
    if not hf.checkIfTableDbExists(pathDb, table):
        out = pd.read_excel(dataPath)
        out = out.drop(["Unnamed: 0"], axis = 1)
        out["tmsp"] = pd.to_datetime(out["tmsp"])
        out = out.sort_values(by = ["tmsp"], ascending = True)

        hf.writeDb(out, pathDb = pathDb, table_name = table)
    else:
        out = hf.readSqlTable(pathDb, table = table)
    
    return out

As seen in the data understanding phase and also described by Mu (2021) and Mao, et al. (2023) also the success rate as a combination of the payent service provider, the card type and the 3D identification could be a valuable feature to integrate in the model. The time windows will be chosen wider in this case because the transaction density for those constellations is not very high. Mu (2021) proposed several days. In order to avoid spurious correlations only an overall expanding window calculation will be applied.

In [6]:
def applyDataCleaningFeatureEng(dataPath = './data/PSP_Jan_Feb_2019.xlsx', 
                                outPath = './data/data_prepared.csv', 
                                train_length = 0.7, 
                                pathDb = './data/PSP_Data.sqlite'
                               ):
    start_pipeline = time.time()
    
    if not hf.checkIfTableDbExists(pathDb, "TB003_DATA_PREPARED"):
        start_time = time.time()
        print('=== Start raw data loading ===')
        out = get_raw_data(dataPath = dataPath, pathDb = pathDb)
        print("=== Elapsed Time: " + str(time.time() - start_time) + " seconds ===")
        print("Shape of dataframe: " + str(out.shape))

        if not hf.checkIfTableDbExists(pathDb, "TB002_DATA_CLEANED"):
            print("")
            start_time = time.time()
            print('=== Start filter rows ===')
            out = fr.selectRows(out)
            print("=== Elapsed Time: " + str(time.time() - start_time) + " seconds ===")
            print("Shape of dataframe: " + str(out.shape))

            print("")
            start_time = time.time()
            print('=== Start get timestamp information ===')
            out = get_tmsp_information(out)
            out = get_daytime(out)
            print("=== Elapsed Time: " + str(time.time() - start_time) + " seconds ===")
            print("Shape of dataframe: " + str(out.shape))

            print("")
            start_time = time.time()
            print('=== Start get amountgroups by Jenks natural breaks ===')
            out = get_amountgroup(out)
            hf.writeDb(out, pathDb = pathDb, table_name = "TB002_DATA_CLEANED")
            print("=== Elapsed Time: " + str(time.time() - start_time) + " seconds ===")
            print("Shape of dataframe: " + str(out.shape))
        else:
            print("=== Cleaned Data Table already exists - reading from DB ===")
            out = hf.readSqlTable(pathDb, "TB002_DATA_CLEANED")
            print("Shape of dataframe: " + str(out.shape))
        
        print("")
        start_time = time.time()
        print("=== Start Feature Engineering ===")
        print("=== Get overall success rates ===")
        out = getOverallSR(out)
        print("=== Get overall success rates for columns and column combinations")
        print("= PSP")
        out = combinatoric_SR(out, addColumns = ["PSP"])
        print("= PSP x card")
        out = combinatoric_SR(out, addColumns = ["PSP", "card"])
        print("= PSP x card x 3D_secured")
        out = combinatoric_SR(out, addColumns = ["PSP", "card", "3D_secured"])
        print("= PSP x card x 3D_secured x amountgroup_word")
        out = combinatoric_SR(out, addColumns = ["PSP", "card", "3D_secured", "amountgroup_word"])
        print("=== Elapsed Time: " + str(time.time() - start_time) + " seconds ===")
        print("Shape of dataframe: " + str(out.shape))
        
        print("")
        start_time = time.time()
        print("=== Get event window success rates for columns and column combinations")
        print("= PSP")
        out = combinatoric_event_window_SR(out, addColumns = ["PSP"])
        print("= PSP x card")
        out = combinatoric_event_window_SR(out, addColumns = ["PSP", "card"])
        print("= PSP x card x 3D_secured")
        out = combinatoric_event_window_SR(out, addColumns = ["PSP", "card", "3D_secured"])
        print("= PSP x card x 3D_secured x amountgroup_word")
        out = combinatoric_event_window_SR(out, addColumns = ["PSP", "card", "3D_secured", "amountgroup_word"])
        print("=== Elapsed Time: " + str(time.time() - start_time) + " seconds ===")
        print("Shape of dataframe: " + str(out.shape))
        
        print("")
        start_time = time.time()
        print("=== Get time window success rates for columns and column combinations")
        print("= PSP")
        out = combinatoric_time_window_SR(out, addColumns = ["PSP"])
        print("= PSP x card")
        out = combinatoric_time_window_SR(out, addColumns = ["PSP", "card"])
        print("= PSP x card x 3D_secured")
        out = combinatoric_time_window_SR(out, addColumns = ["PSP", "card", "3D_secured"])
        print("= PSP x card x 3D_secured x amountgroup_word")
        out = combinatoric_time_window_SR(out, addColumns = ["PSP", "card", "3D_secured", "amountgroup_word"])
        print("=== Elapsed Time: " + str(time.time() - start_time) + " seconds ===")
        print("Shape of dataframe: " + str(out.shape))
        
        hf.writeDb(out, pathDb = pathDb, table_name = "TB003_DATA_PREPARED")
    
    else:
        print("=== Prepared Data already exists - reading from DB ===")
        out = hf.readSqlTable(pathDb, "TB003_DATA_PREPARED")
     
    print("")
    print("============================")
    print("= time for whole pipeline: " + str(time.time() - start_pipeline) + " seconds")
    print("============================")
    
    return out

In [7]:
data_clean = applyDataCleaningFeatureEng()

=== Table already exists ===
=== Start raw data loading ===
=== Table already exists ===
=== Table TB001_DATA_RAW created successful ===
=== Elapsed Time: 6.3144142627716064 seconds ===
Shape of dataframe: (50410, 7)
=== Table already exists ===

=== Start filter rows ===
= Half time: 171.1493136882782 seconds
= End Time: 300.4188566207886 seconds
=== Elapsed Time: 300.4188566207886 seconds ===
Shape of dataframe: (27491, 12)

=== Start get timestamp information ===
=== Elapsed Time: 0.38375306129455566 seconds ===
Shape of dataframe: (27491, 20)

=== Start get amountgroups by Jenks natural breaks ===
= Jenks natural breaks are:
[0, 99, 175, 247, 330, 10000]
=== Table TB002_DATA_CLEANED created successful ===
=== Elapsed Time: 1.2694461345672607 seconds ===
Shape of dataframe: (27491, 21)

=== Start Feature Engineering ===
=== Get overall success rates ===
=== Get overall success rates for columns and column combinations
= PSP
PSP_SR
= PSP x card
PSP_card_SR
= PSP x card x 3D_secured
P

The column ```tmsp``` is included in the additionally created columns ```month```, ```dayOfMonth```, ```weekday```, ```holiday```, ```daytime``` and ```minuteOfDay```, so this column can be deleted. Furthermore the column cannot be used in any machine learning model.

The columns ```daytime``` and ```time``` were created for data exploration reasons only. The columns can be completely reproduced by the column ```minuteOfDay```. So also the columns ```daytime``` and ```time``` can also be deleted.

Also the columns ```amountgroup``` and ```amountgroup_word``` can be completely derived from the column ```amount``` and are artifacts from the previous data understanding steps. Highly correlated features containing redundant information can cause problems in many ML settings, so both columns will be removed.

Also the feature ```failPrevious``` which is a dummy-variable to indicate if a transaction has failed previously or not.

The columns ```lower```, ```upper```, ```numLower``` and ```numUpper``` were created for row-selection reasons and can also be excluded as features for the modeling phase.

In [8]:
try:
    print(data_clean[data_clean['PSP'] != "Simplecard"].success.mean())
    data_clean_dropped_woTime = hf.dropColumns(data = data_clean.copy(), 
        columns = ['tmsp_hour', 'daytime', 'time', 'failedPSP', 'amountgroup_word', 'lower', 'upper', 'numUpper'])
except:
    print("=== Object does not exist ===")

0.41679289104311823


## Format Data
**Reformatted Data** <br>
As a final step before model building, it is helpful to check whether certain techniques require a particular format or order to the data. For example, it is not uncommon that a sequence algorithm requires the data to be presorted before running the model. Even if the model can perform the sorting for you, it may save processing time to use a Sort node prior to modeling. Task List Consider the following questions when formatting data: 
* Which models do you plan to use? 
* Do these models require a particular data format or order? If changes are recommended, the processing tools in IBM SPSS Modeler can help you apply the necessary data manipulation.

Treebased models are particularly useful in dealing with categorical features and finding insightful breakpoints in continuous variables. From a modeling perspective it seems reasonable to achieve good modeling results with treebased models. Furthermore the dataset is not very large and contains only structured data. From a performance perspective ```XGBoost``` models have proven to be particularly competitive in Kaggle competitions for the aforementioned data settings. Because ```XGBoost``` models are harder to tune and tend to overfit also ```Random Forest``` models will be considered in a first modeling iteration. So ```Random Forest``` will be the starting point when it comes to the feature selection phase. In order to do so, the data has to be formatted in such a way, that an ```XGBoost``` implementation in Python can deal with the dataset. This means all categorical and ordinal features in the dataset have to be enconded to numerical features. This can be achieved by One-Hot or Label-Encoding.

One-Hot encoding is useful, when the categorical variables do not have too many unique values. Label-Encoding can be useful, when the categorical variable have many unique values. Label-Encoding is also used for ordinal variables. So the cardinality of the categorical variables in the dataset has to be inspected first. The categorical features in the dataset are:
* country
* success
* PSP
* 3D_secured
* card
* month
* dayOfMonth
* weekday
* weekend
* holiday

From those variables the following variables are already in a numeric format and can be used in ML models:
* success
* 3D_secured
* dayOfMonth
* weekend
* holiday

The cardinality of the remaining variables are:
* PSP: 4
* card: 3
* month: 2
* weekday: 7

This shows that the yet to be formatted variables have a low cardinality and will be One-Hot-transformed to become useful in terms of modeling purpose. In order to have a stable and robust data setup for the subsequent step and some models assume that one dummy category is left out because the value can be perfectly derived from the other dummy variables, as well as that for some modeling approaches like linear models continuous data should be normalized, we also do that in terms of data formatting. This step is done after splitting data into train and test data.

In terms of the remaining continuous variable ```amount``` we don´t have to normalize the data because we are preparing for a treebased modeling szenario.

In [9]:
def formatData(data, columns = ['PSP', 'card', 'month', 'weekday', 'country']):
    out = data.copy()
    
    out = pd.get_dummies(out, columns=columns, drop_first=True)
    print("=== Number of missing values ===")
    print(out.isna().sum().sum())
    
    return out

In [10]:
data_formatted_time = formatData(data_clean_dropped_woTime)

=== Number of missing values ===
0


In order to prepare the subsequent modeling steps we first have to split the data into a train-validate-test design. For the problem at hand, we can do that by a applying a time-based splitting strategy or a random splitting strategy. A well established splitting ratio in many ML settings is 70 % train, 15 % validate and 15 % test data. Because we have an imbalanced data set, it is better to choose a stratified splitting strategy in order to have the same ratios of success and no-success observations in the datasets for the random splitting approach. Because we do not know at this time, which model performs best for the given task, all numerical variables are brought to a unified scale using a min-max scaling approach. This approach brings a numerical feature into a specified range. Because there are many dummy variables in the dataset, the range was fixed between 0 and one.

***
Min-Max Scaler<br>
***
$$x_{scaled} = \displaystyle \frac{x - x_{min}}{x_{max} - x_{min}}$$
***

As a first step we have to separate the X-matrix with all intended features and the y-vector. In this case the y-vector is the column ```success```.

In [11]:
def getColumnsToScale(data):
    from pandas.api.types import is_numeric_dtype
    out = []
    for column in data.copy().columns:
        if is_numeric_dtype(data.copy()[column]):
            if data.copy()[column].max() > 1:
                out.append(column)
        else:
            print("Column " + column + " is not numeric")
    
    return out

In [12]:
def applyRandomSplitting(data, train_size = 0.7, test_size = 0.15, validate_size = 0.15):
    applyData = data.copy()
    y = applyData['success']
    X = hf.dropColumns(applyData, columns = ["success"])
    
    scale_columns = getColumnsToScale(applyData)
    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=test_size + validate_size, random_state=1977)
    X_test, X_validate, y_test, y_validate = train_test_split(X_test, y_test, stratify=y_test, 
                                                              test_size=test_size/(test_size + validate_size), 
                                                              random_state=1977)
    scaler = MinMaxScaler()
    X_train[scale_columns] = scaler.fit_transform(X_train[scale_columns])
    X_test[scale_columns] = scaler.transform(X_test[scale_columns])
    X_validate[scale_columns] = scaler.transform(X_validate[scale_columns])
    
    print("= Success rate in y_train: " + str(y_train.sum()/len(y_train)))
    print("= Success rate in y_validate: " + str(y_validate.mean()))
    print("= Success rate in y_test: " + str(y_test.mean()))
    
    return (X, y, X_train, y_train, X_validate, y_validate, X_test, y_test)

In [13]:
def applyTimeSplitting(data, train_size = 0.7, test_size = 0.15, validate_size = 0.15, time_col = "tmsp"):
    applyData = data.copy()
    applyData = applyData.sort_values(by = [time_col], ascending = True)
    length = len(applyData)
    
    train_length = int(np.round(length*train_size))
    test_length = int(length - train_length)
    validate_length = int(np.round(test_length * (validate_size/(validate_size + test_size))))
    test_length = int(test_length - validate_length)
    
    y = applyData['success']
    X = hf.dropColumns(applyData, columns = ["success", time_col])
    
    assert (test_length + validate_length + train_length) == length, f"number expected: {length}, got: {test_length + validate_length + train_length}"
    
    X_train = X.copy().iloc[:train_length, :]
    y_train = y.copy().iloc[:train_length]
    X_validate = X.copy().iloc[train_length:(train_length + validate_length), :]
    y_validate = y.copy().iloc[train_length:(train_length + validate_length)]
    X_test = X.copy().iloc[(train_length + validate_length):, :]
    y_test = y.copy().iloc[(train_length + validate_length):]
    
    assert (len(X_train) + len(X_validate) + len(X_test)) == length, f"number expected: {length}, got: {(len(X_train) + len(X_validate) + len(X_test))}"
    
    scale_columns = getColumnsToScale(X)
    scaler = MinMaxScaler()
    X_train[scale_columns] = scaler.fit_transform(X_train[scale_columns])
    parameters = hf.loadPickle('./data/parameters.pkl')
    parameters["scaler"] = scaler
    parameters["scale_columns"] = scale_columns
    hf.writePickle('./data/parameters.pkl', parameters)
    X_test[scale_columns] = scaler.transform(X_test[scale_columns])
    X_validate[scale_columns] = scaler.transform(X_validate[scale_columns])
    
    print("= Success rate in y_train: " + str(y_train.mean()))
    print("= Success rate in y_validate: " + str(y_validate.mean()))
    print("= Success rate in y_test: " + str(y_test.mean()))
    
    return (X, y, X_train, y_train, X_validate, y_validate, X_test, y_test)

In [14]:
X, y, X_train, y_train, X_validate, y_validate, X_test, y_test = applyTimeSplitting(data_formatted_time)
if not hf.checkIfTableDbExists('./data/PSP_Data.sqlite', "X"):
    hf.writeDb(X, pathDb = './data/PSP_Data.sqlite', table_name = "X")
    hf.writeDb(y, pathDb = './data/PSP_Data.sqlite', table_name = "y")
if not hf.checkIfTableDbExists('./data/PSP_Data.sqlite', "X_train"):
    hf.writeDb(X_train, pathDb = './data/PSP_Data.sqlite', table_name = "X_train")
    hf.writeDb(y_train, pathDb = './data/PSP_Data.sqlite', table_name = "y_train")
if not hf.checkIfTableDbExists('./data/PSP_Data.sqlite', "X_validate"):
    hf.writeDb(X_validate, pathDb = './data/PSP_Data.sqlite', table_name = "X_validate")
    hf.writeDb(y_validate, pathDb = './data/PSP_Data.sqlite', table_name = "y_validate")
if not hf.checkIfTableDbExists('./data/PSP_Data.sqlite', "X_test"):
    hf.writeDb(X_test, pathDb = './data/PSP_Data.sqlite', table_name = "X_test")
    hf.writeDb(y_test, pathDb = './data/PSP_Data.sqlite', table_name = "y_test")

= Success rate in y_train: 0.3805861567241738
= Success rate in y_validate: 0.34990300678952474
= Success rate in y_test: 0.352413291292748
=== Table already exists ===
=== Table X created successful ===
=== Table y created successful ===
=== Table already exists ===
=== Table X_train created successful ===
=== Table y_train created successful ===
=== Table already exists ===
=== Table X_validate created successful ===
=== Table y_validate created successful ===
=== Table already exists ===
=== Table X_test created successful ===
=== Table y_test created successful ===


# References

<p>Arlot, S., & Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics Surveys, 4(none). https://doi.org/10.1214/09-SS054</p>
<p>Athanasopoulos, G., & Hyndman, R. J. (2021). Forecasting: Principles and Practice (3. Aufl.). OTexts. OTexts.com/fpp3</p>
<p>Bygari, R., Gupta, A., Raghuvanshi, S., Bapna, A., & Sahu, B. (2021). An AI-powered Smart Routing Solution for Payment Systems. 2026–2033. https://doi.org/10.1109/BigData52589.2021.9671961</p>
<p>Chetcuti, J. (2020). PhiCor: Calculating Phi coefficient of Association. (edsbas.C61D16BC). BASE. https://doi.org/10.5281/zenodo.3898308
IBM Corporation. (2021). IBM Documentation: IBM SPSS Modeler CRISP-DM Guide. https://www.ibm.com/docs/en/spss-modeler/18.1.1?topic=spss-modeler-crisp-dm-guide</p>
<p>Kornbrot, D. (2005). Point Biserial Correlation. https://doi.org/10.1002/0470013192.bsa485</p>
<p>Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling (1st ed. 2013, Corr. 2nd printing 2018 Edition). Springer.</p>
<p>Kuhn, M., & Johnson, K. (2019). Feature Engineering and Selection: A Practical Approach for Predictive Models. Taylor & Francis Ltd. http://www.feat.engineering/</p>
<p>Lakshmanan, V., Robinson, S., & Munn, M. (2020). Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps. O’Reilly Media.</p>
<p>Leonard, M., & Wolfe, B. (2005). Mining transactional and time series data. abstract, presentation and paper, SUGI, 10–13.</p>
<p>Mao, X., Xu, S., Kumar, R., R, V., Hong, X., & Menghani, D. (2023). Improving the customer’s experience via ML-driven payment routing. https://engineering.linkedin.com/blog/2023/improving-the-customer-s-experience-via-ml-driven-payment-routin</p>
<p>Mu, L. (2021). Using Machine Learning to Improve Payment Authorization Rate | The PayPal Technology Blog. https://medium.com/paypal-tech/using-machine-learning-to-improve-payment-authorization-rates-bc3b2cbf4999</p>
<p>Wirth, R., & Hipp, J. (2000). CRISP-DM: Towards a standard process model for data mining. Proceedings of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining.</p>