# Baseline feature transformation

The simulated dataset generated in the previous section is simple. It only contains the essential features that characterize a payment card transaction. These are: a unique identifier for the transaction, the date and time of the transaction, the transaction amount, a unique identifier for the customer, a unique number for the merchant, and a binary variable that labels the transaction as legitimate or fraudulent (0 for legitimate or 1 for fraudulent). Fig. 1 provides the first three rows of the simulated dataset:
 
![alt text](images/tx_table.png)
<p style="text-align: center;">
Fig. 1. The first three transactions in the simulated dataset used in this chapter.
</p>



What each row essentially tells us is that, at 00:01:17, on the 1st of April 2018, a customer with the ID 757 made a payment of a value of 49.88 to a merchant with the ID 3548, and that the transaction was not fraudulent. Then, at 00:02:40, on the 1st of April 2018, a customer with the ID 1145 made a payment of a value of 8.55 to a merchant with the ID 1983, and that the transaction was not fraudulent. And so on. The simulated dataset is a long list of such transactions (1.8 million in total). The variable `transaction_ID` is a unique identifier for each transaction.  

While conceptually simple for a human, such a set of features is however not appropriate for a machine learning predictive model. Machine learning algorithms typically require *numerical* and *ordered* features. Numerical means that the type of variable must be an integer or a real number. Ordered means that the order of the values of a variable is meaningful. 

In this dataset, the only numerical and ordered features are the transaction amount and the fraud label. The date is a panda timestamp, and therefore not numerical. The identifiers for the transactions, customers, and terminals are numerical but not ordered: it would not make sense to assume for example that the terminal with ID 3548 is 'bigger' or 'larger' than the terminal with ID 1983. Rather, these identifiers represent distinct 'entities', which are referred to as *categorical* features. 

There is unfortunately no standard procedure to deal with non-numerical or categorical features. The topic is known in the machine learning literature as *feature transformation*, and will be covered later in this book. In essence, the goal of feature transformation is to design new features that are assumed to be relevant for a predictive problem. The design of these features is ususally problem dependent, and involves domain knowledge.

In this section, we will implement three types of feature transformation that are known to be relevant for payment card fraud detection. 

The first type of transformation involves the date/time variable, and consists in creating binary features that characterize potentially relevant periods. We will create two such features. The first one will characterize whether a transaction occurs during a week day, or during the week-end. The second will characterize whether a transaction occurs during the day or the night. These features can be useful since it has been observed that fraudulent patterns differ between week days and week-ends, and between the day and night.  

The second type of transformation involves the customer ID, and consists in creating features that characterize the customer spending behaviors. We will follow the RFM (Rencency, Frequency, Monetary value) framework proposed in {cite}`VANVLASSELAER201538`, and keep track of the average spending amount and number of transactions for each cusomer, for three window sizes. This will lead to the creation of six new features.

The third type of transformation involves the terminal ID, and consists in creating new features that characterize the 'risk' associated to the terminal. The risk will be defined as the average number of frauds that were observed on the terminal for three window sizes. This will lead to the creation of three new features. 

The table below summarizes the types of transformation that will be performed, and the new features that will be created. 

|Original feature name|Original feature type|Transformation|Number of new features|New feature(s) type|
|---|---|---|---|---|
|TX\_DATE\_TIME | Panda TimeStamp |0 if transaction during week day, 1 if transaction during week-end. The new feature is called TX_DURING_WEEKEND.|1|Integer (0/1)|
|TX\_DATE\_TIME | Panda TimeStamp |0 if transaction between 6am and 0pm, 1 if transaction between 0pm and 6am. The new feature is called TX_DURING_NIGHT.|1|Integer (0/1)|
|CUSTOMER\_ID | Categorical variable |Number of transactions by the customer in the last n day(s), for n in {1,7,30}. The new features are called CUSTOMER_ID_NB_TX_nDAY_WINDOW.|3|Integer|
|CUSTOMER\_ID | Categorical variable |Average spending amount in the last n day(s), for n in {1,7,30}. The new features are called CUSTOMER_ID_AVG_AMOUNT_ nDAY_WINDOW.|3|Real|
|TERMINAL\_ID | Categorical variable |Number of transactions on the terminal in the last n+d day(s), for n in {1,7,30} and d=7. The parameter d is called delay and will be discussed later in this notebook. The new features are called TERMINAL_ID_NB_TX_nDAY_WINDOW.|3|Integer|
|TERMINAL\_ID | Categorical variable |Average number of frauds on the terminal in the last n+d day(s), for n in {1,7,30} and d=7. The parameter d is called delay and will be discussed later in this notebook. The new features are called TERMINAL_ID_RISK_nDAY_WINDOW.|3|Real|

The following sections provide the implementation for each of these three transformations. After the transformations, a set of 14 new features will be created.

In [1]:
# Necessary imports for this notebook

import os

import pandas as pd
import numpy as np
import datetime

# For Pandas parallelisation
#from pandarallel import pandarallel
#pandarallel.initialize()

# For plotting
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns


%run ../helper_functions/shared_functions_basic.ipynb

## Loading of dataset

Let us first load the transaction data simulated in the previous notebook. We will load the transaction files from May to September. Files can be loaded using the `read_from_files` function in the `shared_functions.ipynb` notebook. The function was put in this notebook since it will be used frequently throughout this book.

The function takes as input the folder where the data files are located, and the dates that define the period to load (`BEGIN_DATE` and `END_DATE`). It returns a dataframe of transactions. The transactions are sorted by chronological order. 


In [2]:
DIR_INPUT='./simulation_data/raw_data/' 

BEGIN_DATE = "2018-04-01"
END_DATE = "2018-09-30"

print("Load  files")
%time transactions_df=read_from_files(DIR_INPUT, BEGIN_DATE, END_DATE)
print("{0} transactions loaded, containing {1} fraudulent transactions".format(len(transactions_df),transactions_df.TX_FRAUD.sum()))


Load  files
Wall time: 3.6 s
173832 transactions loaded, containing 13678 fraudulent transactions


In [3]:
transactions_df.head()

Unnamed: 0,TRANSACTION_ID,TX_DATETIME,CUSTOMER_ID,TERMINAL_ID,TX_AMOUNT,TX_TIME_SECONDS,TX_TIME_DAYS,TX_FRAUD,TX_FRAUD_SCENARIO
0,0,2018-04-01 00:07:56,2,316,146.0,476,0,0,0
1,1,2018-04-01 00:30:05,360,584,92.74,1805,0,0,0
2,2,2018-04-01 00:32:35,183,992,39.3,1955,0,0,0
3,3,2018-04-01 00:43:59,382,283,15.35,2639,0,0,0
4,4,2018-04-01 00:45:51,381,799,23.15,2751,0,0,0


## Date and time transformations

We will create two new binary features from the transaction dates and times. The first will characterize whether a transaction occurs during a week day (value 0) or a week-end (1), and will be called TX_DURING_WEEKEND. The second will characterize whether a transaction occurs during the day or during the day (0) or during the night (1). The night is defined as hours that are between 0pm and 6am. It will be called TX_DURING_NIGHT. 

For the TX_DURING_WEEKEND feature, we define a function `is_weekend` that takes as input a panda TimeStamp, and returns 1 if the date is during a week-end, or 0 otherwise. The TimeStamp object conveniently provides the `weekday` function to help in computing this value.

In [4]:
def is_weekend(tx_datetime):
    
    # Transform date into week day (0 is Monday, 6 is Sunday)
    weekday = tx_datetime.weekday()
    # Binary value: 0 if week day, 1 if week-end
    is_weekend = weekday>=5
    
    return int(is_weekend)


It is then straghtforward to compute this features for all transactions using the Panda `apply` function. 

In [5]:
%time transactions_df['TX_DURING_WEEKEND']=transactions_df.TX_DATETIME.apply(is_weekend)

Wall time: 1.1 s


We follow the same logic to implement the TX_DURING_NIGHT feature. First, a function `is_night` that takes as input a panda TimeStamp, and returns 1 if the time is during the night, or 0 otherwise. The TimeStamp object conveniently provides the hour property to help in computing this value.

In [6]:
def is_night(tx_datetime):
    
    # Get the hour of the transaction
    tx_hour = tx_datetime.hour
    # Binary value: 1 if hour less than 6, and 0 otherwise
    is_night = tx_hour<=6
    
    return int(is_night)

In [7]:
%time transactions_df['TX_DURING_NIGHT']=transactions_df.TX_DATETIME.apply(is_night)

Wall time: 992 ms


Let us check that these features where correctly computed.

In [8]:
transactions_df[transactions_df.TX_TIME_DAYS>=1]

Unnamed: 0,TRANSACTION_ID,TX_DATETIME,CUSTOMER_ID,TERMINAL_ID,TX_AMOUNT,TX_TIME_SECONDS,TX_TIME_DAYS,TX_FRAUD,TX_FRAUD_SCENARIO,TX_DURING_WEEKEND,TX_DURING_NIGHT
960,960,2018-04-02 00:06:33,95,995,106.67,86793,1,0,0,0,1
961,961,2018-04-02 00:07:31,321,603,20.44,86851,1,0,0,0,1
962,962,2018-04-02 00:16:22,308,666,5.56,87382,1,0,0,0,1
963,963,2018-04-02 00:22:24,273,130,6.49,87744,1,0,0,0,1
964,964,2018-04-02 00:29:46,437,26,73.55,88186,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...
173827,173827,2018-09-30 23:32:59,140,359,97.55,15809579,182,0,0,1,0
173828,173828,2018-09-30 23:33:02,221,41,61.26,15809582,182,0,0,1,0
173829,173829,2018-09-30 23:46:15,101,777,58.80,15810375,182,0,0,1,0
173830,173830,2018-09-30 23:54:38,7,705,15.08,15810878,182,0,0,1,0


The 2018-05-01 was a Monday, and the 2018-09-30 a Sunday. These dates are correctly flagged as week day, and week-end, respectively. The day and night feature is also correclty set for the first transactions, that happen closely after 0pm, and the last transactions that happen closely before 0pm. 

## Customer ID transformations

Let us now proceed with customer ID transformations. We will take inspiration from the RFM (Rencency, Frequency, Monetary value) framework proposed in {cite}`VANVLASSELAER201538`, and compute two of these features over three time windows. The first feature will be the number of transactions that occur within a time window (Frequency). The second will be the average amount spent in these transactions (Monetary value). The time windows will be set to one, seven, and thirty days. This will generate six new features. 

Let us implement these transformations by writing a `get_customer_spending_behaviour_features` function. The function takes as inputs the set of transactions for a customer and a set of window sizes. It returns a dataframe with the six new features. Our implementation rely on the Panda `rolling` function, which makes easy the computation of aggregates over a time window.


In [9]:
def get_customer_spending_behaviour_features(customer_transactions, windows_size_in_days=[1,7,30]):
    
    # Let us first order transactions chronologically
    customer_transactions=customer_transactions.sort_values('TX_DATETIME')
    
    # The transaction date and time is set as the index, which will allow the use of the rolling function 
    customer_transactions.index=customer_transactions.TX_DATETIME
    
    # For each window size
    for window_size in windows_size_in_days:
        
        # Compute the sum of the transaction amounts and the number of transactions for the given window size
        SUM_AMOUNT_TX_WINDOW=customer_transactions['TX_AMOUNT'].rolling(str(window_size)+'d').sum()
        NB_TX_WINDOW=customer_transactions['TX_AMOUNT'].rolling(str(window_size)+'d').count()
    
        # Compute the average transaction amount for the given window size
        # NB_TX_WINDOW is always >0 since current transaction is always included
        AVG_AMOUNT_TX_WINDOW=SUM_AMOUNT_TX_WINDOW/NB_TX_WINDOW
    
        # Use the features to the dataframe
        customer_transactions['CUSTOMER_ID_NB_TX_'+str(window_size)+'DAY_WINDOW']=list(NB_TX_WINDOW)
        customer_transactions['CUSTOMER_ID_AVG_AMOUNT_'+str(window_size)+'DAY_WINDOW']=list(AVG_AMOUNT_TX_WINDOW)
    
    # Reindex according to transaction IDs
    customer_transactions.index=customer_transactions.TRANSACTION_ID
        
    # And return the dataframe with the new features
    return customer_transactions


Let us compute these aggregates for the first customer.

In [10]:
spending_behaviour_customer_0=get_customer_spending_behaviour_features(transactions_df[transactions_df.CUSTOMER_ID==0])
spending_behaviour_customer_0

Unnamed: 0_level_0,TRANSACTION_ID,TX_DATETIME,CUSTOMER_ID,TERMINAL_ID,TX_AMOUNT,TX_TIME_SECONDS,TX_TIME_DAYS,TX_FRAUD,TX_FRAUD_SCENARIO,TX_DURING_WEEKEND,TX_DURING_NIGHT,CUSTOMER_ID_NB_TX_1DAY_WINDOW,CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW,CUSTOMER_ID_NB_TX_7DAY_WINDOW,CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW,CUSTOMER_ID_NB_TX_30DAY_WINDOW,CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW
TRANSACTION_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
195,195,2018-04-01 07:19:05,0,996,123.59,26345,0,0,0,1,0,1.0,123.590000,1.0,123.590000,1.0,123.590000
850,850,2018-04-01 18:00:16,0,996,77.34,64816,0,0,0,1,0,2.0,100.465000,2.0,100.465000,2.0,100.465000
887,887,2018-04-01 19:02:02,0,241,46.51,68522,0,0,0,1,0,3.0,82.480000,3.0,82.480000,3.0,82.480000
1254,1254,2018-04-02 08:51:06,0,330,54.72,118266,1,0,0,0,0,3.0,59.523333,4.0,75.540000,4.0,75.540000
1617,1617,2018-04-02 14:05:38,0,29,63.30,137138,1,0,0,0,0,4.0,60.467500,5.0,73.092000,5.0,73.092000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173450,173450,2018-09-30 13:38:41,0,241,38.23,15773921,182,0,0,1,0,5.0,64.388000,28.0,57.306429,89.0,63.097640
173485,173485,2018-09-30 14:10:21,0,996,43.60,15775821,182,0,0,1,0,6.0,60.923333,29.0,56.833793,89.0,62.433933
173516,173516,2018-09-30 14:34:30,0,29,69.69,15777270,182,0,0,1,0,7.0,62.175714,29.0,57.872414,90.0,62.514556
173539,173539,2018-09-30 14:54:59,0,144,91.26,15778499,182,0,0,1,0,8.0,65.811250,30.0,58.985333,90.0,61.882333


We can check that the new features are consistent with the customer profile (see previous notebook). For customer 0, the mean amount was mean_amount=62.26, and the transaction frequency was mean_nb_tx_per_day=2.18. These values are indeed closely matched by the features CUSTOMER_ID_NB_TX_30DAY_WINDOW and CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW.

Let us now generate these features for all customers. This is straightforward using the pandas groupby and apply methods.

In [11]:
%time transactions_df=transactions_df.groupby('CUSTOMER_ID').apply(lambda x: get_customer_spending_behaviour_features(x, windows_size_in_days=[1,7,30]))
transactions_df=transactions_df.sort_values('TX_DATETIME').reset_index(drop=True)


Wall time: 6.68 s


In [12]:
transactions_df

Unnamed: 0,TRANSACTION_ID,TX_DATETIME,CUSTOMER_ID,TERMINAL_ID,TX_AMOUNT,TX_TIME_SECONDS,TX_TIME_DAYS,TX_FRAUD,TX_FRAUD_SCENARIO,TX_DURING_WEEKEND,TX_DURING_NIGHT,CUSTOMER_ID_NB_TX_1DAY_WINDOW,CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW,CUSTOMER_ID_NB_TX_7DAY_WINDOW,CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW,CUSTOMER_ID_NB_TX_30DAY_WINDOW,CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW
0,0,2018-04-01 00:07:56,2,316,146.00,476,0,0,0,1,1,1.0,146.000000,1.0,146.000000,1.0,146.000000
1,1,2018-04-01 00:30:05,360,584,92.74,1805,0,0,0,1,1,1.0,92.740000,1.0,92.740000,1.0,92.740000
2,2,2018-04-01 00:32:35,183,992,39.30,1955,0,0,0,1,1,1.0,39.300000,1.0,39.300000,1.0,39.300000
3,3,2018-04-01 00:43:59,382,283,15.35,2639,0,0,0,1,1,1.0,15.350000,1.0,15.350000,1.0,15.350000
4,4,2018-04-01 00:45:51,381,799,23.15,2751,0,0,0,1,1,1.0,23.150000,1.0,23.150000,1.0,23.150000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173827,173827,2018-09-30 23:32:59,140,359,97.55,15809579,182,0,0,1,0,5.0,112.694000,29.0,95.726897,99.0,103.052121
173828,173828,2018-09-30 23:33:02,221,41,61.26,15809582,182,0,0,1,0,5.0,75.172000,20.0,80.281500,84.0,88.563214
173829,173829,2018-09-30 23:46:15,101,777,58.80,15810375,182,0,0,1,0,3.0,47.446667,9.0,49.744444,23.0,39.370435
173830,173830,2018-09-30 23:54:38,7,705,15.08,15810878,182,0,0,1,0,3.0,17.110000,24.0,29.880000,70.0,31.265000


## Terminal ID transformations

In [13]:
def get_count_risk_rolling_window(terminal_transactions, delta=7, windows_size_in_days=[1,7,30], feature="TERMINAL_ID"):
    
    terminal_transactions=terminal_transactions.sort_values('TX_DATETIME')
    
    terminal_transactions.index=terminal_transactions.TX_DATETIME
    
    NB_FRAUD_DELTA=terminal_transactions['TX_FRAUD'].rolling(str(delta)+'d').sum()
    NB_TX_DELTA=terminal_transactions['TX_FRAUD'].rolling(str(delta)+'d').count()
    
    for window_size in windows_size_in_days:
    
        NB_FRAUD_DELTA_WINDOW=terminal_transactions['TX_FRAUD'].rolling(str(delta+window_size)+'d').sum()
        NB_TX_DELTA_WINDOW=terminal_transactions['TX_FRAUD'].rolling(str(delta+window_size)+'d').count()
    
        NB_FRAUD_WINDOW=NB_FRAUD_DELTA_WINDOW-NB_FRAUD_DELTA
        NB_TX_WINDOW=NB_TX_DELTA_WINDOW-NB_TX_DELTA
    
        # NB_TX_WINDOW is always >0 since current transaction is always included
        RISK_WINDOW=NB_FRAUD_WINDOW/NB_TX_WINDOW
        
        terminal_transactions[feature+'_NB_TX_'+str(window_size)+'DAY_WINDOW']=list(NB_TX_WINDOW)
        terminal_transactions[feature+'_RISK_'+str(window_size)+'DAY_WINDOW']=list(RISK_WINDOW)
        
    terminal_transactions.index=terminal_transactions.TRANSACTION_ID
    terminal_transactions.fillna(0,inplace=True)
    
    return terminal_transactions


In [14]:
get_count_risk_rolling_window(transactions_df[transactions_df.TERMINAL_ID==0])

Unnamed: 0_level_0,TRANSACTION_ID,TX_DATETIME,CUSTOMER_ID,TERMINAL_ID,TX_AMOUNT,TX_TIME_SECONDS,TX_TIME_DAYS,TX_FRAUD,TX_FRAUD_SCENARIO,TX_DURING_WEEKEND,...,CUSTOMER_ID_NB_TX_7DAY_WINDOW,CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW,CUSTOMER_ID_NB_TX_30DAY_WINDOW,CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW,TERMINAL_ID_NB_TX_1DAY_WINDOW,TERMINAL_ID_RISK_1DAY_WINDOW,TERMINAL_ID_NB_TX_7DAY_WINDOW,TERMINAL_ID_RISK_7DAY_WINDOW,TERMINAL_ID_NB_TX_30DAY_WINDOW,TERMINAL_ID_RISK_30DAY_WINDOW
TRANSACTION_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1256,1256,2018-04-02 08:53:49,399,0,114.07,118429,1,0,0,0,...,1.0,114.07,1.0,114.07,0.0,0.0,0.0,0.0,0.0,0.0
3868,3868,2018-04-05 04:31:09,337,0,49.1,361869,4,0,0,0,...,5.0,30.19,5.0,30.19,0.0,0.0,0.0,0.0,0.0,0.0
4594,4594,2018-04-05 17:04:29,399,0,122.75,407069,4,0,0,0,...,4.0,136.375,4.0,136.375,0.0,0.0,0.0,0.0,0.0,0.0
9974,9974,2018-04-11 12:42:04,337,0,12.88,909724,10,0,0,0,...,9.0,33.992222,12.0,30.2525,0.0,0.0,1.0,0.0,1.0,0.0
13317,13317,2018-04-15 03:11:55,337,0,50.71,1221115,14,0,0,1,...,7.0,37.694286,14.0,34.741429,0.0,0.0,3.0,0.0,3.0,0.0
14449,14449,2018-04-16 07:49:29,399,0,57.58,1324169,15,0,0,0,...,4.0,67.9225,9.0,106.967778,0.0,0.0,3.0,0.0,3.0,0.0
14673,14673,2018-04-16 10:51:49,337,0,39.11,1335109,15,0,0,0,...,10.0,37.049,17.0,34.882941,0.0,0.0,2.0,0.0,3.0,0.0
14751,14751,2018-04-16 12:05:54,399,0,13.99,1339554,15,0,0,0,...,5.0,57.136,10.0,97.67,0.0,0.0,2.0,0.0,3.0,0.0
30378,30378,2018-05-03 06:14:33,399,0,72.95,2787273,32,0,0,0,...,3.0,82.36,14.0,92.935,0.0,0.0,0.0,0.0,8.0,0.0
41224,41224,2018-05-14 12:52:12,399,0,24.62,3761532,43,0,0,0,...,5.0,56.414,14.0,69.772857,0.0,0.0,1.0,0.0,6.0,0.0


In [15]:
%time transactions_df=transactions_df.groupby('TERMINAL_ID').apply(lambda x: get_count_risk_rolling_window(x, delta=7, windows_size_in_days=[1,7,30], feature="TERMINAL_ID"))
transactions_df=transactions_df.sort_values('TX_DATETIME').reset_index(drop=True)


Wall time: 14.8 s


In [16]:
transactions_df

Unnamed: 0,TRANSACTION_ID,TX_DATETIME,CUSTOMER_ID,TERMINAL_ID,TX_AMOUNT,TX_TIME_SECONDS,TX_TIME_DAYS,TX_FRAUD,TX_FRAUD_SCENARIO,TX_DURING_WEEKEND,...,CUSTOMER_ID_NB_TX_7DAY_WINDOW,CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW,CUSTOMER_ID_NB_TX_30DAY_WINDOW,CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW,TERMINAL_ID_NB_TX_1DAY_WINDOW,TERMINAL_ID_RISK_1DAY_WINDOW,TERMINAL_ID_NB_TX_7DAY_WINDOW,TERMINAL_ID_RISK_7DAY_WINDOW,TERMINAL_ID_NB_TX_30DAY_WINDOW,TERMINAL_ID_RISK_30DAY_WINDOW
0,0,2018-04-01 00:07:56,2,316,146.00,476,0,0,0,1,...,1.0,146.000000,1.0,146.000000,0.0,0.000000,0.0,0.000,0.0,0.000000
1,1,2018-04-01 00:30:05,360,584,92.74,1805,0,0,0,1,...,1.0,92.740000,1.0,92.740000,0.0,0.000000,0.0,0.000,0.0,0.000000
2,2,2018-04-01 00:32:35,183,992,39.30,1955,0,0,0,1,...,1.0,39.300000,1.0,39.300000,0.0,0.000000,0.0,0.000,0.0,0.000000
3,3,2018-04-01 00:43:59,382,283,15.35,2639,0,0,0,1,...,1.0,15.350000,1.0,15.350000,0.0,0.000000,0.0,0.000,0.0,0.000000
4,4,2018-04-01 00:45:51,381,799,23.15,2751,0,0,0,1,...,1.0,23.150000,1.0,23.150000,0.0,0.000000,0.0,0.000,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173827,173827,2018-09-30 23:32:59,140,359,97.55,15809579,182,0,0,1,...,29.0,95.726897,99.0,103.052121,3.0,0.333333,8.0,0.125,44.0,0.045455
173828,173828,2018-09-30 23:33:02,221,41,61.26,15809582,182,0,0,1,...,20.0,80.281500,84.0,88.563214,1.0,0.000000,6.0,0.000,28.0,0.000000
173829,173829,2018-09-30 23:46:15,101,777,58.80,15810375,182,0,0,1,...,9.0,49.744444,23.0,39.370435,3.0,0.000000,16.0,0.000,52.0,0.019231
173830,173830,2018-09-30 23:54:38,7,705,15.08,15810878,182,0,0,1,...,24.0,29.880000,70.0,31.265000,0.0,0.000000,12.0,0.000,35.0,0.057143


## Saving of dataset

Let us finally save the dataset, split in daily batches, using the pickle format. 

In [17]:
DIR_OUTPUT = "./simulation_data/transformed_features/"

if not os.path.exists(DIR_OUTPUT):
    os.makedirs(DIR_OUTPUT)

start_date = datetime.datetime.strptime("2018-04-01", "%Y-%m-%d")

for day in range(transactions_df.TX_TIME_DAYS.max()+1):
    
    transactions_day = transactions_df[transactions_df.TX_TIME_DAYS==day].sort_values('TX_TIME_SECONDS')
    
    date = start_date + datetime.timedelta(days=day)
    filename_output = date.strftime("%Y-%m-%d")+'.pkl'
    
    transactions_day.to_pickle(DIR_OUTPUT+filename_output)