In [1]:
import numpy as np
import pandas as pd

import os

In [3]:
generating_data_dir = "generating_data"
transactions_df = pd.read_csv(os.path.join(generating_data_dir,'transactions.csv'))
transactions_df_fraud = pd.read_csv(os.path.join(generating_data_dir,'transactions_df_fraud.csv'))
transactions_df_total = pd.read_csv(os.path.join(generating_data_dir,'transaction_total.csv'))

# Transaction data simulator

## Features

| feature name | feature type | Description     
|-----------------------|----------------------|---------------------------------------------------------------------------------------------------------------------------------|
| TRANSACTION_ID        | Numeric    | A unique identifier for the transaction.                                                                                           
| TX_DATETIME           | Panda timestamp         |  Date and time at which the transaction occurs.                                                                                                                                            
| CUSTOMER_ID           | Numeric    | The identifier for the customer. Each customer has a unique identifier.                                                             
| SME_ID                | Numeric    | The identifier for the merchant (or more precisely the SMEs). Each SME has a unique identifier.                                                                                                                              
| TX_AMOUNT             | Numeric              | The amount of the transaction.                                                                                                                                
| TX_TIME_SECOND        | Numeric              | Second of transaction, starting from 0 to the last transaction day * 86400.                                              
| TX_TIME_DAYS          | Numeric              | Day of transaction, starting from 0 to the last transaction day.                                                                                                  
| CARD_NO               | Categorical          | The credit card used by the customers in the trasactions.                                                                            
| CARD_TYPE             | Categorical          | The type of the credit card being used.                                                                                   
| EMAIL_DOMAIN          | Categorical          |  The email domain that customers used to register their accounts.                                                                                                  
| IP_ADDRESS            | Categorical          | The ip addess of the customers when the transaction happened, can also be interpreted as geographic location customers give.                                                                               
| PHONE_NO              | Categorical          | The phone number of the customers used for transactions.                                                                                                        
| TX_FRAUD              | Binary               | Labels,  binary variable, with the value for a legitimate transaction, or the value for a fraudulent transaction.                                                                           


### 1. Normal Transaction Table

In [11]:
display(transactions_df)

Unnamed: 0,TRANSACTION_ID,TX_DATETIME,CUSTOMER_ID,SME_ID,TX_AMOUNT,TX_TIME_SECONDS,TX_TIME_DAYS,CARD_NO,CARD_TYPE,EMAIL_DOMAIN,IP_ADDRESS,PHONE_NO,TX_FRAUD
0,0,2018-04-01 00:00:17,6160,1602,31.83,17,0,4807386282223131022,VISA 16 digit,smith.com,93.113.29.137,+1-684-043-2108x883,0
1,1,2018-04-01 00:01:59,3305,3563,66.21,119,0,501840408188,JCB 16 digit,yahoo.com,213.50.161.243,+1-435-580-2880x863,0
2,2,2018-04-01 00:02:21,6170,3552,25.17,141,0,2404180621851474,JCB 15 digit,yahoo.com,124.3.159.126,601-185-2008,0
3,3,2018-04-01 00:03:45,8347,344,3.09,225,0,38977833740025,JCB 15 digit,gmail.com,140.252.121.231,+1-287-916-9827x7383,0
4,4,2018-04-01 00:07:56,2,3958,146.00,476,0,4934421761046,Diners Club / Carte Blanche,wiggins.com,77.42.198.63,+1-485-590-9776x58236,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1773913,1773913,2018-09-30 23:56:56,1987,324,34.53,15811016,182,3528716375376223,JCB 16 digit,hughes.info,134.142.234.159,+1-436-476-9444x0036,0
1773914,1773914,2018-09-30 23:57:07,3133,3491,55.83,15811027,182,676151026702,American Express,yahoo.com,20.231.206.121,(405)305-0411x8725,0
1773915,1773915,2018-09-30 23:57:11,3007,3664,65.94,15811031,182,4264830832653,VISA 16 digit,torres.net,182.201.70.209,200-540-6960x0861,0
1773916,1773916,2018-09-30 23:57:44,6244,3137,16.99,15811064,182,213165348311865,Maestro,gmail.com,177.38.229.155,425-178-0631x3652,0


`transactions_df` is generated by following steps:

1. **Generation of customer profiles:** Every customer is different in their spending habits. This will be simulated by defining some properties for each customer. The main properties will be their geographical location, their spending frequency, and their spending amounts. The customer properties will be represented as a table, referred to as the `customer_profile_table`.

2. **Generation of terminal profiles:** SME properties will simply consist of a product position. The sme properties will be represented as a table, referred to as the `sme_profile_table`.

3. **Association of customer profiles to terminals:** We will assume that customers only make transactions with SMEs that are within a radius of of their individual pereference. This makes the simple assumption that a customer only makes transactions with SMEs that are virtual geographically close to their position. This step will consist of adding a feature ‘list_sme’ to each customer profile, that contains the set of SMEs that a customer can use.

4. **Generation of transactions:** The simulator will loop over the set of customer profiles, and generate transactions according to their properties (spending frequencies and amounts, and available SMEs). This will result in a table of transactions.

5. **Generation of fraud scenarios:** This last step will label the transactions as legitimate or genuine. This will be done by following three different fraud scenarios, which will be stated specifically below.



### 2. Extra Transaction Table

In [12]:
display(transactions_df_fraud)

Unnamed: 0,TRANSACTION_ID,TX_DATETIME,CUSTOMER_ID,SME_ID,TX_AMOUNT,TX_TIME_SECONDS,TX_TIME_DAYS,CARD_NO,CARD_TYPE,EMAIL_DOMAIN,IP_ADDRESS,PHONE_NO,TX_FRAUD
0,1773918,2018-04-02 00:00:46,10049,1569,88.02,46,0,4924859058330576,VISA 19 digit,lee-steele.com,7.175.1.6,756-652-9630,1
1,1773919,2018-04-02 01:31:26,10097,389,30.30,5486,0,30583831423180,JCB 16 digit,phillips.com,38.185.247.112,(093)953-7775,1
2,1773920,2018-04-02 01:36:35,10097,3637,31.94,5795,0,6011458766910388,Discover,yahoo.com,38.185.247.112,(093)953-7775,1
3,1773921,2018-04-02 01:41:44,10097,3443,27.28,6104,0,30583831423180,JCB 16 digit,gmail.com,38.185.247.112,618-835-5637x282,1
4,1773922,2018-04-02 07:30:56,10032,778,46.44,27056,0,4073412403335864352,VISA 16 digit,yahoo.com,32.167.91.225,001-746-315-8525,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7319,1781237,2018-09-29 08:48:40,10048,17,14.84,1673320,19,349644752100441,Diners Club / Carte Blanche,dawson-kramer.biz,104.236.227.163,950.081.8325x62834,1
7320,1781238,2018-09-29 09:04:47,10048,2727,50.96,1674287,19,349644752100441,Diners Club / Carte Blanche,dawson-kramer.biz,160.169.210.177,523.097.4216x01531,1
7321,1781239,2018-09-29 11:32:34,10048,1794,80.19,1683154,19,349644752100441,Diners Club / Carte Blanche,dawson-kramer.biz,75.78.177.127,785-341-9944x301,1
7322,1781240,2018-09-29 12:04:38,10048,1413,13.92,1685078,19,5446101279055773,JCB 16 digit,yahoo.com,192.237.88.74,(772)356-3740,1


`transactions_df_fraud` is generated similarly to the normal one, except for some fraud behavior of the fraudster:
* Inconsistencies in customer details across multiple purchases (for example, using the same e-mail address but a different name for another payment).
* Many payments made with:
    Individual preference is very wide and not specific (out of the "circle" with normal square).
    The same card but different ip adress/phone number/account(email domain).
    Many cards that use the same ip adress/phone number/account(email domain).
    Much Higher (2 to 6 times) purchasing frequency than the normal customers.

The fraudster are assumed that they would commit fraud staring at one certain day during the period and would last for about 10 days (normally distributed between 0 and 20).

And the normal transactions table is also added some fraudulent transactions, using the following fraud scenarios:

**Scenario 1:** Any transaction whose amount is more than 300 is 20% likely to be a fraud. The amount is drawn from a normal distribution with mean between 5 and 100 ,and std as double mean.

**Scenario 2:** Every day, a list of 3 customers is drawn at random. In the next 14 days, 1/3 of their transactions have their amounts multiplied by 5 and marked as fraudulent. This scenario simulates a card-not-present fraud where the credentials of a customer have been leaked. The customer continues to make transactions, and transactions of higher values are made by the fraudster who tries to maximize their gains.

**Scenario 3:** Define some suspicious features which is randomly selected, unusual email domain or credit card type, credit card number, ip address, or phone number ending with particular numbers.

And these two transactions dataset, partly fraud `transactions_df` and totally fraud `transactions_df_fraud`, will be merged to one `transactions_df_total` and transforming the baseline.

In [6]:
transactions_df_total.head(5)

Unnamed: 0,TRANSACTION_ID,TX_DATETIME,CUSTOMER_ID,SME_ID,TX_AMOUNT,TX_TIME_SECONDS,TX_TIME_DAYS,CARD_NO,CARD_TYPE,EMAIL_DOMAIN,...,TERMINAL_ID_RISK_1DAY_WINDOW,TERMINAL_ID_NB_TX_7DAY_WINDOW,TERMINAL_ID_RISK_7DAY_WINDOW,TERMINAL_ID_NB_TX_30DAY_WINDOW,TERMINAL_ID_RISK_30DAY_WINDOW,NB_CARD_NO,NB_CARD_TYPE,NB_EMAIL_DOMAIN,NB_IP_ADDRESS,NB_PHONE_NO
0,0,2018-04-01 00:00:17,6160,1602,31.83,17,0,4807386282223131022,VISA 16 digit,smith.com,...,0.0,0.0,0.0,0.0,0.0,10,6,15,3,3
1,1,2018-04-01 00:01:59,3305,3563,66.21,119,0,501840408188,JCB 16 digit,yahoo.com,...,0.0,0.0,0.0,0.0,0.0,5,3,17,2,7
2,2,2018-04-01 00:02:21,6170,3552,25.17,141,0,2404180621851474,JCB 15 digit,yahoo.com,...,0.0,0.0,0.0,0.0,0.0,12,7,19,6,7
3,3,2018-04-01 00:03:45,8347,344,3.09,225,0,38977833740025,JCB 15 digit,gmail.com,...,0.0,0.0,0.0,0.0,0.0,5,4,7,1,2
4,4,2018-04-01 00:07:56,2,3958,146.0,476,0,4934421761046,Diners Club / Carte Blanche,wiggins.com,...,0.0,0.0,0.0,0.0,0.0,6,4,12,4,2


# Baseline Transform

The first type of transformation involves the date/time variable, and consists in creating binary features that characterize potentially relevant periods. We will create two such features. The first one will characterize whether a transaction occurs during a weekday or during the weekend. The second will characterize whether a transaction occurs during the day or the night. These features can be useful since it has been observed in real-world datasets that fraudulent patterns differ between weekdays and weekends, and between the day and night.

The second type of transformation involves the customer ID and consists in creating features that characterize the customer spending behaviors. We will follow the RFM (Recency, Frequency, Monetary value) framework, and keep track of the average spending amount and number of transactions for each customer and for three window sizes. This will lead to the creation of six new features.

The third type of transformation involves the SME ID and consists in creating new features that characterize the ‘risk’ associated with the terminal. The risk will be defined as the average number of frauds that were observed on the terminal for three window sizes. This will lead to the creation of three new features.

The forth type of transformation involves the card number, card type, email domain, phone number, and ip address. They will be transformed to a number of how many different information of each one that the customers used during the period.

The last type of transformation involves the transaction amount, which are performed logarithmic transformation.

The table below summarizes the first three types of transformation that will be performed and the new features that will be created.

| Original feature name | Original feature type | Transformation                                                                                                                     | Number of new features | New feature(s) type |
|-----------------------|----------------------|-----------------------------------------------------------------------------------------------------------------------------------|------------------------|---------------------|
| TX_DATE_TIME          | Panda timestamp      | 0 if transaction during a weekday, 1 if transaction during a weekend. The new feature is called TX_DURING_WEEKEND.                 | 1                      | Integer (0/1)       |
| TX_DATE_TIME          | Panda timestamp      | 0 if transaction between 6am and 0pm, 1 if transaction between 0pm and 6am. The new feature is called TX_DURING_NIGHT.             | 1                      | Integer (0/1)       |
| CUSTOMER_ID           | Categorical variable | Number of transactions by the customer in the last n day(s), for n in {1,7,30}. The new features are called CUSTOMER_ID_NB_TX_nDAY_WINDOW.     | 3                      | Integer             |
| CUSTOMER_ID           | Categorical variable | Average spending amount in the last n day(s), for n in {1,7,30}. The new features are called CUSTOMER_ID_AVG_AMOUNT_nDAY_WINDOW.              | 3                      | Real                |
| TERMINAL_ID           | Categorical variable | Number of transactions on the terminal in the last n+d day(s), for n in {1,7,30} and d=7. The parameter d is called delay. The new features are called TERMINAL_ID_NB_TX_nDAY_WINDOW. | 3                      | Integer             |
| TERMINAL_ID           | Categorical variable | Average number of frauds on the terminal in the last n+d day(s), for n in {1,7,30} and d=7. The parameter d is called delay. The new features are called TERMINAL_ID_RISK_nDAY_WINDOW.  | 3                      | Real                |