# <font face="Arial" color="green">General description</font>

<h1 class="dataset-header-v2__title">Synthetic Financial Datasets For Fraud Detection</h1>

<div class="markdown-converter__text--rendered"><h1>Context</h1>
<p>There is a lack of public available datasets on financial services and specially in the emerging mobile money transactions domain. Financial datasets are important to many researchers and in particular to us performing research in the domain of fraud detection. Part of the problem is the intrinsically private nature of financial transactions, that leads to no publicly available datasets.</p>
<p>We present a synthetic dataset generated using the simulator called PaySim as an approach to such a problem. PaySim uses aggregated data from the private dataset to generate a synthetic dataset that resembles the normal operation of transactions and injects malicious behaviour to later evaluate the performance of fraud detection methods.</p>
<h1>Content</h1>
<p>PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is  the provider of the mobile financial service which is currently running in more than 14 countries all around the world.</p>
<p>This synthetic dataset is scaled down 1/4 of the original dataset and it is created just for Kaggle.</p>
<h1>Headers</h1>
<p>This is a sample of 1 row with headers explanation:</p>
<p>1,PAYMENT,1060.31,C429214117,1089.0,28.69,M1591654462,0.0,0.0,0,0</p>
<p>step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).</p>
<p>type - CASH-IN, CASH-OUT, DEBIT, PAYMENT  and TRANSFER.</p>
<p>amount -<br>
 amount of the transaction in local currency.</p>
<p>nameOrig - customer who started the transaction</p>
<p>oldbalanceOrg - initial balance before the transaction</p>
<p>newbalanceOrig - new balance after the transaction</p>
<p>nameDest - customer who is the recipient of the transaction</p>
<p>oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).</p>
<p>newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).</p>
<p>isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.</p>
<p>isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.</p>
<h1>Past Research</h1>
<p>There are 5 similar files that contain the run of 5 different scenarios. These files are better explained at my PhD thesis chapter 7 (PhD Thesis Available here <a rel="noreferrer nofollow" href="http://urn.kb.se/resolve?urn=urn:nbn:se:bth-12932)">http://urn.kb.se/resolve?urn=urn:nbn:se:bth-12932)</a>.</p>
<p>We ran PaySim several times using random seeds for 744 steps, representing each hour of one month of real time, which matches the original logs. Each run took around 45 minutes on an i7 intel processor with 16GB of RAM. The final result of a run contains approximately 24 million of financial records divided into the 5 types of categories: CASH-IN, CASH-OUT, DEBIT, PAYMENT  and TRANSFER.</p>
<h1>Acknowledgements</h1>
<p>This work is part of the research project ”Scalable resource-efficient systems for big data analytics” funded<br>
by the Knowledge Foundation (grant: 20140032) in Sweden.</p>
<p>Please refer to this dataset using the following citations: </p>
<p>PaySim first paper of the simulator:</p>
<p>E. A. Lopez-Rojas , A. Elmir, and S. Axelsson. "PaySim: A financial mobile money simulator for fraud detection". In: The 28th European Modeling and Simulation Symposium-EMSS, Larnaca, Cyprus. 2016</p></div>


https://www.kaggle.com/ealaxi/paysim1

# <font face="Arial" color="green">Library Imports</font>

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from math import ceil
from math import floor

# 1 - Data import and basic inspection 

## 1.1 data loading and checking

In [3]:
# Data load

df = pd.read_csv("/home/gustavo/repos/fraud-detection/PS_20174392719_1491204439457_log.csv")

df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [6]:
# Data dimensions

print('\nData dimensions:\n')
print('Number of rows: {}'.format(df.shape[0]))
print('Number of columns: {}'.format(df.shape[1]))


Data dimensions:

Number of rows: 6362620
Number of columns: 11


In [7]:
# Checking data types

df.dtypes

step                int64
type               object
amount            float64
nameOrig           object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest           object
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
isFlaggedFraud      int64
dtype: object

In [8]:
# Checking for NA's

df.isna().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

In [9]:
# Dia do mês

df['day_of_month'] = df['step'].apply(lambda x: ceil(x/24))

df['day_of_month']

0           1
1           1
2           1
3           1
4           1
           ..
6362615    31
6362616    31
6362617    31
6362618    31
6362619    31
Name: day_of_month, Length: 6362620, dtype: int64

In [10]:
# Semana do mês

df['week_of_month'] = df['step'].apply(lambda x: floor(x/168))

df['week_of_month']

0          0
1          0
2          0
3          0
4          0
          ..
6362615    4
6362616    4
6362617    4
6362618    4
6362619    4
Name: week_of_month, Length: 6362620, dtype: int64

In [11]:
# Creating aux column 'min_step_of_day'

for i in df['day_of_month'].unique():
    
    df.loc[df['day_of_month']==i,'min_step_of_day'] = df.loc[df['day_of_month']==i, 'step'].min()


In [12]:
# Creating 'hour_of_day' column

df['hour_of_day'] = df['step'] - df['min_step_of_day']

df['hour_of_day']

0           0.0
1           0.0
2           0.0
3           0.0
4           0.0
           ... 
6362615    22.0
6362616    22.0
6362617    22.0
6362618    22.0
6362619    22.0
Name: hour_of_day, Length: 6362620, dtype: float64

In [13]:
df['hour_of_day'].max()

23.0

In [14]:
# Droping aux column 'min_step_of_day'

df = df[['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig',
       'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud',
       'isFlaggedFraud', 'day_of_month', 'week_of_month', 'hour_of_day']]

In [15]:
df

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,day_of_month,week_of_month,hour_of_day
0,1,PAYMENT,9839.64,C1231006815,170136.00,160296.36,M1979787155,0.00,0.00,0,0,1,0,0.0
1,1,PAYMENT,1864.28,C1666544295,21249.00,19384.72,M2044282225,0.00,0.00,0,0,1,0,0.0
2,1,TRANSFER,181.00,C1305486145,181.00,0.00,C553264065,0.00,0.00,1,0,1,0,0.0
3,1,CASH_OUT,181.00,C840083671,181.00,0.00,C38997010,21182.00,0.00,1,0,1,0,0.0
4,1,PAYMENT,11668.14,C2048537720,41554.00,29885.86,M1230701703,0.00,0.00,0,0,1,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6362615,743,CASH_OUT,339682.13,C786484425,339682.13,0.00,C776919290,0.00,339682.13,1,0,31,4,22.0
6362616,743,TRANSFER,6311409.28,C1529008245,6311409.28,0.00,C1881841831,0.00,0.00,1,0,31,4,22.0
6362617,743,CASH_OUT,6311409.28,C1162922333,6311409.28,0.00,C1365125890,68488.84,6379898.11,1,0,31,4,22.0
6362618,743,TRANSFER,850002.52,C1685995037,850002.52,0.00,C2080388513,0.00,0.00,1,0,31,4,22.0


In [None]:
# Day of week

In [None]:
# Dia da semana (segunda, terça, quarta, etc)

In [None]:
# Fim de semana 'is_weekend' (0 ou 1)

## 1.2 descriptive statistics