Skip to content

Download Example Data Set

hkanezashi edited this page Feb 5, 2020 · 12 revisions

For ease of use, we provide a sample data set that you can use for your purposes such as performance evaluation, etc.

** Disclaimer: Please note that all the generated data do not reflect real data. In order to align simulated data with real data, you need to tune all the simulation parameters. It would be helpful if communities could contribute to such efforts.

Examples of data sets can be downloaded from here


From the banks folder, you can download the following internal/external data sets with gzipped feature CSV files. Currently three versions of data sets (sub-folders) are available:

  • v2.1 (added Feb. 4): Tuned transaction network and models for composite transaction pattern features
  • v1.3(added Dec. 16): Improved version of v1.2 so that more graph features improve the model performance
  • v1.2(added Dec. 13): Tuned transaction data set so that graph analytics can detect SAR accounts more accurately
  • v1.1(added Nov. 27): Baseline of the transaction data sets

In v1.1, v1.2 and v1.3, each data set has 100K accounts and several types of the SAR transaction patterns.

  • bank_a (v1.1 only): 100 cycles (3 - 6 SAR accounts for each cycle)
  • bank_b (v1.1 only): 100 scatter-gather patterns (5 - 10 SAR accounts)
  • bank_c (v1.1 only): 100 gather-scatter patterns (7 - 11 SAR accounts)
  • bank_separated: Transaction network with these above three banks and internal-bank SAR transactions
    • bank_a: 20K accounts, 20 cycles
    • bank_b: 30K accounts, 30 scatter-gather patterns
    • bank_c: 50K accounts, 50 gather-scatter patterns
  • bank_mixed: Similar to the bank_separated network, but all SAR transactions are bank-to-bank

CSV files of the transaction graph feature lists are also available (features/bank_*_features.csv.gz). In the v1.3 datasets, images of histograms for all transaction and graph feature values are also available (see bank_mixed_dist and bank_separated_dist directories under v1.3/features).

The first column (name: account) is account ID, and the other columns are transaction and graph feature values as follows.

Feature Name Category Description
count transaction Total number of transactions
sum_amount transaction Total transaction amount
in_count transaction Total number of received transactions
in_amount transaction Total received transaction amount
out_count transaction Total number of received transactions
out_amount transaction Total received transaction amount
ego1_num 1-hop egonet Number of SAR accounts in the 1-hop ego network
ego1_prop 1-hop egonet Proportion of SAR accounts in the 1-hop ego network
ego2_num 2-hop egonet Number of SAR accounts in the 1-hop and 2-hop ego network
ego2_prop 2-hop egonet Proportion of SAR accounts in the 1-hop and 2-hop ego network
pr PageRank PageRank score (Each vertex sends its score to neighbors with damping factor=0.85
and receive and accumulate scores, and then update its own score.)
lp Label Propagation Propagate labels (scores) to neighbor through edges.
SAR accounts have a fixed score 1.0, and others have initially 0.0.
Update scores of non-SAR accounts from edge weight as threshold.
u_sar undirected cycle The maximum proportion of SAR accounts in cycles
u_length_min undirected cycle The minimum length of involving cycles
u_length_avg undirected cycle The average length of involving cycles
u_length_max undirected cycle The maximum length of involving cycles
u_amount_min undirected cycle The minimum total amount of involving cycles
u_amount_avg undirected cycle The average total amount of involving cycles
u_amount_max undirected cycle The maximum total amount of involving cycles
u_period_min undirected cycle The minimum period (days) of involving cycles
u_period_avg undirected cycle The average period (days) of involving cycles
u_period_max undirected cycle The maximum period (days) of involving cycles
u_date_order undirected cycle 1.0 if at least one of involving cycles are in chronological order, 0.0 otherwise.

In the v2.1 data set, bank-to-bank transaction network (bank_mixed) and internal-bank transaction networks (bank_a, bank_b and bank_c) extracted from the bank_mixed are available.

  • bank_mixed: 20K accounts (753 SAR accounts), 885,744 transactions (807 SAR transactions) with the following SAR typologies
    • 20 cycles: 3 - 6 SAR accounts (members), 10 - 30 days (simulation steps), 2,700 - 3,000 transaction amount
    • 30 scatter-gather: 5 - 10 SAR accounts (3 - 8 intermediate accounts), 10 - 30 days, 2,700 - 3,000 transaction amount
    • 50 gather-scatter: 7 - 11 SAR accounts (3 - 5 originators and beneficiaries), 10 - 30 days, 2,700 - 3,000 transaction amount
  • The number of accounts and transactions of internal-bank transaction networks are following.
  • bank_a: 4K accounts, 34,976 transactions
  • bank_b: 6K accounts, 80,502 transactions
  • bank_c: 10K accounts, 222,260 transactions

The following additional features of composite patterns are also available in the v2.1 data set.

Note: In the feature sets of scatter-gather and gather-scatter patterns, the prefix "sg_" or "gs_" is added for the following feature names respectively.

Feature Name Description
num_accts Number of total involving accounts in the composite pattern
num_txs Number of total transactions in the composite pattern
min_amount Minimum transaction amount
max_amount Maximum transaction amount
avg_amount Average transaction amount
sum_amount Total transaction amount
all_period Overall transaction period
diameter Diameter of the pattern subgraph (number of transaction hops from the first accounts to the last accounts)
all_amount_diff Difference of the amount of transactions from the first accounts and transactions to the last accounts
all_amount_ratio Ratio of the total transaction amount of the first accounts sent divided by that of the last accounts received
all_date_diff Difference of the transaction date between the first account sent and the last account received
amount_ordered If transactions amounts are decayed for all intermediate accounts, the value is 1. Otherwise, it is 0.
date_ordered If transactions of all intermediate accounts are chronologically ordered, the value is 1. Otherwise, it is 0.
in_count Number of incoming transactions the account received
out_count Number of outgoing transactions the account sent
in_amount Total amount of incoming transactions
out_amount Total amount of outgoing transactions
acct_amount_diff Difference of the total amount of incoming and outgoing transactions (in_amount - out_amount)
acct_amount_ratio Ratio of the total amount of incoming and outgoing transactions (in_amount / out_amount)
acct_period Overall period of transactions
acct_date_diff Difference of the date between incoming and outgoing transactions
depth Relative distance of this account from the first originator account (first originator: 0.0, last beneficiary: 1.0)

Example Data Set

Here is a list of example data set specially for performance evaluation. For your reference, we also added elapsed time to generate these data so that it would be helpful if you could execute AMLSim with your parameters.

Data Name File Size (7-Zip) Number of Vertices (Accounts) Number of Edges (Transactions) Number of Fraud Patterns (fan_in + fan_out + cycle) AMLSim Execution Time (s)
100vertices-10Kedges 149K 100 10,000 1 (1 + 0 + 0) 0.848
1Kvertices-100Kedges 507K 1,000 100,000 10 (3 + 3 + 4) 13.594
10Kvertices-1Medges 4.7M 10,000 1,000,000 100 (30 + 30 + 40) 201.505
100Kvertices-10Medges 71MB 100,000 10,000,000 1,000 (300 + 300 + 400) 2100.421
1Mvertices-100Medges 976MB 1,000,000 100,000,000 10,000 (3,000 + 3,0000 + 4,000) 18040.666

Statistical Distributions

100vertices-10Kedges (100 accounts, 10K transactions)

Number of Transactions for Each Step Account Degree Distribution WCC Size Distribution
count deg wcc
Clustering Coefficient Transition Diameter and Average Path Length Number of Alert Patterns
cc diameter alert

1Kvertices-100Kedges (1K accounts, 100K transactions)

Number of Transactions for Each Step Account Degree Distribution WCC Size Distribution
count deg wcc
Clustering Coefficient Transition Diameter and Average Path Length Number of Alert Patterns
cc diameter alert

10Kvertices-1Medges (10K accounts, 1M transactions)

Number of Transactions for Each Step Account Degree Distribution WCC Size Distribution
count deg wcc
Clustering Coefficient Transition Diameter and Average Path Length Number of Alert Patterns
cc diameter alert

100Kvertices-10Medges (100K accounts, 10M transactions)

Number of Transactions for Each Step Account Degree Distribution WCC Size Distribution
count deg wcc
Clustering Coefficient Transition Diameter and Average Path Length Number of Alert Patterns
cc diameter alert

1Mvertices-100Medges (1M accounts, 100M transactions)

Number of Transactions for Each Step Account Degree Distribution WCC Size Distribution
count deg wcc
Clustering Coefficient Transition Diameter and Average Path Length Number of Alert Patterns
cc diameter alert

Execution Environment

  • Hardware
    • CPU: Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz, total 16 cores (8 cores, 2 sockets)
    • RAM: 378GB DDR3 1333MHz
  • Software
    • OS: CentOS release 6.7 (Final)
    • GCC: version 5.2.1 20150902 (Red Hat 5.2.1-2)
    • Python: 2.7.12
    • Java: 1.8.0_71 OpenJDK 64-Bit Server VM (build 25.71-b15, mixed mode)