# Definitions: Financial Transactions and Fraud

### What is Fraud?
* Fraud is an intentionally deceptive action designed to provide the perpetrator with an unlawful gain or to deny a right to a victim. In addition, it is a deliberate act (or failure to act) with the intention of obtaining an unauthorized benefit, either for oneself or for the institution, by using deception or false suggestions or suppression of truth or other unethical means, which are believed and relied upon by others. Depriving another person or the institution of a benefit to which he/she/it is entitled by using any of the means described above also constitutes fraud.

* Types of fraud include tax fraud, credit card fraud, wire fraud, securities fraud, and bankruptcy fraud. Fraudulent activity can be carried out by one individual, multiple individuals or a business firm as a whole.

* Both states and the federal government have laws that criminalize fraud, though fraudulent actions may not always result in a criminal trial. Government prosecutors often have substantial discretion in determining whether a case should go to trial and may pursue a settlement instead if this will result in a speedier and less costly resolution. If a fraud case goes to trial, the perpetrator may be convicted and sent to jail.

### Types of Transactions
* A financial transaction is an agreement, or communication, carried out between a buyer and a seller to exchange an asset for payment.

* It involves a change in the status of the finances of two or more businesses or individuals. The buyer and seller are separate entities or objects, often involving the exchange of items of value, such as information, goods, services, and money. It is still a transaction if the goods are exchanged at one time, and the money at another. This is known as a two-part transaction: part one is giving the money, part two is receiving the goods.

1) Cash Transactions (Cash-in and Cash-out): A cash transaction refers to a transaction which involves an immediate outflow of cash towards the purchase of any goods, services, or assets. Cash transaction can be consumer-oriented or business-oriented.

A cash transaction stands in contrast to other modes of payment, such as credit transactions in a business involving bills receivable. Similarly, a cash transaction is also different from credit card transactions.

Cash transactions are different from transactions which involve a delay in delivery of the goods or delay in payment. Such transactions include credit sale, forward contract, futures contract, and other margin transactions.

2) Debit: A debit card payment is the same as an immediate payment of cash as the amount gets instantly debited from your bank account.

Debit cards allow bank customers to spend money by drawing on existing funds they have already deposited at the bank, such as from a checking account. A debit transaction using your PIN (personal identification number), is an online transaction completed in real time. When you complete a debit transaction, you authorize the purchase with your PIN and the merchant communicates immediately with your bank or credit union, causing the funds to be transferred in real time.

The first debit card may have hit the market as early as 1966 when the Bank of Delaware piloted the idea.

3) Payment: An act initiated by the payer or payee, or on behalf of the payer, of placing, transferring or withdrawing funds, irrespective of any underlying obligations between the payer and payee.

4) A transfer involves the movement of assets, monetary funds, and/or ownership rights from one account to another. A transfer may require an exchange of funds when it involves a change in ownership, such as when an investor sells a real estate holding. In this case, there is a transfer of title from the seller to the buyer and a simultaneous transfer of funds, equal to the negotiated price, from the buyer to the seller.

The term transfer may also refer to the movement of an account from one bank or brokerage to another.

### Key Facts 
* Fraud involves **deceit** with the intention to illegally or unethically gain at the expense of another.
* In **finance**, fraud can take on many forms including making false insurance claims, cooking the books, pump & dump schemes, and identity theft leading to unauthorized purchases.
* Fraud **costs the economy billions of dollars** each and every year, and those who are caught are subject to fines and jail time.
* **Consumer fraud** occurs when a person suffers from a financial loss involving the use of deceptive, unfair, or false business practices.
* With **identity theft**, thieves steal your personal information, assume your identity, open credit cards, bank accounts, and charge purchases.
* **Mortgage scams** are aimed at distressed homeowners to get money from them.
* **Credit and debit card fraud** is when someone takes your information off the card and makes purchases or offers to lower your credit card interest rate.
* **Fake charities** and lotteries prey on peoples' sympathy or greed.
* **Debt collection fraud** tries to collect on unpaid bills whether they are yours or not.
* **COVID-19 scams** are a new type of fraud designed to prey on your fear or financial need.

### Legal Considerations
* While the government may decide that a case of fraud can be settled outside of criminal proceedings, non-governmental parties that claim injury may pursue a civil case. The victims of fraud may sue the perpetrator to have funds recovered, or, in a case where no monetary loss occurred, may sue to reestablish the victim’s rights.

* Proving that fraud has taken place requires the perpetrator to have committed specific acts. First, the perpetrator has to provide a false statement as a material fact. Second, the perpetrator had to have known that the statement was untrue. Third, the perpetrator had to have intended to deceive the victim. Fourth, the victim has to demonstrate that it relied on the false statement. And fifth, the victim had to have suffered damages as a result of acting on the intentionally false statement.

### Consequences of Financial Fraud
* First, serving as a signal of dishonesty, financial fraud makes customers and suppliers cast doubt on a firm’s commitments in the product market, which will weaken the incentives for customers and suppliers to sign contracts with the company. Second, financial fraud directly affects a firm’s financing abilities and financing costs, as well as the adjustments of corporate governance (such as the departure of executives). This leads to great difficulties and uncertainties in a company’s production and operation activities. Thus, it is impossible for fraud firms to fulfil their existing commitments (or future commitments) (Cornell & Shapiro, 1987).

* According to Infosecurity Magazine, fraud cost the global economy £3.2 trillion in 2018. For some businesses, losses to fraud reach more than 10% of their total spending. Such massive losses push companies to search for new solutions to prevent, detect, and eliminate fraud.

* Fraud can have a devastating impact on a business. In 2001, a massive corporate fraud was uncovered at Enron, a U.S.-based energy company. Executives used a variety of techniques to disguise the company’s financial health, including the deliberate obfuscation of revenue and misrepresentation of earnings. After the fraud was uncovered, shareholders saw share prices plummet from around $90 to less than $1 in a little over a year. Company employees had their equity wiped out and lost their jobs after Enron declared bankruptcy. The Enron scandal was a major driver behind the regulations found in the Sarbanes-Oxley Act passed in 2002.

* Compared with the control firms, firms engaging in financial fraud exhibit a decline in sales revenue by 11.9–17.1% and a decrease in their gross profit margi on sales by 2.4–2.8% in the three years after punishment. Furthermore, sales revenue from the top five large customers falls 43.9–55.1% in the post-punishment period, while sales revenue from small customers does not decline significantly.

### References
https://www.investopedia.com/terms/f/fraud.asp

https://www.usi.edu/internalaudit/what-is-fraud/

https://en.wikipedia.org/wiki/Financial_transaction

https://cleartax.in/g/terms/cash-transaction

https://www.investopedia.com/terms/d/debit.asp

https://www.southpointfinancial.com/whats-difference-debit-credit/

https://www.handbook.fca.org.uk/handbook/glossary/G3490p.html

https://www.investopedia.com/terms/t/transfer.asp

https://www.investopedia.com/financial-edge/0512/the-most-common-types-of-consumer-fraud.aspx

https://www.intellias.com/how-to-use-machine-learning-in-fraud-detection/

https://www.infosecurity-magazine.com/news/global-fraud-hits-32-trillion/

https://www.tandfonline.com/doi/full/10.1080/21697213.2018.1480005

https://sejaumdatascientist.com/crie-uma-solucao-para-fraudes-em-transacoes-financeiras-usando-machine-learning/

# Kaggle

Context
There is a lack of public available datasets on financial services and specially in the emerging mobile money transactions domain. Financial datasets are important to many researchers and in particular to us performing research in the domain of fraud detection. Part of the problem is the intrinsically private nature of financial transactions, that leads to no publicly available datasets.

We present a synthetic dataset generated using the simulator called PaySim as an approach to such a problem. PaySim uses aggregated data from the private dataset to generate a synthetic dataset that resembles the normal operation of transactions and injects malicious behaviour to later evaluate the performance of fraud detection methods.

Content
PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of the mobile financial service which is currently running in more than 14 countries all around the world.

This synthetic dataset is scaled down 1/4 of the original dataset and it is created just for Kaggle.

Headers
This is a sample of 1 row with headers explanation:

1,PAYMENT,1060.31,C429214117,1089.0,28.69,M1591654462,0.0,0.0,0,0

step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).

type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

amount -
amount of the transaction in local currency.

nameOrig - customer who started the transaction

oldbalanceOrg - initial balance before the transaction

newbalanceOrig - new balance after the transaction

nameDest - customer who is the recipient of the transaction

oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).

newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).

isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.

Past Research
There are 5 similar files that contain the run of 5 different scenarios. These files are better explained at my PhD thesis chapter 7 (PhD Thesis Available here http://urn.kb.se/resolve?urn=urn:nbn:se:bth-12932).

We ran PaySim several times using random seeds for 744 steps, representing each hour of one month of real time, which matches the original logs. Each run took around 45 minutes on an i7 intel processor with 16GB of RAM. The final result of a run contains approximately 24 million of financial records divided into the 5 types of categories: CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

Acknowledgements
This work is part of the research project ”Scalable resource-efficient systems for big data analytics” funded
by the Knowledge Foundation (grant: 20140032) in Sweden.

Please refer to this dataset using the following citations:

PaySim first paper of the simulator:

E. A. Lopez-Rojas , A. Elmir, and S. Axelsson. "PaySim: A financial mobile money simulator for fraud detection". In: The 28th European Modeling and Simulation Symposium-EMSS, Larnaca, Cyprus. 2016

# Business Challange, Output and Tasks

### The Blocker Fraud Company Expansion Strategy

- Financial transactions' fraud detection specialized company.
- The Blocker Fraud service ensures fraudulent transactions block.
- Business Model: service's performance monetization.

1. The company receives 25% of each transaction value truly detected as fraud.
2. The company receives 5% of each transaction value detected as fraud, however the transaction is legitimate.
3. The company gives back 100% of the value for the customer in each transaction detected as legitimate, however the transaction is actually a fraud.

### Goals and Tasks

- Create a model with high accuracy and precision with respect to transactions' fraud detection.

- What is the model's precision and accuracy?
- What is the model's reliability with respect to transactions' classification as legitimate or fraudulent?
- What is the company's forecasted revenue if the model classifies 100% of the transactions?
- What is the company's forecasted loss in case of model's failure?
- What is the Blocker Fraud Company forecasted profit using the model?

### Tasks and Deliveries

- Deployed model with API access. The API must inform "Fraud" or "Legitimate" when the transaction is inputed.
- A Readme about how to use the tool.
- Model performance and results report with respect to profit and loss. The following questions must be answered:

- Answer to:
    - What is the model's precision and accuracy?
    - What is the model's reliability with respect to transactions' classification as legitimate or fraudulent?
    - What is the company's forecasted revenue if the model classifies 100% of the transactions?
    - What is the company's forecasted loss in case of model's failure?
    - What is the Blocker Fraud Company forecasted profit using the model?

# 0. Imports

## 0.1. Libraries

In [1]:
# jupyter core
from IPython.core.display      import display, HTML
from IPython.display           import Image

# data manipulation
import inflection
import datetime
import math
import random
import numpy as np
import pandas as pd
from scipy import stats as ss

# PySpark for data manipulation
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StringType, ShortType, DoubleType, StructType, IntegerType
from pyspark.sql.functions import format_number, skewness, kurtosis, col, when, isnan, count

# EDA
import seaborn as sns
import matplotlib.pyplot as plt

## 0.2. Functions

In [2]:
# jupyter setup
def jupyter_settings():
    
    # jupyter core settings
    display(HTML("<style>.container { width:100% !important; }</style>"))
    !pylab inline
    
    # pandas
    pd.set_option('display.max_rows', 500)
    pd.set_option('display.max_columns', 500)
    pd.set_option('display.expand_frame_repr', False)
    
    # matplotlib
    !matplotlib inline
    plt.style.use('bmh')
    plt.rcParams['figure.figsize'] = [35, 12]
    plt.rcParams['font.size'] = 40
    
    # seaborn
    sns.set()

# descriptive analysis summary for numerical features
def num_analysis(num_attributes):
    # Cemtral tendency - mean, median
    ct1 = pd.DataFrame(num_attributes.apply(np.mean)).T
    ct2 = pd.DataFrame(num_attributes.apply(np.median)).T

    # Dispersion - std, min, max, range, skew, kurtosis
    d1 = pd.DataFrame(num_attributes.apply(np.std)).T
    d2 = pd.DataFrame(num_attributes.apply(min)).T
    d3 = pd.DataFrame(num_attributes.apply(max)).T
    d4 = pd.DataFrame(num_attributes.apply(lambda x: x.max() - x.min())).T
    d5 = pd.DataFrame(num_attributes.apply(lambda x: x.skew())).T
    d6 = pd.DataFrame(num_attributes.apply(lambda x: x.kurtosis())).T

    # concatenate
    m = pd.concat([d2,d3,d4,ct1,ct2,d1,d5,d6]).T.reset_index()
    m.columns = ['attributes','min','max','range','mean','median','std','skew','kurtosis',]
    
    # histogram
    hist = num_attributes.hist(bins=30)
    
    return m

    return hist

def cramer_v (x,y):
    cm = pd.crosstab(x, y).values
    n = cm.sum()
    r, k = cm.shape
    
    chi2 = ss.chi2_contingency(cm)[0]
    chi2corr = max(0, chi2 - (k-1)*(r-1)/(n-1))
    
    kcorr = k - (k-1)**2/(n-1)
    rcorr = r - (r-1)**2/(n-1)
    
    return np.sqrt((chi2corr/n) / (min(kcorr-1,rcorr-1)))


In [3]:
jupyter_settings()

/bin/bash: pylab: command not found
/bin/bash: matplotlib: command not found


## 0.3. Data (with PySpark)

In [28]:
# creates a SparkSession
spark = SparkSession.builder\
                    .master('local')\
                    .appName('Fraud')\
                    .getOrCreate()

# enable arrow-based columnar data transfers
spark.conf.set('spark.sql.execution.arrow.pyspark.enable', 'true')

In [29]:
# schema definition: field, type, nullabe or not
data_schema = [StructField('step', ShortType(), True), 
              StructField('type', StringType(), True),
              StructField('amount', DoubleType(), True), 
              StructField('nameOrig', StringType(), True),
              StructField('oldbalanceOrg', DoubleType(), True), 
              StructField('newbalanceOrig', DoubleType(), True),
              StructField('nameDest', StringType(), True), 
              StructField('oldbalanceDest', DoubleType(), True),
              StructField('newbalanceDest', DoubleType(), True), 
              StructField('isFraud', ShortType(), True),
              StructField('isFlaggedFraud', ShortType(), True)]

# final structure
final_struct = StructType(fields=data_schema)

In [8]:
# load dataset in spark
df_spark = spark.read.csv('../data/raw/raw.csv', schema=final_struct, header=True)

# display the schema to check dtypes
df_spark.printSchema()

root
 |-- step: short (nullable = true)
 |-- type: string (nullable = true)
 |-- amount: double (nullable = true)
 |-- nameOrig: string (nullable = true)
 |-- oldbalanceOrg: double (nullable = true)
 |-- newbalanceOrig: double (nullable = true)
 |-- nameDest: string (nullable = true)
 |-- oldbalanceDest: double (nullable = true)
 |-- newbalanceDest: double (nullable = true)
 |-- isFraud: short (nullable = true)
 |-- isFlaggedFraud: short (nullable = true)



In [9]:
df_spark.show(10)

+----+--------+--------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+
|step|    type|  amount|   nameOrig|oldbalanceOrg|newbalanceOrig|   nameDest|oldbalanceDest|newbalanceDest|isFraud|isFlaggedFraud|
+----+--------+--------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+
|   1| PAYMENT| 9839.64|C1231006815|     170136.0|     160296.36|M1979787155|           0.0|           0.0|      0|             0|
|   1| PAYMENT| 1864.28|C1666544295|      21249.0|      19384.72|M2044282225|           0.0|           0.0|      0|             0|
|   1|TRANSFER|   181.0|C1305486145|        181.0|           0.0| C553264065|           0.0|           0.0|      1|             0|
|   1|CASH_OUT|   181.0| C840083671|        181.0|           0.0|  C38997010|       21182.0|           0.0|      1|             0|
|   1| PAYMENT|11668.14|C2048537720|      41554.0|      29885.86|M1230701703|      

In [10]:
df_spark = df_spark.withColumnRenamed('nameOrig', 'name_orig')\
                   .withColumnRenamed('oldbalanceOrg', 'oldbalance_org')\
                   .withColumnRenamed('newbalanceOrig', 'newbalance_orig')\
                   .withColumnRenamed('nameDest', 'name_dest')\
                   .withColumnRenamed('oldbalanceDest', 'oldbalance_dest')\
                   .withColumnRenamed('newbalanceDest', 'newbalance_dest')\
                   .withColumnRenamed('isFraud', 'is_fraud')\
                   .withColumnRenamed('isFlaggedFraud', 'is_flagged_fraud')

In [11]:
print(df_spark.columns)

['step', 'type', 'amount', 'name_orig', 'oldbalance_org', 'newbalance_orig', 'name_dest', 'oldbalance_dest', 'newbalance_dest', 'is_fraud', 'is_flagged_fraud']


In [12]:
# gets only the numerical columns
df_summary_statistics = df_spark.select(['amount', 'oldbalance_org', 'newbalance_orig', 'oldbalance_dest', 'newbalance_dest'])

In [13]:
df_summary_statistics.describe().show()

+-------+-----------------+-----------------+------------------+------------------+------------------+
|summary|           amount|   oldbalance_org|   newbalance_orig|   oldbalance_dest|   newbalance_dest|
+-------+-----------------+-----------------+------------------+------------------+------------------+
|  count|          6362620|          6362620|           6362620|           6362620|           6362620|
|   mean|179861.9035491287|833883.1040744764| 855113.6685785812|1100701.6665196533|1224996.3982019224|
| stddev|603858.2314629209|2888242.673037527|2924048.5029542595|3399180.1129944525|3674128.9421196915|
|    min|              0.0|              0.0|               0.0|               0.0|               0.0|
|    max|    9.244551664E7|    5.958504037E7|     4.958504037E7|    3.5601588935E8|    3.5617927892E8|
+-------+-----------------+-----------------+------------------+------------------+------------------+



In [27]:
# checks the Q1, Q2 (median) and Q3
df_spark.stat.approxQuantile('amount', [0.25, 0.50, 0.75], 0)

[13389.57, 74871.8, 208721.45]

In [15]:
# checks the Q1, Q2 (median) and Q3
df_spark.stat.approxQuantile('oldbalance_org', [0.25, 0.50, 0.75], 0)

[0.0, 14208.0, 107315.0]

In [17]:
# checks the Q1, Q2 (median) and Q3
df_spark.stat.approxQuantile('newbalance_orig', [0.25, 0.50, 0.75], 0)

[0.0, 0.0, 144258.41]

In [18]:
# checks the Q1, Q2 (median) and Q3
df_spark.stat.approxQuantile('oldbalance_dest', [0.25, 0.50, 0.75], 0)

[0.0, 132705.52, 943036.53]

In [19]:
# checks the Q1, Q2 (median) and Q3
df_spark.stat.approxQuantile('newbalance_dest', [0.25, 0.50, 0.75], 0)

[0.0, 214661.23, 1111909.16]

In [20]:
# calculakting the skewness for numerical features
df_spark.select([skewness(df_spark[column]).alias('skew: ' + column) for column in df_summary_statistics.columns]).show()

+------------------+--------------------+---------------------+---------------------+---------------------+
|      skew: amount|skew: oldbalance_org|skew: newbalance_orig|skew: oldbalance_dest|skew: newbalance_dest|
+------------------+--------------------+---------------------+---------------------+---------------------+
|30.993942175610577|   5.249135183109044|    5.176882780698867|   19.921753219197374|   19.352297495316165|
+------------------+--------------------+---------------------+---------------------+---------------------+



In [23]:
# calculating the kurtosis for numerical columns
df_spark.select([kurtosis(df_spark[column]).alias('kurt: ' + column) for column in df_summary_statistics.columns]).show()

+------------------+--------------------+---------------------+---------------------+---------------------+
|      kurt: amount|kurt: oldbalance_org|kurt: newbalance_orig|kurt: oldbalance_dest|kurt: newbalance_dest|
+------------------+--------------------+---------------------+---------------------+---------------------+
|1797.9552914635412|   32.96485169601807|    32.06695841776405|    948.6733789369696|    862.1558294725742|
+------------------+--------------------+---------------------+---------------------+---------------------+



In [25]:
# checks missing data on each column
# count(CASE WHEN isnan(column) THEN column END) => count when you find a NaN value
df_spark.select([count(when(isnan(column), column)).alias(column) for column in df_spark.columns]).show()

+----+----+------+---------+--------------+---------------+---------+---------------+---------------+--------+----------------+
|step|type|amount|name_orig|oldbalance_org|newbalance_orig|name_dest|oldbalance_dest|newbalance_dest|is_fraud|is_flagged_fraud|
+----+----+------+---------+--------------+---------------+---------+---------------+---------------+--------+----------------+
|   0|   0|     0|        0|             0|              0|        0|              0|              0|       0|               0|
+----+----+------+---------+--------------+---------------+---------+---------------+---------------+--------+----------------+



## 0.4. Data (with Pandas)

In [31]:
df_raw = pd.read_csv('../data/raw/raw.csv')
df_raw.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0
