# Fraud detection: problem, solutions and tools

## 1.1 The nature of the problem

"[Fraud is a billion-dollar business 
and it is increasing every year](https://en.wikipedia.org/wiki/Data_analysis_techniques_for_fraud_detection)"

What's a fraud? There are many formal definitions but essentially a fraud is an "art" and crime of deceiving and scamming people in their financial transactions. Frauds have always existed throughout human history but in this age of digital technology the strategy, extent and magnitude of financial frauds is becoming wide ranging - from credit cards transactions to health benefits to insurance claims. Fraudsters are also getting super creative in this digital age. Who's never received an email from a Nigerian royal family widow that she's looking for trusted someone to hand over large sums of her inheritance?
No wonder why is fraud a big deal. It has been estimated that loses in business organizations can soar upto [4–5% of their revenues](https://wallethub.com/edu/cc/credit-debit-card-fraud-statistics/25725/) due to fraudulent transactions. A 5% fraud reduction may not sound a lot, but in monetary terms it is non-trivial and outweigh costs of not doing it by a large margin. A [PwC survey](https://www.pwc.com/gx/en/forensics/global-economic-crime-and-fraud-survey-2018.pdf) found that 50% of 7,200 companies they surveyed had been victims of fraud of some kind. A very recent [study by FICO](https://www.fico.com/blogs/real-time-payments-fraud?utm_source=social&utm_medium=social_platforms&utm_campaign=APAC_banks) have found that 4 out of 5 banks in their survey have experienced an increase in fraud activities and this is expected to rise in the future.  
Although many organizations took measures to counter frauds it could never be eradicated. The goal is really to minimize its impacts and the benefits of this screening must be weighted against the costs, such as, investment in fraud detection technology and potentially loosing customers due to "false positive" alarms.  
The purpose of this article is to highlight some tools, techniques and best practices in the field of fraud detection. Towards the end I'll provide a python implementation using publicly available dataset.

## 1.2 Use cases
Frauds are everywhere - where a transaction is involved-  but credit card fraud is probably the most known case. It can be as primitive as stealing or using stolen cards to aggressive forms such as account takeover, counterfeiting [and more](https://www.experian.com/blogs/ask-experian/credit-education/preventing-fraud/credit-card-fraud-what-to-do-if-you-are-a-victim/). Credit card frauds have always existed but the magnitude is only growing due to increasing online transactions taking place through credit cards everyday . According to a [Nilson Report](https://nilsonreport.com/upload/content_promo/The_Nilson_Report_10-17-2016.pdf) in 2010 the amount of global fraud was USD 7.6 billion, and is expected to cross a whopping USD 31 billion in 2020. In UK alone fraudulent transactions loses were estimated at more than [USD 1 billion in 2018](https://en.wikipedia.org/wiki/Credit_card_fraud).  
The other kinds of big fraud cases are ongoing in insurance industries. Some estimates suggest that as much as 10% of health insurance claims in the US can be attributed to fraud - which is a non-trivial amount of USD 110 billion annually.  

**Insurance fraud is so widespread that there is an entire organization called [Coalition Against Insurance Fraud](http://www.insurancefraud.org/) and a scientific journal called [Journal of Insurance Fraud](https://www.insurancefraud.org/jifa/jun-2016) to scientifically study frauds in the insurance business.**

## 2.1 Data science solution
In the past (i.e. before machine learning became the trend) the standard practice was to use the so called "Rule-based approach". But every rule has an exception, so this technique was able to only partially mitigate the problem.
With ever increasing online transactions and production of large volume of customer data, machine learning has been increasingly seen as an effective tool to detect and counter frauds. However, there is not a specific tool, the silver bullet, that work for all kinds of fraud detection problems in every industry. The nature of this problem is different in every case and every industry. Therefore every solution is carefully tailored within the domain of the industry and methods depend on the data and transaction types among others in each industry.  
In machine learning parlance fraud detection is generally treated as a supervised classification problem, where observations are classified as "fraud" or "non-fraud" based on the features in those observations. It is also an interesting problem in ML research due to imbalanced data - i.e. very few cases of frauds in an extremely large amount of transactions. How to deal with imbalance classes is itself a subject of another discussion.  
Frauds are also be isolated using outlier detection techniques. Outlier detection tools have their own way of tackling the problem, such as [time series analysis](https://www.datasciencecentral.com/profiles/blogs/outlier-detection-with-time-series-data-mining), cluster analysis, real-time monitoring of transactions etc.

## 2.2 Techniques
**Statistical techniques:** average, quantiles, probability distribution, association rules  
**Supervised ML algorithms:** classification, logistic regression, neural net, time-series analysis  
**Unsupervised ML algorithms:** Cluster analysis, Bayesian network, Peer group analysis, break point analysis, Benford's law ( law of anomalous numbers)

## 3. A simple Python implementation
For this simple demo I am using a popular [Kaggle dataset](https://www.kaggle.com/mlg-ulb/creditcardfraud).

## 3.1 Data preparation

In [1]:
# import libraries
import pandas as pd
import numpy as np

# import data
df = pd.read_csv("C:\\Users\\DataS\\Google Drive\\DataScience\\Datasets\\creditcard.csv")

# view the column names
df.columns

**The dataset has 31 columns. The first column "Time" is transaction time stamp, second last column "Amount" is transaction amount and the last column "Class" designates whether transaction as fraud or non-fraud (fraud = 1, non-fraud = 0).
The rest of the columns, "V1" to "V28" are unknown variables which were transformed.**

In [3]:
# number of fraud and non-fraud observations in the dataset
frauds = len(df[df.Class == 1])
nonfrauds = len(df[df.Class == 0])

print("Frauds", frauds); print("Non-frauds", nonfrauds)

Frauds 492
Non-frauds 284315


In [4]:
## scaling the "Amount" and "Time" columns similar to the others variables

from sklearn.preprocessing import RobustScaler
rob_scaler = RobustScaler()

df['scaled_amount'] = rob_scaler.fit_transform(df['Amount'].values.reshape(-1,1))
df['scaled_time'] = rob_scaler.fit_transform(df['Time'].values.reshape(-1,1))

# now drop the original columns
df.drop(['Time','Amount'], axis=1, inplace=True)

In [5]:
# define X and y variables
X = df.loc[:, df.columns != 'Class']
y = df.loc[:, df.columns == 'Class']

## 3.2 Taking subsamples
 **This is an extremely unbalanced dataset so we need to take a subsample by undersampling**

In [6]:
# number of fraud cases
frauds = len(df[df.Class == 1])

# selecting the indices of the non-fraud classes
fraud_indices = df[df.Class == 1].index
nonfraud_indices = df[df.Class == 0].index

# From all non-fraud observations, randomly select observations equal to number of fraud observations
random_nonfraud_indices = np.random.choice(nonfraud_indices, frauds, replace = False)
random_nonfraud_indices = np.array(random_nonfraud_indices)

# Appending the 2 indices
under_sample_indices = np.concatenate([fraud_indices,random_nonfraud_indices])

# Under sample dataset
under_sample_data = df.iloc[under_sample_indices,:]

# Now split X, y variables from the under sample data
X_undersample = under_sample_data.loc[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.loc[:, under_sample_data.columns == 'Class']


## 3.3 Modeling

In [7]:
## split data into training and testing set
from sklearn.model_selection import train_test_split

# # The complete dataset
# X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)

# Split dataset
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample
                                                                                                   ,y_undersample
                                                                                                   ,test_size = 0.3
                                                                                                   ,random_state = 0)
## modeling with logistic regression

#import model
from sklearn.linear_model import LogisticRegression
# instantiate model
model = LogisticRegression()
# fit 
model.fit(X_train_undersample, y_train_undersample)
# predict
y_pred = model.predict(X_test_undersample)

  y = column_or_1d(y, warn=True)


## 3.4 Model evaluation
**Note: Do not use accuracy score as a metric. In a dataset with 99.9% non-fraud observations, you will likely make correct prediction 99% of time. Confusion matrix and precision/recall score are better metric.**

In [8]:
# import classification report and confusion matrix
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

classification_report = classification_report(y_test_undersample, y_pred)
confusion_matrix = confusion_matrix(y_test_undersample, y_pred)

print("CLASSIFICATION REPORT")
print(classification_report)
print("CONFUSION MATRIX") 
print(confusion_matrix)

CLASSIFICATION REPORT
              precision    recall  f1-score   support

           0       0.93      0.94      0.94       149
           1       0.94      0.93      0.94       147

    accuracy                           0.94       296
   macro avg       0.94      0.94      0.94       296
weighted avg       0.94      0.94      0.94       296

CONFUSION MATRIX
[[140   9]
 [ 10 137]]


## End note:
Thanks for reading this through. A Jupyter notebook along with the python demo can be found in my GitHub repo.   
I can be reached via [Twitter](https://twitter.com/DataEnthus) or [LinkedIn](https://www.linkedin.com/in/mab-alam/).