<span style="color:SteelBlue; font-size:42px;">Fraud Detection</span> 
<hr>
*Draft v2.0 Last updated April 6, 2018*

The following notebook is a work in progress, and adapted from the book *Machine Learning and Security* by David Freeman & Clarence Chio [[1](#citaion1)]

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-libraries" data-toc-modified-id="Import-libraries-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import libraries</a></span></li><li><span><a href="#The-dataset" data-toc-modified-id="The-dataset-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>The dataset</a></span></li><li><span><a href="#Machine-learning-goal" data-toc-modified-id="Machine-learning-goal-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Machine learning goal</a></span></li><li><span><a href="#Citations" data-toc-modified-id="Citations-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Citations</a></span></li></ul></div>

## Import libraries

In [50]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

## The dataset
The dataset is hypothetical ecommerce retailer transaction data. There are 39,221 transactions, five feature inputs, and one binary “label” indicating whether the transaction is fraud `1` or not fraud `0`.

In [51]:
# Read in the data from the CSV file
df = pd.read_csv('project_files/hypothetical_fraud.csv')

In [52]:
df.head()

Unnamed: 0,accountAgeDays,numItems,localTime,paymentMethod,paymentMethodAgeDays,label
0,29,1,4.745402,paypal,28.204861,0
1,725,1,4.742303,storecredit,0.0,0
2,845,1,4.921318,creditcard,0.0,0
3,503,1,4.886641,creditcard,0.0,0
4,2000,1,5.040929,creditcard,0.0,0


Let’s use the `df.sample()` function to retrieve a snippet of 5 rows from df:

In [None]:
# Pull five random samples from our data frame 'df'
df.sample(5)

**Data dictioary: 

- `accountAgeDays`:       Number of days ago account oppened
- `numItems`:             Number of items purchased
- `localTime`:            Time
- `paymentMethod`:        Payment type: creditcard, paypal, or storecredit
- `paymentMethodAgeDays`: Number of days payment method added before transaction
- `label`:                0=not fraud; 1=fraud

## Machine learning goal
What we want to achieve is to have a machine learning algorithm learn how to identify a fraudulent transaction from the five features in our dataset.

In [54]:
# Convert categorical feature into dummy variables with one-hot encoding
df = pd.get_dummies(df, columns=['paymentMethod'])
df.sample(3)

Unnamed: 0,accountAgeDays,numItems,localTime,paymentMethodAgeDays,label,paymentMethod_creditcard,paymentMethod_paypal,paymentMethod_storecredit
3590,2000,1,4.461622,153.490972,0,1,0,0
11037,2000,1,5.034622,357.005556,0,1,0,0
7982,37,1,4.921349,0.0,0,0,1,0


In [55]:
# Split dataset up into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    df.drop('label', axis=1), df['label'],
    test_size=0.33, random_state=17)

In [56]:
# Initialize and train classifier model
clf = LogisticRegression().fit(X_train, y_train)

In [57]:
# Make predictions on test set
y_pred = clf.predict(X_test)

In [58]:
# Compare test set predictions with ground truth labels
print(accuracy_score(y_pred, y_test))

0.99992273816


We have a `99.992`% accuracy! Lets also look at the confusion matrix:

In [59]:
print(confusion_matrix(y_test, y_pred))

[[12753     0]
 [    1   189]]


There appears to only be a single misclassification in the entire test set. 189 transactions are correctly flagged as fraud, and there is 1 false negative in which the fraudulent transaction was not detected. There are no false positives. In a practical setting, we would have our analysts look at our false positives and false negatives to confirm.

We can apply this model to any given incoming transaction and get a probability score for how likely this transaction is to be fraudulent:. But remember, any incoming transactions (e.g., `df_real` below) has to undergo all data pipelining work that we did above. E.g., one-hot encoding.

In [60]:
# Read in real data from the CSV file
df_real = pd.read_csv('project_files/hypothetical_real.csv')

In [61]:
clf.predict_proba(df_real)

array([[  1.00000000e+000,   9.72357795e-284]])

Above is an array showing the probability of the transacton not being fraudulent (first column) vs being fraudulent. Therefore the first prection above means that the transaction in 100% likely to not being  fraudulent. This is all very elementry and start for planning a fraud detection systems. Much more experimentation needs to be done, as well as deploying to production, and montoring the model. More on this in anouther notebook.

## Citations
<a id='citaion1'></a>
[1] Machine Learning and Security by David Freeman, Clarence Chio; Publisher: O'Reilly Media, Inc.; Release Date: February 2018; ISBN: 9781491979907