### Predict Credit Card Fraud

Credit card fraud is one of the leading causes of identify theft around the world. In 2018 alone, over $24 billion were stolen through fraudulent credit card transactions. Financial institutions employ a wide variety of different techniques to prevent fraud, one of the most common being Logistic Regression.

In [3]:
# import lib
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Let’s begin by loading the data into a pandas DataFrame named transactions. Take a peek at the dataset using .head() and use .info() to examine how many rows are there and what datatypes the are.
* How many transactions are fraudulent?

In [5]:
# load the data
data = pd.read_csv('transactions_modified.csv')
print(f"transactions data: {data.head(5)}")
print(data.info())

transactions data:    step      type      amount     nameOrig  oldbalanceOrg  newbalanceOrig  \
0   206  CASH_OUT    62927.08   C473782114           0.00            0.00   
1   380   PAYMENT    32851.57  C1915112886           0.00            0.00   
2   570  CASH_OUT  1131750.38  C1396198422     1131750.38            0.00   
3   184  CASH_OUT    60519.74   C982551468       60519.74            0.00   
4   162   CASH_IN    46716.01  C1759889425     7668050.60      7714766.61   

      nameDest  oldbalanceDest  newbalanceDest  isFraud  isPayment  \
0  C2096898696       649420.67       712347.75        0          0   
1   M916879292            0.00            0.00        0          1   
2  C1612235515       313070.53      1444820.92        1          0   
3  C1378644910        54295.32       182654.50        1          0   
4  C2059152908      2125468.75      2078752.75        0          0   

   isMovement  accountDiff  
0           1    649420.67  
1           0         0.00  
2         

In [6]:
# How many fraudulent transactions?
print(f"fraudulent transactions: {(data['isFraud'] == 1).sum()}")

fraudulent transactions: 282


#### Clean the Data

we can see that there are a few interesting columns to look at. We know that the amount of a given transaction is going to be important. Calculate summary statistics for this column.
* What does the distribution look like?

In [7]:
# Summary statistics on amount column
print(f"Summary statistics on amount column: {data['amount'].describe()}")

Summary statistics on amount column: count    1.000000e+03
mean     5.373080e+05
std      1.423692e+06
min      0.000000e+00
25%      2.933705e+04
50%      1.265305e+05
75%      3.010378e+05
max      1.000000e+07
Name: amount, dtype: float64


Let’s create a new column called isPayment that assigns a 1 when type is “PAYMENT” or “DEBIT”, and a 0 otherwise.

In [8]:
# Create isPayment field
data['isPayment'] = data['type'].apply(lambda x: 1 if x in ['PAYMENT', 'DEBIT'] else 0)

create a column called isMovement, which will capture if money moved out of the origin account. This column will have a value of 1 when type is either “CASH_OUT” or “TRANSFER”, and a 0 otherwise.

In [9]:
# Create isMovement field
data['isMovement'] = data['type'].apply(lambda x: 1 if x in ['CASH_OUT', 'TRANSFER'] else 0)

With financial fraud, another key factor to investigate would be the difference in value between the origin and destination account. Our theory, in this case, being that destination accounts with a significantly different value could be suspect of fraud.

Let’s create a column called accountDiff with the absolute difference of the oldbalanceOrg and oldbalanceDest columns.

In [10]:
# Create accountDiff field
data['accountDiff'] = data['oldbalanceOrg'] - data['oldbalanceDest']