### Predict Credit Card Fraud

Credit card fraud is one of the leading causes of identify theft around the world. In 2018 alone, over $24 billion were stolen through fraudulent credit card transactions. Financial institutions employ a wide variety of different techniques to prevent fraud, one of the most common being Logistic Regression.

In [3]:
# import lib
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Let’s begin by loading the data into a pandas DataFrame named transactions. Take a peek at the dataset using .head() and use .info() to examine how many rows are there and what datatypes the are.
* How many transactions are fraudulent?

In [5]:
# load the data
data = pd.read_csv('transactions_modified.csv')
print(f"transactions data: {data.head(5)}")
print(data.info())

transactions data:    step      type      amount     nameOrig  oldbalanceOrg  newbalanceOrig  \
0   206  CASH_OUT    62927.08   C473782114           0.00            0.00   
1   380   PAYMENT    32851.57  C1915112886           0.00            0.00   
2   570  CASH_OUT  1131750.38  C1396198422     1131750.38            0.00   
3   184  CASH_OUT    60519.74   C982551468       60519.74            0.00   
4   162   CASH_IN    46716.01  C1759889425     7668050.60      7714766.61   

      nameDest  oldbalanceDest  newbalanceDest  isFraud  isPayment  \
0  C2096898696       649420.67       712347.75        0          0   
1   M916879292            0.00            0.00        0          1   
2  C1612235515       313070.53      1444820.92        1          0   
3  C1378644910        54295.32       182654.50        1          0   
4  C2059152908      2125468.75      2078752.75        0          0   

   isMovement  accountDiff  
0           1    649420.67  
1           0         0.00  
2         

In [6]:
# How many fraudulent transactions?
print(f"fraudulent transactions: {(data['isFraud'] == 1).sum()}")

fraudulent transactions: 282


#### Clean the Data

we can see that there are a few interesting columns to look at. We know that the amount of a given transaction is going to be important. Calculate summary statistics for this column.
* What does the distribution look like?

In [7]:
# Summary statistics on amount column
print(f"Summary statistics on amount column: {data['amount'].describe()}")

Summary statistics on amount column: count    1.000000e+03
mean     5.373080e+05
std      1.423692e+06
min      0.000000e+00
25%      2.933705e+04
50%      1.265305e+05
75%      3.010378e+05
max      1.000000e+07
Name: amount, dtype: float64


Let’s create a new column called isPayment that assigns a 1 when type is “PAYMENT” or “DEBIT”, and a 0 otherwise.

In [8]:
# Create isPayment field
data['isPayment'] = data['type'].apply(lambda x: 1 if x in ['PAYMENT', 'DEBIT'] else 0)

create a column called isMovement, which will capture if money moved out of the origin account. This column will have a value of 1 when type is either “CASH_OUT” or “TRANSFER”, and a 0 otherwise.

In [9]:
# Create isMovement field
data['isMovement'] = data['type'].apply(lambda x: 1 if x in ['CASH_OUT', 'TRANSFER'] else 0)

With financial fraud, another key factor to investigate would be the difference in value between the origin and destination account. Our theory, in this case, being that destination accounts with a significantly different value could be suspect of fraud.

Let’s create a column called accountDiff with the absolute difference of the oldbalanceOrg and oldbalanceDest columns.

In [10]:
# Create accountDiff field
data['accountDiff'] = data['oldbalanceOrg'] - data['oldbalanceDest']

#### Select and Split the Data

Before we can start training our model, we need to define our features and label columns. Our label column in this dataset is the isFraud field. Create a variable called features which will be an array consisting of the following fields:
* amount
* isPayment
* isMovement
* accountDiff

Also create a variable called label with the column isFraud.

In [11]:
# Create features and label variables
features = data[['amount', 'isPayment', 'isMovement', 'accountDiff']]
label = data[['isFraud']]
print(features.head())
print(label.head())

       amount  isPayment  isMovement  accountDiff
0    62927.08          0           1   -649420.67
1    32851.57          1           0         0.00
2  1131750.38          0           1    818679.85
3    60519.74          0           1      6224.42
4    46716.01          0           0   5542581.85
   isFraud
0        0
1        0
2        1
3        1
4        0


Split the data into training and test sets using sklearn‘s train_test_split() method. We’ll use the training set to train the model and the test set to evaluate the model. Use a test_size value of 0.3.

In [12]:
# Split dataset
x_train, x_test, y_train, y_test = train_test_split(features, label, test_size = 0.3)

#### Normalize the Data


Since sklearn‘s Logistic Regression implementation uses Regularization, we need to scale our feature data. Create a StandardScaler object, .fit_transform() it on the training features, and .transform() the test features.

In [13]:
# Normalize the features variables
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

#### Create and Evaluate the Model

Create a LogisticRegression model with sklearn and .fit() it on the training data.

Fitting the model find the best coefficients for our selected features so it can more accurately predict our label. We will start with the default threshold of 0.5.

In [14]:
# Fit the model to the training data
model = LogisticRegression()
model.fit(x_train_scaled, y_train)
best_coef = model.coef_

  y = column_or_1d(y, warn=True)


Scoring the model on the training data will process the training data through the trained model and will predict which transactions are fraudulent. The score returned is the percentage of correct classifications, or the accuracy.

In [15]:
# Score the model on the training data
model_train_score = model.score(x_train_scaled, y_train)
print(f"model train score: {model_train_score}")

model train score: 0.8585714285714285


Scoring the model on the test data will process the test data through the trained model and will predict which transactions are fraudulent. The score returned is the percentage of correct classifications, or the accuracy, and will be an indicator for the sucess of your model.
* How did your model perform?

In [16]:
# Score the model on the test data
model_test_score = model.score(x_test_scaled, y_test)
print(f"model test score: {model_test_score}")

model test score: 0.8366666666666667


Print the coefficients for our model to see how important each feature column was for prediction.
* Which feature was most important?
* Least important?

In [17]:
# Print the model coefficients
print(f"best coef: {best_coef}")

best coef: [[ 2.18216988 -0.2753109   2.87298167  1.18993799]]
