# Predicting Credit Card Fraud

In [36]:
import seaborn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [37]:
# Load the data and explore
transactions = pd.read_csv('transactions.csv')
print(transactions.head())
print(transactions.count())

   step      type     amount     nameOrig  oldbalanceOrg  newbalanceOrig  \
0     8  CASH_OUT  158007.12   C424875646           0.00            0.00   
1   236  CASH_OUT  457948.30  C1342616552           0.00            0.00   
2    37   CASH_IN  153602.99   C900876541    11160428.67     11314031.67   
3   331  CASH_OUT   49555.14   C177696810       10865.00            0.00   
4   250  CASH_OUT   29648.02   C788941490           0.00            0.00   

      nameDest  oldbalanceDest  newbalanceDest  isFraud  
0  C1298177219       474016.32      1618631.97        0  
1  C1323169990      2720411.37      3178359.67        0  
2   C608741097      3274930.56      3121327.56        0  
3   C462716348            0.00        49555.14        0  
4  C1971700992        56933.09        86581.10        0  
step              199999
type              199999
amount            199999
nameOrig          199999
oldbalanceOrg     199999
newbalanceOrig    199999
nameDest          199999
oldbalanceDest    19

There are a few interesting columns to look at. We know that the amount of a given transaction is going to be important.

In [38]:
# Summary statistics on amount column
transactions['amount'].describe()

count    1.999990e+05
mean     1.802425e+05
std      6.255482e+05
min      0.000000e+00
25%      1.338746e+04
50%      7.426695e+04
75%      2.086376e+05
max      5.204280e+07
Name: amount, dtype: float64

In [39]:
# Create isPayment field
transactions['isPayment'] = 0
transactions['isPayment'][transactions['type'].isin(['PAYMENT','DEBIT'])] = 1

# Create isMovement field
transactions['isMovement'] = 0
transactions['isMovement'][transactions['type'].isin(['CASH_OUT', 'TRANSFER'])] = 1

# Create accountDiff field
transactions['accountDiff'] = abs(transactions['oldbalanceDest'] - transactions['oldbalanceOrg'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  transactions['isPayment'][transactions['type'].isin(['PAYMENT','DEBIT'])] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  transactions['isMovement'][transactions['type'].isin(['CASH_OUT', 'TRANSFER'])] = 1


With financial fraud, another key factor to investigate would be the difference in value between the origin and destination account.   
Our theory, in this case, being that <u>destination accounts with a significantly different value could be suspect of fraud</u>.

In [40]:
# Create features and label variables
features = transactions[['amount','isPayment','isMovement','accountDiff']]
label = transactions['isFraud']

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.3)

In [41]:
# Normalize the features variables
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [42]:
# Fit the model to the training data
model = LogisticRegression()
model.fit(X_train, y_train)

# Score the model on the training data
print(model.score(X_train, y_train))

0.9985642754591104


In [43]:
# Score the model on the test data
print(model.score(X_test, y_test))

0.9986


In [44]:
# Print the model coefficients
print(model.coef_)

[[ 0.25301337 -0.74797178  2.25524186 -0.80314006]]


## Predict With the Model


In [50]:
# New transaction data
transaction1 = np.array([123456.78, 0.0, 1.0, 54670.1])
transaction2 = np.array([98765.43, 1.0, 0.0, 8524.75])
transaction3 = np.array([543678.31, 0.0, 1.0, 5625.5])
transaction4 = np.array([61472.54, 1.0, 0.0, 565901.23])

# Combine new transactions into a single array
sample_transactions = np.stack((transaction1,transaction2,transaction3,transaction4))

# Normalize the new transactions
sample_transactions = scaler.transform(sample_transactions)



In [51]:
# Predict fraud on the new transactions
print(model.predict(sample_transactions))

[0 0 0 0]


In [52]:
# Show probabilities on the new transactions
print(model.predict_proba(sample_transactions))

[[9.96477842e-01 3.52215815e-03]
 [9.99992243e-01 7.75712215e-06]
 [9.95774663e-01 4.22533680e-03]
 [9.99993186e-01 6.81355072e-06]]


The 1st column is the probability of a transaction not being fraudulent, and the 2nd column is the probability of a transaction being fraudulent (which was calculated by our model to make the final classification decision).

**Looks like there is probably no fraudulent transaction in this stack**