# Predicting Credit Card Fraud

Credit card fraud is one of the leading causes of identify theft around the world. In 2018 alone, over $24 billion were stolen through fraudulent credit card transactions. Financial institutions employ a wide variety of different techniques to prevent fraud, one of the most common being Logistic Regression.

In this project I will use Logistic Regression and create a predictive model to determine if a transaction is fraudulent or not. 

The file [dataset.csv](https://www.kaggle.com/datasets/ealaxi/paysim1) contains data on 200k simulated credit card transactions.

In [13]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd 

transactions = pd.read_csv("dataset.csv")

transactions

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.00,160296.36,M1979787155,0.00,0.00,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.00,19384.72,M2044282225,0.00,0.00,0,0
2,1,TRANSFER,181.00,C1305486145,181.00,0.00,C553264065,0.00,0.00,1,0
3,1,CASH_OUT,181.00,C840083671,181.00,0.00,C38997010,21182.00,0.00,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.00,29885.86,M1230701703,0.00,0.00,0,0
...,...,...,...,...,...,...,...,...,...,...,...
6362615,743,CASH_OUT,339682.13,C786484425,339682.13,0.00,C776919290,0.00,339682.13,1,0
6362616,743,TRANSFER,6311409.28,C1529008245,6311409.28,0.00,C1881841831,0.00,0.00,1,0
6362617,743,CASH_OUT,6311409.28,C1162922333,6311409.28,0.00,C1365125890,68488.84,6379898.11,1,0
6362618,743,TRANSFER,850002.52,C1685995037,850002.52,0.00,C2080388513,0.00,0.00,1,0


### Clean Data

Let's see some statistics on `amount`:

In [14]:
transactions["amount"].describe()

count    6.362620e+06
mean     1.798619e+05
std      6.038582e+05
min      0.000000e+00
25%      1.338957e+04
50%      7.487194e+04
75%      2.087215e+05
max      9.244552e+07
Name: amount, dtype: float64

We have a lot of information about the type of transaction we are looking at. Let’s create a new column `isPayment` that assigns a 1 when type is “PAYMENT” or “DEBIT”, and a 0 otherwise.

In [30]:
transactions["type"].unique()

array(['PAYMENT', 'TRANSFER', 'CASH_OUT', 'DEBIT', 'CASH_IN'],
      dtype=object)

In [71]:
transactions["isPayment"] = 0

transactions["isPayment"][transactions["type"].isin(["PAYMENT", "DEBIT"])] = 1

transactions.sample(10)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,isPayment
630042,34,CASH_IN,34593.67,C1964308471,2959289.31,2993882.98,C260931560,1468151.42,1433557.74,0,0,0
5584417,394,CASH_OUT,545154.08,C1764382170,0.0,0.0,C736930164,1113802.14,1658956.23,0,0,0
3770379,280,CASH_OUT,35514.96,C1503669016,96055.0,60540.04,C1279480147,1813761.73,1849276.69,0,0,0
2767473,213,TRANSFER,286171.98,C1538884243,50142.62,0.0,C1556836011,789466.82,1075638.8,0,0,0
1740472,161,DEBIT,17642.01,C622784729,478.0,0.0,C1991915301,563809.21,581451.22,0,0,1
2449674,203,CASH_OUT,287569.05,C371262605,0.0,0.0,C867497050,1197648.75,1485217.8,0,0,0
5008300,353,CASH_IN,199372.7,C245789251,256025.39,455398.09,C813505718,303481.75,104109.06,0,0,0
5957210,405,CASH_OUT,42615.61,C379382292,10634.0,0.0,C1172772405,353183.74,395799.35,0,0,0
1240521,134,CASH_IN,135024.48,C276659013,74865.0,209889.48,C767497528,285106.44,150081.95,0,0,0
1498179,142,PAYMENT,13673.84,C390941092,1470.0,0.0,M1238995243,0.0,0.0,0,0,1


Similarly, let's create a column `isMovement`, which will capture if money moved out of the origin account. This column will have a value of 1 when type is either “CASH_OUT” or “TRANSFER”, and a 0 otherwise.

In [83]:
transactions["isMovement"] = 0

transactions["isMovement"][transactions["type"].isin(["CASH_OUT", "TRANSFER"])] = 1

transactions.sample(5)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,isPayment,isMovement
1594673,156,TRANSFER,669721.86,C1137520086,525.0,0.0,C406209727,0.0,669721.86,0,0,0,1
4441223,323,CASH_IN,251156.32,C1936716212,52533.0,303689.32,C743639809,57210.79,0.0,0,0,0,0
810986,40,CASH_OUT,492776.27,C1960452738,0.0,0.0,C848982811,1215536.86,1708313.13,0,0,0,1
759079,38,CASH_OUT,143869.23,C278632423,0.0,0.0,C1453102946,2144425.22,2288294.45,0,0,0,1
2917008,229,CASH_IN,7839.33,C1656118325,30325.0,38164.33,C174161132,115591.77,107752.44,0,0,0,0


With financial fraud, another key factor to investigate would be the difference in value between the origin and destination account. My theory, in this case, being that destination accounts with a significantly different value could be suspect of fraud. Let’s create a column `accountDiff` with the absolute difference of the `oldbalanceOrg` and `oldbalanceDest` columns.

In [85]:
transactions["accountDiff"] = abs(transactions["oldbalanceOrg"] - transactions["oldbalanceDest"])

transactions.sample(5)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,isPayment,isMovement,accountDiff
3747225,279,PAYMENT,3597.88,C1086384216,25580.0,21982.12,M943927369,0.0,0.0,0,0,1,0,25580.0
3812662,281,CASH_IN,11765.04,C1798814572,5085159.06,5096924.09,C813671948,573073.64,561308.6,0,0,0,0,4512085.42
132932,11,DEBIT,941.54,C581207770,20303.0,19361.46,C1526689241,110864.0,111805.54,0,0,1,0,90561.0
4931702,350,CASH_IN,122888.72,C348422043,38924.0,161812.72,C838343736,1146917.51,1024028.79,0,0,0,0,1107993.51
5869458,403,CASH_IN,67695.23,C290966166,24687.0,92382.23,C1701194538,2152142.15,2084446.92,0,0,0,0,2127455.15


### Select and Split Data 

I will use the following features for the model: 
- `amount`
- `isPayment`
- `isMovement`
- `accountDiff`

In [87]:
features = transactions[["amount", "isPayment", "isMovement", "accountDiff"]]

label = transactions[["isFraud"]]

Split data:

In [89]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(features, label, test_size=0.3)

### Normalize Data 

In [91]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

### Create and Evaluate the Model

In [94]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(x_train, y_train)

LogisticRegression()

Find accuracy of training data:

In [97]:
model.score(x_train, y_train)

0.9986946078367537

Find accuracy of testing data:

In [99]:
model.score(x_test, y_test)

0.9986850280754365

View the coefficients for the model to see how important each feature column was for prediction:

In [100]:
model.coef_

array([[ 0.21942793, -1.03256274,  3.62222847, -0.65657416]])

Let’s use our model to process some more random transactions:

In [105]:
import numpy as np

transaction1 = np.array([123456.78, 0.0, 1.0, 54670.1])
transaction2 = np.array([98765.43, 1.0, 0.0, 8524.75])
transaction3 = np.array([543678.31, 1.0, 0.0, 510025.5])
transaction4 = np.array([6472.54, 1.0, 0.0, 55901.23])

In [106]:
sample_transactions = np.stack((transaction1, transaction2, transaction3, transaction4))

Normalize the sample_transactions:

In [107]:
sample_transactions = scaler.transform(sample_transactions)

Predict fraud on sample_transactions:

In [108]:
model.predict(sample_transactions)

array([0, 1, 0, 0], dtype=int64)