# Transaction Fraud Detection:

## The idea behind this project:
- Since the advent of internet the digital revolution has rising and has creeped into all aspects to our lives. One of the most important digital revolution happend in financial system and especially transacting money to someone from any part of the world digitally. Digital transactions have become a part of daily life like purchasing a product online, sending money to friends, depositing cash in bank account, investment purposes etc., They had a lot of benefits so does paved way for fradulent activities. People started using digital money transactions medium to launder money and make the money look like it comes from a legal source.

## what can it do?
- The objective of this notebook is to find the patterns of transactions performed and help algorithms learn those patterns in identifying the fradulent transactions and flag them.

## Goals:

- Exploratory analysis of data to extract the pattern of fraudlent activites
- Build a machine learning model to classify fraud and non-fraud transactions
- Reduce the false negatives by tuning the model

In [1]:
#Basic libraries
import pandas as pd
import numpy as np

#Visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

In [2]:
df = pd.read_csv('/content/PS_20174392719_1491204439457_log.csv')

In [3]:
df.sample(10)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
27446,8,CASH_OUT,390573.89,C2024927839,0.0,0.0,C449775363,1423034.98,2052695.2,0.0,0.0
2472,1,PAYMENT,4209.87,C335466988,40973.0,36763.13,M1640244246,0.0,0.0,0.0,0.0
45019,9,CASH_IN,9563.84,C1603178007,411.0,9974.84,C990355670,0.0,0.0,0.0,0.0
52273,9,PAYMENT,22683.86,C1771042759,177986.69,155302.83,M166271178,0.0,0.0,0.0,0.0
130996,11,CASH_OUT,167896.84,C2101142971,42050.0,0.0,C1030564396,59984.0,828281.76,0.0,0.0
26109,8,PAYMENT,931.7,C1505291852,164314.0,163382.3,M1085173315,0.0,0.0,0.0,0.0
88206,10,PAYMENT,5642.5,C1580899996,199493.0,193850.5,M1894995430,0.0,0.0,0.0,0.0
22542,8,PAYMENT,7213.89,C950057110,0.0,0.0,M975500183,0.0,0.0,0.0,0.0
137530,11,PAYMENT,24706.14,C305840862,158448.0,133741.86,M770717524,0.0,0.0,0.0,0.0
27634,8,PAYMENT,6430.83,C1034399668,0.0,0.0,M1414344492,0.0,0.0,0.0,0.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 138116 entries, 0 to 138115
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   step            138116 non-null  int64  
 1   type            138116 non-null  object 
 2   amount          138116 non-null  float64
 3   nameOrig        138116 non-null  object 
 4   oldbalanceOrg   138116 non-null  float64
 5   newbalanceOrig  138116 non-null  float64
 6   nameDest        138116 non-null  object 
 7   oldbalanceDest  138115 non-null  float64
 8   newbalanceDest  138115 non-null  float64
 9   isFraud         138115 non-null  float64
 10  isFlaggedFraud  138115 non-null  float64
dtypes: float64(7), int64(1), object(3)
memory usage: 11.6+ MB


In [5]:
df.isnull().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    1
newbalanceDest    1
isFraud           1
isFlaggedFraud    1
dtype: int64

In [6]:
df.dropna(inplace=True)

In [7]:
df.isnull().sum()
# now nothing is null in the data

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

In [8]:
df['type'].value_counts()

type
PAYMENT     52865
CASH_OUT    44255
CASH_IN     27997
TRANSFER    11712
DEBIT        1286
Name: count, dtype: int64

In [None]:
import plotly.express as px
fig = px.sunburst(df, path=['type'], title='Distribution of Transaction Types')
fig.show()

In [9]:
# correlation:
# Select only numeric columns before calculating correlations
numeric_df = df.select_dtypes(include=['number'])
numeric_df.corr()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
step,1.0,0.065728,5.8e-05,2.2e-05,0.016829,0.007568,-0.050135,
amount,0.065728,1.0,-0.016798,-0.022207,0.234142,0.361064,0.033093,
oldbalanceOrg,5.8e-05,-0.016798,1.0,0.998972,0.097134,0.066367,-0.003436,
newbalanceOrig,2.2e-05,-0.022207,0.998972,1.0,0.098629,0.065312,-0.00919,
oldbalanceDest,0.016829,0.234142,0.097134,0.098629,1.0,0.945485,-0.008594,
newbalanceDest,0.007568,0.361064,0.066367,0.065312,0.945485,1.0,-0.005799,
isFraud,-0.050135,0.033093,-0.003436,-0.00919,-0.008594,-0.005799,1.0,
isFlaggedFraud,,,,,,,,


In [10]:
# ordinal encoding on the type column
# according to the importance from the pie chart
df['type'] = df['type'].replace({'CASH_OUT': 1, 'PAYMENT': 2, 'CASH_IN': 3, 'TRANSFER': 4, 'DEBIT': 5})

In [11]:
# replace isFraud, if o replace it with No_Fraud and 1 with Is_Fraud
df['isFraud'] = df['isFraud'].replace({0: 'No_Fraud', 1: 'Fraud'})

In [12]:
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,2,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,No_Fraud,0.0
1,1,2,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,No_Fraud,0.0
2,1,4,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,Fraud,0.0
3,1,1,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,Fraud,0.0
4,1,2,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,No_Fraud,0.0


In [13]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [14]:
# in X select the type, amount, oldbalanceOrg, newbalanceOrig
X = df[['type', 'amount', 'oldbalanceOrg', 'newbalanceOrig']]
# in y select the isFraud
y = df['isFraud']

In [15]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [16]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

In [17]:
y_pred = knn.predict(X_test)

In [18]:
accuracy_score(y_test, y_pred)

0.9991311588169279

In [19]:
# implement the logistic regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)

In [20]:
y_pred_lr = lr.predict(X_test)

In [21]:
accuracy_score(y_test, y_pred_lr)

0.9999275965680773

In [22]:
lr.predict([[4, 181.00, 181.0, 0.00]])



array(['Fraud'], dtype=object)

In [46]:
import pickle

pickle.dump(df,open('df.pkl','wb'))
pickle.dump(lr,open('lr.pkl','wb'))

In [26]:
fraud_data = df[df['isFraud'] == 'Fraud']

In [27]:
fraud_data

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
2,1,4,181.00,C1305486145,181.00,0.0,C553264065,0.00,0.00,Fraud,0.0
3,1,1,181.00,C840083671,181.00,0.0,C38997010,21182.00,0.00,Fraud,0.0
251,1,4,2806.00,C1420196421,2806.00,0.0,C972765878,0.00,0.00,Fraud,0.0
252,1,1,2806.00,C2101527076,2806.00,0.0,C1007251739,26202.00,0.00,Fraud,0.0
680,1,4,20128.00,C137533655,20128.00,0.0,C1848415041,0.00,0.00,Fraud,0.0
...,...,...,...,...,...,...,...,...,...,...,...
102181,10,1,2662734.59,C813115168,2662734.59,0.0,C401825929,14165.62,2930405.33,Fraud,0.0
102607,10,4,9217.19,C184586799,9217.19,0.0,C812377986,0.00,96795.60,Fraud,0.0
102608,10,1,9217.19,C1105700111,9217.19,0.0,C1767952032,0.00,9217.19,Fraud,0.0
136419,11,4,2100.00,C785601242,2100.00,0.0,C1576053316,0.00,0.00,Fraud,0.0
