# Fraud Transaction Detection

Fraud detection transaction machine learning models are sophisticated algorithms designed to automatically identify and flag suspicious or fraudulent transactions in financial systems. These models analyze a wide range of transaction data, including transaction amounts, timestamps, user details, and more, to detect patterns indicative of fraudulent behavior. By using given transaction data to learn from past fraud patterns, these models can accurately predict and prevent potential fraudulent activities in real-time, safeguarding businesses and consumers from financial losses and security breaches.

In [1]:
#import libraries
import numpy as np
import pandas as pd 

In [2]:
#Load the csv file
fraud = pd.read_csv("Fraud.csv")

In [3]:
#Show first five rows  
fraud.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [4]:
#Information of given data
fraud.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


In [5]:
#Size of given data
fraud.shape

(6362620, 11)

In [6]:
#To check the null values
fraud.isna().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

In [7]:
#To count the fraud cases 
fraud.isFraud.value_counts()

isFraud
0    6354407
1       8213
Name: count, dtype: int64

In [8]:
#To count the flagged illegal fraud cases 
fraud.isFlaggedFraud.value_counts()

isFlaggedFraud
0    6362604
1         16
Name: count, dtype: int64

As nameOrig and nameDest has no involvement in the data training and testing, therefore we can remove those two columns

In [9]:
# Data Preprocessing
fraud = fraud.drop(['nameOrig','nameDest'], axis=1) 

 Label encoding is a process used in machine learning to convert categorical variables (text-based labels) into numerical representations. Many machine learning algorithms require input features to be in numerical format, and label encoding is one way to achieve that conversion.

In [10]:
#Import label encoder
from sklearn import preprocessing

#label encoder object knows how to understand word labels
label_encoder = preprocessing.LabelEncoder()

# fitting the encoder and transforming into its numerical representation
fraud['type']= label_encoder.fit_transform(fraud['type'])

The 'loc' method in pandas DataFrame is used for label-based indexing.
The first argument of 'loc' (before the comma) selects rows, and since it uses the ':' notation, it selects all rows.
The second argument (after the comma) selects columns based on a condition.

In [11]:
x, y = fraud.loc[:, fraud.columns != 'isFraud'], fraud['isFraud']

In [12]:
#split the data into training and testing sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.40, random_state=42)

StandardScaler from scikit-learn (sklearn) library to standardize the feature variables in the training and testing sets. Standardization is a common preprocessing technique used in machine learning to scale the features so that they have a mean of 0 and a standard deviation of 1. This process helps improve the performance and convergence of many machine learning algorithms.

In [13]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train =sc.fit_transform(x_train)
x_test = sc.transform(x_test)

Gaussian Naive Bayes is a probabilistic algorithm used for classification tasks and assumes that the features follow a Gaussian (normal) distribution.

In [14]:
#import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics #import scikit-learn metrics for accuracy calcualtion

#Create a Guassian Classifier
gnb = GaussianNB()

#Train the model using the training sets
gnb.fit(x_train, y_train)

In [15]:
#Predict the response for the test dataset
y_pred = gnb.predict(x_test)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.9955882953877491


Logistic Regression is commonly used for binary classification tasks where the target variable has two classes.

In [16]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state= 0 )
classifier.fit(x_train,y_train)
y_pred = classifier.predict(x_test)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.9992070876462841
