# Fraud Detection
This project will fit and assess a model to predict whether a transaction is a fraud or not.

**Data sources**

- Kaggle: [Synthetic Financial Datasets For Fraud Detection](https://www.kaggle.com/datasets/ealaxi/paysim1) 

For this project, instead of using the file `transactions.csv` that contains 200,000 observations, a modified dataset (`transactions_modified.csv`) will be used which contains only 1,000 observations.

## Import Python Modules

First, import the preliminary modules that will be used in this project:

In [1]:
import seaborn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## Load Data
First, lets load the data into `transactions` to contain the 1,000 observations of transactions, to know if they are fraud.

In [35]:
transactions = pd.read_csv('transactions_modified.csv')
transactions.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isPayment,isMovement,accountDiff
0,206,CASH_OUT,62927.08,C473782114,0.0,0.0,C2096898696,649420.67,712347.75,0,0,1,649420.67
1,380,PAYMENT,32851.57,C1915112886,0.0,0.0,M916879292,0.0,0.0,0,1,0,0.0
2,570,CASH_OUT,1131750.38,C1396198422,1131750.38,0.0,C1612235515,313070.53,1444820.92,1,0,1,818679.85
3,184,CASH_OUT,60519.74,C982551468,60519.74,0.0,C1378644910,54295.32,182654.5,1,0,1,6224.42
4,162,CASH_IN,46716.01,C1759889425,7668050.6,7714766.61,C2059152908,2125468.75,2078752.75,0,0,0,5542581.85


Now, lets see how many frauds there are in the sample and lets see basic information about the data.

In [11]:
number_frauds = len(transactions[transactions['isFraud'] == 1])
print('Number of frauds: ', number_frauds, '\n')
transactions.info()

Number of frauds:  282 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   step            1000 non-null   int64  
 1   type            1000 non-null   object 
 2   amount          1000 non-null   float64
 3   nameOrig        1000 non-null   object 
 4   oldbalanceOrg   1000 non-null   float64
 5   newbalanceOrig  1000 non-null   float64
 6   nameDest        1000 non-null   object 
 7   oldbalanceDest  1000 non-null   float64
 8   newbalanceDest  1000 non-null   float64
 9   isFraud         1000 non-null   int64  
 10  isPayment       1000 non-null   int64  
 11  isMovement      1000 non-null   int64  
 12  accountDiff     1000 non-null   float64
dtypes: float64(6), int64(4), object(3)
memory usage: 101.7+ KB


## Prepare data
From the different columns, `amount` could be important. Lets use summary statistics to understand better the data.

In [15]:
transactions['amount'].describe()

count    1.000000e+03
mean     5.373080e+05
std      1.423692e+06
min      0.000000e+00
25%      2.933705e+04
50%      1.265305e+05
75%      3.010378e+05
max      1.000000e+07
Name: amount, dtype: float64

Now, lets use one hot encoding to create different columns that consider the different types that exist. 

In [36]:
different_types = pd.get_dummies(transactions['type'], dtype=int)
transactions = transactions.join(different_types)
transactions.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isPayment,isMovement,accountDiff,CASH_IN,CASH_OUT,DEBIT,PAYMENT,TRANSFER
0,206,CASH_OUT,62927.08,C473782114,0.0,0.0,C2096898696,649420.67,712347.75,0,0,1,649420.67,0,1,0,0,0
1,380,PAYMENT,32851.57,C1915112886,0.0,0.0,M916879292,0.0,0.0,0,1,0,0.0,0,0,0,1,0
2,570,CASH_OUT,1131750.38,C1396198422,1131750.38,0.0,C1612235515,313070.53,1444820.92,1,0,1,818679.85,0,1,0,0,0
3,184,CASH_OUT,60519.74,C982551468,60519.74,0.0,C1378644910,54295.32,182654.5,1,0,1,6224.42,0,1,0,0,0
4,162,CASH_IN,46716.01,C1759889425,7668050.6,7714766.61,C2059152908,2125468.75,2078752.75,0,0,0,5542581.85,1,0,0,0,0


Another factor to consider is if the new balance desinatary doesnt change accordingly after the transaction, then there could be a fraud. For that, the column `accountDiff` will be used to check the difference between the two accounts.

In [40]:
transactions['accountDiff'] = abs(transactions['oldbalanceOrg'] - transactions['oldbalanceDest'])
transactions.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isPayment,isMovement,accountDiff,CASH_IN,CASH_OUT,DEBIT,PAYMENT,TRANSFER
0,206,CASH_OUT,62927.08,C473782114,0.0,0.0,C2096898696,649420.67,712347.75,0,0,1,649420.67,0,1,0,0,0
1,380,PAYMENT,32851.57,C1915112886,0.0,0.0,M916879292,0.0,0.0,0,1,0,0.0,0,0,0,1,0
2,570,CASH_OUT,1131750.38,C1396198422,1131750.38,0.0,C1612235515,313070.53,1444820.92,1,0,1,818679.85,0,1,0,0,0
3,184,CASH_OUT,60519.74,C982551468,60519.74,0.0,C1378644910,54295.32,182654.5,1,0,1,6224.42,0,1,0,0,0
4,162,CASH_IN,46716.01,C1759889425,7668050.6,7714766.61,C2059152908,2125468.75,2078752.75,0,0,0,5542581.85,1,0,0,0,0


## Train and Test Model
Now, lets separate the data into training and testing sets to train the model. A 70/30 split will be used.

In [46]:
# Get the independent features to predict if there is Fraud
X = transactions[['amount',
                'isPayment',
                'isMovement',
                'accountDiff']]

y = transactions['isFraud']

X_train, X_test, y_train, y_test = train_test_split(X, y,train_size=0.7, test_size=0.3, random_state=6)
print('Shapes of the train and test data')
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

Shapes of the train and test data
(700, 4)
(300, 4)
(700,)
(300,)


### Scaling Data
The logistic regression implementation uses regularization. So, first we need to standarize the data using sklearn.

In [47]:
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

### Train and Test the Model
Now, lets use the new data to train the model and asses the performance. 

In [48]:
model = LogisticRegression()
model.fit(X_train, y_train)

y_predicted = model.predict(X_test)

Now, lets asses the performance of the model using the `score()` method which calculates the accuracy of the model.

In [50]:
print("Train score:")
print(model.score(X_train, y_train))
print()
print("Test score:")
print(model.score(X_test, y_test))

Train score:
0.8414285714285714

Test score:
0.85


### Analysis
From the data below, it can be seen that the feature `amount` is the one that has the greater impact on the model and the feature `isPayment` is the feature that has the least impact on the model.

In [51]:
coefficients = model.coef_
intercept = model.intercept_
print('coefficients: ', coefficients)
print('intercept: ', intercept)

coefficients:  [[ 2.76728882 -0.61054026  2.06030391 -1.29953811]]
intercept:  [-2.12167799]
