# Fraud Detection System

### Modules in the Project

- Data Loading System
- Dataset Wrangling System
- Normalization and Encoding System
- Dataset Visualization System
- Model Training System
- Hperparameter Tuning
- Model Metric Calculation System
- Database System
- View manager
- Template Manager
- Model integration

---

*Contributors*
1. [Apurva Jaiswal](https://github.com/ApurvaJaiswal3398/)
2. [Suyash Mihir](https://github.com/mihirsuyash7/)

In [31]:
import pandas  as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import  classification_report

Loading Dataset and Counting the number of 'isFraud' values

In [8]:
df=pd.read_csv('dataset\Fraud_Detection_Dataset.csv')
df['isFraud'].value_counts()

isFraud
0    999464
1       535
Name: count, dtype: int64

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999999 entries, 0 to 999998
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   step            999999 non-null  int64  
 1   type            999999 non-null  object 
 2   amount          999999 non-null  float64
 3   nameOrig        999999 non-null  object 
 4   oldbalanceOrg   999999 non-null  float64
 5   newbalanceOrig  999999 non-null  float64
 6   nameDest        999999 non-null  object 
 7   oldbalanceDest  999999 non-null  float64
 8   newbalanceDest  999999 non-null  float64
 9   isFraud         999999 non-null  int64  
 10  isFlaggedFraud  999999 non-null  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 83.9+ MB


Counting the 'NaN' values for 'isFraud' Column

In [7]:
pd.isna(df['isFraud']).value_counts()

isFraud
False    999999
Name: count, dtype: int64

Creating new Dataframe with Non-numerical Columns

In [16]:
df_new=df.drop(['isFraud','type','nameOrig','nameDest'],axis=1)
df_new

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFlaggedFraud
0,1,9839.64,170136.00,160296.36,0.0,0.0,0
1,1,1864.28,21249.00,19384.72,0.0,0.0,0
2,1,181.00,181.00,0.00,0.0,0.0,0
3,1,181.00,181.00,0.00,21182.0,0.0,0
4,1,11668.14,41554.00,29885.86,0.0,0.0,0
...,...,...,...,...,...,...,...
999994,45,2987.49,579096.28,576108.80,0.0,0.0,0
999995,45,10913.42,576108.80,565195.38,0.0,0.0,0
999996,45,2014.46,565195.38,563180.92,0.0,0.0,0
999997,45,18839.45,563180.92,544341.47,0.0,0.0,0


In [19]:
df

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.00,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.00,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.00,C1305486145,181.00,0.00,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.00,C840083671,181.00,0.00,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.00,29885.86,M1230701703,0.0,0.0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
999994,45,PAYMENT,2987.49,C2072426611,579096.28,576108.80,M58668896,0.0,0.0,0,0
999995,45,PAYMENT,10913.42,C1384914558,576108.80,565195.38,M166797080,0.0,0.0,0,0
999996,45,PAYMENT,2014.46,C1207593845,565195.38,563180.92,M1027899613,0.0,0.0,0,0
999997,45,PAYMENT,18839.45,C260638437,563180.92,544341.47,M243388883,0.0,0.0,0,0


Splitting the Dataset into *Training* and *Testing Datasets*

In [17]:
from sklearn.model_selection import train_test_split
X = df_new
Y = df['isFraud']
# split into 70:30 ration
X_train, X_test, y_train, y_test = train_test_split(X, Y,test_size = 0.3, random_state = 0)

# describes info about train and test set
print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)


Number transactions X_train dataset:  (699999, 7)
Number transactions y_train dataset:  (699999,)
Number transactions X_test dataset:  (300000, 7)
Number transactions y_test dataset:  (300000,)


In [20]:
X_train

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFlaggedFraud
823731,41,107360.28,12015.00,0.00,329685.09,437045.37,0
70066,9,17804.01,143816.24,126012.23,0.00,0.00,0
591943,33,171281.86,7543395.91,7714677.77,1582800.28,1411518.42,0
578645,33,1664.86,0.00,0.00,0.00,0.00,0
675460,36,441638.40,0.00,0.00,30000000.00,30500000.00,0
...,...,...,...,...,...,...,...
963395,44,478.34,70085.00,69606.66,0.00,0.00,0
117952,11,671493.76,0.00,0.00,11300000.00,12100000.00,0
435829,18,13819.33,0.00,0.00,1858720.70,2084278.02,0
305711,15,136847.95,298.00,0.00,183542.51,320390.46,0


In [28]:
tr = y_train == 1
tr.value_counts()

isFraud
False    699624
True        375
Name: count, dtype: int64

In [22]:
X_test

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFlaggedFraud
157105,12,11562.20,54709.00,43146.80,52907.99,64470.19,0
374554,17,97672.67,323928.00,226255.33,0.00,97672.67,0
973251,44,73829.19,49548.00,0.00,50243.63,124072.82,0
265381,15,205872.83,714471.00,920343.83,0.00,0.00,0
687074,36,69108.92,0.00,0.00,193425.32,262534.24,0
...,...,...,...,...,...,...,...
938531,43,487.32,0.00,0.00,0.00,0.00,0
732856,37,157371.13,381886.31,224515.18,1672195.90,1829567.03,0
30508,8,109162.22,6178160.39,6287322.61,284079.94,174917.71,0
740010,38,126714.51,0.00,0.00,133046.68,259761.19,0


In [29]:
(y_test == 1).value_counts()

isFraud
False    299840
True        160
Name: count, dtype: int64

Applying Logistic Regression and Checking the Accuracy

In [32]:
# logistic regression object
lr = LogisticRegression()

# train the model on train set
lr.fit(X_train, y_train.ravel())

predictions = lr.predict(X_test)

# print classification report
print(classification_report(y_test, predictions))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


              precision    recall  f1-score   support

           0       1.00      1.00      1.00    299840
           1       0.78      0.23      0.35       160

    accuracy                           1.00    300000
   macro avg       0.89      0.61      0.67    300000
weighted avg       1.00      1.00      1.00    300000



Implementing SMOTE to balance the Dataset

In [33]:
print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0)))

# import SMOTE module from imblearn library
# pip install imblearn (if you don't have imblearn in your system)
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state = 2)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train.ravel())

print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_res.shape))

print("After OverSampling, counts of label '1': {}".format(sum(y_train_res == 1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res == 0)))


Before OverSampling, counts of label '1': 375
Before OverSampling, counts of label '0': 699624 

After OverSampling, the shape of train_X: (1399248, 7)
After OverSampling, the shape of train_y: (1399248,) 

After OverSampling, counts of label '1': 699624
After OverSampling, counts of label '0': 699624


In [37]:
X_train_res

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFlaggedFraud
0,41,107360.280000,1.201500e+04,0.00,3.296851e+05,4.370454e+05,0
1,9,17804.010000,1.438162e+05,126012.23,0.000000e+00,0.000000e+00,0
2,33,171281.860000,7.543396e+06,7714677.77,1.582800e+06,1.411518e+06,0
3,33,1664.860000,0.000000e+00,0.00,0.000000e+00,0.000000e+00,0
4,36,441638.400000,0.000000e+00,0.00,3.000000e+07,3.050000e+07,0
...,...,...,...,...,...,...,...
1399243,42,42593.185914,4.259319e+04,0.00,7.717958e+03,0.000000e+00,0
1399244,1,24049.006815,2.404901e+04,0.00,1.000079e+04,3.404980e+04,0
1399245,1,1444.355545,1.444356e+03,0.00,2.359802e+04,0.000000e+00,0
1399246,11,117983.116337,1.179831e+05,0.00,4.007679e+06,4.125662e+06,0


In [39]:
y_train_res

array([0, 0, 0, ..., 1, 1, 1], dtype=int64)

In [47]:
import plotly.express as px
import plotly.graph_objects as go

fig = go.Figure(data=[go.Table(header=dict(values=['isFraud', 'type', 'nameOrig', 'nameDest', 'step', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']))])

Applying Logistic Regression on the oversampled data

In [34]:
# logistic regression object
lr = LogisticRegression()

# train the model on train set
lr.fit(X_train_res, y_train_res.ravel())

predictions = lr.predict(X_test)

# print classification report
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       1.00      0.91      0.96    299840
           1       0.01      0.94      0.01       160

    accuracy                           0.91    300000
   macro avg       0.50      0.93      0.48    300000
weighted avg       1.00      0.91      0.95    300000



Implementing NearMiss to balance the Dataset

In [36]:
print("Before Undersampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before Undersampling, counts of label '0': {} \n".format(sum(y_train == 0)))

# apply near miss
from imblearn.under_sampling import NearMiss
nr = NearMiss()

X_train_miss, y_train_miss = nr.fit_resample(X_train, y_train.ravel())

print('After Undersampling, the shape of train_X: {}'.format(X_train_miss.shape))
print('After Undersampling, the shape of train_y: {} \n'.format(y_train_miss.shape))

print("After Undersampling, counts of label '1': {}".format(sum(y_train_miss == 1)))
print("After Undersampling, counts of label '0': {}".format(sum(y_train_miss == 0)))


Before Undersampling, counts of label '1': 375
Before Undersampling, counts of label '0': 699624 

After Undersampling, the shape of train_X: (750, 7)
After Undersampling, the shape of train_y: (750,) 

After Undersampling, counts of label '1': 375
After Undersampling, counts of label '0': 375


In [40]:
X_train_miss

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFlaggedFraud
0,12,176.50,164.00,0.00,0.00,0.00,0
1,7,160.87,174.00,13.13,0.00,0.00,0
2,2,157.91,172.00,14.09,0.00,0.00,0
3,35,158.43,150.00,0.00,0.00,0.00,0
4,11,195.34,140.00,0.00,0.00,0.00,0
...,...,...,...,...,...,...,...
745,3,22877.00,22877.00,0.00,0.00,0.00,0
746,27,496128.26,496128.26,0.00,0.00,0.00,0
747,9,2539898.07,2539898.07,0.00,0.00,261290.69,0
748,43,435166.65,435166.65,0.00,0.00,0.00,0


In [43]:
y_train_miss

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

Applying Logistic Regression on Undersampled Data

In [44]:
# train the model on train set
lr2 = LogisticRegression()
lr2.fit(X_train_miss, y_train_miss.ravel())
predictions = lr2.predict(X_test)

# print classification report
print(classification_report(y_test, predictions))


ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


              precision    recall  f1-score   support

           0       1.00      0.43      0.61    299840
           1       0.00      0.99      0.00       160

    accuracy                           0.43    300000
   macro avg       0.50      0.71      0.30    300000
weighted avg       1.00      0.43      0.60    300000

