step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).

type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

amount - amount of the transaction in local currency.

nameOrig - customer who started the transaction

oldbalanceOrg - initial balance before the transaction

newbalanceOrig - new balance after the transaction

nameDest - customer who is the recipient of the transaction

oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).

newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).

isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.


**Importing the necessary libraries**

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report,f1_score,precision_score,recall_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from sklearn.linear_model import LogisticRegression

Reading the Data

In [2]:
data=pd.read_csv("Fraud.csv")

**Retrieving the top 5 records from the data.**

In [3]:
data.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0.0,0.0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0.0,0.0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1.0,0.0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1.0,0.0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0.0,0.0


**Retrieving the bottom 5 records from the data**

In [4]:
data.tail()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
219760,14,PAYMENT,591.5,C1443113739,104879.68,104288.18,M936090293,0.0,0.0,0.0,0.0
219761,14,PAYMENT,14839.24,C554257502,104288.18,89448.94,M1152875694,0.0,0.0,0.0,0.0
219762,14,CASH_IN,144258.67,C865880896,298512.0,442770.67,C351716642,891449.26,747190.6,0.0,0.0
219763,14,CASH_OUT,348453.47,C1483305396,11894.0,0.0,C466835398,373004.93,721458.4,0.0,0.0
219764,14,PAYMENT,38981.58,C862563675,11.0,,,,,,


**All the columns in the dataframe**

In [5]:
data.columns

Index(['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig',
       'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud',
       'isFlaggedFraud'],
      dtype='object')

**Knowing the data types of all the features**

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 219765 entries, 0 to 219764
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   step            219765 non-null  int64  
 1   type            219765 non-null  object 
 2   amount          219765 non-null  float64
 3   nameOrig        219765 non-null  object 
 4   oldbalanceOrg   219765 non-null  float64
 5   newbalanceOrig  219764 non-null  float64
 6   nameDest        219764 non-null  object 
 7   oldbalanceDest  219764 non-null  float64
 8   newbalanceDest  219764 non-null  float64
 9   isFraud         219764 non-null  float64
 10  isFlaggedFraud  219764 non-null  float64
dtypes: float64(7), int64(1), object(3)
memory usage: 18.4+ MB


In [7]:
data

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.00,160296.36,M1979787155,0.00,0.0,0.0,0.0
1,1,PAYMENT,1864.28,C1666544295,21249.00,19384.72,M2044282225,0.00,0.0,0.0,0.0
2,1,TRANSFER,181.00,C1305486145,181.00,0.00,C553264065,0.00,0.0,1.0,0.0
3,1,CASH_OUT,181.00,C840083671,181.00,0.00,C38997010,21182.00,0.0,1.0,0.0
4,1,PAYMENT,11668.14,C2048537720,41554.00,29885.86,M1230701703,0.00,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
219760,14,PAYMENT,591.50,C1443113739,104879.68,104288.18,M936090293,0.00,0.0,0.0,0.0
219761,14,PAYMENT,14839.24,C554257502,104288.18,89448.94,M1152875694,0.00,0.0,0.0,0.0
219762,14,CASH_IN,144258.67,C865880896,298512.00,442770.67,C351716642,891449.26,747190.6,0.0,0.0
219763,14,CASH_OUT,348453.47,C1483305396,11894.00,0.00,C466835398,373004.93,721458.4,0.0,0.0


**Getting the numeric features from the dataframe**

In [8]:
data_numeric=data.select_dtypes(include=['int64',"float64"])

**Finding the correlation between the features**

In [9]:
data_numeric.corr()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
step,1.0,0.042946,-0.006093,-0.005527,0.02639,0.005881,-0.03653,
amount,0.042946,1.0,-0.019907,-0.0242,0.217187,0.340119,0.045349,
oldbalanceOrg,-0.006093,-0.019907,1.0,0.99894,0.095025,0.064113,-0.001483,
newbalanceOrig,-0.005527,-0.0242,0.99894,1.0,0.096643,0.063158,-0.008244,
oldbalanceDest,0.02639,0.217187,0.095025,0.096643,1.0,0.95453,-0.008184,
newbalanceDest,0.005881,0.340119,0.064113,0.063158,0.95453,1.0,-0.004663,
isFraud,-0.03653,0.045349,-0.001483,-0.008244,-0.008184,-0.004663,1.0,
isFlaggedFraud,,,,,,,,


In [10]:
data

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.00,160296.36,M1979787155,0.00,0.0,0.0,0.0
1,1,PAYMENT,1864.28,C1666544295,21249.00,19384.72,M2044282225,0.00,0.0,0.0,0.0
2,1,TRANSFER,181.00,C1305486145,181.00,0.00,C553264065,0.00,0.0,1.0,0.0
3,1,CASH_OUT,181.00,C840083671,181.00,0.00,C38997010,21182.00,0.0,1.0,0.0
4,1,PAYMENT,11668.14,C2048537720,41554.00,29885.86,M1230701703,0.00,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
219760,14,PAYMENT,591.50,C1443113739,104879.68,104288.18,M936090293,0.00,0.0,0.0,0.0
219761,14,PAYMENT,14839.24,C554257502,104288.18,89448.94,M1152875694,0.00,0.0,0.0,0.0
219762,14,CASH_IN,144258.67,C865880896,298512.00,442770.67,C351716642,891449.26,747190.6,0.0,0.0
219763,14,CASH_OUT,348453.47,C1483305396,11894.00,0.00,C466835398,373004.93,721458.4,0.0,0.0


**Trying to determine if there is any relationship between transaction type and the occurrence of fraud.**

In [11]:
data[['type','isFraud']]

Unnamed: 0,type,isFraud
0,PAYMENT,0.0
1,PAYMENT,0.0
2,TRANSFER,1.0
3,CASH_OUT,1.0
4,PAYMENT,0.0
...,...,...
219760,PAYMENT,0.0
219761,PAYMENT,0.0
219762,CASH_IN,0.0
219763,CASH_OUT,0.0


In [12]:
data['type'].unique()

array(['PAYMENT', 'TRANSFER', 'CASH_OUT', 'DEBIT', 'CASH_IN'],
      dtype=object)

**Finding the rows with NaN**

In [13]:
rows_with_nan = data[data.isnull().any(axis=1)]

In [14]:
rows_with_nan

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
219764,14,PAYMENT,38981.58,C862563675,11.0,,,,,,


**Droping the NaN records**

In [15]:
data=data.dropna()
data

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.00,160296.36,M1979787155,0.00,0.00,0.0,0.0
1,1,PAYMENT,1864.28,C1666544295,21249.00,19384.72,M2044282225,0.00,0.00,0.0,0.0
2,1,TRANSFER,181.00,C1305486145,181.00,0.00,C553264065,0.00,0.00,1.0,0.0
3,1,CASH_OUT,181.00,C840083671,181.00,0.00,C38997010,21182.00,0.00,1.0,0.0
4,1,PAYMENT,11668.14,C2048537720,41554.00,29885.86,M1230701703,0.00,0.00,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
219759,14,CASH_IN,104070.68,C1369874366,809.00,104879.68,C1765636753,635479.45,531408.77,0.0,0.0
219760,14,PAYMENT,591.50,C1443113739,104879.68,104288.18,M936090293,0.00,0.00,0.0,0.0
219761,14,PAYMENT,14839.24,C554257502,104288.18,89448.94,M1152875694,0.00,0.00,0.0,0.0
219762,14,CASH_IN,144258.67,C865880896,298512.00,442770.67,C351716642,891449.26,747190.60,0.0,0.0


**All the records which are fraudulent.**

In [16]:
fraud_data=data[data['isFraud']==1].reset_index()

In [17]:
fraud_data

Unnamed: 0,index,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,2,1,TRANSFER,181.00,C1305486145,181.00,0.0,C553264065,0.00,0.00,1.0,0.0
1,3,1,CASH_OUT,181.00,C840083671,181.00,0.0,C38997010,21182.00,0.00,1.0,0.0
2,251,1,TRANSFER,2806.00,C1420196421,2806.00,0.0,C972765878,0.00,0.00,1.0,0.0
3,252,1,CASH_OUT,2806.00,C2101527076,2806.00,0.0,C1007251739,26202.00,0.00,1.0,0.0
4,680,1,TRANSFER,20128.00,C137533655,20128.00,0.0,C1848415041,0.00,0.00,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
150,204215,13,TRANSFER,2686564.04,C364029634,2686564.04,0.0,C2041291172,0.00,0.00,1.0,0.0
151,204216,13,CASH_OUT,2686564.04,C901331326,2686564.04,0.0,C1662761228,438199.37,3124763.41,1.0,0.0
152,217320,13,TRANSFER,6188514.81,C135832352,6188514.81,0.0,C2009346140,0.00,0.00,1.0,0.0
153,217321,13,CASH_OUT,6188514.81,C686187434,6188514.81,0.0,C1562904239,381607.21,6424681.56,1.0,0.0


**Here we can see that all the fraudulent are of type "Transfer" and "Cash_Out".**

In [18]:
fraud_data['type'].unique()

array(['TRANSFER', 'CASH_OUT'], dtype=object)

**OneHotEncoding the feature type, preparing for the ML model**

In [19]:
new_data=pd.get_dummies(data,columns=['type'],prefix='Transaction_type')

In [20]:
new_data

Unnamed: 0,step,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,Transaction_type_CASH_IN,Transaction_type_CASH_OUT,Transaction_type_DEBIT,Transaction_type_PAYMENT,Transaction_type_TRANSFER
0,1,9839.64,C1231006815,170136.00,160296.36,M1979787155,0.00,0.00,0.0,0.0,0,0,0,1,0
1,1,1864.28,C1666544295,21249.00,19384.72,M2044282225,0.00,0.00,0.0,0.0,0,0,0,1,0
2,1,181.00,C1305486145,181.00,0.00,C553264065,0.00,0.00,1.0,0.0,0,0,0,0,1
3,1,181.00,C840083671,181.00,0.00,C38997010,21182.00,0.00,1.0,0.0,0,1,0,0,0
4,1,11668.14,C2048537720,41554.00,29885.86,M1230701703,0.00,0.00,0.0,0.0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
219759,14,104070.68,C1369874366,809.00,104879.68,C1765636753,635479.45,531408.77,0.0,0.0,1,0,0,0,0
219760,14,591.50,C1443113739,104879.68,104288.18,M936090293,0.00,0.00,0.0,0.0,0,0,0,1,0
219761,14,14839.24,C554257502,104288.18,89448.94,M1152875694,0.00,0.00,0.0,0.0,0,0,0,1,0
219762,14,144258.67,C865880896,298512.00,442770.67,C351716642,891449.26,747190.60,0.0,0.0,1,0,0,0,0


**Finding the IQR,Q1 and Q3 for detecting  outliers.**

In [21]:
amount_Q1=new_data['amount'].quantile(0.25)
amount_Q3=new_data['amount'].quantile(0.75)
amount_IQR=amount_Q3-amount_Q1
amount_lower_bound=amount_Q1-1.5*amount_IQR
amount_upper_bound=amount_Q3+1.5*amount_IQR

**Data having the outliers in amount feature which is amount of the transaction.**




In [22]:
new_data[(new_data['amount']<amount_lower_bound )| (new_data['amount']>amount_upper_bound)]

Unnamed: 0,step,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,Transaction_type_CASH_IN,Transaction_type_CASH_OUT,Transaction_type_DEBIT,Transaction_type_PAYMENT,Transaction_type_TRANSFER
85,1,1505626.01,C926859124,0.0,0.0,C665576141,29031.00,5515763.34,0.0,0.0,0,0,0,0,1
88,1,761507.39,C412788346,0.0,0.0,C1590550415,1280036.23,19169204.93,0.0,0.0,0,0,0,0,1
89,1,1429051.47,C1520267010,0.0,0.0,C1590550415,2041543.62,19169204.93,0.0,0.0,0,0,0,0,1
93,1,583848.46,C1839168128,0.0,0.0,C1286084959,667778.00,2107778.11,0.0,0.0,0,0,0,0,1
94,1,1724887.05,C1495608502,0.0,0.0,C1590550415,3470595.10,19169204.93,0.0,0.0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
219574,13,749453.80,C203723601,30336.0,0.0,C711324303,429073.55,1178527.35,0.0,0.0,0,0,0,0,1
219583,13,623656.06,C405511670,959.0,0.0,C191751046,1407554.39,2031210.46,0.0,0.0,0,0,0,0,1
219640,14,753328.07,C161053663,31226.0,0.0,C425902101,173839.65,927167.72,0.0,0.0,0,0,0,0,1
219643,14,700139.63,C1666919110,22215.0,0.0,C582111174,2087167.94,2787307.57,0.0,0.0,0,1,0,0,0


**Data without the outliers in amount of transaction**

In [23]:
new_data[(new_data['amount']>amount_lower_bound) & (new_data['amount']<amount_upper_bound)]

Unnamed: 0,step,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,Transaction_type_CASH_IN,Transaction_type_CASH_OUT,Transaction_type_DEBIT,Transaction_type_PAYMENT,Transaction_type_TRANSFER
0,1,9839.64,C1231006815,170136.00,160296.36,M1979787155,0.00,0.00,0.0,0.0,0,0,0,1,0
1,1,1864.28,C1666544295,21249.00,19384.72,M2044282225,0.00,0.00,0.0,0.0,0,0,0,1,0
2,1,181.00,C1305486145,181.00,0.00,C553264065,0.00,0.00,1.0,0.0,0,0,0,0,1
3,1,181.00,C840083671,181.00,0.00,C38997010,21182.00,0.00,1.0,0.0,0,1,0,0,0
4,1,11668.14,C2048537720,41554.00,29885.86,M1230701703,0.00,0.00,0.0,0.0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
219759,14,104070.68,C1369874366,809.00,104879.68,C1765636753,635479.45,531408.77,0.0,0.0,1,0,0,0,0
219760,14,591.50,C1443113739,104879.68,104288.18,M936090293,0.00,0.00,0.0,0.0,0,0,0,1,0
219761,14,14839.24,C554257502,104288.18,89448.94,M1152875694,0.00,0.00,0.0,0.0,0,0,0,1,0
219762,14,144258.67,C865880896,298512.00,442770.67,C351716642,891449.26,747190.60,0.0,0.0,1,0,0,0,0


Since we are dealing with fraud transactions I choose to keep the outliers in amount


**Creating a new featuere if the amount of transcation is >2000000 then 1 else 0
,there are considered as illegal transcations**

In [24]:
new_data['High_Transcation']=new_data['amount'].apply(lambda x:0 if x<2000000 else 1)

In [25]:
new_data[(new_data['amount']>=2000000)]

Unnamed: 0,step,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,Transaction_type_CASH_IN,Transaction_type_CASH_OUT,Transaction_type_DEBIT,Transaction_type_PAYMENT,Transaction_type_TRANSFER,High_Transcation
359,1,2421578.09,C106297322,0.0,0.0,C1590550415,8515645.77,19169204.93,0.0,0.0,0,0,0,0,1,1
375,1,2545478.01,C1057507014,0.0,0.0,C1590550415,12394437.40,19169204.93,0.0,0.0,0,0,0,0,1,1
376,1,2061082.82,C2007599722,0.0,0.0,C1590550415,14939915.42,19169204.93,0.0,0.0,0,0,0,0,1,1
1153,1,3776389.09,C197491520,0.0,0.0,C1883840933,10138670.86,16874643.09,0.0,0.0,0,0,0,0,1,1
1202,1,2258388.15,C12139181,0.0,0.0,C1789550256,2784129.27,4619798.56,0.0,0.0,0,0,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
218736,13,2288940.53,C365018168,0.0,0.0,C326180938,2968089.50,5257030.03,0.0,0.0,0,0,0,0,1,1
218751,13,2013866.62,C1147705407,0.0,0.0,C2066536447,3704474.23,5718340.85,0.0,0.0,0,0,0,0,1,1
218816,13,3435868.32,C1250031179,40821.0,0.0,C1360101896,271507.60,3707375.91,0.0,0.0,0,0,0,0,1,1
218950,13,2076718.66,C1205530585,398289.0,0.0,C940951814,373018.04,2449736.70,0.0,0.0,0,0,0,0,1,1


**Creating a function that runs various Machine Learning models, which will help to work with different X and y values with ease.**

In [26]:
def ML_Predictions(X,y):
  oversampler=RandomOverSampler(random_state=42)

  X_resampled,y_resampled=oversampler.fit_resample(X,y)

  X_train,X_test,y_train,y_test=train_test_split(X_resampled,y_resampled,test_size=0.4,random_state=42)

  scalar=StandardScaler()

  X_train_scaled=scalar.fit_transform(X_train)

  X_test_scaled=scalar.transform(X_test)

  models={
    "Logisitic Regression" :LogisticRegression(max_iter=20000),
    "Decision Tree" :DecisionTreeClassifier(),
    "Random Forest":RandomForestClassifier(),
    "K-Nearest Neighbors": KNeighborsClassifier(n_neighbors=3)
  }
  model_outputs={}

  for i in range(len(list(models))):
      model = list(models.values())[i]
      model.fit(X_train_scaled,y_train.values.ravel()) # Training each of the Models


      # Make predictions
      y_train_pred = model.predict(X_train_scaled)
      y_test_pred =  model.predict(X_test_scaled)

    # Performance of Test set
      model_test_accuracy = accuracy_score(y_test, y_test_pred)
      model_test_f1 = f1_score(y_test, y_test_pred, average='weighted', zero_division=1)
      model_test_precision = precision_score(y_test, y_test_pred , average='weighted', zero_division=1)
      model_test_recall  = recall_score(y_test, y_test_pred,average='weighted', zero_division=1)

    # Performance of Training set
      model_train_accuracy = accuracy_score(y_train, y_train_pred)
      model_train_f1 = f1_score(y_train, y_train_pred, average= 'weighted', zero_division=1)
      model_train_precision = precision_score(y_train, y_train_pred,average='weighted', zero_division=1)
      model_train_recall = recall_score(y_train, y_train_pred,average='weighted', zero_division=1)

      print(list(models.keys())[i])
      model_outputs[list(models.keys())[i]]=[model_test_accuracy,model_test_f1,model_test_precision,model_test_recall]

      print('Model performance for Training set')
      print("- Accuracy: {:.2f}%".format(model_train_accuracy*100))
      print('- F1 score: {:2f}%'.format(model_train_f1*100))
      print('- Precision: {:2f}%'.format(model_train_precision*100))
      print('- Recall: {:2f}%'.format(model_train_recall*100))

      print('----------------------------------')

      print('Model performance for Test set')
      print('- Accuracy: {:.2f}%'.format(model_test_accuracy*100) )
      print('- Fl score: {:.2f}%'.format(model_test_f1*100))
      print('- Precision: {:.2f}%'.format(model_test_precision*100))
      print('- Recall: {:.2f}%'.format(model_test_recall*100))


      print('='*30)
      print('\n')


**First running the model with outliers**

In [27]:
X_with_outliers=new_data.drop(columns=['nameOrig','nameDest','isFraud','isFlaggedFraud'],axis=1)

In [28]:
X_with_outliers

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,Transaction_type_CASH_IN,Transaction_type_CASH_OUT,Transaction_type_DEBIT,Transaction_type_PAYMENT,Transaction_type_TRANSFER,High_Transcation
0,1,9839.64,170136.00,160296.36,0.00,0.00,0,0,0,1,0,0
1,1,1864.28,21249.00,19384.72,0.00,0.00,0,0,0,1,0,0
2,1,181.00,181.00,0.00,0.00,0.00,0,0,0,0,1,0
3,1,181.00,181.00,0.00,21182.00,0.00,0,1,0,0,0,0
4,1,11668.14,41554.00,29885.86,0.00,0.00,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
219759,14,104070.68,809.00,104879.68,635479.45,531408.77,1,0,0,0,0,0
219760,14,591.50,104879.68,104288.18,0.00,0.00,0,0,0,1,0,0
219761,14,14839.24,104288.18,89448.94,0.00,0.00,0,0,0,1,0,0
219762,14,144258.67,298512.00,442770.67,891449.26,747190.60,1,0,0,0,0,0


In [29]:
y_with_outliers=new_data['isFraud']

In [30]:
ML_Predictions(X_with_outliers,y_with_outliers)

Logisitic Regression
Model performance for Training set
- Accuracy: 90.20%
- F1 score: 90.201861%
- Precision: 90.202325%
- Recall: 90.201875%
----------------------------------
Model performance for Test set
- Accuracy: 90.20%
- Fl score: 90.20%
- Precision: 90.20%
- Recall: 90.20%


Decision Tree
Model performance for Training set
- Accuracy: 100.00%
- F1 score: 100.000000%
- Precision: 100.000000%
- Recall: 100.000000%
----------------------------------
Model performance for Test set
- Accuracy: 99.98%
- Fl score: 99.98%
- Precision: 99.98%
- Recall: 99.98%


Random Forest
Model performance for Training set
- Accuracy: 100.00%
- F1 score: 100.000000%
- Precision: 100.000000%
- Recall: 100.000000%
----------------------------------
Model performance for Test set
- Accuracy: 100.00%
- Fl score: 100.00%
- Precision: 100.00%
- Recall: 100.00%


K-Nearest Neighbors
Model performance for Training set
- Accuracy: 99.97%
- F1 score: 99.973438%
- Precision: 99.973452%
- Recall: 99.973438%
--

**Finding the records having no missmatch in there balances**

In [31]:
data_with_balancing_accounts=new_data[((new_data['newbalanceDest'] - new_data['oldbalanceDest']) == new_data['amount']) & ((new_data['oldbalanceOrg'] - new_data['amount']) == new_data['newbalanceOrig'])]


**Even though they have balancing account balances there are some records which are still tagged as fraud**

In [32]:
data_with_balancing_accounts[data_with_balancing_accounts['isFraud']==1]

Unnamed: 0,step,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,Transaction_type_CASH_IN,Transaction_type_CASH_OUT,Transaction_type_DEBIT,Transaction_type_PAYMENT,Transaction_type_TRANSFER,High_Transcation
1870,1,25071.46,C1275464847,25071.46,0.0,C1364913072,9083.76,34155.22,1.0,0.0,0,1,0,0,0,0
2302,1,235238.66,C1499825229,235238.66,0.0,C2100440237,0.0,235238.66,1.0,0.0,0,1,0,0,0,0
3060,2,1096187.24,C77163673,1096187.24,0.0,C644345897,0.0,1096187.24,1.0,0.0,0,1,0,0,0,0
4104,3,10539.37,C1984954272,10539.37,0.0,C124540047,0.0,10539.37,1.0,0.0,0,1,0,0,0,0
4261,3,22877.0,C2126545173,22877.0,0.0,C573200870,0.0,22877.0,1.0,0.0,0,1,0,0,0,0
4668,4,169941.73,C2026325575,169941.73,0.0,C1394526584,0.0,169941.73,1.0,0.0,0,1,0,0,0,0
4694,4,13707.11,C556223230,13707.11,0.0,C2094777811,0.0,13707.11,1.0,0.0,0,1,0,0,0,0
4776,4,86070.17,C1699873763,86070.17,0.0,C560041895,0.0,86070.17,1.0,0.0,0,1,0,0,0,0
4858,5,120074.73,C1174000532,120074.73,0.0,C410033330,0.0,120074.73,1.0,0.0,0,1,0,0,0,0
5467,5,10119.47,C213063852,10119.47,0.0,C922511709,0.0,10119.47,1.0,0.0,0,1,0,0,0,0


**Data having no outliers inside amount column.**

In [33]:
data_without_outliers_in_amount=new_data[(new_data['amount']>amount_lower_bound) & (new_data['amount']<amount_upper_bound)]

**Getting the independent variables**

In [34]:
X_without_outliers=data_without_outliers_in_amount.drop(columns=['nameOrig','nameDest','isFraud','isFlaggedFraud'],axis=1)

**Getting the dependent variable**

In [35]:
y_without_outliers=data_without_outliers_in_amount['isFraud']

Running the model on data having no outliers in the amount column

In [36]:
ML_Predictions(X_without_outliers,y_without_outliers)

Logisitic Regression
Model performance for Training set
- Accuracy: 91.62%
- F1 score: 91.621831%
- Precision: 91.671289%
- Recall: 91.624339%
----------------------------------
Model performance for Test set
- Accuracy: 91.63%
- Fl score: 91.63%
- Precision: 91.67%
- Recall: 91.63%


Decision Tree
Model performance for Training set
- Accuracy: 100.00%
- F1 score: 100.000000%
- Precision: 100.000000%
- Recall: 100.000000%
----------------------------------
Model performance for Test set
- Accuracy: 99.98%
- Fl score: 99.98%
- Precision: 99.98%
- Recall: 99.98%


Random Forest
Model performance for Training set
- Accuracy: 100.00%
- F1 score: 100.000000%
- Precision: 100.000000%
- Recall: 100.000000%
----------------------------------
Model performance for Test set
- Accuracy: 100.00%
- Fl score: 100.00%
- Precision: 100.00%
- Recall: 100.00%


K-Nearest Neighbors
Model performance for Training set
- Accuracy: 99.96%
- F1 score: 99.964122%
- Precision: 99.964148%
- Recall: 99.964122%
--

Done data cleaning found missing values in balances imputing them will affectl the output and misslead the model
Choosed to keep the outliers in amount beacause since it involes fraud detection I choose to keep the outliers inside amount column.
As per my analysis there is no multi-collinearity in the data.
Used ML models on both data with and without outliers for better comparison.


In any model the feature selection and scalling is import I choose only the features which are  relevant for predicting the target variable, also created a new column called "Higherst Transcations" which will tell if the amount of transcation is more than 20,00,000.
After removing the NaN values I have encoded the type column into OneHotEncoding to convert it into numeric from object type
Then I have  to split my dataset into training set and test set.
I have used oversampling here to make the data more balanced since we have more no of non fraudulent then fraud.
then I scaled the data using standard scalar.
I used various classification ML model for prediction


I have selected the independent variables: amount, encoded type, oldbalanceOrg, newbalanceOrg, oldbalancedest, newbalancedest, highest_transactions, and omitted the object variables.

The key factors the predict the fraudulent customers are the amount, transaction type and the before and after balances of the origin and the destination

By seeing the if the amount is getting transferd into  the account or not, if the amount is getting tansferd into multible account, if amount is not transfered into any account at all by using these we can use these factors to detect the fraud

The company can have a data base of locations and id of the account holders to know where the frauds are happening and also can have two step verification on transfering more amount by giving a limit for the  number of attempts and the amount.
We can test it by using a dummy acount to transfer  money from and can test the model

Thank You