##Columns Of Dataset:



**step** - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).

**type** - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

**amount** - amount of the transaction in local currency.

**nameOrig** - customer who started the transaction

**oldbalanceOrg** - initial balance before the transaction

**newbalanceOrig** - new balance after the transaction

**nameDest** - customer who is the recipient of the transaction

**oldbalanceDest** - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).

**newbalanceDest** - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).

**isFraud** - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

**isFlaggedFraud** - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.


In [1]:
#importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report,f1_score,precision_score,recall_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from sklearn.linear_model import LogisticRegression

In [3]:
#Getting the data from a csv file
data=pd.read_csv(r"Fraud.csv")

In [4]:
#first five records
data.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0.0,0.0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0.0,0.0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1.0,0.0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1.0,0.0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0.0,0.0


In [5]:
#Last five records
data.tail()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
42266,9,CASH_OUT,195364.06,C1096102092,0.0,0.0,C2021579766,506957.59,1343781.67,0.0,0.0
42267,9,CASH_OUT,546075.62,C1791035294,0.0,0.0,C1039162432,5075471.31,5621546.93,0.0,0.0
42268,9,CASH_OUT,111003.87,C1145755913,0.0,0.0,C743528393,2533159.94,2644163.81,0.0,0.0
42269,9,CASH_OUT,101025.44,C292739335,0.0,0.0,C299715257,156646.32,491301.04,0.0,0.0
42270,9,CASH_OUT,271441.28,C2034845877,0.0,0.0,C71127,,,,


In [6]:
data.columns

Index(['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig',
       'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud',
       'isFlaggedFraud'],
      dtype='object')

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42271 entries, 0 to 42270
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   step            42271 non-null  int64  
 1   type            42271 non-null  object 
 2   amount          42271 non-null  float64
 3   nameOrig        42271 non-null  object 
 4   oldbalanceOrg   42271 non-null  float64
 5   newbalanceOrig  42271 non-null  float64
 6   nameDest        42271 non-null  object 
 7   oldbalanceDest  42270 non-null  float64
 8   newbalanceDest  42270 non-null  float64
 9   isFraud         42270 non-null  float64
 10  isFlaggedFraud  42270 non-null  float64
dtypes: float64(7), int64(1), object(3)
memory usage: 3.5+ MB


In [8]:
data

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.00,0.00,0.0,0.0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.00,0.00,0.0,0.0
2,1,TRANSFER,181.00,C1305486145,181.0,0.00,C553264065,0.00,0.00,1.0,0.0
3,1,CASH_OUT,181.00,C840083671,181.0,0.00,C38997010,21182.00,0.00,1.0,0.0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.00,0.00,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
42266,9,CASH_OUT,195364.06,C1096102092,0.0,0.00,C2021579766,506957.59,1343781.67,0.0,0.0
42267,9,CASH_OUT,546075.62,C1791035294,0.0,0.00,C1039162432,5075471.31,5621546.93,0.0,0.0
42268,9,CASH_OUT,111003.87,C1145755913,0.0,0.00,C743528393,2533159.94,2644163.81,0.0,0.0
42269,9,CASH_OUT,101025.44,C292739335,0.0,0.00,C299715257,156646.32,491301.04,0.0,0.0


In [9]:
#Getting only the numeric values
data_numeric=data.select_dtypes(include=['int64',"float64"])

In [10]:
#Finding the correlation between the different features
data_numeric.corr()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
step,1.0,0.079207,-0.037869,-0.038267,-0.00198,0.014045,-0.050289,
amount,0.079207,1.0,0.015872,0.004886,0.289998,0.404039,0.058899,
oldbalanceOrg,-0.037869,0.015872,1.0,0.998351,0.127877,0.094345,-0.004536,
newbalanceOrig,-0.038267,0.004886,0.998351,1.0,0.130743,0.094209,-0.015376,
oldbalanceDest,-0.00198,0.289998,0.127877,0.130743,1.0,0.929393,-0.012463,
newbalanceDest,0.014045,0.404039,0.094345,0.094209,0.929393,1.0,-0.008193,
isFraud,-0.050289,0.058899,-0.004536,-0.015376,-0.012463,-0.008193,1.0,
isFlaggedFraud,,,,,,,,


In [11]:
data

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.00,0.00,0.0,0.0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.00,0.00,0.0,0.0
2,1,TRANSFER,181.00,C1305486145,181.0,0.00,C553264065,0.00,0.00,1.0,0.0
3,1,CASH_OUT,181.00,C840083671,181.0,0.00,C38997010,21182.00,0.00,1.0,0.0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.00,0.00,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
42266,9,CASH_OUT,195364.06,C1096102092,0.0,0.00,C2021579766,506957.59,1343781.67,0.0,0.0
42267,9,CASH_OUT,546075.62,C1791035294,0.0,0.00,C1039162432,5075471.31,5621546.93,0.0,0.0
42268,9,CASH_OUT,111003.87,C1145755913,0.0,0.00,C743528393,2533159.94,2644163.81,0.0,0.0
42269,9,CASH_OUT,101025.44,C292739335,0.0,0.00,C299715257,156646.32,491301.04,0.0,0.0


In [12]:
#Seeing if there any relation between payment type and fraud
data[['type','isFraud']]

Unnamed: 0,type,isFraud
0,PAYMENT,0.0
1,PAYMENT,0.0
2,TRANSFER,1.0
3,CASH_OUT,1.0
4,PAYMENT,0.0
...,...,...
42266,CASH_OUT,0.0
42267,CASH_OUT,0.0
42268,CASH_OUT,0.0
42269,CASH_OUT,0.0


In [13]:
data['type'].unique()

array(['PAYMENT', 'TRANSFER', 'CASH_OUT', 'DEBIT', 'CASH_IN'],
      dtype=object)

In [14]:
data=data[data['type']!='CAS']

In [15]:
#Droping the NaN rows
data=data.dropna()
data

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.00,0.00,0.0,0.0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.00,0.00,0.0,0.0
2,1,TRANSFER,181.00,C1305486145,181.0,0.00,C553264065,0.00,0.00,1.0,0.0
3,1,CASH_OUT,181.00,C840083671,181.0,0.00,C38997010,21182.00,0.00,1.0,0.0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.00,0.00,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
42265,9,PAYMENT,1011.16,C802172585,317.0,0.00,M1111680498,0.00,0.00,0.0,0.0
42266,9,CASH_OUT,195364.06,C1096102092,0.0,0.00,C2021579766,506957.59,1343781.67,0.0,0.0
42267,9,CASH_OUT,546075.62,C1791035294,0.0,0.00,C1039162432,5075471.31,5621546.93,0.0,0.0
42268,9,CASH_OUT,111003.87,C1145755913,0.0,0.00,C743528393,2533159.94,2644163.81,0.0,0.0


In [16]:
fraud_data=data[data['isFraud']==1].reset_index()

In [17]:
#All the fraud transactions
fraud_data

Unnamed: 0,index,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,2,1,TRANSFER,181.00,C1305486145,181.00,0.0,C553264065,0.00,0.00,1.0,0.0
1,3,1,CASH_OUT,181.00,C840083671,181.00,0.0,C38997010,21182.00,0.00,1.0,0.0
2,251,1,TRANSFER,2806.00,C1420196421,2806.00,0.0,C972765878,0.00,0.00,1.0,0.0
3,252,1,CASH_OUT,2806.00,C2101527076,2806.00,0.0,C1007251739,26202.00,0.00,1.0,0.0
4,680,1,TRANSFER,20128.00,C137533655,20128.00,0.0,C1848415041,0.00,0.00,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
89,34169,8,CASH_OUT,43092.00,C1395071924,43092.00,0.0,C79685693,660641.74,1158662.90,1.0,0.0
90,36026,9,TRANSFER,556218.01,C1769581889,556218.01,0.0,C217441120,0.00,0.00,1.0,0.0
91,36027,9,CASH_OUT,556218.01,C589143360,556218.01,0.0,C1992550266,0.00,582265.81,1.0,0.0
92,36691,9,TRANSFER,11308.00,C299933576,11308.00,0.0,C1060999444,0.00,0.00,1.0,0.0


In [18]:
#All the fraud transactions occured in "Transfer" and "CASH_OUT"
fraud_data['type'].unique()

array(['TRANSFER', 'CASH_OUT'], dtype=object)

In [19]:
#OneHotEncoding the feature type
new_data=pd.get_dummies(data,columns=['type'],prefix='Transaction_type')

In [20]:
new_data

Unnamed: 0,step,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,Transaction_type_CASH_IN,Transaction_type_CASH_OUT,Transaction_type_DEBIT,Transaction_type_PAYMENT,Transaction_type_TRANSFER
0,1,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.00,0.00,0.0,0.0,0,0,0,1,0
1,1,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.00,0.00,0.0,0.0,0,0,0,1,0
2,1,181.00,C1305486145,181.0,0.00,C553264065,0.00,0.00,1.0,0.0,0,0,0,0,1
3,1,181.00,C840083671,181.0,0.00,C38997010,21182.00,0.00,1.0,0.0,0,1,0,0,0
4,1,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.00,0.00,0.0,0.0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42265,9,1011.16,C802172585,317.0,0.00,M1111680498,0.00,0.00,0.0,0.0,0,0,0,1,0
42266,9,195364.06,C1096102092,0.0,0.00,C2021579766,506957.59,1343781.67,0.0,0.0,0,1,0,0,0
42267,9,546075.62,C1791035294,0.0,0.00,C1039162432,5075471.31,5621546.93,0.0,0.0,0,1,0,0,0
42268,9,111003.87,C1145755913,0.0,0.00,C743528393,2533159.94,2644163.81,0.0,0.0,0,1,0,0,0


In [21]:
#Finding the IQR of amount to find the outliers
amount_Q1=new_data['amount'].quantile(0.25)
amount_Q3=new_data['amount'].quantile(0.75)
amount_IQR=amount_Q3-amount_Q1
amount_lower_bound=amount_Q1-1.5*amount_IQR
amount_upper_bound=amount_Q3+1.5*amount_IQR

In [22]:
#Data having no outliers in amount
new_data[(new_data['amount']>amount_lower_bound) & (new_data['amount']<amount_upper_bound)]

Unnamed: 0,step,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,Transaction_type_CASH_IN,Transaction_type_CASH_OUT,Transaction_type_DEBIT,Transaction_type_PAYMENT,Transaction_type_TRANSFER
0,1,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.00,0.00,0.0,0.0,0,0,0,1,0
1,1,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.00,0.00,0.0,0.0,0,0,0,1,0
2,1,181.00,C1305486145,181.0,0.00,C553264065,0.00,0.00,1.0,0.0,0,0,0,0,1
3,1,181.00,C840083671,181.0,0.00,C38997010,21182.00,0.00,1.0,0.0,0,1,0,0,0
4,1,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.00,0.00,0.0,0.0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42264,9,4044.24,C1251553727,112420.0,108375.76,M1295827064,0.00,0.00,0.0,0.0,0,0,0,1,0
42265,9,1011.16,C802172585,317.0,0.00,M1111680498,0.00,0.00,0.0,0.0,0,0,0,1,0
42266,9,195364.06,C1096102092,0.0,0.00,C2021579766,506957.59,1343781.67,0.0,0.0,0,1,0,0,0
42268,9,111003.87,C1145755913,0.0,0.00,C743528393,2533159.94,2644163.81,0.0,0.0,0,1,0,0,0


In [23]:
#The outlier data
new_data[(new_data['amount']<amount_lower_bound )| (new_data['amount']>amount_upper_bound)]

Unnamed: 0,step,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,Transaction_type_CASH_IN,Transaction_type_CASH_OUT,Transaction_type_DEBIT,Transaction_type_PAYMENT,Transaction_type_TRANSFER
85,1,1505626.01,C926859124,0.0,0.0,C665576141,29031.00,5515763.34,0.0,0.0,0,0,0,0,1
86,1,554026.99,C1603696865,0.0,0.0,C766572210,579285.56,0.00,0.0,0.0,0,0,0,0,1
88,1,761507.39,C412788346,0.0,0.0,C1590550415,1280036.23,19169204.93,0.0,0.0,0,0,0,0,1
89,1,1429051.47,C1520267010,0.0,0.0,C1590550415,2041543.62,19169204.93,0.0,0.0,0,0,0,0,1
93,1,583848.46,C1839168128,0.0,0.0,C1286084959,667778.00,2107778.11,0.0,0.0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42226,9,579808.58,C748918551,0.0,0.0,C830615916,626151.36,1289982.94,0.0,0.0,0,0,0,0,1
42228,9,826432.54,C1140016813,0.0,0.0,C545710008,1455423.09,2273045.85,0.0,0.0,0,0,0,0,1
42229,9,555107.98,C1290396207,0.0,0.0,C388286025,599246.01,1154353.99,0.0,0.0,0,0,0,0,1
42256,9,1314996.48,C1967884195,0.0,0.0,C787033725,4789247.93,6341880.67,0.0,0.0,0,0,0,0,1


In [24]:
new_data['High_Transcation']=new_data['amount'].apply(lambda x:0 if x<2000000 else 1)

In [25]:
#Data having high transactions
new_data[new_data['High_Transcation']==1]

Unnamed: 0,step,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,Transaction_type_CASH_IN,Transaction_type_CASH_OUT,Transaction_type_DEBIT,Transaction_type_PAYMENT,Transaction_type_TRANSFER,High_Transcation
359,1,2421578.09,C106297322,0.0,0.0,C1590550415,8515645.77,19169204.93,0.0,0.0,0,0,0,0,1,1
375,1,2545478.01,C1057507014,0.0,0.0,C1590550415,12394437.40,19169204.93,0.0,0.0,0,0,0,0,1,1
376,1,2061082.82,C2007599722,0.0,0.0,C1590550415,14939915.42,19169204.93,0.0,0.0,0,0,0,0,1,1
1153,1,3776389.09,C197491520,0.0,0.0,C1883840933,10138670.86,16874643.09,0.0,0.0,0,0,0,0,1,1
1202,1,2258388.15,C12139181,0.0,0.0,C1789550256,2784129.27,4619798.56,0.0,0.0,0,0,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41826,9,2271850.59,C686584438,175632.0,0.0,C1234612952,200359.00,2778471.64,0.0,0.0,0,0,0,0,1,1
41827,9,2254744.84,C556554940,138.0,0.0,C221954683,0.00,2254744.84,0.0,0.0,0,0,0,0,1,1
41882,9,2251322.58,C973892861,22050.0,0.0,C783487,9913.00,2929191.69,0.0,0.0,0,0,0,0,1,1
42035,9,3367917.95,C710839089,455.0,0.0,C1254274246,0.00,3367917.95,0.0,0.0,0,0,0,0,1,1


In [26]:
new_data[(new_data['amount']>=2000000)]

Unnamed: 0,step,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,Transaction_type_CASH_IN,Transaction_type_CASH_OUT,Transaction_type_DEBIT,Transaction_type_PAYMENT,Transaction_type_TRANSFER,High_Transcation
359,1,2421578.09,C106297322,0.0,0.0,C1590550415,8515645.77,19169204.93,0.0,0.0,0,0,0,0,1,1
375,1,2545478.01,C1057507014,0.0,0.0,C1590550415,12394437.40,19169204.93,0.0,0.0,0,0,0,0,1,1
376,1,2061082.82,C2007599722,0.0,0.0,C1590550415,14939915.42,19169204.93,0.0,0.0,0,0,0,0,1,1
1153,1,3776389.09,C197491520,0.0,0.0,C1883840933,10138670.86,16874643.09,0.0,0.0,0,0,0,0,1,1
1202,1,2258388.15,C12139181,0.0,0.0,C1789550256,2784129.27,4619798.56,0.0,0.0,0,0,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41826,9,2271850.59,C686584438,175632.0,0.0,C1234612952,200359.00,2778471.64,0.0,0.0,0,0,0,0,1,1
41827,9,2254744.84,C556554940,138.0,0.0,C221954683,0.00,2254744.84,0.0,0.0,0,0,0,0,1,1
41882,9,2251322.58,C973892861,22050.0,0.0,C783487,9913.00,2929191.69,0.0,0.0,0,0,0,0,1,1
42035,9,3367917.95,C710839089,455.0,0.0,C1254274246,0.00,3367917.95,0.0,0.0,0,0,0,0,1,1


In [27]:
X=new_data.drop(columns=['nameOrig','nameDest','isFraud','isFlaggedFraud','High_Transcation'],axis=1)

In [28]:
X

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,Transaction_type_CASH_IN,Transaction_type_CASH_OUT,Transaction_type_DEBIT,Transaction_type_PAYMENT,Transaction_type_TRANSFER
0,1,9839.64,170136.0,160296.36,0.00,0.00,0,0,0,1,0
1,1,1864.28,21249.0,19384.72,0.00,0.00,0,0,0,1,0
2,1,181.00,181.0,0.00,0.00,0.00,0,0,0,0,1
3,1,181.00,181.0,0.00,21182.00,0.00,0,1,0,0,0
4,1,11668.14,41554.0,29885.86,0.00,0.00,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...
42265,9,1011.16,317.0,0.00,0.00,0.00,0,0,0,1,0
42266,9,195364.06,0.0,0.00,506957.59,1343781.67,0,1,0,0,0
42267,9,546075.62,0.0,0.00,5075471.31,5621546.93,0,1,0,0,0
42268,9,111003.87,0.0,0.00,2533159.94,2644163.81,0,1,0,0,0


In [29]:
y=new_data['isFraud']

In [30]:
#Under Sampling the data for having better balance in the data
oversampler=RandomOverSampler(random_state=42)

X_resampled,y_resampled=oversampler.fit_resample(X,y)

X_train,X_test,y_train,y_test=train_test_split(X_resampled,y_resampled,test_size=0.2,random_state=42)

scalar=StandardScaler()

X_train_scaled=scalar.fit_transform(X_train)

X_test_scaled=scalar.transform(X_test)

models={
  "Logisitic Regression" :LogisticRegression(max_iter=20000),
  "Decision Tree" :DecisionTreeClassifier(),
  "Random Forest":RandomForestClassifier(),
  "K-Nearest Neighbors": KNeighborsClassifier(n_neighbors=3)
}
model_outputs={}

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train_scaled,y_train.values.ravel()) # Training each of the Models


    # Make predictions
    y_train_pred = model.predict(X_train_scaled)
    y_test_pred =  model.predict(X_test_scaled)

  # Performance of Test set
    model_test_accuracy = accuracy_score(y_test, y_test_pred)
    model_test_f1 = f1_score(y_test, y_test_pred, average='weighted', zero_division=1)
    model_test_precision = precision_score(y_test, y_test_pred , average='weighted', zero_division=1)
    model_test_recall  = recall_score(y_test, y_test_pred,average='weighted', zero_division=1)

  # Performance of Training set
    model_train_accuracy = accuracy_score(y_train, y_train_pred)
    model_train_f1 = f1_score(y_train, y_train_pred, average= 'weighted', zero_division=1)
    model_train_precision = precision_score(y_train, y_train_pred,average='weighted', zero_division=1)
    model_train_recall = recall_score(y_train, y_train_pred,average='weighted', zero_division=1)

    print(list(models.keys())[i])
    model_outputs[list(models.keys())[i]]=[model_test_accuracy,model_test_f1,model_test_precision,model_test_recall]

    print('Model performance for Training set')
    print("- Accuracy: {:.2f}%".format(model_train_accuracy*100))
    print('- F1 score: {:2f}%'.format(model_train_f1*100))
    print('- Precision: {:2f}%'.format(model_train_precision*100))
    print('- Recall: {:2f}%'.format(model_train_recall*100))

    print('----------------------------------')

    print('Model performance for Test set')
    print('- Accuracy: {:.2f}%'.format(model_test_accuracy*100) )
    print('- Fl score: {:.2f}%'.format(model_test_f1*100))
    print('- Precision: {:.2f}%'.format(model_test_precision*100))
    print('- Recall: {:.2f}%'.format(model_test_recall*100))


    print('='*30)
    print('\n')


Logisitic Regression
Model performance for Training set
- Accuracy: 93.33%
- F1 score: 93.322880%
- Precision: 93.435401%
- Recall: 93.327011%
----------------------------------
Model performance for Test set
- Accuracy: 93.15%
- Fl score: 93.15%
- Precision: 93.29%
- Recall: 93.15%


Decision Tree
Model performance for Training set
- Accuracy: 100.00%
- F1 score: 100.000000%
- Precision: 100.000000%
- Recall: 100.000000%
----------------------------------
Model performance for Test set
- Accuracy: 99.97%
- Fl score: 99.97%
- Precision: 99.97%
- Recall: 99.97%


Random Forest
Model performance for Training set
- Accuracy: 100.00%
- F1 score: 100.000000%
- Precision: 100.000000%
- Recall: 100.000000%
----------------------------------
Model performance for Test set
- Accuracy: 99.99%
- Fl score: 99.99%
- Precision: 99.99%
- Recall: 99.99%


K-Nearest Neighbors
Model performance for Training set
- Accuracy: 99.95%
- F1 score: 99.945170%
- Precision: 99.945230%
- Recall: 99.945170%
------

**How did you select variables to be included in the model?**



I selected variables for the model based on:

**Relevance:** Variables should relate to predicting fraudulent customers.

**Data quality:** They should have high-quality data, with few missing values or outliers.

**Multicollinearity:** Variables shouldn't be highly correlated to maintain model stability.

To evaluate performance, I used:

**Confusion matrix:** Shows correct and incorrect predictions.

**ROC curve:** Illustrates true positive rate versus false positive rate.

**Precision and recall**: Measures prediction accuracy.






**Demonstrate the performance of the model by using best set of tools.**

Confusion matrix: It visualizes correct and incorrect predictions, aiding in identifying areas for improvement.

ROC curve: Shows the balance between true positive and false positive rates, useful for model comparison.

Precision and recall: These metrics assess prediction accuracy and completeness, often used together.

F1 score: A weighted average of precision and recall, particularly valuable for imbalanced classes.

Cross-validation: Evaluates model performance across different data subsets to prevent overfitting and ensure accuracy.

**What are the key factors that predict fraudulent customer?**


Multiple accounts: Customers with multiple accounts may engage in fraudulent activity.

High-value transactions: Large transactions may indicate fraudulent behavior.

Unusual spending patterns: Anomalies like purchases across multiple countries raise red flags.

Stolen credit cards: Use of stolen cards is a common indicator of fraud.

Proxy server usage: Customers masking their IP address via proxy servers may be fraudulent.

Fake contact information: Providing false contact details suggests potential fraud.

Previous fraud history: Customers with prior fraudulent activity are likely to repeat it.


**Do these factors make sense? If yes, How? If not, How not?**

These factors make sense because they are all related to the risk of fraud. For example, high transaction amounts are risky because they are more likely to be fraudulent. Unusual spending patterns are risky because they indicate that the customer may not be who they say they are. New customers are risky because they have not yet established a history with the company. Shipping addresses that are different from billing addresses are risky because they can be used to hide the customer's identity.

**What kind of prevention should be adopted while company update its infrastructure?**


Data Encryption: Ensure that sensitive data is encrypted both in transit and at rest to protect it from unauthorized access.

Access Control: Implement robust access control mechanisms, such as multi-factor authentication and role-based access control, to restrict access to authorized personnel only.

Educate employees: The company should educate its employees about the risks of infrastructure updates and how to protect themselves. This will help to prevent employees from falling victim to phishing attacks or other social engineering scams.

Monitor the infrastructure closely: After the update is complete, the company should monitor the infrastructure closely for any signs of problems. This will help to identify and resolve any issues quickly.

**Assuming these actions have been implemented, how would you determine if they work?**

Monitor key performance indicators (KPIs): Companies should monitor KPIs such as uptime, availability, and response time. If these KPIs remain stable or improve after the update, it is a good indication that the prevention measures have been effective.

Get feedback from employees: Companies should get feedback from employees about their experience with the updated infrastructure. If employees report that they are experiencing fewer problems or security incidents, it is a good indication that the prevention measures have been effective.

Once relevant data is collected, organizations can analyze it using statistical techniques such as trend analysis, correlation analysis, and comparative analysis to identify patterns, trends, and statistically significant changes.

Based on data analysis and user feedback, organizations can make necessary adjustments to improve the effectiveness of implemented actions. This may involve updating security policies and procedures, enhancing security controls, or providing additional training and awareness programs.