in this notebook, we will build a machine learning model to adapt on a large-scale dataset of credit card transactions regarding customers, the model has a main role of detecting potential fraud transactions for early prevention and response. the model will be deployed as the core of the credit card fraud detection system. being an optimal automated approach in the credit card industry that has expanded tremendously in the past few years.

# Initial Imports

In [3]:
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [6]:
# csv file resides in a directory called data
fraud_df = pd.read_csv('Data/Fraud.csv')

In [3]:
fraud_df.rename(columns={'oldbalanceOrg':'oldbalanceOrig'})
fraud_df.head(5)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


# Exploratory Data Analysis

## Missing Values

are there any missing values in the dataset?

In [4]:
def missing_values_counts(df):
    return df.isnull().sum()

missing_values_counts(fraud_df)

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

The dataset has no missing values since the count of NULL values for every feautre is zero

## Checking Unique values 

we will check unique values for feature that have 200 or less unique values, to avoid checking for numerical variables and variables describing names(E.g. names of people), which would result in wasted computation time

In [5]:
def check_unique_values_limited(df, limit=100):
    cols = []
    for col in df.columns:
        if df[col].nunique() <= limit:
            print('Column: ', col)
            print(df[col].value_counts())
            print('---------------------')

check_unique_values_limited(fraud_df)

Column:  type
CASH_OUT    2237500
PAYMENT     2151495
CASH_IN     1399284
TRANSFER     532909
DEBIT         41432
Name: type, dtype: int64
---------------------
Column:  isFraud
0    6354407
1       8213
Name: isFraud, dtype: int64
---------------------
Column:  isFlaggedFraud
0    6362604
1         16
Name: isFlaggedFraud, dtype: int64
---------------------


## Fradulant Transaction Analysis

- 1 -> checking isFraud class by checking its distribution on certain features
- 2 -> analyzing whether there is relation between features and class

In [6]:
def class_dist_by_var(df, class_, var_):
    return df.groupby(var_)[class_].value_counts()

### (1) Checking by types of transactions

In [7]:
class_dist_by_var(fraud_df, 'isFraud', 'type')

type      isFraud
CASH_IN   0          1399284
CASH_OUT  0          2233384
          1             4116
DEBIT     0            41432
PAYMENT   0          2151495
TRANSFER  0           528812
          1             4097
Name: isFraud, dtype: int64

**Note**: 0 stands for 'not fraud transaction' and 1 for 'fraud transaction'

we can see that only in CASH_OUT and TRANSFER transactions we see fraud transactions, and even those types, we find that the distribution is quite imbalanced, with a much higher transaction count for 0 than for 1.

**why is that?** fraud transactions typically involve making a TRANSFER to a (fraudulent) account, which in turn conducts a CASH_OUT operation (transacting with a merchant who pays out cash). it is a two-step process, in which the fraudulent account would be both, the destination in a TRANSFER and the originator in a CASH_OUT. 

### (1) Checking by isFlaggedFraud

In [8]:
class_dist_by_var(fraud_df, 'isFraud', 'isFlaggedFraud')

isFlaggedFraud  isFraud
0               0          6354407
                1             8197
1               1               16
Name: isFraud, dtype: int64

Here we can see that isFlagged provided negative results for fraud prediction. even though there are 16 cases where flagged transaction were in fact fraud, there were 8197 cases where transactions that were not flagged(isFlaggedFraud = 0) did turn out to be fraud, thus seemingly showcasing the feature isFlaggedFraud as being meaningless or irrelavant to the prediction of fraud transactions. and we will validate this later with the use of feature selection methods later on.

### (2) analyzing name features (nameOrig, nameDest)

here we analyze if there is a relatively frequent occurence of origin or destination names with fraud transactions

In [9]:
fraud_df_isFraud = fraud_df.loc[fraud_df.isFraud == 1]
fraud_df_isFraud

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
2,1,TRANSFER,181.00,C1305486145,181.00,0.0,C553264065,0.00,0.00,1,0
3,1,CASH_OUT,181.00,C840083671,181.00,0.0,C38997010,21182.00,0.00,1,0
251,1,TRANSFER,2806.00,C1420196421,2806.00,0.0,C972765878,0.00,0.00,1,0
252,1,CASH_OUT,2806.00,C2101527076,2806.00,0.0,C1007251739,26202.00,0.00,1,0
680,1,TRANSFER,20128.00,C137533655,20128.00,0.0,C1848415041,0.00,0.00,1,0
...,...,...,...,...,...,...,...,...,...,...,...
6362615,743,CASH_OUT,339682.13,C786484425,339682.13,0.0,C776919290,0.00,339682.13,1,0
6362616,743,TRANSFER,6311409.28,C1529008245,6311409.28,0.0,C1881841831,0.00,0.00,1,0
6362617,743,CASH_OUT,6311409.28,C1162922333,6311409.28,0.0,C1365125890,68488.84,6379898.11,1,0
6362618,743,TRANSFER,850002.52,C1685995037,850002.52,0.0,C2080388513,0.00,0.00,1,0


In [10]:
print('fraudulent transaction #:', fraud_df_isFraud.isFraud.count())
print('nameOrig occurence:', fraud_df_isFraud.nameOrig.nunique())
print('nameDest occurence:', fraud_df_isFraud.nameDest.nunique())
print('The customer with the highest number of fraudulent transaction starts:', fraud_df_isFraud.nameOrig.value_counts().max())
print('The recepient related to highest number of fraudulent transactions:', fraud_df_isFraud.nameDest.value_counts().max())

fraudulent transaction #: 8213
nameOrig occurence: 8213
nameDest occurence: 8169
The customer with the highest number of fraudulent transaction starts: 1
The recepient related to highest number of fraudulent transactions: 2


in general, we don't see a frequent occurence within origin or destination accounts, with the highest frequency being 2 occurences of destination accounts. which means either features on its own don't show any relation with the main class label. however, this does not mean that there isn't a general relation between both of the features and the class label.

for accounts in general, we want to check if they are destinations for TRANSFERS and origins for CASHOUTS (i.e. checking if an account is within the general fraud transaction process)

In [11]:
fraud_df_isFraud_transfer = fraud_df_isFraud.loc[(fraud_df.type == 'TRANSFER')]
fraud_df_isFraud_cashout = fraud_df_isFraud.loc[(fraud_df.type == 'CASH_OUT')]

print('accounts conducting the fraud process:', fraud_df_isFraud_transfer.nameDest.isin(fraud_df_isFraud_cashout.nameOrig).any())

# fraud_df_repeated_nameDest = fraud_df_isFraud.loc[fraud_df_isFraud.duplicated(subset=['nameDest'], keep=False)]
# destination_TRANSFER = fraud_df_repeated_nameDest.loc[fraud_df_repeated_nameDest.type == ]
# fraud_df_repeated_nameDest.nameDest.value_counts()
# df[df.duplicated(subset=['Song ID'],keep=False)]

accounts conducting the fraud process: False


nameOrig and nameDest, by not providing enough evidence both on their own or in combination, don't make for relevant features for the prediction of fraudulant transactions.

# Feature Engineering (upon analysis)

In [12]:
# flatten the subsetted dataframe of floats into an array of floats
relevant_cols = fraud_df[["amount","oldbalanceOrg","newbalanceOrig","oldbalanceDest","newbalanceDest"]].values.flatten()

print("negative numbers in any of the features: ", sum(n < 0 for n in relevant_cols))
print("# of transactions where: amount given > origin balance -->", sum(fraud_df["amount"] > fraud_df["oldbalanceOrg"]))
print("# of transactions where: amount received > receiver balance -->", sum(fraud_df["amount"] > fraud_df["newbalanceDest"]))

negative numbers in any of the features:  0
# of transactions where: amount given > origin balance --> 4079080
# of transactions where: amount received > receiver balance --> 2661141


From the above results, we can conclude that :
- There is erronous results in the new and old balance accounts for both sender and receiver
- Some of this erronous results are due to fraudulent transactions
- We cannot get rid of this features as well so we will let them be and add a new feature called 'errorbalanc

In [13]:
fraud_df["errorbalanceOrg"] = fraud_df.newbalanceOrig + fraud_df.amount - fraud_df.oldbalanceOrg
fraud_df["errorbalanceDest"] = fraud_df.oldbalanceDest + fraud_df.amount - fraud_df.newbalanceDest

# Data Cleaning

## Dropping columns based on analysis results

In [14]:
fraud_df = fraud_df.drop(['nameOrig', 'nameDest', 'isFlaggedFraud'], axis = 1)

In [15]:
fraud_df

Unnamed: 0,step,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,errorbalanceOrg,errorbalanceDest
0,1,PAYMENT,9839.64,170136.00,160296.36,0.00,0.00,0,0.0,9.839640e+03
1,1,PAYMENT,1864.28,21249.00,19384.72,0.00,0.00,0,0.0,1.864280e+03
2,1,TRANSFER,181.00,181.00,0.00,0.00,0.00,1,0.0,1.810000e+02
3,1,CASH_OUT,181.00,181.00,0.00,21182.00,0.00,1,0.0,2.136300e+04
4,1,PAYMENT,11668.14,41554.00,29885.86,0.00,0.00,0,0.0,1.166814e+04
...,...,...,...,...,...,...,...,...,...,...
6362615,743,CASH_OUT,339682.13,339682.13,0.00,0.00,339682.13,1,0.0,0.000000e+00
6362616,743,TRANSFER,6311409.28,6311409.28,0.00,0.00,0.00,1,0.0,6.311409e+06
6362617,743,CASH_OUT,6311409.28,6311409.28,0.00,68488.84,6379898.11,1,0.0,1.000000e-02
6362618,743,TRANSFER,850002.52,850002.52,0.00,0.00,0.00,1,0.0,8.500025e+05


## dropping rows not relevant to the task

fraudulent transations occur only in TRANSFER and CASHOUT types. so to keep data relevant to fraud prediction, we keep rows of these two operations only.

In [16]:
fraud_df = fraud_df.loc[(fraud_df.type == 'TRANSFER') | (fraud_df.type == 'CASH_OUT')]
fraud_df

Unnamed: 0,step,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,errorbalanceOrg,errorbalanceDest
2,1,TRANSFER,181.00,181.00,0.0,0.00,0.00,1,0.00,1.810000e+02
3,1,CASH_OUT,181.00,181.00,0.0,21182.00,0.00,1,0.00,2.136300e+04
15,1,CASH_OUT,229133.94,15325.00,0.0,5083.00,51513.44,0,213808.94,1.827035e+05
19,1,TRANSFER,215310.30,705.00,0.0,22425.00,0.00,0,214605.30,2.377353e+05
24,1,TRANSFER,311685.89,10835.00,0.0,6267.00,2719172.89,0,300850.89,-2.401220e+06
...,...,...,...,...,...,...,...,...,...,...
6362615,743,CASH_OUT,339682.13,339682.13,0.0,0.00,339682.13,1,0.00,0.000000e+00
6362616,743,TRANSFER,6311409.28,6311409.28,0.0,0.00,0.00,1,0.00,6.311409e+06
6362617,743,CASH_OUT,6311409.28,6311409.28,0.0,68488.84,6379898.11,1,0.00,1.000000e-02
6362618,743,TRANSFER,850002.52,850002.52,0.0,0.00,0.00,1,0.00,8.500025e+05


## Feature Selection

for feature selection, we will Compute the ANOVA F-value for numerical input features. In addition to that, For categorical variables where no ordinal relationship exists (type feature), we will encode using one-hot method 

In [17]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import OrdinalEncoder

In [18]:
oe = OrdinalEncoder()
type_data = fraud_df['type'].values.reshape(-1,1)
oe.fit(type_data)
type_enc = oe.transform(type_data)
fraud_df['type'] = type_enc
fraud_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  fraud_df['type'] = type_enc


Unnamed: 0,step,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,errorbalanceOrg,errorbalanceDest
2,1,1.0,181.00,181.00,0.0,0.00,0.00,1,0.00,1.810000e+02
3,1,0.0,181.00,181.00,0.0,21182.00,0.00,1,0.00,2.136300e+04
15,1,0.0,229133.94,15325.00,0.0,5083.00,51513.44,0,213808.94,1.827035e+05
19,1,1.0,215310.30,705.00,0.0,22425.00,0.00,0,214605.30,2.377353e+05
24,1,1.0,311685.89,10835.00,0.0,6267.00,2719172.89,0,300850.89,-2.401220e+06
...,...,...,...,...,...,...,...,...,...,...
6362615,743,0.0,339682.13,339682.13,0.0,0.00,339682.13,1,0.00,0.000000e+00
6362616,743,1.0,6311409.28,6311409.28,0.0,0.00,0.00,1,0.00,6.311409e+06
6362617,743,0.0,6311409.28,6311409.28,0.0,68488.84,6379898.11,1,0.00,1.000000e-02
6362618,743,1.0,850002.52,850002.52,0.0,0.00,0.00,1,0.00,8.500025e+05


In [19]:
# Output: categoircal
# Input: numerical(type, step, amount, oldbalanceOrg, newbalanceOrig, oldbalanceDest, newbalanceDest)
# type feature will be encoded accordingly
y = fraud_df['isFraud']
X = fraud_df[['type', 'step', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest', 'errorbalanceOrg', 'errorbalanceDest']]

In [20]:
fs = SelectKBest(score_func=f_classif, k='all')
fs.fit(X, y)
# what are scores for the features
for i in range(len(fs.scores_)):
    print('Feature %s: %f' % (X.columns[i], fs.scores_[i]))

Feature type: 4989.588344
Feature step: 6578.263516
Feature amount: 13901.637857
Feature oldbalanceOrg: 380694.276438
Feature newbalanceOrig: 11236.549146
Feature oldbalanceDest: 620.175363
Feature newbalanceDest: 223.308306
Feature errorbalanceOrg: 815.025776
Feature errorbalanceDest: 13616.228725


oldbalanceDest and newbalanceDest are features with the lowest ANOVA F-score but since they directly related to the transation process, we will still keep those variables

**Note**: we didn't encode isFlaggedFraud or the class label because they are already encoded

## MultiCollinearity

In [21]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [22]:
def calc_vif(df):
    vif = pd.DataFrame()
    vif["features"] = df.columns
    vif["VIF"] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
    return vif

In [23]:
calc_vif(X[['type', 'step', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']]) 
# note: we didn't the engineered features to prevent calculation errors

Unnamed: 0,features,VIF
0,type,1.338878
1,step,1.303365
2,amount,5.321682
3,oldbalanceOrg,2.783066
4,newbalanceOrig,2.631296
5,oldbalanceDest,65.281776
6,newbalanceDest,80.740306


We can see here that the **oldbalanceDest** and **newbalanceDest** have a high VIF value, meaning they can be predicted by other independent variables in the dataset. in addition, we saw before that they had the lowest ANOVA f-measure scores among the variables. so based on this , we will drop both variables from the dataset

In [24]:
X = X.drop(['oldbalanceDest', 'newbalanceDest'], axis = 1)
X

Unnamed: 0,type,step,amount,oldbalanceOrg,newbalanceOrig,errorbalanceOrg,errorbalanceDest
2,1.0,1,181.00,181.00,0.0,0.00,1.810000e+02
3,0.0,1,181.00,181.00,0.0,0.00,2.136300e+04
15,0.0,1,229133.94,15325.00,0.0,213808.94,1.827035e+05
19,1.0,1,215310.30,705.00,0.0,214605.30,2.377353e+05
24,1.0,1,311685.89,10835.00,0.0,300850.89,-2.401220e+06
...,...,...,...,...,...,...,...
6362615,0.0,743,339682.13,339682.13,0.0,0.00,0.000000e+00
6362616,1.0,743,6311409.28,6311409.28,0.0,0.00,6.311409e+06
6362617,0.0,743,6311409.28,6311409.28,0.0,0.00,1.000000e-02
6362618,1.0,743,850002.52,850002.52,0.0,0.00,8.500025e+05


## Outlier Detection

we will use the Winsorization method, which is similar to IQR method in that if a value exceeds the value of the 99th percentile and below the 1st percentile of given values are treated as outliers. outlier detection will be applied to amount, oldbalanceOrg, and newbalanceOrig.

In [25]:
from scipy.stats import stats

In [26]:
out = {}
def Winsorization_outliers(df, col):
    out[col] = []
    q1 = np.percentile(df , 1)
    q3 = np.percentile(df , 99)
    for i in df:
        if i > q3 or i < q1:
            out[col].append(i)
cols_outlier = ['amount', 'oldbalanceOrg', 'newbalanceOrig']

for col in cols_outlier:
    Winsorization_outliers(X[col], col)
    
print('Outliers detected in:', end=' ')
for k in out.keys():
    print(k, end=' ')
print()
    

Outliers detected in: amount oldbalanceOrg newbalanceOrig 


### Removing Outliers

In [27]:
X_drop = pd.DataFrame()
for col in out.keys():
    X_i = X[X[col].isin(out[col])]
    X_drop = pd.concat([X_i, X_drop]).drop_duplicates()

X = X.drop(X_drop.index)
y = y.drop(X_drop.index)
X.shape, y.shape

((2680699, 7), (2680699,))

# Model Building

Xgboos is a type of gradient-boosted Decision Trees algorithm, which was created for speed as well as maximizing the efficiency of computing time and memory resources

In [28]:
from xgboost import XGBClassifier 
from sklearn.model_selection import train_test_split

## Splitting Data

In [29]:
# splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 3)

## Model Definition

In [30]:
model = XGBClassifier(max_depth=3, n_jobs=-1, random_state=3, learning_rate=0.08)

## Parameter Tuning

In [36]:
# import packages for hyperparameters tuning
from hyperopt import STATUS_OK, Trials, fmin, hp, tpe

In [37]:
space={'max_depth': hp.quniform("max_depth", 3, 18, 1),
        'gamma': hp.uniform ('gamma', 1,9),
        'reg_alpha' : hp.quniform('reg_alpha', 40,180,1),
        'reg_lambda' : hp.uniform('reg_lambda', 0,1),
        'colsample_bytree' : hp.uniform('colsample_bytree', 0.5,1),
        'min_child_weight' : hp.quniform('min_child_weight', 0, 10, 1),
        'n_estimators': 180,
        'seed': 0
}

In [38]:
def objective(space):
    clf= XGBClassifier(
                    n_estimators =space['n_estimators'], max_depth = int(space['max_depth']), gamma = space['gamma'],
                    reg_alpha = int(space['reg_alpha']),min_child_weight=int(space['min_child_weight']),
                    colsample_bytree=int(space['colsample_bytree']))
    
    evaluation = [( X_train, y_train), ( X_test, y_test)]
    
    clf.fit(X_train, y_train,
            eval_set=evaluation, eval_metric="auc",
            early_stopping_rounds=10,verbose=False)
    

    pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, pred>0.5)
    print ("SCORE:", accuracy)
    return {'loss': -accuracy, 'status': STATUS_OK }

In [39]:
trials = Trials()

best_hyperparams = fmin(fn = objective,
                        space = space,
                        algo = tpe.suggest,
                        max_evals = 30,
                        trials = trials)

  0%|                                                                           | 0/30 [00:00<?, ?trial/s, best loss=?]




SCORE:                                                                                                                 
0.9984407057858022                                                                                                     
  3%|█▌                                              | 1/30 [00:17<08:28, 17.53s/trial, best loss: -0.9984407057858022]




SCORE:                                                                                                                 
0.9983034282090498                                                                                                     
  7%|███▏                                            | 2/30 [00:31<07:15, 15.55s/trial, best loss: -0.9984407057858022]




SCORE:                                                                                                                 
0.9983034282090498                                                                                                     
 10%|████▊                                           | 3/30 [00:45<06:35, 14.66s/trial, best loss: -0.9984407057858022]




SCORE:                                                                                                                 
0.9978960719215131                                                                                                     
 13%|██████▍                                         | 4/30 [01:03<07:00, 16.17s/trial, best loss: -0.9984407057858022]




SCORE:                                                                                                                 
0.9983123810944903                                                                                                     
 17%|████████                                        | 5/30 [01:20<06:44, 16.20s/trial, best loss: -0.9984407057858022]




SCORE:                                                                                                                 
0.9983034282090498                                                                                                     
 20%|█████████▌                                      | 6/30 [01:34<06:14, 15.62s/trial, best loss: -0.9984407057858022]




SCORE:                                                                                                                 
0.9984392136382289                                                                                                     
 23%|███████████▏                                    | 7/30 [01:50<06:03, 15.78s/trial, best loss: -0.9984407057858022]




SCORE:                                                                                                                 
0.9983034282090498                                                                                                     
 27%|████████████▊                                   | 8/30 [02:04<05:33, 15.17s/trial, best loss: -0.9984407057858022]




SCORE:                                                                                                                 
0.9983616219644122                                                                                                     
 30%|██████████████▍                                 | 9/30 [02:20<05:22, 15.34s/trial, best loss: -0.9984407057858022]




SCORE:                                                                                                                 
0.9983034282090498                                                                                                     
 33%|███████████████▋                               | 10/30 [02:34<04:58, 14.92s/trial, best loss: -0.9984407057858022]




SCORE:                                                                                                                 
0.9983034282090498                                                                                                     
 37%|█████████████████▏                             | 11/30 [02:48<04:40, 14.77s/trial, best loss: -0.9984407057858022]




SCORE:                                                                                                                 
0.9983034282090498                                                                                                     
 40%|██████████████████▊                            | 12/30 [03:03<04:27, 14.88s/trial, best loss: -0.9984407057858022]




SCORE:                                                                                                                 
0.9983034282090498                                                                                                     
 43%|████████████████████▎                          | 13/30 [03:17<04:07, 14.54s/trial, best loss: -0.9984407057858022]




SCORE:                                                                                                                 
0.9983034282090498                                                                                                     
 47%|█████████████████████▉                         | 14/30 [03:31<03:49, 14.37s/trial, best loss: -0.9984407057858022]




SCORE:                                                                                                                 
0.9983034282090498                                                                                                     
 50%|███████████████████████▌                       | 15/30 [03:46<03:36, 14.42s/trial, best loss: -0.9984407057858022]




SCORE:                                                                                                                 
0.9983034282090498                                                                                                     
 53%|█████████████████████████                      | 16/30 [04:00<03:21, 14.36s/trial, best loss: -0.9984407057858022]




SCORE:                                                                                                                 
0.9983571455216921                                                                                                     
 57%|██████████████████████████▋                    | 17/30 [04:16<03:14, 14.95s/trial, best loss: -0.9984407057858022]




SCORE:                                                                                                                 
0.9984392136382289                                                                                                     
 60%|████████████████████████████▏                  | 18/30 [04:32<03:04, 15.35s/trial, best loss: -0.9984407057858022]




SCORE:                                                                                                                 
0.9983034282090498                                                                                                     
 63%|█████████████████████████████▊                 | 19/30 [04:47<02:45, 15.07s/trial, best loss: -0.9984407057858022]




SCORE:                                                                                                                 
0.9983034282090498                                                                                                     
 67%|███████████████████████████████▎               | 20/30 [05:01<02:27, 14.79s/trial, best loss: -0.9984407057858022]




SCORE:                                                                                                                 
0.9983765434401463                                                                                                     
 70%|████████████████████████████████▉              | 21/30 [05:15<02:12, 14.69s/trial, best loss: -0.9984407057858022]




SCORE:                                                                                                                 
0.9984824859178573                                                                                                     
 73%|██████████████████████████████████▍            | 22/30 [05:31<01:59, 14.93s/trial, best loss: -0.9984824859178573]




SCORE:                                                                                                                 
0.9984392136382289                                                                                                     
 77%|████████████████████████████████████           | 23/30 [05:46<01:44, 14.86s/trial, best loss: -0.9984824859178573]




SCORE:                                                                                                                 
0.9984392136382289                                                                                                     
 80%|█████████████████████████████████████▌         | 24/30 [06:01<01:29, 14.99s/trial, best loss: -0.9984824859178573]




SCORE:                                                                                                                 
0.9984392136382289                                                                                                     
 83%|███████████████████████████████████████▏       | 25/30 [06:16<01:14, 14.89s/trial, best loss: -0.9984824859178573]




SCORE:                                                                                                                 
0.9984869623605774                                                                                                     
 87%|████████████████████████████████████████▋      | 26/30 [06:33<01:02, 15.56s/trial, best loss: -0.9984869623605774]




SCORE:                                                                                                                 
0.9984839780654307                                                                                                     
 90%|██████████████████████████████████████████▎    | 27/30 [06:50<00:48, 16.07s/trial, best loss: -0.9984869623605774]




SCORE:                                                                                                                 
0.9984168314246279                                                                                                     
 93%|███████████████████████████████████████████▊   | 28/30 [07:07<00:33, 16.53s/trial, best loss: -0.9984869623605774]




SCORE:                                                                                                                 
0.9984392136382289                                                                                                     
 97%|█████████████████████████████████████████████▍ | 29/30 [07:25<00:16, 16.71s/trial, best loss: -0.9984869623605774]




SCORE:                                                                                                                 
0.9984451822285224                                                                                                     
100%|███████████████████████████████████████████████| 30/30 [07:44<00:00, 15.49s/trial, best loss: -0.9984869623605774]


In [40]:
print("The best hyperparameters are : ","\n")
print(best_hyperparams)

The best hyperparameters are :  

{'colsample_bytree': 0.8577313631443405, 'gamma': 2.753622729981114, 'max_depth': 12.0, 'min_child_weight': 0.0, 'reg_alpha': 49.0, 'reg_lambda': 0.17899342296513426}


## Model Training

In [43]:
model =XGBClassifier(
                    n_estimators =space['n_estimators'], max_depth = int(best_hyperparams['max_depth']), 
                    gamma = best_hyperparams['gamma'], reg_alpha = best_hyperparams['reg_alpha'],
                    min_child_weight=best_hyperparams['min_child_weight'],
                    colsample_bytree=best_hyperparams['colsample_bytree'])

In [44]:
history = model.fit(X_train, y_train)
 
# Predict on testing set
predictions = model.predict(X_test)



## Prediction and Evaluation

In [45]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [46]:
# Predict on testing set
predictions = model.predict(X_test)

In [47]:
confusion_matrix(y_test, predictions)

array([[669038,      0],
       [    11,   1126]], dtype=int64)

generally great results with confusion metrics, showing a very low number of FNs and FPs

In [48]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00    669038
           1       1.00      0.99      1.00      1137

    accuracy                           1.00    670175
   macro avg       1.00      1.00      1.00    670175
weighted avg       1.00      1.00      1.00    670175



accuracy in general is good which indicates an overall accurate prediction. precision and recall measures are also near-perfect, also implying that adequate prediction with highly imbalanced distribution between positive and negative classes.

# Conclusion

The model we built we'll act as core part of an automated detection system for credit card fraud transactions. which we'll be implemented as the trigger for the prevention system, that will lock the accounts for the transaction temporarily for access control, and  notify the user accounts via email of the potential fraud transaction, asking from the user to verify his identity and details, and contacting the company for clarification if the case was not fraudulent related. the system will also have a mechanism where transations are not commited until passsing of fraud identification, to prevent effects taken and operations that occurs in losses and damages, thus overall giving a whole new layer of security for credit card transactions.
however, challenges to this approach include FN cases where the model will predict a non-fraud transaction when in fact it is, based on the small missclassification rate showcased. another problem is the continuous change in fraud attempts in credit card transactions, that might affect model performance over time, and will require retrain and redeployment process for the machine learning model.