# **Credit Card Fraud Detection**

***Context***  
It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.  
  
***Content Dataset***  
The dataset contains transactions made by credit cards in September 2013 by European cardholders.  
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.  
  
It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.  
  
Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.  
  
A simulator for transaction data has been released as part of the practical handbook on Machine Learning for Credit Card Fraud Detection - https://fraud-detection-handbook.github.io/fraud-detection-handbook/Chapter_3_GettingStarted/SimulatedDataset.html. We invite all practitioners interested in fraud detection datasets to also check out this data simulator, and the methodologies for credit card fraud detection presented in the book.  

Credit Card Fraud is one of the biggest issues faced by the government and the amount of money involved in this is generally enormous. Fraud may happen are as follows:
1. Firstly and most ostensibly when your card details are overseen by some other person.
2. When your card is lost or stolen and the person possessing it knows how to get things done.
3. Fake phone call convincing you to share the details.
4. And lastly and most improbably, a high-level hacking of the bank account details.

Main challenges involved in credit card fraud detection are:
1. Enormous Data is processed every day and the model build must be fast enough to respond to the scam in time.
2. Imbalanced Data i.e most of the transactions(99.8%) are not fraudulent which makes it really hard for detecting the fraudulent ones
3. Data availability as the data is mostly private.
4. Misclassified Data can be another major issue, as not every fraudulent transaction is caught and reported.
5. And last but not the least, Adaptive techniques used against the model by the scammers.  
  
How to tackle these challenges?  
1. The model used must be simple and fast enough to detect the anomaly and classify it as a fraudulent transaction as quickly as possible.
2. Imbalance can be dealt with by properly using some methods which we will talk about in the next paragraph
3. For protecting the privacy of the user the dimensionality of the data can be reduced.
4. A more trustworthy source must be taken which double-check the data, at least for training the model.
5. We can make the model simple and interpretable so that when the scammer adapts to it with just some tweaks we can have a new model up and running to deploy.  

Dealing with Imbalance  
We will see in the later parts of the article that the data we received is highly imbalanced i.e only 0.17% of the total Credit Card transaction is fraudulent. Well, a class imbalance is a very common problem in real life and needs to be handled before applying any algorithm to it.  
  
There are three common ways to deal with the imbalance of Data  
  
- Undersampling- One-sided sampling by Kubat and Matwin(ICML 1997)
- Oversampling-SMOTE(Synthetic Minority Oversampling Technique)
- Combining the above two.  

The imbalance is not within the scope of this article. Here is another article guiding you to deal with this problem specifically.  
  
For those of you who are wondering if the fraudulent transaction is so rare why even bother, well here is another fact. The amount of money involved in the fraudulent transaction reaches Billions of USD and by increasing the specificity to 0.1% we can save Millions of USD. Whereas higher Sensitivity means fewer people harassed.  

Dataset : https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud  

# **Load dataset**

In [1]:
!pip install pycaret 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting markupsafe~=2.1.1
  Using cached MarkupSafe-2.1.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
Installing collected packages: markupsafe
  Attempting uninstall: markupsafe
    Found existing installation: MarkupSafe 2.0.1
    Uninstalling MarkupSafe-2.0.1:
      Successfully uninstalled MarkupSafe-2.0.1
Successfully installed markupsafe-2.1.1


In [2]:
!pip install markupsafe==2.0.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting markupsafe==2.0.1
  Using cached MarkupSafe-2.0.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (31 kB)
Installing collected packages: markupsafe
  Attempting uninstall: markupsafe
    Found existing installation: MarkupSafe 2.1.1
    Uninstalling MarkupSafe-2.1.1:
      Successfully uninstalled MarkupSafe-2.1.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pandas-profiling 3.2.0 requires markupsafe~=2.1.1, but you have markupsafe 2.0.1 which is incompatible.[0m
Successfully installed markupsafe-2.0.1


In [3]:
import numpy as np
import pandas as pd
import jinja2
from pycaret.classification import *

  defaults = yaml.load(f)


In [4]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [5]:
df = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/Fraud detection/creditcard.csv')
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.5516,-0.617801,-0.99139,-0.311169,1.468177,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,0.489095,-0.143772,0.635558,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,0.717293,-0.165946,2.345865,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.345852,-1.11967,0.175121,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [6]:
print(len(df[df['Class'] == 0])) # value 0 incase of otherwise
print(len(df[df['Class'] == 1])) # value 1 in case of fraud

284315
492


In [7]:
# initializing setup
clf1 = setup(data = df, target = 'Class')

Unnamed: 0,Description,Value
0,session_id,1415
1,Target,Class
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(284807, 31)"
5,Missing Values,False
6,Numeric Features,30
7,Categorical Features,0
8,Ordinal Features,False
9,High Cardinality Features,False


INFO:logs:create_model_container: 0
INFO:logs:master_model_container: 0
INFO:logs:display_container: 1
INFO:logs:Pipeline(memory=None,
         steps=[('dtypes',
                 DataTypes_Auto_infer(categorical_features=[],
                                      display_types=True, features_todrop=[],
                                      id_columns=[],
                                      ml_usecase='classification',
                                      numerical_features=[], target='Class',
                                      time_features=[])),
                ('imputer',
                 Simple_Imputer(categorical_strategy='not_available',
                                fill_value_categorical=None,
                                fill_value_numerical=None,
                                numeric_strate...
                ('scaling', 'passthrough'), ('P_transform', 'passthrough'),
                ('binn', 'passthrough'), ('rem_outliers', 'passthrough'),
                ('cluste

In [8]:
compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.9996,0.9452,0.7988,0.955,0.8692,0.869,0.8729,149.024
et,Extra Trees Classifier,0.9996,0.9499,0.8084,0.9434,0.869,0.8688,0.8722,20.172
lda,Linear Discriminant Analysis,0.9994,0.9086,0.7766,0.8475,0.8096,0.8093,0.8105,1.247
dt,Decision Tree Classifier,0.9993,0.8882,0.7768,0.7907,0.7797,0.7793,0.7814,15.248
ada,Ada Boost Classifier,0.9993,0.9682,0.726,0.8303,0.7738,0.7735,0.7757,46.348
lr,Logistic Regression,0.9991,0.9427,0.6359,0.7938,0.6983,0.6979,0.7061,8.805
gbc,Gradient Boosting Classifier,0.9991,0.7305,0.6033,0.7738,0.6744,0.674,0.681,251.05
ridge,Ridge Classifier,0.9989,0.0,0.4183,0.8186,0.55,0.5496,0.5826,0.182
knn,K Neighbors Classifier,0.9984,0.5946,0.0282,0.6,0.0537,0.0536,0.1282,3.421
dummy,Dummy Classifier,0.9984,0.5,0.0,0.0,0.0,0.0,0.0,0.11


INFO:logs:create_model_container: 14
INFO:logs:master_model_container: 14
INFO:logs:display_container: 2
INFO:logs:RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=1415, verbose=0,
                       warm_start=False)
INFO:logs:compare_models() succesfully completed......................................


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=1415, verbose=0,
                       warm_start=False)

In [9]:
# Creating logistic regression model
ET = create_model('et')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.9995,0.9042,0.75,0.96,0.8421,0.8419,0.8483
1,0.9996,0.936,0.7812,0.9615,0.8621,0.8619,0.8665
2,0.9997,0.9999,0.9062,0.9355,0.9206,0.9205,0.9206
3,0.9995,0.9199,0.75,0.96,0.8421,0.8419,0.8483
4,0.9996,0.9176,0.8387,0.8966,0.8667,0.8665,0.8669
5,0.9996,0.983,0.8387,0.9286,0.8814,0.8812,0.8823
6,0.9994,0.9671,0.75,0.8889,0.8136,0.8133,0.8162
7,0.9996,0.9517,0.75,1.0,0.8571,0.8569,0.8659
8,0.9996,0.9835,0.875,0.9032,0.8889,0.8887,0.8888
9,0.9997,0.9361,0.8438,1.0,0.9153,0.9151,0.9184


INFO:logs:create_model_container: 15
INFO:logs:master_model_container: 15
INFO:logs:display_container: 3
INFO:logs:ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=None, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
                     oob_score=False, random_state=1415, verbose=0,
                     warm_start=False)
INFO:logs:create_model() succesfully completed......................................


In [10]:
ET

ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=None, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
                     oob_score=False, random_state=1415, verbose=0,
                     warm_start=False)

In [13]:
#hyperparameter tuning for a particular model
model=tune_model(ET)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.9994,0.9747,0.8125,0.8125,0.8125,0.8122,0.8122
1,0.9994,0.9374,0.7812,0.8621,0.8197,0.8194,0.8204
2,0.9995,0.998,0.9688,0.775,0.8611,0.8609,0.8662
3,0.9994,0.9753,0.7812,0.8333,0.8065,0.8062,0.8066
4,0.9993,0.9922,0.8065,0.7576,0.7812,0.7809,0.7813
5,0.9992,0.982,0.871,0.6923,0.7714,0.771,0.7761
6,0.9993,0.9807,0.875,0.7368,0.8,0.7997,0.8026
7,0.9996,0.9972,0.8438,0.9,0.871,0.8708,0.8712
8,0.9996,0.9877,0.9375,0.8571,0.8955,0.8953,0.8962
9,0.9993,0.9677,0.875,0.7368,0.8,0.7997,0.8026


INFO:logs:create_model_container: 16
INFO:logs:master_model_container: 16
INFO:logs:display_container: 4
INFO:logs:ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0,
                     class_weight='balanced_subsample', criterion='entropy',
                     max_depth=9, max_features=1.0, max_leaf_nodes=None,
                     max_samples=None, min_impurity_decrease=0,
                     min_impurity_split=None, min_samples_leaf=2,
                     min_samples_split=10, min_weight_fraction_leaf=0.0,
                     n_estimators=60, n_jobs=-1, oob_score=False,
                     random_state=1415, verbose=0, warm_start=False)
INFO:logs:tune_model() succesfully completed......................................


In [14]:
ET

ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=None, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
                     oob_score=False, random_state=1415, verbose=0,
                     warm_start=False)

In [15]:
# saving the model
save_model(ET, 'ET_saved')

INFO:logs:Initializing save_model()
INFO:logs:save_model(model=ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=None, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
                     oob_score=False, random_state=1415, verbose=0,
                     warm_start=False), model_name=ET_saved, prep_pipe_=Pipeline(memory=None,
         steps=[('dtypes',
                 DataTypes_Auto_infer(categorical_features=[],
                                      display_types=True, features_todrop=[],
                                      id_columns=[],
                                      ml_usecase='classification',
                                      numerical_features=[], tar

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[], target='Class',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_strate...
                  ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0,
                                       class_weight=None, criterion='gini',
                                       max_depth=None, max_features='auto',
                                       max_leaf_nod

In [16]:
# Loading the saved model
ET_saved = load_model('ET_saved')

INFO:logs:Initializing load_model()
INFO:logs:load_model(model_name=ET_saved, platform=None, authentication=None, verbose=True)


Transformation Pipeline and Model Successfully Loaded
