<a href="https://www.kaggle.com/code/aadarshkumarshah/credit-card-fraud-detection?scriptVersionId=187656975" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

---
<center><h1>Credit Card Fraud Detection</h1></center>

---

## 1) Understanding Problem Statement
---

Credit card fraud is a pervasive and costly issue in the financial industry, causing significant financial losses to both cardholders and financial institutions. Timely detection and prevention of fraudulent credit card transactions are essential for minimizing financial losses and protecting consumers. Machine learning models can be instrumental in identifying suspicious transactions and predicting the likelihood of credit card fraud based on transaction data and behavioral patterns.

The project is about **Anomaly Detection in Financial Transactions using Machine Learning**. The project falls under **Classification Machine Learning Problem**. The goal of this project is **to develop a credit card fraud detection model that can accurately identify potentially fraudulent transactions, safeguarding cardholders and financial institutions from financial losses**.

## 2) Understanding Data
---

The project uses **Credit Card Fraud Data** which contains several variables (independent variables) and the outcome variable or dependent variable.

## 3) Getting System Ready
---
Importing required libraries


In [1]:
import numpy as np
import pandas as pd

# for model buidling
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

## 4) Data Eyeballing
---

### Laoding Data

In [2]:
credit_card_fraud_data = pd.read_csv('/kaggle/input/credit-card-data/creditcard.csv') 

In [3]:
credit_card_fraud_data

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.50,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,69.99,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,...,0.213454,0.111864,1.014480,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.055080,2.035030,-0.738589,0.868229,1.058415,0.024330,0.294869,0.584800,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.249640,-0.557828,2.630515,3.031260,-0.296827,0.708417,0.432454,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.240440,0.530483,0.702510,0.689799,-0.377961,0.623708,-0.686180,0.679145,0.392087,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.00,0


In [4]:
print('The size of Dataframe is: ', credit_card_fraud_data.shape)
print('-'*100)
print('The Column Name, Record Count and Data Types are as follows: ')
credit_card_fraud_data.info()
print('-'*100)

The size of Dataframe is:  (284807, 31)
----------------------------------------------------------------------------------------------------
The Column Name, Record Count and Data Types are as follows: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 1

In [5]:
# Defining numerical & categorical columns
numeric_features = [feature for feature in credit_card_fraud_data.columns if credit_card_fraud_data[feature].dtype != 'O']
categorical_features = [feature for feature in credit_card_fraud_data.columns if credit_card_fraud_data[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

We have 31 numerical features : ['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount', 'Class']

We have 0 categorical features : []


In [6]:
print('Missing Value Presence in different columns of DataFrame are as follows : ')
print('-'*100)
total=credit_card_fraud_data.isnull().sum().sort_values(ascending=False)
percent=(credit_card_fraud_data.isnull().sum()/credit_card_fraud_data.isnull().count()*100).sort_values(ascending=False)
pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

Missing Value Presence in different columns of DataFrame are as follows : 
----------------------------------------------------------------------------------------------------


Unnamed: 0,Total,Percent
Time,0,0.0
V16,0,0.0
Amount,0,0.0
V28,0,0.0
V27,0,0.0
V26,0,0.0
V25,0,0.0
V24,0,0.0
V23,0,0.0
V22,0,0.0


In [7]:
print('Summary Statistics of numerical features for DataFrame are as follows:')
print('-'*100)
credit_card_fraud_data.describe()

Summary Statistics of numerical features for DataFrame are as follows:
----------------------------------------------------------------------------------------------------


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,1.168375e-15,3.416908e-16,-1.379537e-15,2.074095e-15,9.604066e-16,1.487313e-15,-5.556467e-16,1.213481e-16,-2.406331e-15,...,1.654067e-16,-3.568593e-16,2.578648e-16,4.473266e-15,5.340915e-16,1.683437e-15,-3.660091e-16,-1.22739e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,...,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,...,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,...,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,...,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,...,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,...,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


In [8]:
credit_card_fraud_data['Class'].value_counts()

Class
0    284315
1       492
Name: count, dtype: int64

#### `Class` is the Target Variable. Seeing the distribution of `legit transaction(Class=0)` and `fraudulent transactions(Class=1)`, the dataset is highly unbalanced.

## 4) Data Cleaning and Preprocessing
---

### Separating the data as per `Class` column for analysis

In [9]:
legit = credit_card_fraud_data[credit_card_fraud_data.Class == 0]
fraud = credit_card_fraud_data[credit_card_fraud_data.Class == 1]

In [10]:
legit.shape,fraud.shape

((284315, 31), (492, 31))

#### Statistical Look at Amount column for legit and fraud transaction

In [11]:
legit.Amount.describe()

count    284315.000000
mean         88.291022
std         250.105092
min           0.000000
25%           5.650000
50%          22.000000
75%          77.050000
max       25691.160000
Name: Amount, dtype: float64

In [12]:
fraud.Amount.describe()

count     492.000000
mean      122.211321
std       256.683288
min         0.000000
25%         1.000000
50%         9.250000
75%       105.890000
max      2125.870000
Name: Amount, dtype: float64

In [13]:
credit_card_fraud_data.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,94838.202258,0.008258,-0.006271,0.012171,-0.00786,0.005453,0.002419,0.009637,-0.000987,0.004467,...,-0.000644,-0.001235,-2.4e-05,7e-05,0.000182,-7.2e-05,-8.9e-05,-0.000295,-0.000131,88.291022
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


#### Under Sampling - Build a sample dataset containing similar distribution of legit transactions and fraud transaction, Here as per fraud transaction count 492

In [14]:
legit_sample = legit.sample(n=492)

In [15]:
legit_sample

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
103751,68780.0,0.488552,-0.922367,1.673494,3.003203,-1.465758,0.749788,-0.700551,0.383919,0.511891,...,0.371627,0.606800,-0.222559,0.405858,0.080003,0.148637,0.022034,0.087126,282.98,0
52739,45618.0,1.176412,-0.005461,-0.339213,1.548106,1.869341,4.303396,-0.969804,1.092231,-0.128669,...,0.019474,0.065941,-0.135792,1.025085,0.689118,0.164118,0.030556,0.022710,11.82,0
125424,77668.0,-0.324319,1.124186,0.664101,1.024145,-0.388033,-0.812017,0.438155,0.270082,-0.657542,...,0.187456,0.390670,0.018725,0.382855,-0.133469,-0.327742,-0.056399,-0.016765,39.29,0
277522,167694.0,1.772505,-1.877118,-2.706048,-2.712425,-0.354475,-1.031250,0.281879,-0.476465,0.864019,...,-0.123163,-0.288057,-0.236765,0.024393,0.338704,-0.715044,-0.023066,-0.010586,288.58,0
10027,15108.0,-0.985587,-0.346430,0.632396,-0.845306,0.639123,-0.214537,1.055697,-0.062857,0.854995,...,0.131755,0.253732,0.435558,-0.287631,-0.499117,0.804087,-0.034822,0.161758,200.00,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
132117,79862.0,-0.434428,1.122242,1.721206,0.036081,-0.151561,-1.063572,0.773368,-0.138474,-0.553674,...,-0.196333,-0.460943,0.007162,0.717005,-0.207418,0.047965,0.271677,0.126207,2.69,0
234317,147921.0,-4.142018,3.494414,-1.785318,-3.085613,-0.954915,-1.545573,0.304685,0.309538,2.124917,...,-0.527983,0.045885,0.148285,-0.093564,0.090716,-0.632618,0.503353,-0.412685,0.99,0
166973,118414.0,-1.490427,-0.320934,-0.647754,-2.500237,1.293554,-0.645486,0.395299,0.622673,0.881324,...,0.599520,1.504982,-0.065270,-1.041042,-0.234518,-0.974644,0.386372,0.055643,56.23,0
229949,146122.0,-1.233040,0.488683,1.412712,-2.282076,-0.501826,-0.775993,-0.021647,0.461402,0.575706,...,-0.116109,-0.415755,-0.072357,0.067460,0.315300,-0.384547,-0.178274,-0.023378,3.50,0


#### Concatenating dataframe from Under Sample Data

In [16]:
sampled_data = pd.concat([legit_sample, fraud], axis=0)

In [17]:
sampled_data

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
103751,68780.0,0.488552,-0.922367,1.673494,3.003203,-1.465758,0.749788,-0.700551,0.383919,0.511891,...,0.371627,0.606800,-0.222559,0.405858,0.080003,0.148637,0.022034,0.087126,282.98,0
52739,45618.0,1.176412,-0.005461,-0.339213,1.548106,1.869341,4.303396,-0.969804,1.092231,-0.128669,...,0.019474,0.065941,-0.135792,1.025085,0.689118,0.164118,0.030556,0.022710,11.82,0
125424,77668.0,-0.324319,1.124186,0.664101,1.024145,-0.388033,-0.812017,0.438155,0.270082,-0.657542,...,0.187456,0.390670,0.018725,0.382855,-0.133469,-0.327742,-0.056399,-0.016765,39.29,0
277522,167694.0,1.772505,-1.877118,-2.706048,-2.712425,-0.354475,-1.031250,0.281879,-0.476465,0.864019,...,-0.123163,-0.288057,-0.236765,0.024393,0.338704,-0.715044,-0.023066,-0.010586,288.58,0
10027,15108.0,-0.985587,-0.346430,0.632396,-0.845306,0.639123,-0.214537,1.055697,-0.062857,0.854995,...,0.131755,0.253732,0.435558,-0.287631,-0.499117,0.804087,-0.034822,0.161758,200.00,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
279863,169142.0,-1.927883,1.125653,-4.518331,1.749293,-1.566487,-2.010494,-0.882850,0.697211,-2.064945,...,0.778584,-0.319189,0.639419,-0.294885,0.537503,0.788395,0.292680,0.147968,390.00,1
280143,169347.0,1.378559,1.289381,-5.004247,1.411850,0.442581,-1.326536,-1.413170,0.248525,-1.127396,...,0.370612,0.028234,-0.145640,-0.081049,0.521875,0.739467,0.389152,0.186637,0.76,1
280149,169351.0,-0.676143,1.126366,-2.213700,0.468308,-1.120541,-0.003346,-2.234739,1.210158,-0.652250,...,0.751826,0.834108,0.190944,0.032070,-0.739695,0.471111,0.385107,0.194361,77.89,1
281144,169966.0,-3.113832,0.585864,-5.399730,1.817092,-0.840618,-2.943548,-2.208002,1.058733,-1.632333,...,0.583276,-0.269209,-0.456108,-0.183659,-0.328168,0.606116,0.884876,-0.253700,245.00,1


In [18]:
sampled_data['Class'].value_counts()

Class
0    492
1    492
Name: count, dtype: int64

In [19]:
sampled_data.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,92010.152439,-0.104453,-0.109346,-0.052041,0.020938,0.055982,0.006444,0.075827,-0.001372,-0.030582,...,0.045855,0.008813,-0.044995,-0.064337,-0.039338,-0.005871,-0.036368,0.004553,-0.000743,120.155264
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


## 5) Model Building
---

### Creating Feature Matrix (Independent Variables) & Target Variable (Dependent Variable)

In [20]:
# separating the data and labels
X = sampled_data.drop(columns = ['Class'], axis=1) # Feature matrix
y = sampled_data['Class'] # Target variable

In [21]:
X

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
103751,68780.0,0.488552,-0.922367,1.673494,3.003203,-1.465758,0.749788,-0.700551,0.383919,0.511891,...,0.309299,0.371627,0.606800,-0.222559,0.405858,0.080003,0.148637,0.022034,0.087126,282.98
52739,45618.0,1.176412,-0.005461,-0.339213,1.548106,1.869341,4.303396,-0.969804,1.092231,-0.128669,...,-0.022071,0.019474,0.065941,-0.135792,1.025085,0.689118,0.164118,0.030556,0.022710,11.82
125424,77668.0,-0.324319,1.124186,0.664101,1.024145,-0.388033,-0.812017,0.438155,0.270082,-0.657542,...,-0.095587,0.187456,0.390670,0.018725,0.382855,-0.133469,-0.327742,-0.056399,-0.016765,39.29
277522,167694.0,1.772505,-1.877118,-2.706048,-2.712425,-0.354475,-1.031250,0.281879,-0.476465,0.864019,...,-0.049223,-0.123163,-0.288057,-0.236765,0.024393,0.338704,-0.715044,-0.023066,-0.010586,288.58
10027,15108.0,-0.985587,-0.346430,0.632396,-0.845306,0.639123,-0.214537,1.055697,-0.062857,0.854995,...,0.364772,0.131755,0.253732,0.435558,-0.287631,-0.499117,0.804087,-0.034822,0.161758,200.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
279863,169142.0,-1.927883,1.125653,-4.518331,1.749293,-1.566487,-2.010494,-0.882850,0.697211,-2.064945,...,1.252967,0.778584,-0.319189,0.639419,-0.294885,0.537503,0.788395,0.292680,0.147968,390.00
280143,169347.0,1.378559,1.289381,-5.004247,1.411850,0.442581,-1.326536,-1.413170,0.248525,-1.127396,...,0.226138,0.370612,0.028234,-0.145640,-0.081049,0.521875,0.739467,0.389152,0.186637,0.76
280149,169351.0,-0.676143,1.126366,-2.213700,0.468308,-1.120541,-0.003346,-2.234739,1.210158,-0.652250,...,0.247968,0.751826,0.834108,0.190944,0.032070,-0.739695,0.471111,0.385107,0.194361,77.89
281144,169966.0,-3.113832,0.585864,-5.399730,1.817092,-0.840618,-2.943548,-2.208002,1.058733,-1.632333,...,0.306271,0.583276,-0.269209,-0.456108,-0.183659,-0.328168,0.606116,0.884876,-0.253700,245.00


In [22]:
y

103751    0
52739     0
125424    0
277522    0
10027     0
         ..
279863    1
280143    1
280149    1
281144    1
281674    1
Name: Class, Length: 984, dtype: int64

### Data Standardization

In [23]:
scaler = StandardScaler()

In [24]:
scaler.fit(X)

In [25]:
standardized_data = scaler.transform(X)

In [26]:
standardized_data

array([[-0.35812667,  0.53062837, -0.7054888 , ..., -0.06498426,
         0.1011572 ,  0.5130537 ],
       [-0.82947023,  0.65533925, -0.46408307, ..., -0.0565336 ,
        -0.03004746, -0.34678853],
       [-0.17725708,  0.38325262, -0.16666605, ..., -0.14276518,
        -0.1104508 , -0.25968178],
       ...,
       [ 1.68847952,  0.31946601, -0.16609211, ...,  0.29506773,
         0.31957793, -0.13728205],
       [ 1.70099469, -0.12249371, -0.30839708, ...,  0.79067721,
        -0.59304902,  0.39261998],
       [ 1.70876834,  0.8032033 , -0.42092117, ..., -0.08387249,
        -0.10748539, -0.24940781]])

In [27]:
X = standardized_data

In [28]:
X

array([[-0.35812667,  0.53062837, -0.7054888 , ..., -0.06498426,
         0.1011572 ,  0.5130537 ],
       [-0.82947023,  0.65533925, -0.46408307, ..., -0.0565336 ,
        -0.03004746, -0.34678853],
       [-0.17725708,  0.38325262, -0.16666605, ..., -0.14276518,
        -0.1104508 , -0.25968178],
       ...,
       [ 1.68847952,  0.31946601, -0.16609211, ...,  0.29506773,
         0.31957793, -0.13728205],
       [ 1.70099469, -0.12249371, -0.30839708, ...,  0.79067721,
        -0.59304902,  0.39261998],
       [ 1.70876834,  0.8032033 , -0.42092117, ..., -0.08387249,
        -0.10748539, -0.24940781]])

### Train-Test Split

In [29]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=45)

In [30]:
print(X.shape, X_train.shape, X_test.shape)

(984, 30) (787, 30) (197, 30)


In [31]:
print(y.shape, y_train.shape, y_test.shape)

(984,) (787,) (197,)


### Model Comparison : Training & Evaluation

In [32]:
models = [LogisticRegression, SVC, DecisionTreeClassifier, RandomForestClassifier]
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []

for model in models:
    classifier = model().fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    
    accuracy_scores.append(accuracy_score(y_test, y_pred))
    precision_scores.append(precision_score(y_test, y_pred))
    recall_scores.append(recall_score(y_test, y_pred))
    f1_scores.append(f1_score(y_test, y_pred))

In [33]:
classification_metrics_df = pd.DataFrame({
    "Model": ["Logistic Regression", "SVM", "Decision Tree", "Random Forest"],
    "Accuracy": accuracy_scores,
    "Precision": precision_scores,
    "Recall": recall_scores,
    "F1 Score": f1_scores
})

classification_metrics_df.set_index('Model', inplace=True)
classification_metrics_df

Unnamed: 0_level_0,Accuracy,Precision,Recall,F1 Score
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Logistic Regression,0.939086,0.939394,0.939394,0.939394
SVM,0.93401,0.938776,0.929293,0.93401
Decision Tree,0.923858,0.928571,0.919192,0.923858
Random Forest,0.954315,0.978723,0.929293,0.953368


### Inference

In the context of credit card fraud detection, 
- Logistic Regression and Support Vector Machine (SVM) exhibit strong overall performance with high accuracy, precision, and F1 scores, indicating their effectiveness in correctly identifying fraudulent transactions and minimizing false positives. While Decision Tree shows a slightly lower accuracy, its balanced precision and recall suggest it's a reasonable choice. Random Forest also performs well but falls slightly short of Logistic Regression and SVM. In real-world credit card fraud detection, prioritizing precision to reduce false positives is crucial to minimize the financial burden on customers. Thus, Logistic Regression or SVM may be preferred, depending on specific operational requirements.