# **MID COURSE ASSESSMENT** #

## **Machine Learning(ML) Project** ##
### **Credit-Card Fraud Detection** ###

## **Problem Statement** ##
### The rapid increase in online financial transactions has led to a corresponding rise in fraudulent activities, posing significant risks to both financial institutions and their customers. Detecting fraudulent transactions in real-time is a challenging task due to the complexity and diversity of fraud patterns. Traditional rule-based systems are often inadequate, resulting in high false positive rates, missed frauds, and increased operational costs. ###

# **Objective** #
### The primary objective of this machine learning project is to develop a predictive model that can accurately classify upcoming transactions as either fraudulent or genuine using supervised learning techniques. The model aims to enhance the security measures of financial systems by proactively identifying and mitigating fraudulent activities, thereby protecting both the organization and its customers from potential financial losses. ###

## **What is Credit Card Fraud?**
Credit card fraud is when someone uses another person's credit card or account information to make unauthorized purchases or access funds through cash advances. Credit card fraud doesn’t just happen online; it happens in brick-and-mortar stores, too. As a business owner, you can avoid serious headaches – and unwanted publicity – by recognizing potentially fraudulent use of credit cards in your payment environment.

## **Challenges surrounding credit card fraud**

**Sophisticated Fraud Techniques:** Criminals constantly develop new methods to evade detection and exploit vulnerabilities in payment systems, requiring continual adaptation of fraud prevention strategies.

**Data Breaches:** Breaches expose sensitive information, increasing the risk of identity theft and fraudulent transactions.

**Online Transactions:** The anonymity and convenience of online shopping make it a prime target for fraudsters, necessitating strong security measures for e-commerce platforms.

**Account Takeover (ATO):** Stolen credentials can grant fraudsters access to accounts, enabling unauthorized transactions and account manipulation.

**Lack of Consumer Awareness:** Many consumers are unaware of how to protect themselves from fraud, making them easy targets for scams.

# **Importing important Libraries** #

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
from sklearn.model_selection import train_test_split
from imblearn.under_sampling import RandomUnderSampler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,confusion_matrix,roc_auc_score

In [3]:
# Syntax to Display all the columns present in the Dataset:
pd.set_option('display.max_columns', None)

### **Loading Dataset:**

In [4]:
card_data=pd.read_csv(r"C:\Users\akash\OneDrive\Desktop\Python\AmlaBetter\Python_Practice\Module_6 Machine Learning\Mid Course\creditcard.csv")

# **First view for the Dataset:**

In [5]:
card_data

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.551600,-0.617801,-0.991390,-0.311169,1.468177,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,0.489095,-0.143772,0.635558,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,0.717293,-0.165946,2.345865,-2.890083,1.109969,-0.121359,-2.261857,0.524980,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.50,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.345852,-1.119670,0.175121,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,69.99,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,4.356170,-1.593105,2.711941,-0.689256,4.626942,-0.924459,1.107641,1.991691,0.510632,-0.682920,1.475829,0.213454,0.111864,1.014480,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.055080,2.035030,-0.738589,0.868229,1.058415,0.024330,0.294869,0.584800,-0.975926,-0.150189,0.915802,1.214756,-0.675143,1.164931,-0.711757,-0.025693,-1.221179,-1.545556,0.059616,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.249640,-0.557828,2.630515,3.031260,-0.296827,0.708417,0.432454,-0.484782,0.411614,0.063119,-0.183699,-0.510602,1.329284,0.140716,0.313502,0.395652,-0.577252,0.001396,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.240440,0.530483,0.702510,0.689799,-0.377961,0.623708,-0.686180,0.679145,0.392087,-0.399126,-1.933849,-0.962886,-1.042082,0.449624,1.962563,-0.608577,0.509928,1.113981,2.897849,0.127434,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.00,0


In [6]:
# Displaying the first Five transaction from the DataSet:
card_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.5516,-0.617801,-0.99139,-0.311169,1.468177,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,0.489095,-0.143772,0.635558,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,0.717293,-0.165946,2.345865,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.345852,-1.11967,0.175121,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [7]:
# Displaying the Last Five transaction from the DataSet:
card_data.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,4.35617,-1.593105,2.711941,-0.689256,4.626942,-0.924459,1.107641,1.991691,0.510632,-0.68292,1.475829,0.213454,0.111864,1.01448,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.05508,2.03503,-0.738589,0.868229,1.058415,0.02433,0.294869,0.5848,-0.975926,-0.150189,0.915802,1.214756,-0.675143,1.164931,-0.711757,-0.025693,-1.221179,-1.545556,0.059616,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.24964,-0.557828,2.630515,3.03126,-0.296827,0.708417,0.432454,-0.484782,0.411614,0.063119,-0.183699,-0.510602,1.329284,0.140716,0.313502,0.395652,-0.577252,0.001396,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.24044,0.530483,0.70251,0.689799,-0.377961,0.623708,-0.68618,0.679145,0.392087,-0.399126,-1.933849,-0.962886,-1.042082,0.449624,1.962563,-0.608577,0.509928,1.113981,2.897849,0.127434,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.0,0
284806,172792.0,-0.533413,-0.189733,0.703337,-0.506271,-0.012546,-0.649617,1.577006,-0.41465,0.48618,-0.915427,-1.040458,-0.031513,-0.188093,-0.084316,0.041333,-0.30262,-0.660377,0.16743,-0.256117,0.382948,0.261057,0.643078,0.376777,0.008797,-0.473649,-0.818267,-0.002415,0.013649,217.0,0


In [8]:
# Checkingn for the Total number of Rows and columns in the Dataset:
print(f"The total number of Rows are: {card_data.shape[0]}\nTotal number of Columns are: {card_data.shape[1]}")

The total number of Rows are: 284807
Total number of Columns are: 31


In [9]:
# Checking for the missing Values in the Dataset:
card_data.isna().sum().sum()

0

### **There is no missing values present in any column of the dataset.**

In [10]:
# Checking for the Duplicate Values in the Dataset:
print("The Total number of Duplicate data Present in the Dataset is:",card_data.duplicated().sum())

The Total number of Duplicate data Present in the Dataset is: 1081


### **The total number of duplicate values Present in the dataset is '1081'.** 

In [11]:
# Displaying all the Columns Present in the Datset:
card_data.columns

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')

In [12]:
# Display a concise summary of the DataFrame, including the number of non-null values, data types, and memory usage.
card_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

In [13]:
# Summary statistics of the central tendency, dispersion, and shape of a dataset’s distribution for all the Transactions whether it is Fraud or Non-Fraud.
card_data[['Time','Amount','Class']].describe().round(2)

Unnamed: 0,Time,Amount,Class
count,284807.0,284807.0,284807.0
mean,94813.86,88.35,0.0
std,47488.15,250.12,0.04
min,0.0,0.0,0.0
25%,54201.5,5.6,0.0
50%,84692.0,22.0,0.0
75%,139320.5,77.16,0.0
max,172792.0,25691.16,1.0


# **Data Cleaning:**

In [14]:
# Removing all the Duplicate Transaction from the Dataset:
card_data.drop_duplicates(inplace=True)

In [15]:
# Checkingn for the Total number of Rows and columns after Removing the Duplicate Transaction:
print(f"The total number of Rows are: {card_data.shape[0]}\nTotal number of Columns are: {card_data.shape[1]}\n")
print("The Total number of Duplicate data Present in the Dataset is:",card_data.duplicated().sum())

The total number of Rows are: 283726
Total number of Columns are: 31

The Total number of Duplicate data Present in the Dataset is: 0


In [16]:
card_data['Class'].value_counts()

Class
0    283253
1       473
Name: count, dtype: int64

**The majority of transactions are non-fraudulent. Using this imbalanced dataset as the foundation for our predictive models and analyses can lead to significant errors and cause our algorithms to overfit. This overfitting occurs because the model might "assume" that most transactions are legitimate, missing critical patterns indicative of fraud. Our goal is to create a model that accurately detects signs of fraudulent activity, rather than simply assuming the majority of transactions are not fraud.**

In [17]:
Fraud_trnsaction=card_data[card_data['Class']==1]       # 1->Fraud
normal_transaction=card_data[card_data['Class']==0]     # 0->Non-Fraud

In [18]:
# Checking for the First Five Fraud_trnsaction data:
Fraud_trnsaction.head() # class-> 1 > Fraud

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
541,406.0,-2.312227,1.951992,-1.609851,3.997906,-0.522188,-1.426545,-2.537387,1.391657,-2.770089,-2.772272,3.202033,-2.899907,-0.595222,-4.289254,0.389724,-1.140747,-2.830056,-0.016822,0.416956,0.126911,0.517232,-0.035049,-0.465211,0.320198,0.044519,0.17784,0.261145,-0.143276,0.0,1
623,472.0,-3.043541,-3.157307,1.088463,2.288644,1.359805,-1.064823,0.325574,-0.067794,-0.270953,-0.838587,-0.414575,-0.503141,0.676502,-1.692029,2.000635,0.66678,0.599717,1.725321,0.283345,2.102339,0.661696,0.435477,1.375966,-0.293803,0.279798,-0.145362,-0.252773,0.035764,529.0,1
4920,4462.0,-2.30335,1.759247,-0.359745,2.330243,-0.821628,-0.075788,0.56232,-0.399147,-0.238253,-1.525412,2.032912,-6.560124,0.022937,-1.470102,-0.698826,-2.282194,-4.781831,-2.615665,-1.334441,-0.430022,-0.294166,-0.932391,0.172726,-0.08733,-0.156114,-0.542628,0.039566,-0.153029,239.93,1
6108,6986.0,-4.397974,1.358367,-2.592844,2.679787,-1.128131,-1.706536,-3.496197,-0.248778,-0.247768,-4.801637,4.895844,-10.912819,0.184372,-6.771097,-0.007326,-7.358083,-12.598419,-5.131549,0.308334,-0.171608,0.573574,0.176968,-0.436207,-0.053502,0.252405,-0.657488,-0.827136,0.849573,59.0,1
6329,7519.0,1.234235,3.01974,-4.304597,4.732795,3.624201,-1.357746,1.713445,-0.496358,-1.282858,-2.447469,2.101344,-4.609628,1.464378,-6.079337,-0.339237,2.581851,6.739384,3.042493,-2.721853,0.009061,-0.379068,-0.704181,-0.656805,-1.632653,1.488901,0.566797,-0.010016,0.146793,1.0,1


In [19]:
# Checking for the First Five Normal_trnsaction data:
normal_transaction.head() # class-> 0 > Non-Fraud

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.5516,-0.617801,-0.99139,-0.311169,1.468177,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,0.489095,-0.143772,0.635558,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,0.717293,-0.165946,2.345865,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.345852,-1.11967,0.175121,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [20]:
# Checking for the distribution of Fraud and non-Fraud Transaction:

print(f"> The Total number of Non-Fraud Transactions are '{normal_transaction.shape[0]}'.\nNon-Fraud Transaction in Percentage: {(len(normal_transaction)*100)/len(card_data)}\n")
print(f"> The Total number of Fraud Transactions are '{Fraud_trnsaction.shape[0]}'.\nFraud Transaction in Percentage: {(len(Fraud_trnsaction)*100)/len(card_data)}")

> The Total number of Non-Fraud Transactions are '283253'.
Non-Fraud Transaction in Percentage: 99.83328986416473

> The Total number of Fraud Transactions are '473'.
Fraud Transaction in Percentage: 0.1667101358352777


In [21]:
# Summary statistics of the central tendency, dispersion, and shape of a dataset’s distribution for Fraud Transaction as well as Non-Fraud Transactions.
pd.concat([normal_transaction['Amount'].describe(),Fraud_trnsaction['Amount'].describe()],axis=1).round(2)

Unnamed: 0,Amount,Amount.1
count,283253.0,473.0
mean,88.41,123.87
std,250.38,260.21
min,0.0,0.0
25%,5.67,1.0
50%,22.0,9.82
75%,77.46,105.89
max,25691.16,2125.87


# **Data Pre-Processing & Model Selection :**

**Due to the highly imbalanced distribution of the dataset, comprising 283,253 non-fraudulent transactions (Class 0) and just 473 fraudulent transactions (Class 1), it's vital to tackle this imbalance before constructing a machine learning model. Therefore, I plan to use the 'Undersampling Technique,' which involves reducing the number of non-fraudulent transactions by randomly removing examples.**

##### **Since the transaction class has imbalance distribution, keeping that in mind I'll be building a sample dataset containing similar distribution of Transaction for both the Class:**

In [22]:
print(f"Total number of Fraud Trans: {Fraud_trnsaction.shape[0]}\nTotal number of Fraud Trans: {normal_transaction.shape[0]}")

Total number of Fraud Trans: 473
Total number of Fraud Trans: 283253


In [23]:
# Randomly sample 473 non-fraudulent transactions:
normal_transaction_sample = normal_transaction.sample(n=473, random_state=42)

# Combine the sampled non-fraudulent transactions with the fraudulent transactions:
balanced_trans_data = pd.concat([normal_transaction_sample, Fraud_trnsaction])

# Shuffle the combined dataset:
balanced_trans_data = balanced_trans_data.sample(frac=1, random_state=42).reset_index(drop=True)

# Separating the features and target variables:
X_balanced=balanced_trans_data.drop('Class',axis=1)
y_balanced=balanced_trans_data['Class']

In [24]:
# Checking for both the Class Transaction number using Value_count method:
balanced_trans_data.Class.value_counts()

Class
0    473
1    473
Name: count, dtype: int64

# ***Splitting the Dataset into Training and Test Datasets:***

In [25]:
X_train,X_test,y_train,y_test=train_test_split(X_balanced,y_balanced,test_size=0.2,random_state=42)

In [26]:
# from sklearn.feature_selection import RFECV
# from sklearn.model_selection import StratifiedKFold
# from sklearn.ensemble import RandomForestClassifier
# import numpy as np


# # Initialize the model for feature selection
# rf = RandomForestClassifier()

# # Initialize RFECV
# rfecv = RFECV(estimator=rf, step=1, cv=StratifiedKFold(5), scoring='accuracy')

# # Fit RFECV
# rfecv.fit(X, y)

# # Get selected features
# selected_features = np.where(rfecv.support_)[0]

# # Print selected feature indices
# print("Selected Features:", selected_features)

# # Print ranking of features
# print("Feature Rankings:", rfecv.ranking_)

# **Model Selection :**

### ***1. Logistic Regression***

In [27]:
# Train the Logistic Regression model:
lr_model = LogisticRegression(max_iter=1000, class_weight='balanced')
lr_model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [28]:
# Performing the model Predition on the Test Dataset:
y_pred_lr_model=lr_model.predict(X_test)
y_pred_proba_lr = lr_model.predict_proba(X_test)[:, 1]

### ***2. Random Forest***

In [29]:
# Train the Random Forest model:
rf_model = RandomForestClassifier(class_weight='balanced', random_state=42)
rf_model.fit(X_train, y_train)

In [30]:
# Performing the model Predition on the Test Dataset:
y_pred_rf_model=rf_model.predict(X_test)
y_pred_proba_rf = rf_model.predict_proba(X_test)[:, 1]

### ***3.Support Vector Machine (SVM)***

In [31]:
# Train the Support Vector Machine model:
svm_model = SVC(kernel='rbf', probability=True, class_weight='balanced', random_state=42)
svm_model.fit(X_train,y_train)

In [32]:
# Performing the model Predition on the Test Dataset:
y_pred_svm_model=svm_model.predict(X_test)
y_pred_proba_svm = svm_model.predict_proba(X_test)[:, 1]

### ***4.Gradient Boosting***

In [33]:
# Train the Gradient Boosting model:
gb_model=GradientBoostingClassifier(random_state=42)
gb_model.fit(X_train, y_train)

In [34]:
# Performing the model Predition on the Test Dataset:
y_pred_gb_model=gb_model.predict(X_test)
y_pred_proba_gb = gb_model.predict_proba(X_test)[:, 1]

### ***5.Neural Network***

In [35]:
# Train the Neural Network model:
nn_model=MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=500, random_state=42)
nn_model.fit(X_train, y_train)

In [36]:
# Performing the model Predition on the Test Dataset:
y_pred_nn_model=nn_model.predict(X_test)
y_pred_proba_nn = nn_model.predict_proba(X_test)[:, 1]

### ***Evaluate all the models:***

#### **1. Logestic Regression:**

In [49]:
# Calculate metrics
print("Calculating metrics for Logistic Regression Model:\n")
accuracy = accuracy_score(y_test, y_pred_lr_model)
precision = precision_score(y_test, y_pred_lr_model)
recall = recall_score(y_test, y_pred_lr_model)
f1 = f1_score(y_test, y_pred_lr_model)
conf_matrix = confusion_matrix(y_test, y_pred_lr_model)
roc_auc = roc_auc_score(y_test, y_pred_proba_lr)

# Print metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)
print("Confusion Matrix:\n", conf_matrix)
print("ROC-AUC Score:", roc_auc)

Calculating metrics for Logistic Regression Model:

Accuracy: 0.9421052631578948
Precision: 0.9743589743589743
Recall: 0.8941176470588236
F1-Score: 0.9325153374233128
Confusion Matrix:
 [[103   2]
 [  9  76]]
ROC-AUC Score: 0.9833053221288516


#### **2. Random Forest:**

In [48]:
# Calculate metrics
print("Calculating metrics for Random Forest Model:\n")
accuracy = accuracy_score(y_test, y_pred_rf_model)
precision = precision_score(y_test, y_pred_rf_model)
recall = recall_score(y_test, y_pred_rf_model)
f1 = f1_score(y_test, y_pred_rf_model)
conf_matrix = confusion_matrix(y_test, y_pred_rf_model)
roc_auc = roc_auc_score(y_test, y_pred_proba_rf)

# Print metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)
print("Confusion Matrix:\n", conf_matrix)
print("ROC-AUC Score:", roc_auc)

Calculating metrics for Random Forest Model:

Accuracy: 0.9421052631578948
Precision: 0.9868421052631579
Recall: 0.8823529411764706
F1-Score: 0.9316770186335404
Confusion Matrix:
 [[104   1]
 [ 10  75]]
ROC-AUC Score: 0.9773109243697479


#### **3. Support Vector Machine (SVM):**

In [39]:
# Calculate metrics
print("Calculating metrics for SVM Model:\n")
accuracy = accuracy_score(y_test, y_pred_svm_model)
precision = precision_score(y_test, y_pred_svm_model)
recall = recall_score(y_test, y_pred_svm_model)
f1 = f1_score(y_test, y_pred_svm_model)
conf_matrix = confusion_matrix(y_test, y_pred_svm_model)
roc_auc = roc_auc_score(y_test, y_pred_proba_svm)

# Print metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)
print("Confusion Matrix:\n", conf_matrix)
print("ROC-AUC Score:", roc_auc)

Calculating metrics for SVM Model:

Accuracy: 0.5210526315789473
Precision: 0.47115384615384615
Recall: 0.5764705882352941
F1-Score: 0.5185185185185185
Confusion Matrix:
 [[50 55]
 [36 49]]
ROC-AUC Score: 0.5560784313725491


#### **4.Gradient Boosting Model:**

In [40]:
# Calculate metrics
print("Calculating metrics for Gradient Boosting Model:\n")
accuracy = accuracy_score(y_test, y_pred_gb_model)
precision = precision_score(y_test, y_pred_gb_model)
recall = recall_score(y_test, y_pred_gb_model)
f1 = f1_score(y_test, y_pred_gb_model)
conf_matrix = confusion_matrix(y_test, y_pred_gb_model)
roc_auc = roc_auc_score(y_test, y_pred_proba_gb)

# Print metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)
print("Confusion Matrix:\n", conf_matrix)
print("ROC-AUC Score:", roc_auc)

Calculating metrics for Gradient Boosting Model:

Accuracy: 0.9421052631578948
Precision: 0.9743589743589743
Recall: 0.8941176470588236
F1-Score: 0.9325153374233128
Confusion Matrix:
 [[103   2]
 [  9  76]]
ROC-AUC Score: 0.9817366946778712


#### **5.Neural Network Model:**

In [41]:
# Calculate metrics
print("Calculating metrics for Neural Network Model:\n")
accuracy = accuracy_score(y_test, y_pred_nn_model)
precision = precision_score(y_test, y_pred_nn_model)
recall = recall_score(y_test, y_pred_nn_model)
f1 = f1_score(y_test, y_pred_nn_model)
conf_matrix = confusion_matrix(y_test, y_pred_nn_model)
roc_auc = roc_auc_score(y_test, y_pred_proba_nn)

# Print metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)
print("Confusion Matrix:\n", conf_matrix)
print("ROC-AUC Score:", roc_auc)

Calculating metrics for Neural Network Model:

Accuracy: 0.7157894736842105
Precision: 0.6371681415929203
Recall: 0.8470588235294118
F1-Score: 0.7272727272727273
Confusion Matrix:
 [[64 41]
 [13 72]]
ROC-AUC Score: 0.7787114845938375


# **Observation:**

The dataset contains transactions made by credit cards in September 2013 by European cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions before removing the Duplicate Transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC).

# **Analysis:**
### ***Logistic Regression and Gradient Boosting both have similar performance with high accuracy, precision, recall, F1-score, and ROC-AUC scores. Random Forest also performs very well but slightly lower recall compared to Logistic Regression and Gradient Boosting. SVM shows significantly lower performance across all metrics.*** 

# **Conclusion:**

**Logistic Regression and Gradient Boosting appear to be the best models based on the provided metrics. Both models have high accuracy, precision, recall, F1-score, and ROC-AUC scores, indicating they are good at correctly identifying both positive and negative classes and discriminating between them.**

**Since the metrics are nearly identical for Logistic Regression and Gradient Boosting, we might consider other factors such as model complexity and interpretability. Logistic Regression is simpler and more interpretable, while Gradient Boosting might capture more complex patterns at the cost of being more complex and potentially requiring more tuning and computational resources.**