<a href="https://colab.research.google.com/github/JessicaMalik08/Credit-Card-Fraud-Detection/blob/main/CreditCard_Fraud_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CreditCard Fraud Detection using Machine Learning
This dataset contains credit card transactions, comprising 31 features and a class label used for fraud detection. The goal is to train machine learning models to identify potentially fraudulent transactions in real time.
*   Time: The first feature represents the number of seconds elapsed between each transaction and the first transaction in the dataset.
*  V1 to V28: These are anonymized features derived from a Principal Component Analysis (PCA) transformation of the original data. They capture various aspects of the transactions—such as location, type, and behavior patterns—while preserving privacy and reducing dimensionality.
*  Amount: This feature indicates the transaction amount in USD.
*   Class: The target label where 0 represents a legitimate transaction and 1 indicates a fraudulent one.

This dataset is commonly used to develop and evaluate fraud detection algorithms. The anonymized and normalized features allow models to learn subtle patterns and anomalies associated with fraudulent activity, enabling proactive and accurate transaction monitoring.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
# load data
data = pd.read_csv('/content/creditcard.csv')

In [None]:
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0.0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0.0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0.0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0.0


In [None]:
data.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
219897,141928.0,0.725118,-2.548904,-1.143057,0.884748,-0.690906,1.295878,0.021953,0.12965,1.087925,...,0.152074,-0.710809,-0.302099,-0.135988,-0.581003,-0.01138,-0.110487,0.064208,653.88,0.0
219898,141928.0,-1.536097,0.220033,-1.41482,-0.622342,0.551558,1.268065,0.043718,1.35567,-0.146195,...,-0.021327,0.041924,0.184604,-1.673858,-0.959674,0.299898,0.371547,-0.218262,150.8,0.0
219899,141928.0,1.993809,-0.054577,-1.779255,0.287526,0.535501,-0.672557,0.471301,-0.26489,0.232152,...,0.066127,0.331656,0.031827,0.806145,0.354862,-0.28107,-0.046084,-0.063754,29.99,0.0
219900,141928.0,-1.295406,0.537327,0.440644,-1.271928,2.358895,-0.379169,1.214365,-0.425961,0.241044,...,-0.57762,-1.342617,-0.519606,-0.488752,0.682605,0.215675,-0.415435,-0.21065,4.56,0.0
219901,14192.0,,,,,,,,,,...,,,,,,,,,,


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 219902 entries, 0 to 219901
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    219902 non-null  float64
 1   V1      219901 non-null  float64
 2   V2      219901 non-null  float64
 3   V3      219901 non-null  float64
 4   V4      219901 non-null  float64
 5   V5      219901 non-null  float64
 6   V6      219901 non-null  float64
 7   V7      219901 non-null  float64
 8   V8      219901 non-null  float64
 9   V9      219901 non-null  float64
 10  V10     219901 non-null  float64
 11  V11     219901 non-null  float64
 12  V12     219901 non-null  float64
 13  V13     219901 non-null  float64
 14  V14     219901 non-null  float64
 15  V15     219901 non-null  float64
 16  V16     219901 non-null  float64
 17  V17     219901 non-null  float64
 18  V18     219901 non-null  float64
 19  V19     219901 non-null  float64
 20  V20     219901 non-null  float64
 21  V21     21

In [None]:
data.isnull().sum()

Unnamed: 0,0
Time,0
V1,1
V2,1
V3,1
V4,1
V5,1
V6,1
V7,1
V8,1
V9,1


Replacing the null values

In [None]:
data = data.fillna(data.median(numeric_only=True))

In [None]:
data.isnull().sum()

Unnamed: 0,0
Time,0
V1,0
V2,0
V3,0
V4,0
V5,0
V6,0
V7,0
V8,0
V9,0


In [None]:
data['Class'].value_counts()

Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
0.0,219496
1.0,406


In [None]:
# separate legitimate and fraudulent transactions
legit = data[data.Class == 0]
fraud = data[data.Class == 1]

In [None]:
legit.shape
fraud.shape

(406, 31)

In [None]:
legit.Amount.describe()

Unnamed: 0,Amount
count,219496.0
mean,90.588663
std,250.576422
min,0.0
25%,6.0
50%,23.32
75%,79.9
max,19656.53


In [None]:
fraud.Amount.describe()

Unnamed: 0,Amount
count,406.0
mean,125.438079
std,258.067068
min,0.0
25%,1.0
50%,13.385
75%,106.41
max,2125.87


In [None]:
#compare values for both the transactions
data.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.0,76732.340202,-0.06537,-0.017539,0.233137,0.044138,-0.06705,0.037358,-0.019263,0.004319,0.009806,...,0.012013,-0.009628,-0.029191,-0.011313,0.001766,0.043911,0.003433,0.000174,0.002013,90.588663
1.0,65249.660099,-5.461637,4.021435,-7.534374,4.674517,-3.872341,-1.361593,-6.357473,0.662369,-2.703405,...,0.388887,0.779198,0.009403,-0.045996,-0.084434,0.053858,0.043356,0.192421,0.061715,125.438079


In [None]:
legit_sample=legit.sample(n=406)

In [None]:
#Concatenating the two dataframes
new_data=pd.concat([legit_sample,fraud],axis=0)

In [None]:
new_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
119306,75402.0,-0.693997,1.258782,1.275151,0.12245,1.074558,-0.954595,1.255104,-0.343359,-0.609092,...,-0.065553,-0.071844,-0.510623,-0.089283,0.718981,-0.408651,-0.116388,-0.130706,1.0,0.0
19600,30404.0,-1.319096,-0.012312,0.784408,-1.196982,0.784468,1.101766,0.097044,0.229825,-1.846092,...,0.374451,1.388733,0.134226,-0.965992,0.038365,0.086655,-0.216163,-0.046991,15.0,0.0
169125,119543.0,-0.800161,0.029992,-0.965533,0.278063,2.53501,-1.283765,0.92846,-0.265118,-0.849835,...,0.225187,0.415368,-0.094285,0.261593,0.164731,0.832103,0.036356,0.196179,54.0,0.0
147718,88887.0,-0.059392,1.392108,-1.234085,-0.904778,1.879639,-1.150581,1.978255,-0.757425,-0.392672,...,0.129627,0.895018,-0.193029,0.734237,-0.787754,0.357525,0.057519,-0.061497,0.99,0.0
22837,32454.0,-0.924412,0.493494,0.629622,-2.022078,-0.496354,-0.908943,0.698856,0.35716,0.940285,...,-0.053368,-0.371666,0.015711,-0.211647,0.173087,-1.003768,-0.020661,0.066219,88.97,0.0


In [None]:
new_data.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
215953,140293.0,0.951025,3.252926,-5.039105,4.632411,3.014501,-1.34957,0.98094,-1.819539,-2.099049,...,1.404524,-0.760549,0.358292,-1.185942,-1.286177,0.000365,0.169662,0.108276,0.77,1.0
215984,140308.0,-4.861747,-2.72266,-4.656248,2.502005,-2.008346,0.615422,-3.48568,1.878856,-1.116268,...,1.138876,1.033664,-0.806199,-1.511046,-0.191731,0.080999,1.215152,-0.923142,592.9,1.0
218442,141320.0,-6.352337,-2.370335,-4.875397,2.335045,-0.809555,-0.413647,-4.082308,2.239089,-1.98636,...,1.325218,1.226745,-1.485217,-1.470732,-0.240053,0.112972,0.910591,-0.650944,195.66,1.0
219025,141565.0,0.114965,0.766762,-0.494132,0.116772,0.868169,-0.477982,0.438496,0.063073,-0.186207,...,-0.284413,-0.706865,0.131405,0.600742,-0.604264,0.262938,0.099145,0.01081,4.49,1.0
219892,141925.0,0.120301,1.974141,-0.434087,5.390793,1.289684,0.28059,0.221963,0.067827,-1.387054,...,-0.03869,0.204554,-0.167313,0.791547,-0.223675,0.473223,-0.160202,0.065039,0.76,1.0


In [None]:
new_data['Class'].value_counts()

Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
0.0,406
1.0,406


In [None]:
new_data.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.0,76862.608374,0.009333,-0.001768,0.261833,-0.025568,-0.120349,0.141386,0.002093,0.013535,0.116868,...,-0.049593,0.01617,0.035654,-0.014434,0.03986,0.079683,0.010447,-0.027728,0.016251,87.546355
1.0,65249.660099,-5.461637,4.021435,-7.534374,4.674517,-3.872341,-1.361593,-6.357473,0.662369,-2.703405,...,0.388887,0.779198,0.009403,-0.045996,-0.084434,0.053858,0.043356,0.192421,0.061715,125.438079


In [None]:
x = data.drop(columns="Class", axis=1)
y = data["Class"]

In [None]:
x

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
63070,50541.0,-1.563646,-0.653530,0.056558,-1.246019,0.703834,-1.730407,-0.280172,0.357150,-1.613935,...,0.259049,0.481951,0.670470,-0.179435,0.025908,0.038546,-0.228866,-0.002628,-0.120497,29.90
190047,128665.0,1.989645,-0.278832,-1.328960,0.269425,0.071669,-0.571115,0.066824,-0.102793,0.475875,...,-0.216893,-0.213059,-0.563629,0.217298,-0.473496,-0.223999,0.286830,-0.084663,-0.075569,27.69
75014,55858.0,-1.869400,-1.500024,1.458852,0.462234,2.168078,-1.527987,-0.912499,0.262681,-0.763078,...,0.705053,0.205776,-0.068444,0.367796,0.060149,-0.347492,0.796994,-0.020654,0.145228,20.36
16069,27491.0,-2.659997,2.258158,0.039235,-1.344288,-1.713783,-1.157119,-1.014566,1.883069,-0.250930,...,-0.080514,0.051247,-0.220822,0.202609,0.527784,-0.181625,0.715695,0.062230,0.096042,0.77
135083,81100.0,-2.790134,0.996522,0.193384,-0.565463,0.576888,0.099757,0.984646,-0.216879,0.938228,...,0.526619,-0.713102,-0.700243,-0.164193,-1.023063,0.173332,0.012241,-0.340986,-1.001671,18.88
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
215953,140293.0,0.951025,3.252926,-5.039105,4.632411,3.014501,-1.349570,0.980940,-1.819539,-2.099049,...,-0.381444,1.404524,-0.760549,0.358292,-1.185942,-1.286177,0.000365,0.169662,0.108276,0.77
215984,140308.0,-4.861747,-2.722660,-4.656248,2.502005,-2.008346,0.615422,-3.485680,1.878856,-1.116268,...,0.285559,1.138876,1.033664,-0.806199,-1.511046,-0.191731,0.080999,1.215152,-0.923142,592.90
218442,141320.0,-6.352337,-2.370335,-4.875397,2.335045,-0.809555,-0.413647,-4.082308,2.239089,-1.986360,...,0.186898,1.325218,1.226745,-1.485217,-1.470732,-0.240053,0.112972,0.910591,-0.650944,195.66
219025,141565.0,0.114965,0.766762,-0.494132,0.116772,0.868169,-0.477982,0.438496,0.063073,-0.186207,...,0.062199,-0.284413,-0.706865,0.131405,0.600742,-0.604264,0.262938,0.099145,0.010810,4.49


In [None]:
y

Unnamed: 0,Class
63070,0.0
190047,0.0
75014,0.0
16069,0.0
135083,0.0
...,...
215953,1.0
215984,1.0
218442,1.0
219025,1.0


In [None]:
# split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, stratify=y, random_state=2)

In [None]:
print(x.shape,x_train.shape,x_test.shape)

(812, 30) (649, 30) (163, 30)


In [None]:
model= LogisticRegression()

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

# Create a pipeline: scale the data first, then apply logistic regression
model = make_pipeline(
    StandardScaler(),
    LogisticRegression(max_iter=1000)  # Increase max_iter to ensure convergence
)

# Fit the model
model.fit(x_train, y_train)

# (Optional) Predict on test data
y_pred = model.predict(x_test)

In [None]:
# accuracy on training data
x_train_prediction = model.predict(x_train)
training_data_accuracy = accuracy_score(x_train_prediction, y_train)

In [None]:
print('Accuracy on Training data : ', training_data_accuracy)

Accuracy on Training data :  0.9476117103235747


In [None]:
# accuracy on test data
x_test_prediction = model.predict(x_test)
test_data_accuracy = accuracy_score(x_test_prediction, y_test)

In [None]:
print('Accuracy score on Test Data : ', test_data_accuracy)

Accuracy score on Test Data :  0.9447852760736196


# Project Report
Credit card fraud poses a significant threat to both consumers and financial institutions, leading to substantial financial losses and potential reputational damage. To mitigate these risks, machine learning techniques have become a crucial tool in the detection of fraudulent transactions. In this project, we implement logistic regression, a widely used classification algorithm, to distinguish between legitimate and fraudulent credit card transactions based on various transaction features.
# Introduction
Credit card fraud is a major concern for both consumers and financial institutions. Fraudulent transactions can lead to financial losses and damage to the reputation of financial institutions. Machine learning techniques have been used extensively to detect fraudulent transactions. In this project, we use logistic regression to classify transactions as either legitimate or fraudulent based on their features.
# Data
The data used in this project is a CSV file containing credit card transaction data. The data has 31 columns and 284,807 rows. The "Class" column is the target variable, which indicates whether the transaction is legitimate (Class = 0) or fraudulent (Class = 1).
# Preprocessing
Before training the model, we first separate the legitimate and fraudulent transactions. Since the data is imbalanced, with significantly more legitimate transactions than fraudulent transactions, we undersample the legitimate transactions to balance the classes. We then split the data into training and testing sets using the train_test_split () function.
# Model
We use logistic regression to classify transactions as either legitimate or fraudulent based on their features. Logistic regression is a widely used classification algorithm that models the probability of an event occurring based on input features. The logistic regression model is trained on the training data using the LogisticRegression () function from scikit-learn. The trained model is then used to predict the target variable for the testing data.
# Evaluation
The performance of the model is evaluated using the accuracy metric, which is the fraction of correctly classified transactions. The accuracy on the training and testing data is calculated using the accuracy_score() function from scikit-learn.
#Conclusion
In this project, we used logistic regression to detect fraudulent credit card transactions. We achieved a high accuracy on both the training and testing data, indicating that the model is effective at detecting fraudulent transactions.  
