# CreditCard Fraud Detection using Machine Learning
This dataset contains credit card transactions, comprising 31 features and a class label used for fraud detection. The goal is to train machine learning models to identify potentially fraudulent transactions in real time.
*   Time: The first feature represents the number of seconds elapsed between each transaction and the first transaction in the dataset.
*  V1 to V28: These are anonymized features derived from a Principal Component Analysis (PCA) transformation of the original data. They capture various aspects of the transactions—such as location, type, and behavior patterns—while preserving privacy and reducing dimensionality.
*  Amount: This feature indicates the transaction amount in USD.
*   Class: The target label where 0 represents a legitimate transaction and 1 indicates a fraudulent one.

This dataset is commonly used to develop and evaluate fraud detection algorithms. The anonymized and normalized features allow models to learn subtle patterns and anomalies associated with fraudulent activity, enabling proactive and accurate transaction monitoring.

In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [4]:
# load data
data = pd.read_csv('/content/creditcard.csv')

In [5]:
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
1,0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0.0
2,1,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0.0
3,1,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0.0
4,2,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0.0


In [6]:
data.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
39697,39927,-1.466679,2.425732,0.877724,3.950765,0.762477,1.162748,0.575625,-0.288354,-1.35799,...,0.186051,0.57101,-0.107517,-0.754306,-0.752828,0.113179,-2.079421,-0.809173,1.5,0.0
39698,39927,-0.523165,-0.100021,0.892966,-1.900405,-0.15687,-0.783894,0.917683,-0.308345,-1.305284,...,-0.082504,-0.414677,-0.063392,-0.087455,-0.303383,-0.682889,-0.178417,-0.137169,100.92,0.0
39699,39928,-2.768425,-1.007072,2.151127,0.117797,1.283178,1.869731,-0.56224,0.820374,0.348797,...,-0.182963,0.77821,0.904077,-1.288631,0.212441,0.483975,-0.027614,-0.582813,11.99,0.0
39700,39928,1.201327,0.158614,-0.325263,0.471667,0.086446,-0.770357,0.422151,-0.205277,-0.451865,...,0.027664,-0.018485,-0.199382,0.053605,0.683829,0.428416,-0.077342,-0.006394,45.0,0.0
39701,39929,1.097669,-1.315782,0.659681,-0.683915,-1.342612,0.332629,-1.1109,0.194811,-0.248825,...,,,,,,,,,,


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39702 entries, 0 to 39701
Data columns (total 31 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Time    39702 non-null  int64  
 1   V1      39702 non-null  float64
 2   V2      39702 non-null  float64
 3   V3      39702 non-null  float64
 4   V4      39702 non-null  float64
 5   V5      39702 non-null  float64
 6   V6      39702 non-null  float64
 7   V7      39702 non-null  float64
 8   V8      39702 non-null  float64
 9   V9      39702 non-null  float64
 10  V10     39702 non-null  float64
 11  V11     39702 non-null  float64
 12  V12     39702 non-null  float64
 13  V13     39701 non-null  float64
 14  V14     39701 non-null  float64
 15  V15     39701 non-null  float64
 16  V16     39701 non-null  float64
 17  V17     39701 non-null  float64
 18  V18     39701 non-null  float64
 19  V19     39701 non-null  float64
 20  V20     39701 non-null  float64
 21  V21     39701 non-null  float64
 22

In [8]:
data.isnull().sum()

Unnamed: 0,0
Time,0
V1,0
V2,0
V3,0
V4,0
V5,0
V6,0
V7,0
V8,0
V9,0


Replacing the null values

In [9]:
data = data.fillna(data.median(numeric_only=True))

In [10]:
data.isnull().sum()

Unnamed: 0,0
Time,0
V1,0
V2,0
V3,0
V4,0
V5,0
V6,0
V7,0
V8,0
V9,0


In [11]:
data['Class'].value_counts()

Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
0.0,39598
1.0,104


In [12]:
# separate legitimate and fraudulent transactions
legit = data[data.Class == 0]
fraud = data[data.Class == 1]

In [13]:
legit.shape
fraud.shape

(104, 31)

In [14]:
legit.Amount.describe()

Unnamed: 0,Amount
count,39598.0
mean,87.419084
std,234.511642
min,0.0
25%,7.42
50%,23.5
75%,79.0
max,7879.42


In [15]:
fraud.Amount.describe()

Unnamed: 0,Amount
count,104.0
mean,97.070769
std,255.01216
min,0.0
25%,1.0
50%,3.775
75%,99.99
max,1809.68


In [16]:
#compare values for both the transactions
data.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.0,25515.089702,-0.194278,0.036933,0.74416,0.174523,-0.217442,0.107157,-0.093208,0.031787,0.222422,...,0.046045,-0.031388,-0.111706,-0.03947,0.007852,0.135896,0.022777,0.005987,0.00392,87.419084
1.0,20683.201923,-7.697309,5.766703,-10.853666,5.86585,-5.424996,-2.275717,-7.641867,3.827809,-2.924282,...,0.663205,0.626795,-0.345972,-0.342944,-0.230514,0.299242,0.176522,0.811431,0.099914,97.070769


In [17]:
legit_sample=legit.sample(n=406)

In [18]:
#Concatenating the two dataframes
new_data=pd.concat([legit_sample,fraud],axis=0)

In [19]:
new_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
3274,2822,1.175695,0.114254,0.4946,0.459372,-0.283004,-0.210279,-0.157352,0.111676,-0.134934,...,-0.185103,-0.562422,0.160712,-0.002423,0.096507,0.108889,-0.019816,0.004074,0.89,0.0
26519,34113,1.069408,-0.071493,0.055796,1.006942,-0.283515,-0.467093,0.098525,0.010921,-0.005937,...,0.086298,-0.002796,-0.201672,-0.026916,0.614367,-0.328314,-0.016695,0.017068,84.0,0.0
5701,5987,-0.14812,2.521435,-2.350749,1.926179,0.475644,-1.650702,0.126441,0.56526,0.620028,...,-0.183431,-0.14265,0.271092,0.013054,-0.468302,-0.423918,0.108598,-0.053975,0.99,0.0
15232,26591,-3.140395,1.14887,-2.011436,1.154779,-1.006603,-0.693769,-0.452055,1.675695,-0.465208,...,0.10829,0.563073,0.346209,0.016963,-0.589063,-0.368209,0.38818,-0.024777,99.99,0.0
2983,2532,1.191047,0.129128,0.314928,0.94624,-0.31525,-0.421447,-0.092692,0.065947,0.054931,...,-0.035299,-0.209472,-0.06339,-0.058065,0.498218,-0.431416,0.005936,0.010553,14.95,0.0


In [20]:
new_data.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
30473,35942,-4.194074,4.382897,-5.118363,4.45523,-4.812621,-1.224645,-7.281328,3.33225,-3.679659,...,1.550473,0.614573,0.028521,0.013704,-0.149512,-0.131687,0.473934,0.473757,14.46,1.0
30496,35953,-4.844372,5.649439,-6.730396,5.252842,-4.409566,-1.740767,-6.311699,3.449167,-5.416284,...,1.194888,-0.845753,0.190674,-0.216443,-0.325033,-0.270328,0.210214,0.391855,111.7,1.0
31002,36170,-5.685013,5.776516,-7.064977,5.902715,-4.715564,-1.755633,-6.958679,3.877795,-5.541529,...,1.128641,-0.96296,-0.110045,-0.177733,-0.089175,-0.049447,0.303445,0.21938,111.7,1.0
33276,37167,-7.923891,-5.19836,-3.000024,4.420666,2.272194,-3.394483,-5.283435,0.131619,0.658176,...,-0.734308,-0.599926,-4.908301,0.41017,-1.16766,0.520508,1.937421,-1.552593,12.31,1.0
39183,39729,-0.964567,-1.643541,-0.187727,1.158253,-2.458336,0.852222,2.785163,-0.303609,0.940006,...,0.44718,0.536204,1.634061,0.203839,0.218749,-0.221886,-0.308555,-0.1645,776.83,1.0


In [21]:
new_data['Class'].value_counts()

Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
0.0,406
1.0,104


In [22]:
new_data.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.0,26677.310345,-0.231524,-0.019583,0.775719,0.109968,-0.269062,0.079394,-0.052011,0.042344,0.179464,...,0.041258,-0.005943,-0.064401,-0.012452,0.037293,0.1428,0.001123,0.00141,0.000452,92.167118
1.0,20683.201923,-7.697309,5.766703,-10.853666,5.86585,-5.424996,-2.275717,-7.641867,3.827809,-2.924282,...,0.663205,0.626795,-0.345972,-0.342944,-0.230514,0.299242,0.176522,0.811431,0.099914,97.070769


In [23]:
x = data.drop(columns="Class", axis=1)
y = data["Class"]

In [24]:
x

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62
1,0,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,2.69
2,1,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.524980,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66
3,1,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.208038,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.50
4,2,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,69.99
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39697,39927,-1.466679,2.425732,0.877724,3.950765,0.762477,1.162748,0.575625,-0.288354,-1.357990,...,0.129962,0.186051,0.571010,-0.107517,-0.754306,-0.752828,0.113179,-2.079421,-0.809173,1.50
39698,39927,-0.523165,-0.100021,0.892966,-1.900405,-0.156870,-0.783894,0.917683,-0.308345,-1.305284,...,-0.071640,-0.082504,-0.414677,-0.063392,-0.087455,-0.303383,-0.682889,-0.178417,-0.137169,100.92
39699,39928,-2.768425,-1.007072,2.151127,0.117797,1.283178,1.869731,-0.562240,0.820374,0.348797,...,-0.491556,-0.182963,0.778210,0.904077,-1.288631,0.212441,0.483975,-0.027614,-0.582813,11.99
39700,39928,1.201327,0.158614,-0.325263,0.471667,0.086446,-0.770357,0.422151,-0.205277,-0.451865,...,-0.023074,0.027664,-0.018485,-0.199382,0.053605,0.683829,0.428416,-0.077342,-0.006394,45.00


In [25]:
y

Unnamed: 0,Class
0,0.0
1,0.0
2,0.0
3,0.0
4,0.0
...,...
39697,0.0
39698,0.0
39699,0.0
39700,0.0


In [26]:
# split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, stratify=y, random_state=2)

In [27]:
print(x.shape,x_train.shape,x_test.shape)

(39702, 30) (31761, 30) (7941, 30)


In [28]:
model= LogisticRegression()

In [29]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

# Create a pipeline: scale the data first, then apply logistic regression
model = make_pipeline(
    StandardScaler(),
    LogisticRegression(max_iter=1000)  # Increase max_iter to ensure convergence
)

# Fit the model
model.fit(x_train, y_train)

# (Optional) Predict on test data
y_pred = model.predict(x_test)

In [30]:
# accuracy on training data
x_train_prediction = model.predict(x_train)
training_data_accuracy = accuracy_score(x_train_prediction, y_train)

In [31]:
print('Accuracy on Training data : ', training_data_accuracy)

Accuracy on Training data :  0.9988035641195177


In [32]:
# accuracy on test data
x_test_prediction = model.predict(x_test)
test_data_accuracy = accuracy_score(x_test_prediction, y_test)

In [33]:
print('Accuracy score on Test Data : ', test_data_accuracy)

Accuracy score on Test Data :  0.9989925702052638


In [35]:
from sklearn.ensemble import RandomForestClassifier

In [37]:
import pickle

In [38]:
# Train model
model = RandomForestClassifier()
model.fit(x_train, y_train)

with open('credit.pkl', 'wb') as f:
    pickle.dump(model, f)

In [39]:
from google.colab import files
files.download('credit.pkl')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Project Report

---


Credit card fraud poses a significant threat to both consumers and financial institutions, leading to substantial financial losses and potential reputational damage. To mitigate these risks, machine learning techniques have become a crucial tool in the detection of fraudulent transactions. In this project, we implement logistic regression, a widely used classification algorithm, to distinguish between legitimate and fraudulent credit card transactions based on various transaction features.
### Data
The data used in this project is a CSV file containing credit card transaction data. The data has 31 columns and 284,807 rows. The "Class" column is the target variable, which indicates whether the transaction is legitimate (Class = 0) or fraudulent (Class = 1).
### Preprocessing
Before training the model, we first separate the legitimate and fraudulent transactions. Since the data is imbalanced, with significantly more legitimate transactions than fraudulent transactions, we undersample the legitimate transactions to balance the classes. We then split the data into training and testing sets using the train_test_split () function.
### Model
We use logistic regression to classify transactions as either legitimate or fraudulent based on their features. Logistic regression is a widely used classification algorithm that models the probability of an event occurring based on input features. The logistic regression model is trained on the training data using the LogisticRegression () function from scikit-learn. The trained model is then used to predict the target variable for the testing data.
### Evaluation
The performance of the model is evaluated using the accuracy metric, which is the fraction of correctly classified transactions. The accuracy on the training and testing data is calculated using the accuracy_score() function from scikit-learn.

---


## Conclusion
In this project, we used logistic regression to detect fraudulent credit card transactions. We achieved a high accuracy on both the training and testing data, indicating that the model is effective at detecting fraudulent transactions.  
