# 💳 Credit Card Fraud Detection with Logistic Regression

In this project, we aim to identify fraudulent credit card transactions using a basic machine learning model. The dataset is highly imbalanced, and the detection of rare fraud cases is the central challenge. We use logistic regression to build a simple classifier and evaluate its performance.

#Import the Dependencies

In [15]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [16]:
#loading the dataset into pandas dataframe
credit_card_dataset=pd.read_csv("/content/creditcard.csv")

In [17]:
#first 5 rows of the dataset
credit_card_dataset.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [18]:
#last five rows of the dataset
credit_card_dataset.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,...,0.213454,0.111864,1.01448,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.05508,2.03503,-0.738589,0.868229,1.058415,0.02433,0.294869,0.5848,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.24964,-0.557828,2.630515,3.03126,-0.296827,0.708417,0.432454,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.24044,0.530483,0.70251,0.689799,-0.377961,0.623708,-0.68618,0.679145,0.392087,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.0,0
284806,172792.0,-0.533413,-0.189733,0.703337,-0.506271,-0.012546,-0.649617,1.577006,-0.41465,0.48618,...,0.261057,0.643078,0.376777,0.008797,-0.473649,-0.818267,-0.002415,0.013649,217.0,0


In [19]:
#number of rows and columns
credit_card_dataset.shape

(284807, 31)

In [20]:
#dataset information
credit_card_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

In [21]:
#Checking the number of missing values in each columns
credit_card_dataset.isnull().sum()

Unnamed: 0,0
Time,0
V1,0
V2,0
V3,0
V4,0
V5,0
V6,0
V7,0
V8,0
V9,0


In [24]:
#distribution of legit transaction and fraudulent transaction
credit_card_dataset['Class'].value_counts()

Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
0,284315
1,492


#This dataset is highly  Unbalanced




0.    Normal transaction

1.   fraudulent transaction




In [25]:
#Separating the data for analysis
legit=credit_card_dataset[credit_card_dataset.Class==0]
fraud=credit_card_dataset[credit_card_dataset.Class==1]

In [29]:
legit.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [30]:
fraud.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
541,406.0,-2.312227,1.951992,-1.609851,3.997906,-0.522188,-1.426545,-2.537387,1.391657,-2.770089,...,0.517232,-0.035049,-0.465211,0.320198,0.044519,0.17784,0.261145,-0.143276,0.0,1
623,472.0,-3.043541,-3.157307,1.088463,2.288644,1.359805,-1.064823,0.325574,-0.067794,-0.270953,...,0.661696,0.435477,1.375966,-0.293803,0.279798,-0.145362,-0.252773,0.035764,529.0,1
4920,4462.0,-2.30335,1.759247,-0.359745,2.330243,-0.821628,-0.075788,0.56232,-0.399147,-0.238253,...,-0.294166,-0.932391,0.172726,-0.08733,-0.156114,-0.542628,0.039566,-0.153029,239.93,1
6108,6986.0,-4.397974,1.358367,-2.592844,2.679787,-1.128131,-1.706536,-3.496197,-0.248778,-0.247768,...,0.573574,0.176968,-0.436207,-0.053502,0.252405,-0.657488,-0.827136,0.849573,59.0,1
6329,7519.0,1.234235,3.01974,-4.304597,4.732795,3.624201,-1.357746,1.713445,-0.496358,-1.282858,...,-0.379068,-0.704181,-0.656805,-1.632653,1.488901,0.566797,-0.010016,0.146793,1.0,1


In [26]:
print(legit.shape)
print(fraud.shape)

(284315, 31)
(492, 31)


In [27]:
#Statestical measures of the data
legit.Amount.describe()

Unnamed: 0,Amount
count,284315.0
mean,88.291022
std,250.105092
min,0.0
25%,5.65
50%,22.0
75%,77.05
max,25691.16


In [28]:
fraud.Amount.describe()

Unnamed: 0,Amount
count,492.0
mean,122.211321
std,256.683288
min,0.0
25%,1.0
50%,9.25
75%,105.89
max,2125.87


In [31]:
credit_card_dataset.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,94838.202258,0.008258,-0.006271,0.012171,-0.00786,0.005453,0.002419,0.009637,-0.000987,0.004467,...,-0.000644,-0.001235,-2.4e-05,7e-05,0.000182,-7.2e-05,-8.9e-05,-0.000295,-0.000131,88.291022
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


# Under-Sampling

Constructing a balanced dataset by under-sampling the majority class.

The original dataset contains only 492 fraudulent transactions, so we reduce the number of legitimate transactions

In order to match this count in order to create a balanced distribution between the two classes.

In [32]:
legit_sample=legit.sample(n=492)

#Concatenating two dataframes

In [34]:
new_df=pd.concat([legit_sample,fraud],axis=0)
new_df

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
92977,64198.0,-2.090886,-1.929205,0.837809,-0.017618,-2.497695,1.483219,3.237254,-0.129763,-0.091751,...,0.658947,0.399502,1.751782,-0.504647,0.445545,0.441346,-0.240249,0.167685,900.00,0
19270,30133.0,0.994998,-1.106790,0.387466,-0.184460,-1.232018,-0.208375,-0.572321,0.000759,-0.908399,...,-0.056540,-0.093293,-0.226612,0.030446,0.271453,0.646391,-0.041273,0.036430,183.92,0
225671,144345.0,2.063272,-0.043923,-1.059015,0.417445,-0.143324,-1.218268,0.187017,-0.338051,0.533910,...,-0.288629,-0.693180,0.341464,0.047554,-0.297690,0.195311,-0.070262,-0.059886,0.99,0
166970,118413.0,2.031371,-0.194793,-1.105595,0.443824,-0.282068,-1.245298,0.114750,-0.263919,0.740667,...,-0.294783,-0.814469,0.343910,0.010648,-0.350184,0.198223,-0.082292,-0.060061,16.98,0
234175,147862.0,1.895773,-0.453364,-1.413134,0.405147,-0.019663,-0.556177,0.023221,-0.122036,0.429856,...,0.351126,0.959647,-0.125111,-0.410737,0.133931,0.585635,-0.080790,-0.070004,78.00,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
279863,169142.0,-1.927883,1.125653,-4.518331,1.749293,-1.566487,-2.010494,-0.882850,0.697211,-2.064945,...,0.778584,-0.319189,0.639419,-0.294885,0.537503,0.788395,0.292680,0.147968,390.00,1
280143,169347.0,1.378559,1.289381,-5.004247,1.411850,0.442581,-1.326536,-1.413170,0.248525,-1.127396,...,0.370612,0.028234,-0.145640,-0.081049,0.521875,0.739467,0.389152,0.186637,0.76,1
280149,169351.0,-0.676143,1.126366,-2.213700,0.468308,-1.120541,-0.003346,-2.234739,1.210158,-0.652250,...,0.751826,0.834108,0.190944,0.032070,-0.739695,0.471111,0.385107,0.194361,77.89,1
281144,169966.0,-3.113832,0.585864,-5.399730,1.817092,-0.840618,-2.943548,-2.208002,1.058733,-1.632333,...,0.583276,-0.269209,-0.456108,-0.183659,-0.328168,0.606116,0.884876,-0.253700,245.00,1


In [35]:
new_df['Class'].value_counts()

Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
0,492
1,492


In [36]:
new_df.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,96703.577236,-0.017112,-0.004152,-0.087436,0.002246,0.002986,-0.003286,0.027371,-0.005734,-0.041146,...,0.003091,-0.008575,-0.031776,-0.000923,-0.04205,0.007492,0.000743,0.002855,-0.018333,92.806159
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


#Splitting the data into features and targets

In [37]:
X=new_df.drop('Class',axis=1)
Y=new_df['Class']

In [39]:
X,Y

(            Time        V1        V2        V3        V4        V5        V6  \
 92977    64198.0 -2.090886 -1.929205  0.837809 -0.017618 -2.497695  1.483219   
 19270    30133.0  0.994998 -1.106790  0.387466 -0.184460 -1.232018 -0.208375   
 225671  144345.0  2.063272 -0.043923 -1.059015  0.417445 -0.143324 -1.218268   
 166970  118413.0  2.031371 -0.194793 -1.105595  0.443824 -0.282068 -1.245298   
 234175  147862.0  1.895773 -0.453364 -1.413134  0.405147 -0.019663 -0.556177   
 ...          ...       ...       ...       ...       ...       ...       ...   
 279863  169142.0 -1.927883  1.125653 -4.518331  1.749293 -1.566487 -2.010494   
 280143  169347.0  1.378559  1.289381 -5.004247  1.411850  0.442581 -1.326536   
 280149  169351.0 -0.676143  1.126366 -2.213700  0.468308 -1.120541 -0.003346   
 281144  169966.0 -3.113832  0.585864 -5.399730  1.817092 -0.840618 -2.943548   
 281674  170348.0  1.991976  0.158476 -2.583441  0.408670  1.151147 -0.096695   
 
               V7        V

#Splitting the data into train and test Data

In [41]:
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.1,stratify=Y,random_state=2)

In [42]:
X.shape,X_train.shape,X_test.shape

((984, 30), (885, 30), (99, 30))

#Model training :


##Logistic Regression

In [46]:
model=LogisticRegression(max_iter=10000)

In [47]:
model.fit(X_train,Y_train)

#Model Evaluation

In [48]:
#accuracy score on training data
X_train_prediction=model.predict(X_train)
accuracy=accuracy_score(Y_train,X_train_prediction)
print("Accuracy score on training data: ",accuracy)

Accuracy score on training data:  0.9525423728813559


In [49]:
#accuracy score on test data
X_test_prediction=model.predict(X_test)
accuracy=accuracy_score(Y_test,X_test_prediction)
print("Accuracy score on test data: ",accuracy)

Accuracy score on test data:  0.9393939393939394


## ✅ Conclusion

Our model performed well in terms of accuracy, achieving ~99.4% on both training and testing datasets. However, due to the imbalanced nature of the dataset, accuracy alone may not reflect the true performance. In future work, we should consider using more advanced metrics like recall, precision, and the confusion matrix.