**Credit Card Fraud Detection**

**Task 05 : Problem Statement**

1.Build a machine learning model to identify fradulent credit card transactions.

2.Preprocess and normalize the transaction data, handle class inbalance issues, and split the dataset into training and testing sets.

3.Train a classification algorithm,such as logistic regression or random forests to classify transactions as fraudelent or genuine.

4.Evaluate the model's performance using metrics like precision, recall, and f1-score and cosider techniques like oversampling or undersampling for improving results.

**Work Flow**

1.Data loading

2.Data pre-processing

3.Exploratory Data analysis

4.Spliting training and test data

5.Model training -Logistic Regression

6.Model Evaluation


In [1]:
# Importig the libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, recall_score , classification_report


In [2]:
# loding the dataset

cc_data= pd.read_csv('/content/creditcard.csv')

In [3]:
# Shows first 5 rows of dataframe
cc_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
1,0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0.0
2,1,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0.0
3,1,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0.0
4,2,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0.0


In [4]:
cc_data.info()  # to get information about dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128821 entries, 0 to 128820
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    128821 non-null  int64  
 1   V1      128821 non-null  float64
 2   V2      128821 non-null  float64
 3   V3      128821 non-null  float64
 4   V4      128821 non-null  float64
 5   V5      128821 non-null  float64
 6   V6      128821 non-null  float64
 7   V7      128821 non-null  float64
 8   V8      128821 non-null  float64
 9   V9      128821 non-null  float64
 10  V10     128821 non-null  float64
 11  V11     128821 non-null  float64
 12  V12     128821 non-null  float64
 13  V13     128821 non-null  float64
 14  V14     128821 non-null  float64
 15  V15     128821 non-null  float64
 16  V16     128820 non-null  float64
 17  V17     128820 non-null  float64
 18  V18     128820 non-null  float64
 19  V19     128820 non-null  float64
 20  V20     128820 non-null  float64
 21  V21     12

In [5]:
# Checking the missing values
cc_data.isnull().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       1
V17       1
V18       1
V19       1
V20       1
V21       1
V22       1
V23       1
V24       1
V25       1
V26       1
V27       1
V28       1
Amount    1
Class     1
dtype: int64

In [6]:
# distribution of legit transaction and fradulent transaction

cc_data['Class'].value_counts()
# 0---> normal transaction
# 1---> fradulent transaction

0.0    128559
1.0       261
Name: Class, dtype: int64

In [7]:
# separating the data

legit=cc_data[cc_data.Class==0]
fraud=cc_data[cc_data.Class==1]

In [8]:
print(legit.shape)
print(fraud.shape)

(128559, 31)
(261, 31)


In [9]:
# Statstical info
legit.Amount.describe()

count    128559.000000
mean         92.909269
std         251.540738
min           0.000000
25%           6.500000
50%          24.950000
75%          83.190000
max       19656.530000
Name: Amount, dtype: float64

In [10]:
fraud.Amount.describe()

count     261.000000
mean      116.679693
std       246.300626
min         0.000000
25%         1.000000
50%        11.380000
75%        99.990000
max      1809.680000
Name: Amount, dtype: float64

In [11]:
fraud.Amount.describe()

count     261.000000
mean      116.679693
std       246.300626
min         0.000000
25%         1.000000
50%        11.380000
75%        99.990000
max      1809.680000
Name: Amount, dtype: float64

In [17]:
# Under Sampling -- building a sample dataset containing similar distribution of normal transaction and fradulent transaction.

#number of fradulent transaction = 261


legit_sample=legit.sample(n=261)

In [13]:
# concatenating

new_data=pd.concat([legit_sample, fraud],axis=0)

In [14]:
new_data.tail() # checking bottem columns and rows

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
124087,77171,1.11856,1.291858,-1.298805,2.135772,0.772204,-1.147291,0.390578,-0.107072,-0.038339,...,-0.346374,-0.663588,-0.102326,0.017911,0.650302,-0.332366,0.105949,0.128124,1.0,1.0
124115,77182,-1.410852,2.268271,-2.297554,1.871331,0.248957,-1.208799,-1.358648,1.102916,-1.317364,...,0.155381,-0.61488,-0.196126,-0.464376,0.118473,-0.484537,0.373596,0.187657,1.0,1.0
124176,77202,-0.356326,1.435305,-0.813564,1.993117,2.055878,-0.543579,0.487691,0.085449,-0.536352,...,-0.312863,-0.687874,-0.267003,-1.15848,0.27146,-0.155397,0.114328,0.101526,1.0,1.0
125342,77627,-7.13906,2.773082,-6.757845,4.446456,-5.464428,-1.713401,-6.485365,3.409395,-3.053493,...,1.30325,-0.016118,-0.87667,0.38223,-1.054624,-0.614606,-0.766848,0.409424,106.9,1.0
128479,78725,-4.312479,1.886476,-2.338634,-0.475243,-1.185444,-2.112079,-2.122793,0.272565,0.290273,...,0.550541,-0.06787,-1.114692,0.269069,-0.020572,-0.963489,-0.918888,0.001454,60.0,1.0


In [15]:
new_data['Class'].value_counts() # verifying concatenated data

0.0    492
1.0    261
Name: Class, dtype: int64

In [16]:
new_data.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.0,49325.684959,-0.307618,-0.004727,0.670546,0.058531,-0.257857,0.07838,-0.059975,-0.026491,-0.139086,...,0.040877,-0.074115,-0.135589,-0.025607,-0.025068,0.110088,0.060757,-0.014945,-0.03476,98.761179
1.0,41975.996169,-5.665868,3.97347,-7.225782,4.526519,-4.014566,-1.501321,-5.982089,1.526847,-2.616088,...,0.238243,1.27186,-0.320923,-0.117196,-0.105943,0.199014,0.054738,0.492548,0.081347,116.679693


In [18]:
# Spliting the data into Features and Targets

X=new_data.drop(columns='Class', axis=1)
Y=new_data['Class']

In [19]:
print(X)

         Time        V1        V2        V3        V4        V5        V6  \
58759   48533 -1.321038  0.447223  0.534273 -0.509824  1.633673  0.212502   
4593     3945 -1.476588  0.208667 -0.556085 -0.107826 -1.722552  0.879064   
69673   53527  1.416024 -0.263850  0.011829 -0.641799 -0.673429 -0.900209   
110442  71826 -0.932935  0.470182  1.887888 -1.202398  0.332200 -0.949726   
80218   58384 -0.415545  1.168379  1.389886  0.573037 -0.436592  0.099706   
...       ...       ...       ...       ...       ...       ...       ...   
124087  77171  1.118560  1.291858 -1.298805  2.135772  0.772204 -1.147291   
124115  77182 -1.410852  2.268271 -2.297554  1.871331  0.248957 -1.208799   
124176  77202 -0.356326  1.435305 -0.813564  1.993117  2.055878 -0.543579   
125342  77627 -7.139060  2.773082 -6.757845  4.446456 -5.464428 -1.713401   
128479  78725 -4.312479  1.886476 -2.338634 -0.475243 -1.185444 -2.112079   

              V7        V8        V9  ...       V20       V21       V22  \


In [20]:
print(Y)

58759     0.0
4593      0.0
69673     0.0
110442    0.0
80218     0.0
         ... 
124087    1.0
124115    1.0
124176    1.0
125342    1.0
128479    1.0
Name: Class, Length: 753, dtype: float64


Spliting data into train and test data

In [21]:
# Split the data into train and test

X_train,X_test,Y_train,Y_test=train_test_split (X,Y, test_size=0.2, stratify=Y,random_state=2)

In [22]:
print(X.shape, X_train.shape,X_test.shape)

(753, 30) (602, 30) (151, 30)


Model Training - Logistic Regression

In [23]:
# model training

model=LogisticRegression()

In [24]:
# training the model with training data

model.fit(X_train, Y_train)

Model Evaluation

In [25]:
# accuracy on training data

X_train_prediction=model.predict(X_train)
training_data_accuracy=accuracy_score(X_train_prediction,Y_train)

In [27]:
print('Accuracy on training data :{:.2f}%'.format(training_data_accuracy*100))

Accuracy on training data :95.35%


In [28]:
# accuracy on test data

X_test_prediction=model.predict(X_test)
test_data_accuracy=accuracy_score(X_test_prediction,Y_test)

In [29]:
print('Accuracy on test data :{:.2f}%'.format(test_data_accuracy*100))

Accuracy on test data :92.05%


In [30]:
print (classification_report(Y_test , model.predict(X_test)))

              precision    recall  f1-score   support

         0.0       0.90      0.99      0.94        99
         1.0       0.98      0.79      0.87        52

    accuracy                           0.92       151
   macro avg       0.94      0.89      0.91       151
weighted avg       0.93      0.92      0.92       151



**Conclusion**

The logistic regression algorithm is likely to be a  good model for the given problem to classify transactions as fraudulent or genuine. It has evaluated the model's performance using metrics like precision, recall,and F1-score, and considered techniques like oversampling or
undersampling for improving results.