<a href="https://colab.research.google.com/github/DHATCHAYANI-CSE/21CS040/blob/main/CREDIT_CARD_FRAUD_DETECTION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Work Flow

1.Credit Card Data
2.Data Preprocessing
3.Data Analysis
4.Train Test Split
5.Logestic Regression Model
6.Evaluation

What are V1, V2, ..., V28?
Anonymized Features: V1 to V28 are anonymized features in the dataset. They are not directly interpretable or meaningful in a real-world context.
Generated by PCA: These features are created through a process called Principal Component Analysis (PCA). PCA is used to transform the original data into a new set of variables (principal components) that capture the most important patterns in the data.
2. Why Use PCA?
Dimensionality Reduction: Credit card transaction data can have many features. PCA reduces the number of features (dimensions) while retaining as much of the important information as possible.
Privacy: Since PCA combines and transforms the original features into new components, it helps in anonymizing the data, which is crucial for privacy.
3. Example to Illustrate:
Imagine you have a dataset with 10 original features about transactions, like:

Transaction Amount
Merchant Category
Transaction Location
Time of Transaction
User's Purchase History
When PCA is applied to these features, it creates new variables that are combinations of the original ones. These new variables are V1, V2, ..., V28.

Example Breakdown:
Original Features:

Transaction Amount
Merchant Category
Time of Transaction
Apply PCA:

PCA transforms these features into principal components. For instance:
V1 might represent a combination of Transaction Amount and Merchant Category.
V2 might represent another combination of Time of Transaction and Transaction Amount.
And so on...
Resulting Features:

V1, V2, ..., V28 are now abstract features that summarize the original data in a way that captures the most important patterns for detecting fraud.
In Summary:
V1 to V28 are new variables created from the original features through PCA.
Purpose: They make it easier to analyze and detect patterns without dealing with the complexity and privacy issues of the original data.
In essence, these V features help simplify the data while preserving the critical information needed to identify fraudulent transactions.








In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
# loading the dataset to a Pandas DataFrame
credit_card_data = pd.read_csv('/content/creditcard.csv')

In [None]:
# first 5 rows of the dataset
credit_card_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
1,0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0.0
2,1,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0.0
3,1,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0.0
4,2,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0.0


In [None]:
credit_card_data.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
23853,32953,1.296815,-0.518508,0.3488,0.12975,-0.845934,-0.283287,-0.463165,-0.007344,-0.530097,...,-0.498652,-0.791893,0.058456,0.067753,0.367979,0.400899,0.005089,0.008846,14.0,0.0
23854,32954,1.295646,-0.699613,-1.129649,-2.505043,1.428275,3.010605,-0.892932,0.798386,1.309837,...,0.07154,0.120545,-0.144172,1.066577,0.615628,0.123208,0.013875,0.018547,37.07,0.0
23855,32954,-1.691394,-1.452403,1.671861,-1.76959,1.636843,-0.943701,0.193116,-0.712756,-0.987725,...,0.047745,0.539167,-0.69801,-0.470563,0.621828,-0.254508,-0.724865,-0.360165,53.8,0.0
23856,32954,1.112786,0.062772,1.481419,2.922471,-0.905121,0.366357,-0.670663,0.256586,0.599399,...,0.010125,0.34674,-0.117334,0.416103,0.586102,0.216021,0.04271,0.024984,0.0,0.0
23857,32954,-0.26422,1.107046,0.985544,0.346904,1.121035,0.069275,1.307409,-0.681733,0.34665,...,-0.159083,,,,,,,,,


In [None]:
# dataset informations
credit_card_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23858 entries, 0 to 23857
Data columns (total 31 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Time    23858 non-null  int64  
 1   V1      23858 non-null  float64
 2   V2      23858 non-null  float64
 3   V3      23858 non-null  float64
 4   V4      23858 non-null  float64
 5   V5      23858 non-null  float64
 6   V6      23858 non-null  float64
 7   V7      23858 non-null  float64
 8   V8      23858 non-null  float64
 9   V9      23858 non-null  float64
 10  V10     23858 non-null  float64
 11  V11     23858 non-null  float64
 12  V12     23858 non-null  float64
 13  V13     23858 non-null  float64
 14  V14     23858 non-null  float64
 15  V15     23858 non-null  float64
 16  V16     23858 non-null  float64
 17  V17     23858 non-null  float64
 18  V18     23858 non-null  float64
 19  V19     23858 non-null  float64
 20  V20     23858 non-null  float64
 21  V21     23858 non-null  float64
 22

In [None]:
# checking the number of missing values in each column
credit_card_data.isnull().sum()

Unnamed: 0,0
Time,0
V1,0
V2,0
V3,0
V4,0
V5,0
V6,0
V7,0
V8,0
V9,0


In [None]:
# Remove rows with missing values
credit_card_data = credit_card_data.dropna()

# Verify if the missing values have been removed
credit_card_data.isnull().sum()


Unnamed: 0,0
Time,0
V1,0
V2,0
V3,0
V4,0
V5,0
V6,0
V7,0
V8,0
V9,0


In [None]:
credit_card_data.isnull().sum()

Unnamed: 0,0
Time,0
V1,0
V2,0
V3,0
V4,0
V5,0
V6,0
V7,0
V8,0
V9,0


In [None]:
# distribution of legit transactions & fraudulent transactions
credit_card_data['Class'].value_counts()

Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
0.0,23769
1.0,88


This Dataset is highly unblanced

0 --> Normal Transaction

1 --> fraudulent transaction

Imagine a Basket of Fruits:
Majority Class: Apples (lots of them)
Minority Class: Oranges (few of them)
Scenario:
You have a basket with 100 apples and 5 oranges.
Apples are the majority class (because there are a lot of them).
Oranges are the minority class (because there are only a few of them).
In a Dataset:
Majority Class (Non-Fraudulent Transactions): Most of the transactions are normal and not fraudulent.
Minority Class (Fraudulent Transactions): Only a few transactions are fraudulent.
Problem with Imbalance:
If You Only Focus on Apples: You’d be very good at identifying apples because there are so many. But if you’re trying to find oranges, it’s harder because there are so few.

In Fraud Detection: If a model sees mostly non-fraudulent transactions and only a few fraudulent ones, it might learn to ignore the fraudulent ones because they are so rare.

Why This Matters:
Bias Towards Majority: The model might become very good at predicting non-fraudulent transactions but not at detecting the rare fraudulent ones.
Example with Fraud Detection:
Data: 1,000 transactions; 950 are non-fraudulent (majority), and 50 are fraudulent (minority).
Issue: If a model predicts every transaction as non-fraudulent, it will still be correct 95% of the time. But it won’t catch any fraudulent transactions.
Handling Imbalance:
Balance Techniques: You might use methods to balance the data, like:
Adding More Fraudulent Transactions (Oversampling)
Reducing Non-Fraudulent Transactions (Undersampling)
In summary, in a highly unbalanced dataset, one type of transaction (fraudulent) is much less common than the other type (non-fraudulent). This makes it challenging for the model to learn to detect the rare type effectively.








In [None]:
# separating the data for analysis
legit = credit_card_data[credit_card_data.Class == 0]
fraud = credit_card_data[credit_card_data.Class == 1]

In [None]:
print(legit.shape)
print(fraud.shape)

(23769, 31)
(88, 31)


In [None]:
# statistical measures of the data
legit.Amount.describe()

#25% of transaction are less than 6.0000 dollars
# count: 23,769
# What It Means: This is the number of transactions recorded in the Amount column. In this case, there are 23,769 transactions.
# 2. mean: 73.88
# What It Means: The average amount spent in these transactions is $73.88. It’s calculated by summing all the transaction amounts and dividing by the total number of transactions.
# 3. std: 212.54
# What It Means: This is the standard deviation, which measures the spread of the transaction amounts around the mean. A standard deviation of 212.54 indicates that transaction amounts vary widely from the average. Higher standard deviation means more variability.
# 4. min: 0.00
# What It Means: The smallest transaction amount recorded is $0.00. This could indicate transactions with no charge or refunds.
# 5. 25%: 6.00
# What It Means: This is the 25th percentile (or the first quartile). It means that 25% of the transactions have amounts less than or equal to $6.00. This is a way to understand the distribution of lower-end values.
# 6. 50%: 18.11
# What It Means: This is the 50th percentile (or the median). It indicates that half of the transactions have amounts less than or equal to $18.11, and the other half are greater. It provides a measure of the central value.
# 7. 75%: 65.85
# What It Means: This is the 75th percentile (or the third quartile). It means that 75% of the transactions have amounts less than or equal to $65.85. This helps understand the distribution of higher-end values.
# 8. max: 7,879.42
# What It Means: The largest transaction amount recorded is $7,879.42. This shows the upper limit of the transaction amounts in the dataset.
# Summary:
# The Amount column shows a wide range of transaction values, from $0 to $7,879.42. Most transactions are relatively small, with the average being $73.88, but there are a few very large transactions that significantly increase the variability in the data, as indicated by the high standard deviation. The percentiles provide additional context about the distribution of transaction amounts, showing that while many transactions are small, there are also some significantly larger ones.



Unnamed: 0,Amount
count,23769.0
mean,73.880199
std,212.541174
min,0.0
25%,6.0
50%,18.11
75%,65.85
max,7879.42


In [None]:
fraud.Amount.describe()

Unnamed: 0,Amount
count,88.0
mean,100.01
std,265.845031
min,0.0
25%,1.0
50%,1.0
75%,99.99
max,1809.68


In [None]:
# compare the values for both transactions
credit_card_data.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.0,18213.77782,-0.208135,0.175983,0.774952,0.226463,-0.166866,0.092304,-0.101868,0.006906,0.512306,...,0.038311,-0.043337,-0.136382,-0.037279,0.014613,0.127111,0.02694,0.010111,0.004578,73.880199
1.0,17935.875,-8.613716,6.376169,-12.221731,6.231847,-6.027247,-2.48708,-8.308784,4.351326,-2.987199,...,0.714069,0.539387,-0.381823,-0.350615,-0.25297,0.346695,0.17976,0.856336,0.100578,100.01


Under-Sampling

Build a sample dataset containing similar distribution of normal transactions and Fraudulent Transactions

Number of Fraudulent Transactions --> 88

In [None]:
legit_sample = legit.sample(n=88)

Concatenating two DataFrames

In [None]:
new_dataset = pd.concat([legit_sample, fraud], axis=0)

In [None]:
new_dataset.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
11240,19510,0.975456,-0.229698,-0.205088,1.091675,0.366861,0.749329,-0.092772,0.145041,1.324778,...,-0.023755,-0.159234,-0.352941,-1.435432,0.623995,-0.232061,-0.034286,0.009128,144.0,0.0
2374,1925,-0.473115,0.996275,2.493954,3.11575,-0.480267,0.50749,-0.095556,0.360578,-0.610364,...,0.199908,0.900588,-0.065819,0.637144,-0.447663,0.344693,0.447278,0.222931,8.34,0.0
21874,31906,-0.345979,1.094568,1.282208,0.068241,-0.016965,-0.989567,0.693958,-0.050855,-0.333856,...,-0.266852,-0.723026,-0.003364,0.322646,-0.173496,0.073145,0.241997,0.09761,5.99,0.0
14060,25029,1.238249,0.038902,0.589443,0.174498,-0.621018,-0.859388,-0.262618,-0.136791,1.680149,...,-0.225245,-0.400549,0.142163,0.36854,0.063588,0.935525,-0.100558,-0.005641,0.85,0.0
7964,10979,-2.33456,1.861316,0.164384,-0.816969,-1.402592,-0.549793,-1.206648,1.690017,0.688121,...,0.239931,0.277885,-0.000503,0.008197,-0.360606,0.822897,-0.646044,-0.102,14.95,0.0


In [None]:
new_dataset.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
18773,29753,0.269614,3.549755,-5.810353,5.80937,1.538808,-2.269219,-0.824203,0.35107,-3.759059,...,0.371121,-0.32229,-0.549856,-0.520629,1.37821,0.564714,0.553255,0.4024,0.68,1.0
18809,29785,0.923764,0.344048,-2.880004,1.72168,-3.019565,-0.639736,-3.801325,1.299096,0.864065,...,0.899931,1.481271,0.725266,0.17696,-1.815638,-0.536517,0.489035,-0.049729,30.3,1.0
20198,30852,-2.830984,0.885657,1.19993,2.861292,0.321669,0.289966,1.76776,-2.45105,0.069736,...,0.546589,0.334971,0.172106,0.62359,-0.527114,-0.079215,-2.532445,0.311177,104.81,1.0
23308,32686,0.287953,1.728735,-1.652173,3.813544,-1.090927,-0.984745,-2.202318,0.555088,-2.033892,...,0.262202,-0.633528,0.092891,0.187613,0.368708,-0.132474,0.576561,0.309843,0.0,1.0
23422,32745,-2.179135,0.020218,-2.182733,2.572046,-3.663733,0.081568,0.268049,0.660437,-2.374027,...,1.026421,0.299614,1.6568,0.328433,0.106457,0.691775,0.196779,0.241085,717.15,1.0


In [None]:
new_dataset['Class'].value_counts()

Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
0.0,88
1.0,88


In [None]:
new_dataset.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.0,17256.704545,-0.339554,0.349329,0.772506,0.145261,-0.246359,-0.114096,-0.186608,-0.304003,0.520652,...,-0.040408,0.279324,-0.278328,-0.004015,-0.04502,0.104057,0.074795,-0.010969,0.024045,56.041364
1.0,17935.875,-8.613716,6.376169,-12.221731,6.231847,-6.027247,-2.48708,-8.308784,4.351326,-2.987199,...,0.714069,0.539387,-0.381823,-0.350615,-0.25297,0.346695,0.17976,0.856336,0.100578,100.01


Splitting the data into Features & Targets



In [None]:
X = new_dataset.drop(columns='Class', axis=1)
Y = new_dataset['Class']

In [None]:
print(X)

        Time        V1        V2        V3        V4        V5        V6  \
11240  19510  0.975456 -0.229698 -0.205088  1.091675  0.366861  0.749329   
2374    1925 -0.473115  0.996275  2.493954  3.115750 -0.480267  0.507490   
21874  31906 -0.345979  1.094568  1.282208  0.068241 -0.016965 -0.989567   
14060  25029  1.238249  0.038902  0.589443  0.174498 -0.621018 -0.859388   
7964   10979 -2.334560  1.861316  0.164384 -0.816969 -1.402592 -0.549793   
...      ...       ...       ...       ...       ...       ...       ...   
18773  29753  0.269614  3.549755 -5.810353  5.809370  1.538808 -2.269219   
18809  29785  0.923764  0.344048 -2.880004  1.721680 -3.019565 -0.639736   
20198  30852 -2.830984  0.885657  1.199930  2.861292  0.321669  0.289966   
23308  32686  0.287953  1.728735 -1.652173  3.813544 -1.090927 -0.984745   
23422  32745 -2.179135  0.020218 -2.182733  2.572046 -3.663733  0.081568   

             V7        V8        V9  ...       V20       V21       V22  \
11240 -0.0927

In [None]:
print(Y)

11240    0.0
2374     0.0
21874    0.0
14060    0.0
7964     0.0
        ... 
18773    1.0
18809    1.0
20198    1.0
23308    1.0
23422    1.0
Name: Class, Length: 176, dtype: float64


Split the data into Training data & Testing Data

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)

In [None]:
print(X.shape, X_train.shape, X_test.shape)

(176, 30) (140, 30) (36, 30)


Model Training

Logistic Regression

In [None]:
model = LogisticRegression()

In [None]:
# training the Logistic Regression Model with Training Data
model.fit(X_train, Y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
# from sklearn.linear_model import LogisticRegression

# # Create a Logistic Regression model with a different solver
# model = LogisticRegression(solver='liblinear', max_iter=1000)  # Try other solvers as needed
# model.fit(X_train, Y_train)

In [None]:
# accuracy on training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [None]:
print('Accuracy on Training data : ', training_data_accuracy)

Accuracy on Training data :  0.9642857142857143


In [None]:
# accuracy on test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [None]:
print('Accuracy score on Test Data : ', test_data_accuracy)

Accuracy score on Test Data :  0.9166666666666666
