<a href="https://colab.research.google.com/github/Madhusudan3223/Credit_Card_Fraud_Detection_using_Machine_Learning/blob/main/Credit_Card_Fraud_Detection_using_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Project Type - Credit Card Fraud Detection using Machine Learning**
## Contribution - Individual
### Name - Madhusudan Mandal

About Data:

This is a dataset containing credit card transactions with 31 features and a class label. The features represent various aspects of the transaction, and the class label indicates whether the transaction was fraudulent (class 1) or not (class 0).

The first feature is "Time", which represents the number of seconds elapsed between the transaction and the first transaction in the dataset. The next 28 features, V1 to V28, are anonymized variables resulting from a principal component analysis (PCA) transformation of the original features. They represent different aspects of the transaction, such as the amount, location, and type of transaction.

The second last feature is "Amount", which represents the transaction amount in USD. The last feature is the "Class" label, which indicates whether the transaction is fraudulent (class 1) or not (class 0).

Overall, this dataset is used to train machine learning models to detect fraudulent transactions in real-time. The features are used to train the model to learn patterns in the data, which can then be used to detect fraudulent transactions in future transactions.


# **Project Summary**

Based on the analysis and the Logistic Regression model trained on the balanced dataset:

The dataset was highly imbalanced, with a significantly smaller number of fraudulent transactions compared to legitimate ones.
A balanced dataset was created by undersampling the legitimate transactions to match the number of fraudulent transactions.
A Logistic Regression model was trained on this balanced dataset.
The model achieved an accuracy of approximately {{training_data_accuracy:.2f}}% on the training data and {{test_data_accuracy:.2f}}% on the test data.
This indicates that the model is able to distinguish between legitimate and fraudulent transactions with a reasonable degree of accuracy on this balanced dataset. Further steps could include exploring other evaluation metrics (like precision, recall, and F1-score) and trying different machine learning models to potentially improve performance.


To check if a transaction is fraudulent using the trained model, you will need to provide the features of that transaction as input to the model's predict method. The model will then output a prediction (0 for legitimate, 1 for fraudulent).

Here's an example of how you can do this. You'll need to replace the example input_data with the actual features of the transaction you want to check.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
credit_card_data = pd.read_csv('/content/creditcard.csv.zip')

In [None]:
credit_card_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [None]:
credit_card_data.sample()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
210494,138006.0,1.85386,-0.225378,-1.759708,0.352318,0.140858,-0.783001,0.118242,-0.143414,0.779135,...,-0.169573,-0.426644,0.067708,-0.447235,-0.079322,-0.096437,-0.025135,-0.021392,85.66,0


In [None]:
# dataset informations
credit_card_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

In [None]:
# checking the number of missing values in each column
credit_card_data.isnull().sum()

Unnamed: 0,0
Time,0
V1,0
V2,0
V3,0
V4,0
V5,0
V6,0
V7,0
V8,0
V9,0


In [None]:
# distribution of legit transactions & fraudulent transactions
credit_card_data['Class'].value_counts()

Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
0,284315
1,492


This Dataset is highly unblanced

0 --> Normal Transaction

1 --> fraudulent transaction

The first line of code creates a new dataframe called "legit" by selecting only the rows from the original "credit_card_data" dataframe where the "Class" label is equal to 0. In other words, it filters out all transactions labeled as fraudulent (Class == 1) and keeps only the legitimate transactions (Class == 0).

The second line of code creates a new dataframe called "fraud" by selecting only the rows from the original "credit_card_data" dataframe where the "Class" label is equal to 1. This filters out all legitimate transactions and keeps only the fraudulent transactions.

By separating the data into two dataframes, it becomes easier to analyze and compare the characteristics of legitimate and fraudulent transactions separately. This can be useful for identifying patterns or features that are more common in fraudulent transactions, which can then be used to develop models for fraud detection.

In [None]:
legit = credit_card_data[credit_card_data.Class==0]
fraud = credit_card_data[credit_card_data['Class']==1]

In [None]:
fraud['Class']

Unnamed: 0,Class
541,1
623,1
4920,1
6108,1
6329,1
...,...
279863,1
280143,1
280149,1
281144,1


In [None]:
# statistical measures of the data
legit.Amount.describe()


Unnamed: 0,Amount
count,284315.0
mean,88.291022
std,250.105092
min,0.0
25%,5.65
50%,22.0
75%,77.05
max,25691.16


"count" indicates the total number of transactions in the dataset that have a valid "Amount" value (i.e., non-missing values).
"mean" is the average transaction amount for all the transactions in the dataset. In this case, the mean is 88.291022, indicating that the average transaction amount is around \$88 USD.
"std" is the standard deviation of the transaction amounts. It is a measure of how spread out the transaction amounts are from the mean. In this case, the standard deviation is 250.105092, which indicates that the transaction amounts vary widely, with some transactions having very large amounts.
"min" is the smallest transaction amount in the dataset. In this case, the smallest transaction amount is 0.0, indicating that there may be some transactions with zero value or very low value.
"25%" is the first quartile of the transaction amounts. It indicates the value below which 25% of the transactions fall. In this case, the first quartile is 5.65, indicating that 25% of the transactions have a value less than or equal to \$5.65 USD.
"50%" is the median transaction amount, which is the value that separates the lower half of the transactions from the upper half. In this case, the median is 22.0, indicating that half of the transactions have a value less than or equal to \$22 USD.
"75%" is the third quartile of the transaction amounts. It indicates the value below which 75% of the transactions fall. In this case, the third quartile is 77.05, indicating that 75% of the transactions have a value less than or equal to \$77.05 USD.
"max" is the largest transaction amount in the dataset. In this case, the largest transaction amount is 25,691.16, indicating that there may be some transactions with very large values.

In [None]:
fraud.Amount.describe()

Unnamed: 0,Amount
count,492.0
mean,122.211321
std,256.683288
min,0.0
25%,1.0
50%,9.25
75%,105.89
max,2125.87


In [None]:
# compare the values for both transactions
credit_card_data.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,94838.202258,0.008258,-0.006271,0.012171,-0.00786,0.005453,0.002419,0.009637,-0.000987,0.004467,...,-0.000644,-0.001235,-2.4e-05,7e-05,0.000182,-7.2e-05,-8.9e-05,-0.000295,-0.000131,88.291022
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


Build a sample dataset containing similar distribution of normal transactions and Fraudulent Transactions

Number of Fraudulent Transactions --> 492

legit_sample = legit.sample(n=492) is a line of code that takes a random sample of 492 observations from the legit dataset. This is done to balance the number of observations in the legit and fraud datasets, which is necessary for training a machine learning model to predict fraud. Since the original dataset has a large number of legitimate transactions and a small number of fraudulent transactions, the model may be biased towards predicting that all transactions are legitimate. By creating a balanced dataset with an equal number of legitimate and fraudulent transactions, the model can be trained to better recognize the patterns that differentiate fraudulent transactions from legitimate ones

In [None]:
legit_sample = legit.sample(n=492)

In [None]:
new_df = pd.concat([legit_sample,fraud],axis=0)

In [None]:
new_df

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
137050,81995.0,1.276662,-0.434916,0.130787,-0.012057,-0.109492,0.917876,-0.647481,0.172635,-1.127163,...,-0.171493,0.023615,-0.237523,-1.343190,0.616359,-0.145780,0.073914,0.012496,37.51,0
206790,136372.0,-0.013977,0.484124,0.020728,-0.393635,-0.411396,0.186364,-0.321803,0.711808,0.262430,...,0.193341,0.396937,0.467446,0.615233,-1.248645,0.127690,-0.064171,-0.002171,73.35,0
250655,155008.0,-1.440668,0.041710,0.610007,-2.820097,-1.664394,-0.877558,-0.884012,0.786296,-1.961644,...,-0.319893,-0.610358,0.215085,-0.015094,-0.377717,-0.573045,0.309037,0.067352,42.46,0
237740,149371.0,2.008088,-0.417225,-3.117988,-0.649756,2.667963,3.139857,-0.102022,0.646951,0.230551,...,-0.016967,-0.034558,0.086659,0.763056,0.200858,0.578063,-0.084542,-0.077400,39.95,0
45045,42182.0,-0.572632,-0.112968,0.861649,-2.471848,0.435680,-0.099465,0.351239,-0.230553,1.745705,...,-0.032063,0.428610,-0.570310,-0.883251,0.255838,-0.647711,-0.189453,-0.096516,22.79,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
279863,169142.0,-1.927883,1.125653,-4.518331,1.749293,-1.566487,-2.010494,-0.882850,0.697211,-2.064945,...,0.778584,-0.319189,0.639419,-0.294885,0.537503,0.788395,0.292680,0.147968,390.00,1
280143,169347.0,1.378559,1.289381,-5.004247,1.411850,0.442581,-1.326536,-1.413170,0.248525,-1.127396,...,0.370612,0.028234,-0.145640,-0.081049,0.521875,0.739467,0.389152,0.186637,0.76,1
280149,169351.0,-0.676143,1.126366,-2.213700,0.468308,-1.120541,-0.003346,-2.234739,1.210158,-0.652250,...,0.751826,0.834108,0.190944,0.032070,-0.739695,0.471111,0.385107,0.194361,77.89,1
281144,169966.0,-3.113832,0.585864,-5.399730,1.817092,-0.840618,-2.943548,-2.208002,1.058733,-1.632333,...,0.583276,-0.269209,-0.456108,-0.183659,-0.328168,0.606116,0.884876,-0.253700,245.00,1


In [None]:
new_df['Class'].value_counts()

Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
0,492
1,492


In [None]:
new_df.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,93167.857724,-0.101606,0.066073,0.148858,-0.032233,-0.043371,-0.047827,0.060177,0.004154,0.027325,...,0.003138,0.000475,-0.050825,-0.013948,-0.013402,-0.040302,-0.013903,-0.01894,0.001568,87.635955
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


In [None]:
X = new_df.drop(columns='Class', axis=1)
Y = new_df['Class']

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)

# **Model Training**
Logistic Regression

In [34]:
model=LogisticRegression(max_iter=1000)

In [35]:
# training the Logistic Regression Model with Training Data
model.fit(X_train, Y_train)
# accuracy on training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
print('Accuracy on Training data : ', training_data_accuracy)

Accuracy on Training data :  0.9479034307496823


In [36]:
# accuracy on test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)
print('Accuracy score on Test Data : ', test_data_accuracy)

Accuracy score on Test Data :  0.9187817258883249


In [37]:
# Example of how to check if a transaction is fraudulent

# Replace this with the actual input data for the transaction you want to check
# The input data should be a NumPy array or a pandas DataFrame with the same columns as the training data (excluding the 'Class' column)
input_data = np.array([[0.0, -1.359807, -0.072781, 2.536347, 1.378155, -0.338321, 0.462388, 0.239599, 0.098698, 0.363787, 0.090794, -0.551600, -0.617801, -0.991390, -0.311169, 1.468177, -0.470401, 0.207971, 0.025791, 0.403993, 0.251412, -0.018307, 0.277838, -0.110474, 0.066928, 0.128539, -0.189115, 0.133558, -0.021053, 149.62]])

# Get the column names from the training data
input_data_df = pd.DataFrame(input_data, columns=X.columns)

# Make the prediction
prediction = model.predict(input_data_df)

if (prediction[0] == 0):
  print('Legitimate transaction')
else:
  print('Fraudulent transaction')

Legitimate transaction


Based on the analysis and the Logistic Regression model trained on the balanced dataset:

The dataset was highly imbalanced, with a significantly smaller number of fraudulent transactions compared to legitimate ones.
A balanced dataset was created by undersampling the legitimate transactions to match the number of fraudulent transactions.
A Logistic Regression model was trained on this balanced dataset.
The model achieved an accuracy of approximately {{training_data_accuracy:.2f}}% on the training data and {{test_data_accuracy:.2f}}% on the test data.
This indicates that the model is able to distinguish between legitimate and fraudulent transactions with a reasonable degree of accuracy on this balanced dataset. Further steps could include exploring other evaluation metrics (like precision, recall, and F1-score) and trying different machine learning models to potentially improve performance.

To check if a transaction is fraudulent using the trained model, you will need to provide the features of that transaction as input to the model's predict method. The model will then output a prediction (0 for legitimate, 1 for fraudulent).

Here's an example of how you can do this. You'll need to replace the example input_data with the actual features of the transaction you want to check.