# Credit Card Fraud Detection using Machine Learning

# About Data:
This is a dataset containing credit card transactions with 31 features and a class label. The features represent various aspects of the transaction, and the class label indicates whether the transaction was fraudulent (class 1) or not (class 0).

The first feature is "Time", which represents the number of seconds elapsed between the transaction and the first transaction in the dataset. The next 28 features, V1 to V28, are anonymized variables resulting from a principal component analysis (PCA) transformation of the original features. They represent different aspects of the transaction, such as the amount, location, and type of transaction.

The second last feature is "Amount", which represents the transaction amount in USD. The last feature is the "Class" label, which indicates whether the transaction is fraudulent (class 1) or not (class 0).

Overall, this dataset is used to train machine learning models to detect fraudulent transactions in real-time. The features are used to train the model to learn patterns in the data, which can then be used to detect fraudulent transactions in future transactions.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

: 

In [None]:
credit_card_data = pd.read_csv('creditcard.csv')
credit_card_data.head()

: 

In [None]:
credit_card_data.sample()

: 

In [None]:
# dataset informations
credit_card_data.info()

: 

In [None]:
# checking the number of missing values in each column
credit_card_data.isnull().sum()

: 

In [None]:
# distribution of legit transactions & fraudulent transactions
credit_card_data['Class'].value_counts()

: 

This Dataset is highly unblanced

0 --> Normal Transaction

1 --> fraudulent transaction

The first line of code creates a new dataframe called "legit" by selecting only the rows from the original "credit_card_data" dataframe where the "Class" label is equal to 0. In other words, it filters out all transactions labeled as fraudulent (Class == 1) and keeps only the legitimate transactions (Class == 0).

The second line of code creates a new dataframe called "fraud" by selecting only the rows from the original "credit_card_data" dataframe where the "Class" label is equal to 1. This filters out all legitimate transactions and keeps only the fraudulent transactions.

By separating the data into two dataframes, it becomes easier to analyze and compare the characteristics of legitimate and fraudulent transactions separately. This can be useful for identifying patterns or features that are more common in fraudulent transactions, which can then be used to develop models for fraud detection.

In [None]:
legit = credit_card_data[credit_card_data.Class==0]
fraud = credit_card_data[credit_card_data['Class']==1]

: 

In [None]:
fraud['Class']

: 

In [None]:
# statistical measures of the data
legit.Amount.describe()

: 

"count" indicates the total number of transactions in the dataset that have a valid "Amount" value (i.e., non-missing values).
"mean" is the average transaction amount for all the transactions in the dataset. In this case, the mean is 88.291022, indicating that the average transaction amount is around $88 USD.
"std" is the standard deviation of the transaction amounts. It is a measure of how spread out the transaction amounts are from the mean. In this case, the standard deviation is 250.105092, which indicates that the transaction amounts vary widely, with some transactions having very large amounts.
"min" is the smallest transaction amount in the dataset. In this case, the smallest transaction amount is 0.0, indicating that there may be some transactions with zero value or very low value.
"25%" is the first quartile of the transaction amounts. It indicates the value below which 25% of the transactions fall. In this case, the first quartile is 5.65, indicating that 25% of the transactions have a value less than or equal to $5.65 USD.
"50%" is the median transaction amount, which is the value that separates the lower half of the transactions from the upper half. In this case, the median is 22.0, indicating that half of the transactions have a value less than or equal to $22 USD.
"75%" is the third quartile of the transaction amounts. It indicates the value below which 75% of the transactions fall. In this case, the third quartile is 77.05, indicating that 75% of the transactions have a value less than or equal to $77.05 USD.
"max" is the largest transaction amount in the dataset. In this case, the largest transaction amount is 25,691.16, indicating that there may be some transactions with very large values.

In [None]:
fraud.Amount.describe()

: 

In [None]:
# compare the values for both transactions
credit_card_data.groupby('Class').mean()

: 

Build a sample dataset containing similar distribution of normal transactions and Fraudulent Transactions

Number of Fraudulent Transactions --> 492

: 

legit_sample = legit.sample(n=492) is a line of code that takes a random sample of 492 observations from the legit dataset. This is done to balance the number of observations in the legit and fraud datasets, which is necessary for training a machine learning model to predict fraud. Since the original dataset has a large number of legitimate transactions and a small number of fraudulent transactions, the model may be biased towards predicting that all transactions are legitimate. By creating a balanced dataset with an equal number of legitimate and fraudulent transactions, the model can be trained to better recognize the patterns that differentiate fraudulent transactions from legitimate ones

In [None]:
legit_sample = legit.sample(n=492)

: 

In [None]:
new_df = pd.concat([legit_sample,fraud],axis=0)

: 

In [None]:
new_df

: 

In [None]:
new_df['Class'].value_counts()

: 

In [None]:
new_df.groupby('Class').mean()

: 

In [None]:
X = new_df.drop(columns='Class', axis=1)
Y = new_df['Class']

: 

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)

: 

# Model Training

Logistic Regression

In [None]:
model=LogisticRegression()

: 

In [None]:
# training the Logistic Regression Model with Training Data
model.fit(X_train, Y_train)
# accuracy on training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
print('Accuracy on Training data : ', training_data_accuracy)

: 

In [None]:
# accuracy on test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)
print('Accuracy score on Test Data : ', test_data_accuracy)

: 

: 