# Credit Card Fraud Detection Project

## Introduction
The Credit Card Fraud Detection Project aims to develop a machine learning-based system capable of identifying and preventing fraudulent activities within financial transactions. Leveraging advanced analytics and predictive modeling techniques, the system distinguishes between legitimate and fraudulent transactions, helping financial institutions mitigate risks and protect customers from fraudulent behavior.

## Project Objective
The primary objective of this project is to build and deploy an effective fraud detection system that can:
- Identify anomalous patterns or deviations from normal transaction behavior.
- Predict fraudulent transactions with high accuracy.
- Provide real-time monitoring and alerts for potential fraudulent activities.
- Scale efficiently to handle large volumes of transactions.

## Dataset Description
The dataset used in this project contains a collection of credit card transactions, including features such as transaction time, transaction amount, and anonymized features (V1-V28) obtained through principal component analysis (PCA) for confidentiality reasons. The 'Class' column indicates whether a transaction is fraudulent (Class=1) or legitimate (Class=0).

## Key Components and Challenges
1. **Anomaly Detection**: Identifying unusual patterns or deviations from normal behavior within transaction data.
2. **Machine Learning Models**: Employing algorithms like Logistic Regression, Random Forest, or Neural Networks for predictive analysis.
3. **Feature Engineering**: Selecting and transforming relevant features to enhance fraud detection accuracy.
4. **Real-time Monitoring**: Implementing systems that can detect and respond to fraudulent activities in real-time.
5. **Scalability**: Designing fraud detection systems capable of handling large volumes of transactions efficiently.

## Methodology
1. Data Loading and Preparation
2. Exploratory Data Analysis (EDA)
3. Model Selection and Training
4. Model Evaluation and Fine-tuning
5. Deployment and Integration

## Conclusion
The Credit Card Fraud Detection Project is an essential initiative to safeguard financial transactions and protect consumers from fraudulent activities. By leveraging advanced machine learning techniques, this project aims to enhance fraud detection capabilities and contribute to a more secure financial ecosystem.

For more details, refer to the project code and documentation.


# Installation of Required Libraries

First, we need to install the necessary Python libraries to execute the code successfully. We'll install the following libraries:

- matplotlib: For creating visualizations.
- pandas: For data manipulation and analysis.
- numpy: For numerical computing.
- seaborn: For statistical data visualization.
- scikit-learn: For machine learning algorithms and tools.

In [11]:
import matplotlib.pyplot as plt 
import pandas as pd 
import numpy  as np
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from joblib import dump


In [12]:
df= pd.read_csv('creditcard.csv')
df.head().T

Unnamed: 0,0,1,2,3,4
Time,0.0,0.0,1.0,1.0,2.0
V1,-1.359807,1.191857,-1.358354,-0.966272,-1.158233
V2,-0.072781,0.266151,-1.340163,-0.185226,0.877737
V3,2.536347,0.16648,1.773209,1.792993,1.548718
V4,1.378155,0.448154,0.37978,-0.863291,0.403034
V5,-0.338321,0.060018,-0.503198,-0.010309,-0.407193
V6,0.462388,-0.082361,1.800499,1.247203,0.095921
V7,0.239599,-0.078803,0.791461,0.237609,0.592941
V8,0.098698,0.085102,0.247676,0.377436,-0.270533
V9,0.363787,-0.255425,-1.514654,-1.387024,0.817739


In [None]:
df.columns

# Description of Dataset Columns

The dataset contains the following columns:

1. **Time**: The time elapsed in seconds between the current transaction and the first transaction in the dataset.
2. **V1-V28**: Anonymized features obtained through principal component analysis (PCA) to protect sensitive information. These features may represent various aspects of the transaction, but their specific meanings are not disclosed for confidentiality reasons.
3. **Amount**: The transaction amount.
4. **Class**: Indicates whether the transaction is fraudulent or legitimate. 
   - Class 0: Legitimate transaction
   - Class 1: Fraudulent transaction

The anonymized features (V1-V28) and transaction amount (Amount) will be used as input features for training machine learning models, while the Class column will be the target variable for classification.


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

# Check for Missing Values

The dataset does not contain any missing values. All columns have a count of 0 missing values.


In [14]:
df.isnull().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

In [15]:
df.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,1.168375e-15,3.416908e-16,-1.379537e-15,2.074095e-15,9.604066e-16,1.487313e-15,-5.556467e-16,1.213481e-16,-2.406331e-15,...,1.654067e-16,-3.568593e-16,2.578648e-16,4.473266e-15,5.340915e-16,1.683437e-15,-3.660091e-16,-1.22739e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,...,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,...,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,...,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,...,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,...,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,...,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


In [16]:
df.shape

(284807, 31)

In [17]:
df['Class'].value_counts()

Class
0    284315
1       492
Name: count, dtype: int64

In [18]:
# Feature Engineering
# 1. Normalization
# Normalize the 'Amount' column
scaler = StandardScaler()
df['Normalized_Amount'] = scaler.fit_transform(df['Amount'].values.reshape(-1, 1))

# 2. Time Feature
import numpy as np

# Extract hour of the day and day of the week from the 'Time' column
df['Hour'] = df['Time'].apply(lambda x: np.ceil(float(x) / 3600) % 24)
df['Day_of_Week'] = df['Time'].apply(lambda x: np.ceil((float(x) / 3600) / 24) % 7)

# 3. Interaction Terms (Optional)
# Example: Interaction between 'V1' and 'V2'
df['V1_V2_Interact'] = df['V1'] * df['V2']


In [19]:
# Features and target variable
X = df.drop(['Class'], axis=1)
y = df['Class']

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)


# Feature Engineering

## Normalization
- Normalize the 'Amount' column using StandardScaler.
- Create a new column 'Normalized_Amount' to store the normalized values.

## Time Feature
- Extract the hour of the day and day of the week from the 'Time' column.
- Create new columns 'Hour' and 'Day_of_Week' to store these extracted features.

## Interaction Terms
- Optionally, create interaction terms between selected features.
- Example: Create a new column 'V1_V2_Interact' representing the interaction between 'V1' and 'V2'.

## Splitting Dataset
- Separate the features (X) from the target variable (y).
- Split the dataset into training and testing sets with a 80-20 ratio.
- Use stratified sampling to ensure balanced class distribution in both training and testing sets.



In [22]:

# Initialize and train models
logistic_model = LogisticRegression()
random_forest_model = RandomForestClassifier()

logistic_model.fit(X_train, y_train)
random_forest_model.fit(X_train, y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Model Initialization and Training

## Initialize Models
- Initialize a Logistic Regression model and a Random Forest model.

## Train Models
- Train the Logistic Regression model and the Random Forest model using the training data.

In [21]:
# Scale the test set using the same scaler
scaler = StandardScaler()
X_test_scaled = scaler.fit_transform(X_test)

# Evaluate Logistic Regression model
logistic_preds = logistic_model.predict(X_test_scaled)
print("Logistic Regression:")
print(classification_report(y_test, logistic_preds))


Logistic Regression:
              precision    recall  f1-score   support

           0       1.00      0.78      0.88     56864
           1       0.01      0.93      0.01        98

    accuracy                           0.79     56962
   macro avg       0.50      0.86      0.45     56962
weighted avg       1.00      0.79      0.88     56962





# Model Evaluation (Logistic Regression)

The classification report provides insights into the performance of the Logistic Regression model on the test set:

| Metric    | Class 0 (Legitimate) | Class 1 (Fraudulent) |
|-----------|-----------------------|-----------------------|
| Precision | 1.00                  | 0.01                  |
| Recall    | 0.78                  | 0.93                  |
| F1-Score  | 0.88                  | 0.01                  |
| Accuracy  | 0.79                  |                       |

- **Precision**: The model correctly identifies 100% of legitimate transactions but only 1% of fraudulent transactions.

- **Recall**: It effectively captures 78% of legitimate transactions and 93% of fraudulent transactions.

- **F1-Score**: Achieves a balance between precision and recall for legitimate transactions, but performs poorly for fraudulent transactions.

- **Accuracy**: The overall accuracy is 79%, but it may not be suitable due to class imbalance.

In summary, the model demonstrates strong performance in detecting legitimate transactions but struggles with identifying fraudulent ones, as indicated by the low precision and F1-score for fraudulent transactions.


In [None]:
# Save the trained Logistic Regression model
dump(logistic_model, 'logistic_regression_model.joblib')


['logistic_regression_model.joblib']

In [None]:
# Load the saved Logistic Regression model
loaded_model = load('logistic_regression_model.joblib')

# Define the features of the data instance for prediction
data_instance = np.array([[7, -0.89428608220282, 0.286157196276544, -0.113192212729871, -0.271526130088604, 2.6695986595986, 3.72181806112751, 0.370145127676916, 0.851084443200905, -0.392047586798604, -0.410430432848439, -0.705116586646536, -0.110452261733098, -0.286253632470583, 0.0743553603016731, -0.328783050303565, -0.210077268148783, -0.499767968800267, 0.118764861004217, 0.57032816746536, 0.0527356691149697, -0.0734251001059225, -0.268091632235551, -0.204232669947878, 1.0115918018785, 0.373204680146282, -0.384157307702294, 0.0117473564581996, 0.14240432992147, 93.2, 0, 0, 0, 0]])

# Make predictions on the data instance
predictions = loaded_model.predict(data_instance)

print(predictions)


[0]




# Prediction Outcome

The output of the prediction is a single value: [0].

This indicates that the model predicted the given data instance as Class 0, which typically corresponds to a legitimate transaction.


# Conclusion and Recommendations

## Closing Insights
- The credit card fraud detection project aimed to identify fraudulent activities within financial transactions using machine learning techniques.
- Through extensive data exploration, preprocessing, and model training, several insights were gained regarding the dataset and the performance of the models.

## Key Findings
1. **Class Imbalance**: The dataset exhibits a significant class imbalance, with a vast majority of transactions being legitimate and only a small fraction being fraudulent.
2. **Model Performance**: The trained models, particularly the Logistic Regression and Random Forest models, demonstrate varying levels of performance in detecting fraudulent transactions.
3. **Evaluation Metrics**: Precision, recall, and F1-score metrics reveal the trade-offs between correctly identifying fraudulent transactions and minimizing false positives.
4. **Feature Importance**: Feature engineering techniques such as normalization and extraction of time-related features have been employed to improve model performance.

## Conclusion
- Despite the challenges posed by class imbalance, the models show promise in identifying fraudulent transactions.
- However, there is room for improvement, especially in reducing false positives and enhancing the detection of fraudulent activities.
- Continuous monitoring and refinement of the models are necessary to adapt to evolving fraud patterns and maintain effectiveness.

## Recommendations
1. **Feature Engineering**: Explore additional feature engineering techniques to capture more nuanced patterns in fraudulent transactions.
2. **Model Tuning**: Experiment with hyperparameter tuning and ensemble methods to optimize model performance.
3. **Real-time Monitoring**: Implement real-time monitoring systems to detect and respond to fraudulent activities promptly.
4. **Education and Awareness**: Educate users about fraud prevention measures and encourage reporting of suspicious transactions.

## Practical Use
- The trained models can be deployed in financial institutions' fraud detection systems to automatically flag potentially fraudulent transactions.
- Regular updates and retraining of the models are essential to maintain accuracy and effectiveness in detecting evolving fraud patterns.
- Collaboration with cybersecurity experts and law enforcement agencies can provide valuable insights and enhance fraud detection capabilities.

## Closing Note
- The credit card fraud detection project underscores the importance of leveraging advanced analytics and machine learning to combat financial fraud effectively.
- By combining technological solutions with proactive measures and collaboration, we can mitigate the risks associated with fraudulent activities and safeguard financial systems and consumers.
