
# Final Exploratory Data Analysis for Credit Card Fraud Detection

## Overview
In this notebook, we will conduct an exploratory data analysis and train several machine learning models to detect fraudulent transactions using the Credit Card Fraud Detection dataset. 

The notebook is organized as follows:
1. **Data Loading & Initial Exploration**: We load the dataset and examine its structure.
2. **Data Cleaning & Feature Engineering**: We process the data, handle class imbalance, and prepare features.
3. **Model Training & Evaluation**: We train and evaluate multiple models including Logistic Regression, Random Forest, Gradient Boosting, and CatBoost.
4. **Model Interpretation**: We use SHAP analysis to interpret feature importance in the CatBoost model.
5. **Hypothesis Testing**: We conduct formal statistical testing on key hypotheses.
6. **Suggestions for Next Steps**: Recommendations for improving the model and future steps.



## 1. Data Loading & Initial Exploration

We begin by loading the dataset from OpenML and examining its structure.


In [None]:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset from OpenML (CSV format)
url = "https://www.openml.org/data/get_csv/1597/phpKo8OWT"
df = pd.read_csv(url)

# Display basic information and preview of the dataset
df.info()
df.head()



## 2. Data Cleaning & Feature Engineering

We will check for missing values, normalize the 'Amount' and 'Time' features, and split the dataset into training and test sets.


In [None]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Check for missing values
df.isnull().sum()

# Splitting dataset into train and test sets
X = df.drop('Class', axis=1)
y = df['Class']

# Standardize 'Amount' and 'Time'
scaler = StandardScaler()
X[['Amount', 'Time']] = scaler.fit_transform(X[['Amount', 'Time']])

# Stratified train-test split to preserve class distribution
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
print("Train and test split completed.")



## 3. Model Training & Evaluation

We will train and evaluate the following models:
1. **Logistic Regression**
2. **Random Forest**
3. **Gradient Boosting**
4. **CatBoost with Bayesian Optimization**

For each model, we will evaluate key metrics such as accuracy, precision, recall, and F1-score.


In [None]:

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import classification_report

# Train Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred_logreg = logreg.predict(X_test)
print("Logistic Regression Performance:")
print(classification_report(y_test, y_pred_logreg))

# Train Random Forest
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("Random Forest Performance:")
print(classification_report(y_test, y_pred_rf))

# Train Gradient Boosting
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)
y_pred_gb = gb.predict(X_test)
print("Gradient Boosting Performance:")
print(classification_report(y_test, y_pred_gb))

# Train CatBoost with Bayesian Optimization
catboost_model = CatBoostClassifier(verbose=0)
catboost_model.fit(X_train, y_train)
y_pred_catboost = catboost_model.predict(X_test)
print("CatBoost Performance:")
print(classification_report(y_test, y_pred_catboost))



## 4. SHAP Analysis for CatBoost

To interpret the results of the CatBoost model, we will use SHAP to identify the most important features and their contributions to the predictions.


In [None]:

import shap

# Initialize SHAP explainer
explainer = shap.TreeExplainer(catboost_model)
shap_values = explainer.shap_values(X_test)

# SHAP Summary Plot
shap.summary_plot(shap_values, X_test)

# SHAP Dependence Plot for feature V14
shap.dependence_plot('V14', shap_values, X_test)



## 5. Hypothesis Testing

Based on the initial exploratory analysis, we formulate the following hypotheses:

1. **Hypothesis 1**: Transactions with higher amounts are more likely to be fraudulent.
2. **Hypothesis 2**: The feature **V14** has a strong correlation with fraudulent transactions.
3. **Hypothesis 3**: The time of a transaction (feature **Time**) has a significant effect on the likelihood of fraud.

We will conduct a formal significance test for **Hypothesis 2** using statistical techniques.


In [None]:

from scipy import stats

# Hypothesis 2: Test for correlation between V14 and fraud
fraud_transactions = df[df['Class'] == 1]['V14']
non_fraud_transactions = df[df['Class'] == 0]['V14']

# Perform a two-sample t-test
t_stat, p_val = stats.ttest_ind(fraud_transactions, non_fraud_transactions)

print(f"T-statistic: {t_stat}, P-value: {p_val}")

if p_val < 0.05:
    print("Reject the null hypothesis: V14 has a statistically significant effect on fraud.")
else:
    print("Fail to reject the null hypothesis: No significant difference found between fraud and non-fraud for V14.")



## 6. Suggestions for Next Steps

Based on the results of the model evaluations and the hypothesis testing, the following suggestions can be made for improving the fraud detection model:

1. **Model Refinement**: Fine-tuning the CatBoost model using advanced hyperparameter optimization techniques, such as Bayesian Optimization with more iterations, could further improve its performance.
   
2. **Handling Class Imbalance**: Implementing techniques like **SMOTE** (Synthetic Minority Over-sampling Technique) to generate more fraudulent samples, or using **cost-sensitive learning** to penalize misclassification of fraud cases, can help improve the recall for fraudulent transactions.

3. **Feature Engineering**: Further feature engineering, such as combining time-based features (e.g., transaction rate per user) or creating interaction terms between key PCA features, could provide additional predictive power to the models.

4. **Real-Time Fraud Detection**: Integrating the model into a real-time detection system to flag transactions in real-time, utilizing techniques like **streaming data** and **online learning**, could make the model more actionable in financial environments.

These steps will help create a more robust and production-ready fraud detection system.
