In [None]:
# Credit Card Fraud Detection
# Author: Geisiana Maurício
# Objective:
# Build a complete data analysis and machine learning pipeline
# to detect fraudulent credit card transactions using Python.

# Credit Card Fraud Risk Analysis — Exploratory Data Analysis (EDA)

## Business Context
Credit card fraud represents a major financial and operational risk for financial institutions.
Before building predictive models, it is essential to understand fraud patterns, data limitations,
and the implications of extreme class imbalance.

This analysis focuses on understanding fraudulent transaction behavior and identifying
high-risk patterns to support fraud prevention decisions.

## Business Questions

1. How rare are fraudulent transactions?
2. What is the level of class imbalance in the dataset?
3. Are fraudulent transactions associated with specific transaction amounts?
4. Does transaction timing show suspicious patterns?
5. What risks arise from ignoring data imbalance and quality issues?

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

pd.set_option("display.float_format", "{:.4f}".format)


In [None]:

df = pd.read_csv ("../data/raw/creditcard.csv")

## Data Understanding
## The dataset contains anonymized features resulting from PCA transformation, except for Time and Amount.

In [None]:
df.info()
df.describe()
df.shape

## Fraud vs Non-Fraud Distribution

In [None]:
class_counts = df["Class"].value_counts(normalize=True)

class_counts

In [None]:
sns.barplot(x=class_counts.index, y=class_counts.values)
plt.title("Class Distribution")
plt.ylabel("Proportion")
plt.show()

## Fraudulent transactions represent approximately X% of the dataset, indicating extreme class imbalance.

## Class Imbalance Risk

Extreme class imbalance can lead to misleading metrics such as accuracy.
A naive model predicting all transactions as legitimate would achieve high accuracy
while failing to detect fraud.

## Transaction Amount Analysis

In [None]:
sns.boxplot(x="Class", y="Amount", data=df)
plt.yscale("log")
plt.title("Transaction Amount by Class")
plt.show()

## Fraudulent transactions tend to have lower median amounts, suggesting potential fraud testing behavior.


## Transaction Time Patterns

In [None]:
sns.histplot(df[df["Class"] == 1]["Time"], bins=50, kde=False)
plt.title("Fraud Transactions Over Time")
plt.show()


Fraudulent transactions show clustering patterns over time, which may indicate coordinated attacks.

## Correlation Analysis

In [None]:
corr = df.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr, cmap="coolwarm", center=0)
plt.title("Feature Correlation Matrix")
plt.show()

## Due to PCA transformation, correlations should be interpreted cautiously.


## Data Quality and Risk Considerations

- Extreme class imbalance poses modeling challenges.
- Anonymized features limit interpretability.
- Fraud patterns may change over time (concept drift).

## Key Insights

- Fraud transactions are extremely rare (~0.17%).
- Transaction amount and time show distinct behavior for fraud.
- Class imbalance significantly impacts risk analysis.

## Business Recommendations

- Prioritize recall over accuracy in fraud detection systems.
- Apply stricter monitoring to high-risk transaction windows.
- Combine automated detection with human review for borderline cases.

This pipeline transforms raw data into risk understanding and initial decisions.