# Fraud Detection – Exploratory Data Analysis (EDA)

This notebook performs **exploratory data analysis (EDA)** on two datasets:
1. **E-commerce fraud data**
2. **Credit card transaction data**

The goal is to:
- Understand feature distributions
- Identify potential fraud-related patterns
- Generate visual insights for model development

All figures are saved to `reports/figures`.


## Importing Required Libraries

We import libraries for:
- Data manipulation (`pandas`)
- Data visualization (`matplotlib`, `seaborn`)
- File and directory management (`pathlib`)


In [9]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

sns.set(style="whitegrid")

# Directory setup

# Base directory = project root (one level up from notebooks/)
BASE_DIR = Path.cwd().parent
# Figures folder
FIGURES_DIR = BASE_DIR / 'reports' / 'figures'
FIGURES_DIR.mkdir(parents=True, exist_ok=True)

# Data folders
RAW_DIR = BASE_DIR / 'data' / 'raw'
PROCESSED_DIR = BASE_DIR / 'data' / 'processed'
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)


## Plot Saving Utility

This helper function saves generated plots to disk for later use in reports
and automatically closes figures to avoid memory issues.


In [10]:
def save_plot(name):
    path = FIGURES_DIR / f"{name}.png"
    plt.savefig(path, bbox_inches='tight')
    plt.close()
    print(f"Saved plot: {path}")


## Univariate Analysis

Univariate analysis examines **one feature at a time** to understand:
- Distribution of numerical variables
- Frequency of categorical variables

This helps detect skewness, outliers, and dominant categories.


In [11]:
def plot_univariate(df, col, name_prefix):
    plt.figure(figsize=(10, 6))
    
    if df[col].dtype in ['int64', 'float64']:
        sns.histplot(df[col], kde=True, bins=30)
        plt.title(f'Distribution of {col}')
    else:
        sns.countplot(
            y=df[col],
            order=df[col].value_counts().index[:10]
        )
        plt.title(f'Count of {col}')
        
    save_plot(f"{name_prefix}_univariate_{col}")


## Bivariate Analysis

Bivariate analysis explores the relationship between:
- An independent feature
- The target variable (fraud or non-fraud)

This helps identify features that separate fraudulent from legitimate transactions.


In [12]:
def plot_bivariate_vs_target(df, col, target, name_prefix):
    plt.figure(figsize=(10, 6))
    
    if df[col].dtype in ['int64', 'float64']:
        sns.boxplot(x=target, y=col, data=df)
        plt.title(f'{col} by {target}')
    else:
        sns.countplot(x=col, hue=target, data=df)
        plt.title(f'{col} distribution by {target}')
        plt.xticks(rotation=45)
        
    save_plot(f"{name_prefix}_bivariate_{col}_vs_{target}")


## E-commerce Fraud Dataset

This section analyzes transaction data from an e-commerce platform.
We examine:
- Customer demographics
- Purchase behavior
- Fraud vs non-fraud patterns


In [13]:
def run_eda_fraud_data():
    try:
        df = pd.read_csv(RAW_DIR / 'Fraud_Data.csv')
        display(df.head())
        
        numerical_cols = ['purchase_value', 'age']
        categorical_cols = ['source', 'browser', 'sex']
        
        # Univariate analysis
        for col in numerical_cols + categorical_cols:
            if col in df.columns:
                plot_univariate(df, col, 'Fraud')
        
        # Bivariate analysis
        if 'class' in df.columns:
            for col in numerical_cols:
                plot_bivariate_vs_target(df, col, 'class', 'Fraud')
                
    except FileNotFoundError:
        print("❌ Fraud_Data.csv not found, skipping EDA.")


## Credit Card Transaction Dataset

This dataset contains anonymized credit card transactions.
The focus is on:
- Transaction amount behavior
- Temporal patterns
- Fraud vs legitimate transaction separation


In [14]:
def run_eda_creditcard():
    try:
        df = pd.read_csv(RAW_DIR / 'creditcard.csv')
        display(df.head())
        
        cols = ['Amount', 'Time']
        
        # Univariate analysis
        for col in cols:
            if col in df.columns:
                plot_univariate(df, col, 'CreditCard')
        
        # Bivariate analysis
        if 'Class' in df.columns:
            for col in cols:
                plot_bivariate_vs_target(df, col, 'Class', 'CreditCard')
                
    except FileNotFoundError:
        print("❌ creditcard.csv not found, skipping EDA.")


## Execute EDA Pipeline

This cell runs the complete exploratory analysis for both datasets
and generates all visualizations used for reporting and modeling.


In [15]:
run_eda_fraud_data()
run_eda_creditcard()


Unnamed: 0,user_id,signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,ip_address,class
0,22058,2015-02-24 22:55:49,2015-04-18 02:47:11,34,QVPSPJUOCKZAR,SEO,Chrome,M,39,732758400.0,0
1,333320,2015-06-07 20:39:50,2015-06-08 01:38:54,16,EOGFQPIZPYXFZ,Ads,Chrome,F,53,350311400.0,0
2,1359,2015-01-01 18:52:44,2015-01-01 18:52:45,15,YSSKYOSJHPPLJ,SEO,Opera,M,53,2621474000.0,1
3,150084,2015-04-28 21:13:25,2015-05-04 13:54:50,44,ATGTXKYKUDUQN,SEO,Safari,M,41,3840542000.0,0
4,221365,2015-07-21 07:09:52,2015-09-09 18:40:53,39,NAUITBZFJKHWW,Ads,Safari,M,45,415583100.0,0


Saved plot: c:\Users\yoga\code\10_Academy\week_5\reports\figures\Fraud_univariate_purchase_value.png
Saved plot: c:\Users\yoga\code\10_Academy\week_5\reports\figures\Fraud_univariate_age.png
Saved plot: c:\Users\yoga\code\10_Academy\week_5\reports\figures\Fraud_univariate_source.png
Saved plot: c:\Users\yoga\code\10_Academy\week_5\reports\figures\Fraud_univariate_browser.png
Saved plot: c:\Users\yoga\code\10_Academy\week_5\reports\figures\Fraud_univariate_sex.png
Saved plot: c:\Users\yoga\code\10_Academy\week_5\reports\figures\Fraud_bivariate_purchase_value_vs_class.png
Saved plot: c:\Users\yoga\code\10_Academy\week_5\reports\figures\Fraud_bivariate_age_vs_class.png


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


Saved plot: c:\Users\yoga\code\10_Academy\week_5\reports\figures\CreditCard_univariate_Amount.png
Saved plot: c:\Users\yoga\code\10_Academy\week_5\reports\figures\CreditCard_univariate_Time.png
Saved plot: c:\Users\yoga\code\10_Academy\week_5\reports\figures\CreditCard_bivariate_Amount_vs_Class.png
Saved plot: c:\Users\yoga\code\10_Academy\week_5\reports\figures\CreditCard_bivariate_Time_vs_Class.png
