# Dataset Overview – Credit Card Fraud Detection

This notebook provides an overview of the Credit Card Fraud Detection dataset released by the Machine Learning Group (MLG) at Université Libre de Bruxelles (ULB).  
The objective is to understand the dataset structure, characteristics, limitations, and implications for building a real-time fraud detection system.


## Dataset Source

- Origin: Machine Learning Group (MLG), Université Libre de Bruxelles (ULB)
- Context: Real European cardholder transactions
- Time span: 2 consecutive days
- Transactions: 284,807
- Fraudulent transactions: 492 (≈ 0.172%)

The dataset was made publicly available for research on highly imbalanced classification problems and fraud detection.

In [1]:
## load the dataset 

import pandas as pd 

DATA_PATH = '../data/raw/creditcard.csv'

df = pd.read_csv(DATA_PATH)
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [2]:
df.shape

(284807, 31)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

## Dataset Structure

- Rows: 284,807 transactions
- Columns: 31
  - 28 anonymized numerical features (V1–V28)
  - `Time`: Seconds elapsed since the first transaction
  - `Amount`: Transaction amount
  - `Class`: Target variable (1 = Fraud, 0 = Legitimate)

All features are numerical; there are no categorical variables.


In [4]:
df.isnull().sum().sum()

0

## Missing Values

The dataset contains **no missing values**, simplifying preprocessing and model deployment.

### Target Variable Distribution

In [5]:
df['Class'].value_counts()

Class
0    284315
1       492
Name: count, dtype: int64

In [6]:
df['Class'].value_counts(normalize=True) * 100

Class
0    99.827251
1     0.172749
Name: proportion, dtype: float64

This represents an **extreme class imbalance**, which is the defining challenge of this dataset and must be explicitly addressed during modeling.

## Feature Anonymization

Features `V1`–`V28` are the result of a PCA transformation applied by the data provider.

Implications:
- Original feature semantics are unknown
- Feature interpretability is limited
- Feature engineering opportunities are constrained
- Model explainability must rely on statistical behavior rather than domain meaning


## Special Features

### Time
- Represents seconds elapsed since the first transaction in the dataset
- Does not correspond to real-world timestamps
- Can capture transaction bursts and temporal patterns

### Amount
- Raw transaction amount
- Not PCA-transformed
- Requires scaling before modeling


## Assumptions and Constraints

- Each transaction is treated independently
- No user or merchant identifiers are available
- Fraud labels are assumed to be accurate
- Dataset reflects historical behavior over a short time window

These constraints influence model selection, evaluation metrics, and deployment design.


## Summary

- The dataset represents a realistic, highly imbalanced fraud detection problem
- Feature anonymization limits interpretability but simplifies modeling
- Class imbalance is the primary modeling challenge

Next steps:
- Perform exploratory data analysis to identify patterns
- Analyze fraud vs non-fraud behavior
- Design preprocessing and imbalance-handling strategies
