# Fraud Detection Analysis | Dataset Presentation & Data Wrangling

## Dataset Presentation
This project focuses on analyzing a credit card transaction dataset for fraud detection. The dataset is loaded from a CSV file (creditcard.csv) found in Kaggle (link: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud/) and it's a collection of transactions made by credit cards in September 2013 by European cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.
It contains only numerical input variables which are the result of a PCA transformation due to privacy; the only non-transformed features are Time (contains the seconds elapsed between each transaction and the first transaction in the dataset) and Amount (transaction Amount). 

 Various exploratory data analysis (EDA) techniques are applied to understand feature distributions and relationships. Additionally, machine learning algorithms are applied to classify and predict fraudulent credit card transactions.

## Data Wrangling

In this data wrangling section, we will perform essential preprocessing tasks to clean and prepare the dataset for analysis. This includes handling missing values, removing duplicate records, normalizing features, addressing class imbalances, and encoding categorical variables if needed. These steps ensure that the dataset is in an optimal state for machine learning modeling.

### Data Loading & First Analysis

Reads creditcard.csv using pandas and displays basic statistics and dataset structure.

In [27]:
import pandas as pd

file_name = 'DATA/creditcard.csv'
data = pd.read_csv(file_name)

print(data.describe()) # brief desciption of the dataset
print(data.info()) # information about the dataset


                Time            V1            V2            V3            V4  \
count  284807.000000  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05   
mean    94813.859575  1.168375e-15  3.416908e-16 -1.379537e-15  2.074095e-15   
std     47488.145955  1.958696e+00  1.651309e+00  1.516255e+00  1.415869e+00   
min         0.000000 -5.640751e+01 -7.271573e+01 -4.832559e+01 -5.683171e+00   
25%     54201.500000 -9.203734e-01 -5.985499e-01 -8.903648e-01 -8.486401e-01   
50%     84692.000000  1.810880e-02  6.548556e-02  1.798463e-01 -1.984653e-02   
75%    139320.500000  1.315642e+00  8.037239e-01  1.027196e+00  7.433413e-01   
max    172792.000000  2.454930e+00  2.205773e+01  9.382558e+00  1.687534e+01   

                 V5            V6            V7            V8            V9  \
count  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05   
mean   9.604066e-16  1.487313e-15 -5.556467e-16  1.213481e-16 -2.406331e-15   
std    1.380247e+00  1.332271e+00  1.23709

In [28]:
data.head(3) # show first 3 datapoints

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0


In [29]:
data.dtypes # show the datatypes of each feature

Time      float64
V1        float64
V2        float64
V3        float64
V4        float64
V5        float64
V6        float64
V7        float64
V8        float64
V9        float64
V10       float64
V11       float64
V12       float64
V13       float64
V14       float64
V15       float64
V16       float64
V17       float64
V18       float64
V19       float64
V20       float64
V21       float64
V22       float64
V23       float64
V24       float64
V25       float64
V26       float64
V27       float64
V28       float64
Amount    float64
Class       int64
dtype: object

Identify and calculate the percentage of the missing values for each attribute (no value missing, so there is no need to address Na)

In [30]:
data.isnull().sum()/data.shape[0]*100

Time      0.0
V1        0.0
V2        0.0
V3        0.0
V4        0.0
V5        0.0
V6        0.0
V7        0.0
V8        0.0
V9        0.0
V10       0.0
V11       0.0
V12       0.0
V13       0.0
V14       0.0
V15       0.0
V16       0.0
V17       0.0
V18       0.0
V19       0.0
V20       0.0
V21       0.0
V22       0.0
V23       0.0
V24       0.0
V25       0.0
V26       0.0
V27       0.0
V28       0.0
Amount    0.0
Class     0.0
dtype: float64

Check of duplicates in the dataset

In [31]:
# Number of datapoints before removing duplicates
num_datapoints_before = data.shape[0]

# Remove duplicates
data = data.drop_duplicates()

# Number of datapoints after removing duplicates
num_datapoints_after = data.shape[0]

print(f"Number of datapoints before removing duplicates: {num_datapoints_before}")
print(f"Number of datapoints after removing duplicates: {num_datapoints_after}")

Number of datapoints before removing duplicates: 284807
Number of datapoints after removing duplicates: 283726


### Standardization of Time and Amount

In the fraud detection dataset (creditcard.csv), the Time and Amount features are in different scales compared to other features (which are already PCA-transformed, so already centered and scaled). Standardizing them is crucial for improving model performance, as models would be biased towards these attributes due to the larger magnitude.
For tree-based model it wouldn't be necessary but it's done for the other models we will use in the next steps.
We will proced with StandardScaler (Z-score normalization).

In [32]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data[['Time', 'Amount']] = scaler.fit_transform(data[['Time', 'Amount']])

y = data['Class'] # class data
X = data.drop(columns=['Class']) # feature data

In [33]:
data.head(3) # to show the new attributes 

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,-1.996823,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,0.2442,0
1,-1.996823,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,-0.342584,0
2,-1.996802,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,1.1589,0


### Class Imbalance

We now want to address the class imbalance of this dataset, as only a small percentage of transaction resulted as fraud transactions.
In fraud detection, class imbalance is a common issue because fraudulent transactions (positive class) are much rarer than legitimate ones (negative class). If not addressed, machine learning models tend to be biased toward the majority class, leading to:

- Poor fraud detection (high false negatives).
- High accuracy but low recall (misleading performance metrics).
- Unreliable model predictions (unable to generalize to real-world fraud cases).

In [34]:
class_counts = y.value_counts()
print(class_counts) # the dataset is very imbalance, a rebalancment is needed to not bias the models

class_percentage = class_counts / class_counts.sum() * 100
print(f"Class distribution:\n{class_percentage}")

Class
0    283253
1       473
Name: count, dtype: int64
Class distribution:
Class
0    99.83329
1     0.16671
Name: count, dtype: float64


The class imbalanced is address in the next step, in particular the resampling (oversampling/undersampling) is performed after splitting into train and test sets. 

Here's why:

1. Prevent Data Leakage:
- If you apply oversampling (SMOTE) or undersampling before splitting, the synthetic/generated samples may appear in both train and test sets.
- This leads to overfitting and an overly optimistic evaluation, because the model sees similar patterns in both training and testing.
2. Maintain a Realistic Test Distribution:
- The test set should represent the real-world distribution, where fraud cases are rare.
- Applying resampling before splitting would create an artificially balanced test set, which does not reflect real-world fraud detection scenarios.

The last step for this notebook is to save the transformed dataset so it can be loaded in other notebooks (done using Pickle library)

In [35]:
import pickle

with open("DATA/data.pkl", "wb") as f:
    pickle.dump((X, y), f)