## Fraud data columns
## user_id: A unique identifier for the user who made the transaction.
## signup_time: The timestamp when the user signed up.'
## purchase_time: The timestamp when the purchase was made.
## purchase_value: The value of the purchase in dollars.
## device_id: A unique identifier for the device used to make the transaction.
## source: The source through which the user came to the site (e.g., SEO, Ads).
## browser: The browser used to make the transaction (e.g., Chrome, Safari).
## sex: The gender of the user (M for male, F for female).
## age: The age of the user.
## ip_address: The IP address from which the transaction was made.
## class: The target variable where 1 indicates a fraudulent transaction and 0 indicates a non-fraudulent transaction.
## Critical Challenge: Class Imbalance. This dataset is highly imbalanced, with far fewer fraudulent transactions than legitimate ones. This will significantly influence your choice of evaluation metrics and modeling techniques.

## Step 1 Basic Data Exploration

In [1]:
import sys
from pathlib import Path
import pandas as pd

# Add src to path so we can import our module
PROJECT_ROOT = Path.cwd().parent
if str(PROJECT_ROOT / "src") not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT / "src"))

from eda_fraud_data import Eda

In [2]:
# Load data
data = Eda.load_data("../data/raw/Fraud_Data.csv")

# Initialize EDA
if data is not None:
    eda = Eda(data)

Data loaded successfully from ../data/raw/Fraud_Data.csv


In [3]:

# Basic Exploration
eda.basic_exploration()

--- Data Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 151112 entries, 0 to 151111
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   user_id         151112 non-null  int64  
 1   signup_time     151112 non-null  object 
 2   purchase_time   151112 non-null  object 
 3   purchase_value  151112 non-null  int64  
 4   device_id       151112 non-null  object 
 5   source          151112 non-null  object 
 6   browser         151112 non-null  object 
 7   sex             151112 non-null  object 
 8   age             151112 non-null  int64  
 9   ip_address      151112 non-null  float64
 10  class           151112 non-null  int64  
dtypes: float64(1), int64(4), object(6)
memory usage: 12.7+ MB
None

--- Data Shape ---
(151112, 11)

--- Data Description ---
             user_id  purchase_value            age    ip_address  \
count  151112.000000   151112.000000  151112.000000  1.511120e+05   
mean   2

## From the data info we can see that the columns data types for signup_time and purchase_time are object. We need to convert them to datetime format for better analysis. For ip address also it needs to be converted to integer format.

In [4]:
# Convert data types
eda.convert_datetypes(['signup_time', 'purchase_time'])


--- Converting Data Types ---
Conversion successful.
user_id                    int64
signup_time       datetime64[ns]
purchase_time     datetime64[ns]
purchase_value             int64
device_id                 object
source                    object
browser                   object
sex                       object
age                        int64
ip_address               float64
class                      int64
dtype: object


Unnamed: 0,user_id,signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,ip_address,class
0,22058,2015-02-24 22:55:49,2015-04-18 02:47:11,34,QVPSPJUOCKZAR,SEO,Chrome,M,39,7.327584e+08,0
1,333320,2015-06-07 20:39:50,2015-06-08 01:38:54,16,EOGFQPIZPYXFZ,Ads,Chrome,F,53,3.503114e+08,0
2,1359,2015-01-01 18:52:44,2015-01-01 18:52:45,15,YSSKYOSJHPPLJ,SEO,Opera,M,53,2.621474e+09,1
3,150084,2015-04-28 21:13:25,2015-05-04 13:54:50,44,ATGTXKYKUDUQN,SEO,Safari,M,41,3.840542e+09,0
4,221365,2015-07-21 07:09:52,2015-09-09 18:40:53,39,NAUITBZFJKHWW,Ads,Safari,M,45,4.155831e+08,0
...,...,...,...,...,...,...,...,...,...,...,...
151107,345170,2015-01-27 03:03:34,2015-03-29 00:30:47,43,XPSKTWGPWINLR,SEO,Chrome,M,28,3.451155e+09,1
151108,274471,2015-05-15 17:43:29,2015-05-26 12:24:39,35,LYSFABUCPCGBA,SEO,Safari,M,32,2.439047e+09,0
151109,368416,2015-03-03 23:07:31,2015-05-20 07:07:47,40,MEQHCSJUBRBFE,SEO,IE,F,26,2.748471e+09,0
151110,207709,2015-07-09 20:06:07,2015-09-07 09:34:46,46,CMCXFGRHYSTVJ,SEO,Chrome,M,37,3.601175e+09,0


In [5]:
# Check for missing values
eda.check_missing_values()


--- Missing Values ---
user_id           0
signup_time       0
purchase_time     0
purchase_value    0
device_id         0
source            0
browser           0
sex               0
age               0
ip_address        0
class             0
dtype: int64
No missing values found.


## From the isna and isnull we can see that there are no missing values in the dataset.

In [6]:
# Check for unique identifiers
eda.check_unique_identifiers("user_id")


--- Checking Uniqueness for user_id ---
Unique count: 151112
Total rows: 151112
All values in 'user_id' are unique (no duplicates).


## It can also be seen that all rows have unique user_id meaning there are no duplicate rows in the dataset.