# Supervised Learning: Fraud Detection in E-Commerce

## Problem Definition
The objective of this study is to develop a binary classification model capable of detecting fraudulent e-commerce transactions. The target variable represents whether a transaction is fraudulent (`1`) or legitimate (`0`). The outcome of this model is intended to support the early identification of fraudulent activity, improving online transaction security.

## Dataset Description
This project utilizes two publicly available datasets:
- `Dataset1.csv`: A detailed transactional dataset that includes demographic, behavioral, and transactional features.
- `Dataset2.csv`: A more compact dataset focused on user and transaction metadata.

Given their complementary nature, these datasets will be merged to construct a richer feature space for model training.
However, the datasets present differences in their structure and attribute naming, which necessitates a standardization process before merging.

## Setup and Data Loading
We begin by importing the necessary libraries and loading the datasets.

In [1]:
# Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Load datasets
dataset1 = pd.read_csv('Datasets/Dataset1.csv')
dataset2 = pd.read_csv('Datasets/Dataset2.csv')

### Column Comparison
We first inspect the structure of both datasets to identify differences in column names and schema.

In [2]:
print("Dataset 1 Columns:\n", dataset1.columns, "\n")
print("Dataset 2 Columns:\n", dataset2.columns)

Dataset 1 Columns:
 Index(['Transaction ID', 'Customer ID', 'Transaction Amount',
       'Transaction Date', 'Payment Method', 'Product Category', 'Quantity',
       'Customer Age', 'Customer Location', 'Device Used', 'IP Address',
       'Shipping Address', 'Billing Address', 'Is Fraudulent',
       'Account Age Days', 'Transaction Hour'],
      dtype='object') 

Dataset 2 Columns:
 Index(['user_id', 'signup_time', 'purchase_time', 'purchase_value',
       'device_id', 'source', 'browser', 'sex', 'age', 'ip_address', 'class'],
      dtype='object')


### Standardization
Several columns in dataset2 refer to similar concepts as those in dataset1, although with different names. We rename those columns to ensure semantic alignment.

In [3]:
dataset2_renamed = dataset2.rename(columns={
    'user_id': 'Customer ID',
    'purchase_time': 'Transaction Date',
    'purchase_value': 'Transaction Amount',
    'device_id': 'Device Used',
    'ip_address': 'IP Address',
    'age': 'Customer Age',
    'class': 'Is Fraudulent'
})

Additionally, we add missing columns with 'NaN' to maintain consistency across datasets.

In [4]:
missing_columns = set(dataset1.columns) - set(dataset2_renamed.columns)
for col in missing_columns:
    dataset2_renamed[col] = np.nan

# Align column order to match dataset1
dataset2_aligned = dataset2_renamed[dataset1.columns]

### Merging the Datasets
With both datasets now aligned in schema, we concatenate them into a single unified dataset.

In [5]:
merged_data = pd.concat([dataset1, dataset2_aligned], ignore_index=True)

# Preview the merged dataset
merged_data.head()

Unnamed: 0,Transaction ID,Customer ID,Transaction Amount,Transaction Date,Payment Method,Product Category,Quantity,Customer Age,Customer Location,Device Used,IP Address,Shipping Address,Billing Address,Is Fraudulent,Account Age Days,Transaction Hour
0,15d2e414-8735-46fc-9e02-80b472b2580f,d1b87f62-51b2-493b-ad6a-77e0fe13e785,58.09,2024-02-20 05:58:41,bank transfer,electronics,1.0,17,Amandaborough,tablet,212.195.49.198,Unit 8934 Box 0058\nDPO AA 05437,Unit 8934 Box 0058\nDPO AA 05437,0,30.0,5.0
1,0bfee1a0-6d5e-40da-a446-d04e73b1b177,37de64d5-e901-4a56-9ea0-af0c24c069cf,389.96,2024-02-25 08:09:45,debit card,electronics,2.0,40,East Timothy,desktop,208.106.249.121,"634 May Keys\nPort Cherylview, NV 75063","634 May Keys\nPort Cherylview, NV 75063",0,72.0,8.0
2,e588eef4-b754-468e-9d90-d0e0abfc1af0,1bac88d6-4b22-409a-a06b-425119c57225,134.19,2024-03-18 03:42:55,PayPal,home & garden,2.0,22,Davismouth,tablet,76.63.88.212,"16282 Dana Falls Suite 790\nRothhaven, IL 15564","16282 Dana Falls Suite 790\nRothhaven, IL 15564",0,63.0,3.0
3,4de46e52-60c3-49d9-be39-636681009789,2357c76e-9253-4ceb-b44e-ef4b71cb7d4d,226.17,2024-03-16 20:41:31,bank transfer,clothing,5.0,31,Lynnberg,desktop,207.208.171.73,"828 Strong Loaf Apt. 646\nNew Joshua, UT 84798","828 Strong Loaf Apt. 646\nNew Joshua, UT 84798",0,124.0,20.0
4,074a76de-fe2d-443e-a00c-f044cdb68e21,45071bc5-9588-43ea-8093-023caec8ea1c,121.53,2024-01-15 05:08:17,bank transfer,clothing,2.0,51,South Nicole,tablet,190.172.14.169,"29799 Jason Hills Apt. 439\nWest Richardtown, ...","29799 Jason Hills Apt. 439\nWest Richardtown, ...",0,158.0,5.0


### Removal of Non-Analytical Columns
The merged dataset contains several columns; however, some of these, such as 'Transaction ID' and 'Customer ID', do not provide meaningful analytical value. As such, we will exclude these from the dataset.
The 'Shipping Address' and 'Billing Address' columns, while not directly useful for analysis on their own, may hold important insights regarding the potential for fraudulent activity when compared. Specifically, the match between these two addresses could indicate suspicious behavior. Therefore, instead of keeping both columns, we will create a new binary column, "Address Match," that will store 1 if the shipping and billing addresses match, and 0 if they do not.

In [6]:
merged_data.drop(columns=['Transaction ID', 'Customer ID'], inplace=True)

merged_data['Address Match'] = (merged_data['Shipping Address'] == merged_data['Billing Address']).astype(int)

merged_data.drop(columns=['Shipping Address', 'Billing Address'], inplace=True)

# Preview the merged dataset
merged_data.head()
"Transaction.Hour","source","browser","sex","Payment.Method","Product.Category","Quantity","Device.Used","Address.Match"


('Transaction.Hour',
 'source',
 'browser',
 'sex',
 'Payment.Method',
 'Product.Category',
 'Quantity',
 'Device.Used',
 'Address.Match')

### Missing Value Analysis
We evaluate the presence of missing data in the merged dataset, which will guide the preprocessing steps that follow.

In [7]:
print("Missing values per column:\n", merged_data.isnull().sum())


Missing values per column:
 Transaction Amount         0
Transaction Date           0
Payment Method        151112
Product Category      151112
Quantity              151112
Customer Age               0
Customer Location     151112
Device Used                0
IP Address                 0
Is Fraudulent              0
Account Age Days      151112
Transaction Hour      151112
Address Match              0
dtype: int64
