# Fraud detection in online sales
We have a large dataset pertaining to online e-commerce sales, labeled as fraudulent and non-fraudulent, and our objective is to develop a machine learning solution to classify future sales based on a set of information parameters given.

## Viewing of Data
Initially, we want to observe the shape of data, frequence of the classification feature, so that we can see if we need to fill-in/exclude any data, or take any other preprocessing steps to maintain a balanced approach without over-representation, over-fitting or misleading data.

In [27]:
# data processing and view
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# machine learning tools
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

# load dataset
dataset = pd.read_csv('dataset/merged_dataset.csv')

# shape, types and example
print("Dataset Shape:", dataset.shape)
print("\nData Types:\n", dataset.dtypes)
print("\nFirst Rows:\n", dataset.head())

# categorical feature distribution
print("\nTransaction Fraudulence Data Distribution:\n", dataset['Is.Fraudulent'].value_counts())
print("\nSource Data Distribution:\n", dataset['source'].value_counts())
print("\nBrowser Data Distribution:\n", dataset['browser'].value_counts())
print("\nSex Data Distribution:\n", dataset['sex'].value_counts())
print("\nPayment Method Data Distribution:\n", dataset['Payment.Method'].value_counts())
print("\nProduct Category Data Distribution:\n", dataset['Product.Category'].value_counts())
print("\nDevice Used Data Distribution:\n", dataset['Device.Used'].value_counts())
print("\nAddress Match Distribution:\n", dataset['Address.Match'].value_counts())

Dataset Shape: (300000, 14)

Data Types:
 Transaction.Date       object
Transaction.Amount    float64
Customer.Age            int64
Is.Fraudulent           int64
Account.Age.Days        int64
Transaction.Hour        int64
source                 object
browser                object
sex                    object
Payment.Method         object
Product.Category       object
Quantity                int64
Device.Used            object
Address.Match           int64
dtype: object

First Rows:
       Transaction.Date  Transaction.Amount  Customer.Age  Is.Fraudulent  \
0  2024-02-12 10:05:21              145.98            29              0   
1  2024-01-25 22:24:06              677.62            40              0   
2  2024-03-26 20:32:44              798.63            40              0   
3  2024-01-07 23:14:51              314.65            34              0   
4  2024-01-19 11:01:19              119.80            11              1   

   Account.Age.Days  Transaction.Hour  source  browser sex 

### Conclusions on Viewing
We have a large dataset, with 300000 individual labeled cases, with 13 features, mixed numeric and categoric, which means we will need to encode categorical variables before modeling. <br>
We also have a 13 to 1 distribution of non-fraudulent vs fraudulent cases, which means a highly imbalaced dataset, meaning accuracy will be a misleading metric, and precision, recall and f1-score will be more expressive of the model's peformance. <br>
We might need techniques for handling imbalance, like oversampling, undersampling or class weighting in algorithms. <br>
As for the distributions of categorical data, the source, sex, payment method, product category and device used all follow pretty uniform distributions. While browser and address match over-represent Chrome and 1 respectively.

## Feature Preprocessing
Simple checking (valid values and not-null) and processing of multiple features and cleaning up of the data into a new file. We are clearly shaping our data into a desired state of values. <br>
We also maintain label encoding on Is.Fraudulent, sex and address match because it suits the format and meaning of these categories. <br>
We perform one-hot encoding on source, browser, Payment.Method, Product.Category and Device.Used, in order to not express false order, at the cost of increasing the number of features. <br>
We finally process the raw date into numeric parts fed as numbers into the model.

In [28]:
def validate_row(row):
    errors = []

    try:
        pd.to_datetime(row['Transaction.Date'])
    except Exception:
        errors.append('Invalid Transaction.Date')

    if not isinstance(row['Transaction.Amount'], (int, float)) or row['Transaction.Amount'] < 0:
        errors.append('Invalid Transaction.Amount')
    

    if row['Is.Fraudulent'] not in [0, 1]:
        errors.append('Invalid Is.Fraudulent')

    if not isinstance(row['Account.Age.Days'], (int, float)) or not (row['Account.Age.Days'] <= 0 or row['Account.Age.Days'] <= 365):
        errors.appned('Invalid Account.Age.Days')
    
    if not isinstance(row['Transaction.Hour'], (int, float)) or not (0 <= row['Transaction.Hour'] <= 23):
        errors.append('Invalid Transaction.Hour')
    
    if row['source'] not in ['Ads', 'SEO', 'Direct']:
        errors.append('Invalid source')
    
    if row['browser'] not in ['Chrome', 'Safari', 'IE', 'FireFox', 'Opera']:
        errors.append('Invalid source')

    if row['sex'] not in ['M', 'F']:
        errors.append('Invalid sex')

    if row['Payment.Method'] not in ['credit card', 'debit card', 'PayPal', 'bank transfer']:
        errors.append('Invalid Payment.Method')

    if row['Product.Category'] not in ['clothing', 'health & beauty', 'home & garden', 'electronics', 'toys & games']:
        errors.append('Invalid Product.Category')

    if not isinstance(row['Quantity'], (int, float)) or not (1 <= row['Quantity'] <= 5):
        errors.append('Invalid Quantity')

    if row['Device.Used'] not in ['mobile', 'desktop', 'tablet']:
        errors.append('Invalid Device.Used')

    if row['Address.Match'] not in [0, 1]:
        errors.append('Invalid Address.Match')

    return errors

# apply validate_row to all rows
validation_results = dataset.apply(validate_row, axis=1)

# filter into only errors
invalid_rows = dataset[validation_results.apply(lambda x: len(x) > 0)]
error_messages = validation_results[validation_results.apply(lambda x: len(x) > 0)]

# show errors
for (row, errors) in zip(invalid_rows.iterrows(), error_messages):
    index, data = row
    print(f"Row {index} has errors: {errors}")


In [31]:
# remove invalid rows
valid_dataset = dataset[validation_results.apply(lambda x: len(x) == 0)].copy()

# label encode 'sex' category
le_sex = LabelEncoder()
valid_dataset['sex'] = le_sex.fit_transform(valid_dataset['sex'])

# select one hot columns
one_hot_columns = ['source', 'browser', 'Payment.Method', 'Product.Category', 'Device.Used']

# perform automatic one-hot encoding through get_dummies
valid_dataset = pd.get_dummies(valid_dataset, columns=one_hot_columns)


# extract transaction date values 
valid_dataset['Transaction.Date'] = pd.to_datetime(valid_dataset['Transaction.Date'], format='mixed')

# create columns for date
valid_dataset['Transaction.Year'] = valid_dataset['Transaction.Date'].dt.year
valid_dataset['Transaction.Month'] = valid_dataset['Transaction.Date'].dt.month
valid_dataset['Transaction.Day'] = valid_dataset['Transaction.Date'].dt.day
valid_dataset['Transaction.Weekday'] = valid_dataset['Transaction.Date'].dt.weekday
# valid_dataset['Transaction.Hour'] = valid_dataset['Transaction.Hour'] # already exists

# perform cyclic enconding so that model understands weekday/hour proximity (e.g. 23 is close to 0)
# for days of week (0-6)
valid_dataset['Weekday_Sin'] = np.sin(2 * np.pi * valid_dataset['Transaction.Weekday'] / 7)
valid_dataset['Weekday_Cos'] = np.cos(2 * np.pi * valid_dataset['Transaction.Weekday'] / 7)

# for hour of day (0-23)
valid_dataset['Hour_Sin'] = np.sin(2 * np.pi * valid_dataset['Transaction.Hour'] / 24)
valid_dataset['Hour_Cos'] = np.cos(2 * np.pi * valid_dataset['Transaction.Hour'] / 24)

# drop redundant column
valid_dataset = valid_dataset.drop(columns=['Transaction.Date'])

# One-hot encode the Transaction.Weekday
valid_dataset = pd.get_dummies(valid_dataset, columns=['Transaction.Weekday'])

# save the prepared
valid_dataset.to_csv('dataset/prepared_dataset.csv', index=False)

##