# Fraud detection in online sales
We have a large dataset pertaining to online e-commerce sales, labeled as fraudulent and non-fraudulent, and our objective is to develop a machine learning solution to classify future sales based on a set of information parameters given.

## Viewing of Data
Initially, we want to observe the shape of data, frequence of the classification feature, so that we can see if we need to fill-in/exclude any data, or take any other preprocessing steps to maintain a balanced approach without over-representation, over-fitting or misleading data.

In [9]:
# data processing and view
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# machine learning tools
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

# load dataset
dataset = pd.read_csv('dataset/merged_dataset.csv')

# shape, types and example
print("Dataset Shape:", dataset.shape)
print("\nData Types:\n", dataset.dtypes)
print("\nFirst Rows:\n", dataset.head())

# feature distribution
print("\nTransaction Data Distribution:\n", dataset['Transaction.Date'].value_counts())

Dataset Shape: (300000, 14)

Data Types:
 Transaction.Date       object
Transaction.Amount    float64
Customer.Age            int64
Is.Fraudulent           int64
Account.Age.Days        int64
Transaction.Hour        int64
source                 object
browser                object
sex                    object
Payment.Method         object
Product.Category       object
Quantity                int64
Device.Used            object
Address.Match           int64
dtype: object

First Rows:
       Transaction.Date  Transaction.Amount  Customer.Age  Is.Fraudulent  \
0  2024-02-12 10:05:21              145.98            29              0   
1  2024-01-25 22:24:06              677.62            40              0   
2  2024-03-26 20:32:44              798.63            40              0   
3  2024-01-07 23:14:51              314.65            34              0   
4  2024-01-19 11:01:19              119.80            11              1   

   Account.Age.Days  Transaction.Hour  source  browser sex 

### Conclusions on Viewing
We have a large dataset, with 300000 individual labeled cases, with 13 features, mixed numeric and categoric, which means we will need to encode categorical variables before modeling. <br>
We also have a 13 to 1 distribution of non-fraudulent vs fraudulent cases, which means a highly imbalaced dataset, meaning accuracy will be a misleading metric, and precision, recall and f1-score will be more expressive of the model's peformance. <br>
We might need techniques for handling imbalance, like oversampling, undersampling or class weighting in algorithms. <br>

## Feature Preprocessing
Simple checking (valid values and not-null) and processing of multiple features and cleaning up of the data into a new file.

In [None]:
def validate_row(row):
    errors = []

    try:
        pd.to_datetime(row['Transaction.Date'])
    except Exception:
        errors.append('Invalid Transaction.Date')

    if not isinstance(row['Transaction.Amount'], (int, float)) or row['Transaction.Amount'] < 0:
        errors.append('Invalid Transaction.Amount')
    

    if row['Is.Fraudulent'] not in [0, 1]:
        errors.append('Invalid Is.Fraudulent')

    if not isinstance(row['Account.Age.Days'], (int, float)) or not (row['Account.Age.Days'] <= 0 or row['Account.Age.Days'] <= 365):
        errors.appned('Invalid Account.Age.Days')
    
    if not isinstance(row['Transaction.Hour'], (int, float)) or not (0 <= row['Transaction.Hour'] <= 23):
        errors.append('Invalid Transaction.Hour')
    
    if row['source'] not in ['Ads', 'SEO', 'Direct']:
        errors.append('Invalid source')
    
    if row['browser'] not in ['Chrome', 'Safari', 'IE', 'FireFox', 'Edge']:
        errors.append('Invalid source')

    if row['sex'] not in ['M', 'F']:
        errors.append('Invalid sex')

    if row['Payment.Method'] not in ['credit card', 'debit card', 'PayPal', 'bank transfer']:
        errors.append('Invalid Payment.Method')

    if row['Product.Category'] not in ['clothing', 'health & beauty', 'home & garden', 'sports & outdoors', 'toys & games']:
        errors.append('Invalid Product.Category')

    if not isinstance(row['Quantity'], (int, float)) or not (1 <= row['Quantity'] <= 5):
        errors.append('Invalid Quantity')

    if row['Device.Used'] not in ['mobile', 'desktop', 'tablet']:
        errors.append('Invalid Device.Used')

    if row['Address.Match'] not in [0, 1]:
        errors.append('Invalid Address.Match')

    return errors