# Machine learning Fraud casus


# Introduction


Fraud detection is a critical challenge in online transactions, where the ability to accurately identify fraudulent activities can save companies significant losses and enhance customer trust. In 2019, Kaggle hosted a competition with a high reward of €20,000, challenging participants to develop machine learning models capable of predicting fraudulent transactions. This competition attracted many participants, with the best algorithm achieving an impressive accuracy rate of 94%.

In this case study, we aim to take on this challenge ourselves by independently developing machine learning models to predict whether a transaction is fraudulent. The main objectives of this analysis are:

How accurately can we predict if a transaction is fraudulent?
Do the features in the dataset significantly impact whether a transaction is fraudulent?
Through this study, we hope to learn from the process, test our skills, and evaluate how close we can come to the benchmark set during the original competition.


# Data set explanation

The dataset comprises 600,000 transactions, with 20,000 labeled as fraudulent (isFraud = 1). It is split into two primary files:

Transaction Data

Contains details about individual transactions.
Key categorical features include:
ProductCD: The product code.
card1 - card6: Various card-related details.
addr1, addr2: Address information.
P_emaildomain, R_emaildomain: Sender and receiver email domains.
M1 - M9: Miscellaneous categorical flags.


Identity Data

Provides additional information about some transactions, such as device and ID details.
Key categorical features include:
DeviceType: Type of device used for the transaction.
DeviceInfo: Specific device details.
id_12 - id_38: Various identity-related attributes.
Special Notes:

Not all transactions in the dataset have corresponding identity information.


Files:

train_transaction.csv and train_identity.csv: Training dataset.

test_transaction.csv and test_identity.csv: Test dataset requiring predictions.

# Data Quality Assessment
The dataset used in this project had high quality but presented some challenges:

1. **Dataset Size:**
   - The dataset contained approximately 600,000 transactions with about 350 columns of data.

2. **Anonymized Columns:**
   - Many of the columns were anonymized for privacy reasons. 
   - This limited our understanding of what the data represented exactly, but it was sufficient for model training.

3. **Missing Values:**
   - A large portion of the columns contained NULL or empty values.
   - This required data cleaning and preprocessing, including:
     - Dropping columns with excessive missing values.
     - Imputing missing values in other columns.
   - Detailed steps for this process are documented in the Jupyter Notebook file `DataSetAnalysis.ipynb`.

## Data Preprocessing:
   - To handle the missing values and prepare the data for machine learning models, we performed the following steps:
     - Filled NULL values in the dataset using appropriate strategies (e.g., mean or median for numeric features, mode for categorical features).
     - Scaled all numeric features to a range of 0 to 1 using a Min-Max Scaler for consistency.
     - Converted categorical features into numerical values using one-hot encoding.
   - Additionally, we merged the two datasets (`train_transaction.csv` and `train_identity.csv`) based on their shared identifiers to create a unified dataset for analysis and modeling.


# Main question


How accurately can we predict whether an online transaction is fraudulent using machine learning models?

This dataset originates from a Kaggle competition held in 2019, where participants competed to predict fraudulent transactions, with a prize pool of €20,000. The challenge attracted alot of data scientists and machine learning enthusiasts, with the top-performing algorithm achieving an impressive accuracy rate of 94%.

Our main question mirrors this challenge, as we aim to explore how well we can predict fraudulent transactions independently. By creating our own algorithm without referencing other participants' solutions, we want to achieve the highest possible accuracy while applying our own strategies and techniques. This approach allows us to experience the challenge firsthand and evaluate our performance against the benchmark set by the competition.

# Sub questions

This casus requires us to independently create our own algorithm, so we chose to formulate our sub-questions in a way that allows each of us to approach the challenge individually. Both our sub-questions will follow the structure: 

"Do these features impact whether or not a transaction is fraudulent?"

Before finalizing the specific sub-questions, we will perform a dataset analysis to identify potential features that significantly influence whether a transaction is fraudulent. This analysis will guide us in selecting the features for each algorithm.

This analysis will be detailed in our next Jupyter file, titled DatasetAnalysis, and will guide us in selecting features for our algorithms.