# Buy Now Pay Later Project Summary
## Industrial Project Group 9


## Table of Contents
1. [ETL Pipeline](##ETL-Pipeline)
2. [Preliminary Analysis](##Preliminary-Analysis)
3. [Geospatial Analysis](##Geospatial-Analysis)

## ETL Pipeline
The main data was provided by the BNPL firm, so we didn’t use any APIs or Python libraries to retrieve it. For external datasets, we used `urlretrieve` to download data from the **Australian Bureau of Statistics**, while postcode-to-LGA mapping data had to be downloaded manually from the [website](https://www.matthewproctor.com/Content/postcodes/australian_postcodes.csv).

### ABS's Dataset 
We used two datasets from the ABS: **Personal Income in Australia** and **Personal Fraud**. The first dataset contains information on the median and mean income, as well as the median age of earners in each LGA region. The second dataset reports the percentage of personal fraud, including card fraud, identity theft, and scams, for each state.

In the first dataset, some LGA codes in Western Australia lacked entries for median income, mean income, and median age. We imputed these missing values using the respective state's averages. We also found a discrepancy in the total number of earners. Western Australia’s total is listed as *1,585,093*, but summing the earners across all LGAs gives *1,581,061*, a difference of *32*. While the cause is unclear, we split this difference between two missing LGAs. Although not a perfect solution, we believe imputing these small numbers will not significantly impact the fraud probability prediction, especially since we cannot confirm whether consumers reside in these LGAs.

For the second dataset, minimal preprocessing was required aside from renaming columns.

### Postcode-LGA Mapping Data

For this dataset, we selected essential columns for mapping. When a postcode was missing an LGA code, we found the nearest neighboring postcode using coordinates. If the neighbor had a valid LGA code, we assigned it to the missing one. We used a simple K-Nearest Neighbour with `k=1` for this task.

Next, we merged this dataset with ABS data. After the merge, two postcodes from the mapping data were not present in the income data, resulting in null values. We applied the same approach to impute the missing values.

### BNPL Data

Each customer has two unique IDs: `user_id` and `consumer_id`. We chose `consumer_id` as the primary identifier for consistency across all datasets.

For the *consumer's fraud probability* data, we checked for duplicates and removed *99* *(0.28%)* entries. The same process was applied to the *merchant's fraud probability* and transactions data, with no duplicates found.

In the merchant's information data, we used regex to split the tags feature into three categories: category, revenue level, and take rate. No duplicates were found.

The consumer's information data had one column encapsulating details like name, address (state and postcode), gender, and consumer ID. We used regex to split this data into separate columns and found no duplicates.

For the transactions data, covering purchases from February 28, 2021, to August 31, 2022, we ensured all transactions fell within this date range, resulting in the removal of *1,651,235 (11.63%)* rows. The same check was applied to the fraud probability data, with no change in the merchant data and *18 (0.05%)* entries removed from the consumer data.

## Preliminary Analysis

## Geospatial Analysis
We observed that the number of customers varies across different states but remains similar within the same postcode. Therefore, we are interested in analyzing the average fraud probability at both the state and postcode levels to determine if people from different states or postcodes exhibit different scam rates.

We utilized ABS Digital Boundary shapefiles to merge our transaction records with geospatial information at both the postcode and state levels.

![caption](../plots/average_fraud_prob_postcode.png)

![caption](../plots/average_fraud_prob_state.png)

By examining the average fraud probability for each state and postcode, we concluded that using the average fraud probability at the postcode level would be a more effective feature for our fraud detection models. This is because the fraud probability varies significantly across postcodes, ranging from 8% to 53%, whereas at the state level, it only varies between 14.4% and 15.45%.

## Visualisation

## Consumer Fraud Probability Model

Since only some customers had predicted fraud probabilities, we used machine learning to estimate missing values. Two approaches were taken:

1. **Consumer-level**: Assigns the same fraud probability to all transactions by a consumer.
2. **Transaction-level**: Assigns different fraud probabilities for each transaction.

We initially expected the second approach to perform better due to more training data and additional features (e.g., order value) that could improve prediction accuracy. In contrast, only *20,128 (4%)* consumers out of 499,999 had fraud probabilities for the first approach.

We engineered features used in both approaches, including average fraud probability, order value, dollar value standard deviation, and transaction count. We also introduced a feature that calculated the percentage of a consumer's median or mean income spent on shopping, assuming those who spend a higher percentage might be more suspicious due to limited funds for necessities like rent and bills.

For the second approach, we added temporal features, such as the purchase month and day of the week. After feature engineering, we applied encoding, standardization, and log-transformation as needed.

For both approaches, we used Linear Regression (LR) as a baseline model and Random Forest Regression (RFR). We evaluated the models using RMSE and $R^2$ metrics.


|                   | RMSE  | R2    |
|-------------------|-------|-------|
| Linear Regression | 8.062 | 0.285 |
| Random Forest     | 7.721 | 0.401 |


For the second approach
|                   | RMSE  | R2    |
|-------------------|-------|-------|
| Linear Regression | 7.830 | 0.241 |
| Random Forest     | 6.811 | 0.426 |

We can see that RFR outperforms LR in both approaches. This is expected as RF is better in capturing nonlinear and complex relationship whereas LR is too simple. As we expected, the RFR model for the second approach is better than the RFR model of the first approach. Thus, we decide to go with the former for the rest of our project.

## Merchant Fraud Probablity model

## Ranking System

We adopted the perspective of a Finance Project Manager and treated each merchant as an investment generating revenue for the BNPL firm. We used Discounted Cash Flow (DCF) to estimate a merchant's total revenue by summing future revenue and adjusting for the time-value of money, as money today is worth more than in the future.

$$ \text{DCF} = \sum^{n}_{t=1}\frac{CF_t}{(1+r)^t}$$

The `r` is a discount rate, assumed to be the same for all merchants, based on the Victoria State Government's guidelines. This makes ranking merchants intuitive: we simply select the top 100 with the highest estimated total revenue.

We calculated DCF using forecasted revenues for September, October, and November 2022. The DCF value was then multiplied by the take rate to determine the revenue for the BNPL firm.

To forecast revenues, we used two approaches. The first was calculating the average monthly growth rate for all merchants, but we restricted the range to May 2021–August 2022 to eliminate merchants with no sales in certain months, ensuring balanced growth rates over 15 months.

The second approach used a Long Short-Term Memory (LSTM) model, but the forecasts varied significantly with each run. Fine-tuning and pre-training could fix this, but it was computationally expensive, so we ultimately used the first method.

After calculating the money that goes to the firm, we adjust it by multiplying it with the combined fraud probability of both merchants and consumers, using a weighted average. We assign weights of $\alpha = 0.65$ to the merchant's fraud probability and $\beta = 0.35$ to the consumer's. The formula for the combined fraud probability is:

$$\text{Combined Fraud Probability (CBF)} = \alpha \times \text{Merchant's FP} + \beta\times\text{Consumer's FP}$$

We identified merchants with unrealistically high average monthly revenue growth (marked as red points in the plot), with the highest being 5108%. This prompted us to apply Winsorizing to remove outliers in growth rates.

![avg_growth_rate](../plots/growth_rate_v2.png)


Some merchants have very few orders per month. From a BNPL perspective, we prefer merchants with higher order volumes, as this typically results in more revenue for the BNPL firm. We also found that low average monthly order volumes lead to unstable growth rates and unrealistic revenue forecasts. Therefore, we need a weight that penalizes merchants with low order volumes, which we calculate using a Sigmoid function.

$$ W_{\text{num orders}} = \frac{1}{1 + e^{-(\bar{x_i} - \bar{x_{.}})}}$$

where $\bar{x_i}$ is the average number of order of merchant $i$ and $\bar{x_.}$ is the average number of order of all merchants.

![order_volume](../plots/order_volume_v2.png)

The **coefficient of variation** is a ratio between the standard deviation and the mean, measuring the relative stability which help us compare merchants with different average revenue. Thus, we will create a weight that favors merchant with higher stability. The weight is calculate as

$$W_{\text{CV}} = \frac{1}{1 + CV}$$

## Segmenting Merchants