# PROJECT SUMMARY

Due to limited resources, our BNPL business aims to work with 100 merchants only. This project seeks to construct a highly interpretable and robust rating system, providing insights to help the organisation locate the appropriate merchants.

![Alt Text](../plots/bnpl.png)

### Questions for ranking

How much does each merchant make?

How large is their customer base?

Are their customers loyal?

Are our customers and merchants trustworthy?

How much are we going to get ?


## Methodology

1. Pre-process provided data and external ABS data

2. Model missing customer and merchant fraud data

3. Visualize and explore patterns

4. Model each merchants’ future performance

5. Cluster merchants into relevant segments

6. Rank merchants for usage by a BNPL firm

## Provided Data

14 million individual transactions

Consumer’s details (Address,...)

Merchant’s details(ABN, describing tags…)

Fraud data for both consumers and merchants

## External data
We also get the following external datasets:

2021 Australia census data

Postcode to Statistical Area 2 (SA2) code mapping table

SA2 boundaries shapefile

## Preprocessing

### Explore transaction

![Alt Text](../plots/transaction_dollar.png)

We observe close to 0 transaction, which is very unlikely due to the lack of product in the range and rational consumer shouldn't lend money for such a small purchase. More than 10,000 dollar transaction also appear, which is unlikely and won't be taken into consideration as the BNPL industry ban transactions of more than few thousands due to the risk involve. Both of these are abnormal and can be sign of fraudulent.

#### We thus remove outlier. Outliers are defined as data points that fall outside range:
𝑄
1
−
1.5
×
IQR
and above:
𝑄
3
+
1.5
×
IQR

After removal, here is the observed distribution

![Alt Text](../plots/clean_transaction.png)



## Explore and impute missing consumer fraud

![Alt Text](../plots/consumer_fraud.png)

Despite not using transaction > 2000 to predict future merchant performance and ranking and we advise our BNPL firm to strictly apply this according to industry practice. We still use full range transaction in modeling fraud. we observe a strong linear relationship between customer spending and fraud probability. The connection between fraud and number of order is less obvious. 


#### We chose to impute consumer fraud data using a linear regression model with AIC stepwise selection. 
This is the final model that we come up with and its performance:

#### Predictors and their Coefficients:

total_spent: 2.0996552300408644e-05

total_spent_per_order: 6.228304562183401e-06

total_spent_squared: -1.8457767617199225e-10

num_orders_squared: 0.0006513709972329626

Intercept: 0.03455972479087561

Final RMSE: 0.03886892546091528
R-squared: 0.8526665350226852

## Explore and impute merchant fraud

Percentage of merchants with fraud data is 1.55% and this is only available for serveral days, meaning > 99.9% of the data is missing.

![Alt Text](../plots/merchant_fraud1.png)
![Alt Text](../plots/merchant_fraud2.png)

Correlation between fraud_probability and total_money: -0.3411541778846223

Correlation between fraud_probability and num_transactions: -0.24645163406918985

Correlation between fraud_probability and avg_transaction_value: 0.2673609612684381

### We decided to use a very simple decision tree to impute fraud. Howevever, in this case (>99.9% missing) we can discard this entire data to ensure validity as too many records are lacking. We can also use mean imputation as our final objective is ranking and mean imputation ensure fair treatment among those unobserved merchants. 

Using maxdepth selected of 2, and train test split 9:1. Here is the performance

RMSE: 6.046683793780371
R-squared: 0.8115129482946885

Though we must noted that for our case where so many records are missing. The validity of this in imputation is questionable.

## External ABS data handling

We will link our custumer with their related information in ABS. However as we only know our postcode and their address in synthesize using the USA address, this pose challenge in linking. As postcodes may be matched to multiple SA2 regions, therefore we will choose the region with the highest ratio (percentage of population for that postcode) as representative.

![Alt Text](../plots/outlier_abs.png)

We should remove instances which include outlier values for some features. For example, we can see that minimum values for a lot of the statistics are zero, which doesn't make sense. We should also remove any NaN values.'

![Alt Text](../plots/abs.png)

We interest in the median income for our final ranking objective. We highly expect that our customer can repay their debt and we hope to attract more spending from high income group through target advertising in the future

![Alt Text](../plots/geo1.png)
![Alt Text](../plots/geo2.png)
![Alt Text](../plots/geo3.png)

We can see that our merchants attract orders from all across Australia. Having a large customer base and customers coming from many regions is advantagous as our brand can be exposed to more people, attracting more customers.


# Model merchant future performance

In our business model, for a transaction like a bicycle cost 100 dollar. We pay our merchant 94 dollar right away (6% take rate). We expect our honest user to pay us back 100 dollar, thus we gain 6 dollar. However when fraud happen with our user, we lost 94. Here, We assume that we can't recover anything here for simplicity. In reality we the recover rate can be a% and we adjust the loss calculation respectively. 

Here we don't incorporate merchant fraud in our calculation. For customer, since there are many of them so some level of fraud is acceptable as we still have many more honest customer. But for merchant is special, as there are only 100 and we need high level of trust in this business model, if we discover 1 case of fraud, we stop cooperate with them right away, bring them to court and ask for compensation. These terms need to be highlight in our contract.

![Alt Text](../plots/commision.png)

### We will use the past data to model our merchant next 6 months's commission. Our model of choice is SARIMAX

# SARIMAX Model Overview

The **SARIMAX** model (Seasonal AutoRegressive Integrated Moving Average with eXogenous variables) is chosen for forecasting because it is designed for time series data. It incorporates several key components:

- **Auto-regression (AR)**: Uses past values of the target variable to predict future values.
- **Moving Average (MA)**: Utilizes past forecast errors to make predictions.
- **Integrated (I)**: Applies differencing to the series in order to make it stationary.
- **Seasonality**: Captures seasonal patterns in the data (e.g., weekly, monthly seasonality).

### Hyperparameters of the SARIMAX Model

The SARIMAX model requires several hyperparameters to be defined:

- **p**: The number of lag observations in the AR model (the number of previous terms used for prediction).
- **d**: The number of differences applied to make the time series stationary.
- **q**: The number of lagged forecast errors in the MA model.
- **P, D, Q, m**: These are the seasonal components:
  - **P**: Seasonal autoregressive order.
  - **D**: Seasonal differencing order.
  - **Q**: Seasonal moving average order.
  - **m**: The number of observations per season.

These parameters allow the model to capture both short-term and seasonal patterns in the data, making it ideal for time series forecasting.

## We first find out the hyperparameter

### Get the average profit across all merchant in every date. Then treat it as a representative merchant for finding the best parameters and visualizing. We apply grid search for best hyperparameter using AIC.

![Alt Text](../plots/forecast.png)

### We then apply this for nearly 4000 merchants, as every single merchants have different model, this take significant time to train

### At the end there are 30 merchants out of 3930 where MLE fail to converge. these are merchant likely having 0 revenues for days or have revenues fluctuate highly. Thus MLE fail to converge. But they account for less then 1% so we will remove them. Otherwise we can impute these with last 6 months revenue.

# Segmenting business

We have to clean the merchant tags (removing stop word, lemmatize and one hot encode....). After all we extract 84 words which is display using this word cloud, these hint at the products and services provided by our merchants. 

![Alt Text](../plots/cloud.png)

We finally group into 9 clusters. K-mean clustering is used but the final clusters are decided based on domain knowledge of business products. For example group involve "computer", "system", "programming"... is named "IT and tech gadgets"...

![Alt Text](../plots/cluster.png)

# Groups and their overall performance

![Alt Text](../plots/cluster_stat1.png)

![Alt Text](../plots/cluster_stat2.png)

#### We find that Art and Decors lead in both revenue, orders and returned commission. Business like Housing applicances rank 4th in revenue but rank 7th in total orders, indicating that their transactions are of high value. Business like sourvenir and gift rank 8th in revenue but rank 4th in total order, indicating that their transactions are of low value. These difference will be noted when we later calculate the ranking score.

#### We decided to take top 5 most numerous groups(Arts and Decors, IT and Tech Gadgets, Leisure and Hobbies, Housing Applicances and Furnitures, Accessories and Luxuries), these are also in top 5 for revenue and profit.

# RANKING

![Alt Text](../plots/summary1.png)
![Alt Text](../plots/summary2.png)

### Candidates features for final ranking:

commission

merchant_fraud

user_number

region_reached

median_income

user_returned_rate

merchant_revenue

### We expect our merchants to have high revenue, number of user, high number of regions reached and high returned rates for their customer. They need to bring us high commission, and most importantly they need to have low fraud rate

![Alt Text](../plots/matrix.png)

Inspecing the correlation matrix, we find out that some features highly correlated with each other, like between merchant_revenue and commission(0.92) or between region_reached and num_user(0.81). We decided to remove revenue and region_reached to reduce this intercorrelation, ensuring that they contribute independent value to the ranking score. This ensure that if 1 merchant gain advantage in 1 facet, they don't get scaled up too much and gain an unfair advantage.

![Alt Text](../plots/score.png)

The ranking score is just a simple linear combination of features. Since there is no inherent labels for our merchant for trainning or evaluating, we can't employ more sophisticated ranking method and it all up to us to decide what suitable. Feature_i get min-max scaling. For fraud_rate since we want lower not higher fraud rate so we will transform to 1-fraud_rate in scoring. The weightings sum to 1. Different businesses will have different features weighting. 

0.5, 0.2, 0.1, 0.1, 0.1 is the general weighting for commission, fraud, num_user, median_income, returned_rate. Here we put high weighting on commission as this is what our business run on. Fraud rate also get high weighting as we won't tolerate fraud for merchant in our business. This also get adjusted for different businesses. For example, for housing applicances and furnising as this business has high value transaction, we put more weight on fraud and median income(to lower the risk) and less weight on user number. Though based on the nature of our BNPL firm's business and requirement, these will be changed more according the their adjustment. 

# TOP 10 per businesses

![Alt Text](../plots/top10.png)

Here we can observe some downward trend of num user and commission from to 1 to 10 in our business. Fraud rate has some upward trend from top 1 to 10

# TOP 100 merchants

![Alt Text](../plots/top100.png)

Here we again observe clear downward trend for commission from top 1 to 100 as we put very high weighting on commission. the pattern for number of user is less obvious where we observe some upward trend for fraud rate.


![Alt Text](../plots/100cluster.png)

the proportion of different business in top 100 seem to highly reflect their innitial proportion .

# Assumptions

No confounding factors other than the features included in our data

Consumer address is likely linked to the most populous SA2 district if a postcode is said to belong to two SA2 districts

Individual consumer’s income can be represented by the median of their SA2.

Consumer fraud is not insured or fully recovered, so the BNPL business loses all fraudulent transactions

# Limitations and difficulties

Missing Data: Our biggest obstacle was dealing with missing data, e.g. missing fraud rate data,....

Small Dataset: Our transactions data only covers ~1.5 years. A larger range of data would be beneficial in determining consumer patterns over years.

Significant time to train SARIMAX for nearly 4000 merchants.

External Correspondence: mismatches when dealing with the external dataset, e.g. SA2 regions are not 1-to-1 with postcodes.

No ranking label available (similar to how QS rank universities), it’s all up to us to decide the directions.


# Future improment and application

Firstly, we can enhance our model for rating customer fraud and create a credit rating system that decide whether a transaction is accepted or not. But we will need multi layer authentication, integration with bank account, credit data from third parties….

Secondly, we can bring our ranking model into production. Merchants having interest will submit through an online portal, and we accept them if they pass through certain point. We will build an automatic pipeline for this. 

Thirdly, we can build bots scraping the internet, social media for potential merchant and feed into our model. If our candidate passes, we will automatically send an invitation email. If things go well, our sales team will further contact them to negotiate terms. Though, we need to aware of privacy in this.
