# Buy Now, Pay Later Project

The Buy Now, Pay Later (BNPL) Firm has begun offering a new “Pay in 5 Installments” feature and is going to onboard 100 merchants every year. This project focuses on these tasks:


- Overview of consumer and transaction data
- Analysis to find the 100 best merchants
- Recommendations for BNPL

_________________

## Dataset Overview
The dataset provided for this project includes:
- Transaction Dataset
- Consumer Dataset
- Merchant Dataset

External Dataset employed to provide more insights into the consumer analysis:
- [Australian postcode](https://www.matthewproctor.com/australian_postcodes)
- Income by SA2 Districts（ABS）
- SA2 shapefile (ABS)

### Provided Dataset

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a spark session (which will run spark jobs)
spark = (
    SparkSession.builder.appName("ADS project 2")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .getOrCreate()
)

In [None]:
transaction_sdf = spark.read.parquet('../data/tables/transactions_*/*')
print(f"Transaction dataset includes {transaction_sdf.count()} transaction records.")
print("Features included are: ")
transaction_sdf.printSchema()

In [None]:
consumer_sdf = spark.read.option("delimiter", "|").csv('../data/tables/tbl_consumer.csv', inferSchema =True, header=True)
print(f"Consumer dataset includes {consumer_sdf.count()} consumer records.")
print("Features included are: ")
consumer_sdf.printSchema()

In [None]:
merchant_sdf = spark.read.csv("../data/curated/merchant.csv", inferSchema =True, header=True)
print(f"Merchant dataset includes {merchant_sdf.count()} merchant records.")
print("Features included are: ")
merchant_sdf.printSchema()

### External Datasets Overview

1. **Australian Postcode**  
Used to convert postcode of each region to their SA2 code for furture geospatial plotting

2. **Income by SA2 Districts**  
Used to analyse the purchase power of consumers from different regions which may correlate with final assessment of the merchants

In [None]:
postcode_SA2_sdf = spark.read.csv("../data/curated/processed_postcode.csv", inferSchema =True, header=True)
print("Features included are: ")
postcode_SA2_sdf.printSchema()

In [None]:
income_sdf = spark.read.csv("../data/curated/processed_income.csv", inferSchema =True, header=True)
print("Features included are: ")
income_sdf.printSchema()

## Visualisation
We inspect the relationship between each of the three features with respect to the location:
- Mean total income
- Number of Consumers
- Number of Transactions

**Mean Total Income Map**:  
The only three areas colored red in all of Australia are near Perth meaning that consumers here have relatively higher mean total income compared to the rest of Australia. Therefore, these areas may be more profitable for the BNPL company to target.

<img src="../plots/mean_total_income.png" width="400"/> 
<img src="../plots/mean_income_perth.png" width="400"/>


**Number of Consumer vs SA2 Map:**  
WA and SA have relatively more consumers than other state.

**Number of Transaction Map:**  
similar as number of consumer map which can be explained by the correlative between number of consumer and number of transaction (more consumers refer to more possible purchasing -> more transactions).

<img src="../plots/num%20consumers.png" width="400"/>
<img src="../plots/num%20transactions.png" width="400"/>

**Number of Transaction Per Day:**  
The volume of transactions during Christmas and the Summer Holidays is higher compared to the rest of the year. Also, the yearly volume of transactions follows a similar trend. For example, the trends from March to November in each year are similar. Our model will make an assumption based on these trends, which will be explained later.

<img src="../plots/Number of Transactions Each Day.png" width="800"/> 


**Number of Consumers in Each State:**  
NSW and Victoria have the most consumers. Hence, the company may earn more profits in these two states.

<img src="../plots/consumer distribution.png" width="600"/> 

## Assumptions
1. Transactions follow similar pattern each year

2. Transactions other than the days listed in the delta files for both customer and merchant have 1% fraud rate

## Limitations
1. Limited transaction data (from 2021-02-28 to 2022-10-26)

2. Missing values: 2 postcodes out of 3167 postcodes do not have a corresponding SA2 code


## Fraud Detection Model
A 5% fraud probability benchmark was set to label a transaction as fraud or not. The two given delta files (covers transactions from 2021-02 to 2022-02) were used to train a logistic model to classify whether other transactions are fraud. Then all fraud transactions were removed from the full dataset before proceeding to build the ranking system.

## Ranking System
To select the best merchants to cooperate, we first summarise each merchant’s data into several features, including 
1. Total number of consumers
2. Average transaction dollar value
3. Total number of transactions
4. Mean income of consumers
5. Revenue level
6. BNPL revenue = take rate * total transaction
7. Number of distinct postcode
8. Tag

We then use merchants’ historical data to predict their future business prospects,
1. Predicted total number of consumers
2. Next year BNPL revenue
3. Predicted total number of transactions

### Feature Selection
- Categorical variables: tag, revenue level  
use anova test to examine the significance of these variables in predicting each target variable

- Continuous variables  
calculate the pearson correlation of each pair of continuous variables

In [None]:
train_df = pd.read_parquet('../data/curated/train_data/')
model = ols('y_total_num_consumer ~ C(tag) + C(revenue_level)', data=train_df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
anova_table

In [None]:
model = ols('y_total_revenue ~ C(tag) + C(revenue_level)', data=train_df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
anova_table

In [None]:
model = ols('y_total_num_transaction ~ C(tag) + C(revenue_level)', data=train_df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
anova_table

**Observation**:  
"tag" is significant in all models, whereas "revenue level" is not a significant feature in predicting total number of consumers and transactions.

<img src="../plots/Pearson Correlation Metric.png" width="600"/>

**Observation**:  
"Mean income" and "avergate dollar value" have little correlation with target variables. Therefore, they can be excluded in models.

### Ranking Criteria
**Modelling:**  
After summarizing each merchant's data, we fit a machine learning model on each of the target variable below:
- BNPL revenue: Multi-layer Perceptron

- Number of consumers: Linear Regression

- Number of transactions: Linear Regression

<br>
For predicting **number of consumers and transactions** next year, Linear Regression and Neural Network produce similar results. **Linear regression** is chosen as the final model since it has better interpretability and requires less time to run the model. 

However, when predicting **total revenue**, we choose **Neural Network** as it shows a better performance with increased r2 score and decreased mean absolute error.


<br>

**Ranking Equation:**  
The ranking system utilises predicted number of consumers and transactions as well as revenue that the company could gain from the merchant next year. We also give each merchant a score within 0-100. The ranking score is calculated as follows:

1. Standardise each attribute using min-max normalization 

2. Predicted total number of consumers * 30%

3. Next year BNPL revenue * 40%

4. Predicted total number of transactions * 30%

As we take the revenue that BNPL firm could earn as the highest priority, this feature is assigned the largest weight.
Number of consumers and transaction volume are included since they are considered to be positively related with a merchant’s stability and long-term revenue.


### Split Merchants into 4 Segments
Based on [Merchant Cateogry Groups by ANZ](https://www.anz.com/Documents/Business/CommercialCard/Merchant_cateogry_codes_control.pdf), we devide all merchant into 4 categories.

1. Health service: health, optician

2. Recreational good retailing: bicycle, books, stationary, hobby, tent, digital goods, 

3. Personal & household good retail: antique, watch, jewellery, music, artist supply, gift, art dealer, florists, furniture, shoe, garden supply, 

4. Technical & machinery service: cable, telecom, computer, equipment, motor

### Result 
The top 100 merchants overall and top 10 merchants in each segment are displayed on the [website](https://rank-merchant.herokuapp.com/v1/top100). 

For each merchant on the website, we display their predicted business features in the future. For top 10 merchants in each segment, if you click into a certain merchant, you can see the trend of revenue, number of consumers and transactions data we took from them over the past. 

In [2]:
top100 = pd.read_csv("../data/curated/top100.csv")
top100.head(10)

Unnamed: 0,rank,merchant_abn,name,tags,revenue_level,take_rate,pred_total_num_consumer,pred_total_num_transaction,pred_total_revenue,scaled_pred_total_num_consumer,scaled_pred_total_num_transaction,scaled_pred_total_revenue,score,segment
0,1.0,86578477987,Leo In Consulting,watch,a,6.43,18356.113566,186946.749066,39925150.0,81.846275,93.84799,97.593499,91.745679,personal & household good retail
1,2.0,45629217853,Lacus Consulting,gift,a,6.98,20191.435558,152028.388588,38193320.0,90.029612,76.318838,93.360191,87.248612,personal & household good retail
2,3.0,89726005175,Est Nunc Consulting,tent,a,6.01,20434.182107,148236.132238,35873460.0,91.111971,74.415111,87.689512,84.73393,recreational good retailing
3,4.0,49891706470,Non Vestibulum Industries,tent,a,5.8,19570.482775,169802.295225,30463440.0,87.260907,85.241408,74.465184,81.536768,recreational good retailing
4,5.0,21439773999,Mauris Non Institute,cable,a,6.1,22388.332445,81951.886245,37059750.0,99.825141,41.140163,90.589293,78.525308,technical & machinery service
5,6.0,32361057556,Orci In Consequat Corporation,gift,a,6.61,21544.352688,58447.433467,39867280.0,96.062002,29.340837,97.452059,76.601675,personal & household good retail
6,7.0,64403598239,Lobortis Ultrices Company,music,a,6.31,22427.549126,77913.665936,35278950.0,100.0,39.112961,86.236284,76.228402,personal & household good retail
7,8.0,43186523025,Lorem Ipsum Sodales Industries,florists,b,4.47,21166.87885,138185.373344,27543230.0,94.378921,69.369591,67.326996,76.055352,personal & household good retail
8,9.0,24852446429,Erat Vitae LLP,florists,c,2.94,18360.089252,199201.65464,19451030.0,81.864002,100.0,47.546338,73.577736,personal & household good retail
9,10.0,94493496784,Dictum Phasellus In Institute,gift,a,5.65,22215.379057,67344.125535,33613700.0,99.053976,33.807011,82.165735,72.72459,personal & household good retail


## Insights & Recommendation

<img src="../plots/Tag and Segments Distribution in TOP100.png" width="800"/>

**Insight (1):**  
Within the Top 100 merchants, the segment of **Personal and Household retail** is the largest at 43%. Possible reasons of this observation could be:
- It covers the most individual tags and merchants. 
- For individual consumers, the purchase of personal and household goods are always more frequent and sustainable in the long-term.

Therefore these merchants are more likely to run a risk-less business as the demand is always large. 

For the individual tags, the top two that take up the most weights are **‘tent’** and **‘computer’**, 14 and 12 out of 100 respectively. This may be because tent merchants often have high customer volume and flow while computer merchants are related to greater transaction values.

<br>

**Recommendation (1):**  
We recommend the BNPL company to investment in the Personal & Household retailers to receive risk-less benefits. Also, this final ranking could be considered as an insight into and assessment of whether a particular business behaviour is likely to lead to large revenues and benefits and in turn, future business opportunities. 

<img src="../plots/Average Total Revenue of Merchants for Each Segment.png" width="500"/>

**Insight (2):**  
By comparing the average total revenue of Top 100 merchants with that of all other merchants, BNPL company’s profits can be more than **10 times higher** if they focus their resources on cooperating with the top merchants. Consequently, it shows that our final result is reliable in detecting what merchants are more likely to bring high return in each segment. 

**Recommendation (2):**  
We strongly recommend the company to construct a similar system which consists of both predictions of merchants’ future business prospects and a ranking system based on that. This allows the company to pool all resources to maximize benefits. 

**Recommendation (3):**  
Based on our previous geospatial analysis, we recommend the company to focus on merchants and businesses in Perth, NSW, VIC as these regions relate to the highest income and number of consumers.

## Reflection
Some difficulties we ran into: 
- **Limited computing resources and time constraint:** Due to millions of transaction data, we spent more time building and running our code

- **Limited data:** We only managed to get less than two years of data, which may present a year's pattern, but we cannot see it in the long term.

If we could address these issues, we could have build better models and receive more accurate results.