# Summary of the Project

In [1]:
from pyspark.sql import SparkSession, functions as F
import pandas as pd

# Create a spark session
spark = (
    SparkSession.builder.appName("BNPL Project")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.driver.memory", "4g")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .getOrCreate()
)

22/10/07 23:42:53 WARN Utils: Your hostname, LAPTOP-03OFAS5P resolves to a loopback address: 127.0.1.1; using 172.23.233.173 instead (on interface eth0)
22/10/07 23:42:53 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/10/07 23:42:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## ETL Script

The ETL script will carry out all the preprocessing, category assignment and incorporation of the external dataset. It will output a single parquet file "process_data.parquet".

In [2]:
# Read output file of ETL script
sdf = spark.read.parquet("../data/curated/process_data.parquet/")

                                                                                

A sample instance of the combined dataset is shown below:

In [3]:
# show an instance of the dataset
sdf.show(1, vertical=True)

[Stage 1:>                                                          (0 + 1) / 1]

-RECORD 0----------------------------------------------------
 merchant_abn                         | 31585975447          
 consumer_id                          | 1656                 
 user_id                              | 8913                 
 dollar_value                         | 51.28                
 order_id                             | 00001f53-b987-4b4... 
 order_datetime                       | 2021-07-24           
 state                                | NSW                  
 postcode                             | 1163                 
 gender                               | Male                 
 merchant_name                        | Dolor Dapibus Gra... 
 tag                                  | digital goods: bo... 
 revenue                              | b                    
 rate                                 | 0.0312               
 category                             | retail_and_wholes... 
 subcategory                          | household_goods_r... 
 merchan

                                                                                

## Categories of the merchants

The merchants is divided into 5 different categories based on their tags:
1. Rental Hiring and Real Estate
2. Retail and Wholesale Trade
3. Agriculture
4. Arts and Recreation
5. Info, Media and Telecommunications

This is done to be able to rank merchants from different industries separately.
The number of merchants and transactions for each category is shown below.

In [4]:
# calculate number of transactions per category
transactionpd = sdf.groupBy("category").count().toPandas()

# calculate number of merchants per category
merchantpd = sdf.groupBy("category").agg(F.countDistinct("merchant_abn")).toPandas()

# format and show table
df = transactionpd.merge(merchantpd, on="category")
df = df.rename(columns = {"count": "number of transactions", "count(merchant_abn)": "number of merchants"})
df

                                                                                

Unnamed: 0,category,number of transactions,number of merchants
0,retail_and_wholesale_trade,11321695,2956
1,rental_hiring_and_real_estate,34040,134
2,arts_and_recreation,20101,112
3,others,244335,164
4,info_media_and_telecommunications,1886393,655


## Fraud Data

To identify fraudulent transactions, Gradient-Boosted Tree Regression is used. The user and merchant fraud probabilities given is used to predict the fraud probabilities of the remaining transactions. After predicting, the outliers of user fraud probabilities and merchant fraud probabilities are identified. Transactions with user and merchant fraud probabilities which are both considered as outliers will be considered fraudulent. They are then removed from the dataset.

This fraud detection model is included in the ETL script.

## Outlier Analysis

Outlier analysis is done on the dollar value of each transaction.

To ensure that transactions of similar scale are compared, the largest merchant category, retail and wholesale trade,is further divided into 5 subcategories.

Visualisations of outliers can be seen in detail in "Transaction_Analysis.ipynb".

## Ranking

After outlier analysis and detection of fraud transactions, the dataset is now ready for ranking.

To rank the merchants, the expected revenue of each merchant is calculated. This is done by summing up the dollar value of all non-fraud transactions of a merchant, and then multiplying it by the merchant's take rate.

In addition, the standard deviation of the dollar value of all the transactions of each merchant is calculated. It is the multiplied by the take rate. This will measure the consistency of the transactions of each merchant. A high standard deviation means that their sales are highly varied and therefore a partnership with this particular merchant may be risky.

The merchants are ranked by expected revenue in descending order, and standard deviation in ascending order.

This ranking method is chosen because the main goal of the BNPL firm would be to maximise its profits while reducing risk. This can be done by choosing the highest ranking merchants.

The ranking system gives the following results:

## Additional Insights

There are some notable merchants that might be worth considering.

In [5]:
# number of transactions for each merchant
numtransaction = sdf.groupBy("merchant_name").count().toPandas()

# total dollar value for each merchant
dollarvalue = sdf.groupBy("merchant_name").sum("dollar_value").toPandas()

# number of different customers for each merchant
numconsumer = sdf.groupBy("merchant_name").agg(F.countDistinct("consumer_id")).toPandas()

# mean dollar value of transaction for each merchant
meandollarvalue = sdf.groupBy("merchant_name").mean("dollar_value").toPandas()

# mean merchant fraud probability of each merchant
meanfraud = sdf.groupBy("merchant_name").mean("merchant_fraud_probability").toPandas()
meanfraud["avg(merchant_fraud_probability)"] = meanfraud["avg(merchant_fraud_probability)"].fillna(0)

# join all dataframes
merchantdf = numtransaction.merge(dollarvalue, on="merchant_name")
merchantdf = merchantdf.merge(numconsumer, on="merchant_name")
merchantdf = merchantdf.merge(meandollarvalue, on="merchant_name")
merchantdf = merchantdf.merge(meanfraud, on="merchant_name")

                                                                                

1. There are some merchants with only a few transactions

The firm might want to exclude these merchants due to lack of information.
The table below shows merchants with only one transaction.

In [6]:
merchantdf.loc[merchantdf["count"] == 1][["merchant_name", "count"]]

Unnamed: 0,merchant_name,count
316,Felis Ltd,1
3923,Iaculis Quis LLC,1
3982,Elit Dictum Eu Foundation,1
4005,Phasellus LLP,1
4016,Sem Corporation,1
4018,Curae Foundation,1


2. There are some merchants with a very small consumer base

No matter how high the dollar value is, this is not ideal. Having only one or two customer means that the merchant is definitely not doing well. Therefore the firm might want to exclude these merchants from consideration. The table below shows merchants with only one customer.

In [7]:
merchantdf.loc[merchantdf["count(consumer_id)"] == 1][["merchant_name", "count(consumer_id)"]]

Unnamed: 0,merchant_name,count(consumer_id)
316,Felis Ltd,1
3923,Iaculis Quis LLC,1
3982,Elit Dictum Eu Foundation,1
4005,Phasellus LLP,1
4016,Sem Corporation,1
4018,Curae Foundation,1


3. Merchant with high mean merchant fraud probability

The firm might want to completely remove these merchants from the ranking, instead of removing their fraud transactions, since the merchant is deemed to be untrustworthy. The table shows the 5 merchants with the highest fraud probability.

In [8]:
merchantdf.sort_values(by = "avg(merchant_fraud_probability)", inplace = True, ascending = False)
merchantdf.head(5)[["merchant_name", "avg(merchant_fraud_probability)"]]

Unnamed: 0,merchant_name,avg(merchant_fraud_probability)
218,Ut Industries,0.73
3127,Mi Eleifend Egestas LLP,0.71
1804,Nullam Enim Sed Incorporated,0.69
1412,Tincidunt Nibh LLP,0.68
403,Nec Limited,0.68


In conclusion, the firm should consider whether the merchants stated above should be excluded from consideration. The decision will depend on whether the firm is willing to take on riskier partnerships in exchange for a possibility of a higher revenue.

## Conclusion and Recommendations

The firm should use the ranking of merchants as a guide when selecting merchants to be partnered in the BNPL program. If the firm wants to have partners from a varied set of industries, they can use the ranking of merchants in the 5 different categories to identify the best subset of merchants from each of the 5.