# Summary of the Project

In [3]:
from pyspark.sql import SparkSession, functions as F
import pandas as pd

# Create a spark session
spark = (
    SparkSession.builder.appName("BNPL Project")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.driver.memory", "4g")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .getOrCreate()
)

## ETL Script

The ETL script will carry out all the preprocessing, category assignment and incorporation of the external dataset. It will output a single parquet file "process_data.parquet".

In [4]:
# Read output file of ETL script
sdf = spark.read.parquet("../data/curated/process_data.parquet/")

                                                                                

A sample instance of the combined dataset is shown below:

In [5]:
# show an instance of the dataset
sdf.show(1, vertical=True)

[Stage 1:>                                                          (0 + 1) / 1]

-RECORD 0----------------------------------------------------
 merchant_abn                         | 27093785141          
 consumer_id                          | 1195503              
 user_id                              | 1                    
 dollar_value                         | 366.23               
 order_id                             | a8514aae-18fb-454... 
 order_datetime                       | 2021-11-17           
 state                                | WA                   
 postcode                             | 6935                 
 gender                               | Female               
 merchant_name                        | Placerat Orci Ins... 
 tag                                  | stationery, offic... 
 revenue                              | c                    
 rate                                 | 2.73                 
 category                             | retail_and_wholes... 
 subcategory                          | others_retailing     
 merchan

                                                                                

## Categories of the merchants

The merchants is divided into 5 different categories based on their tags:
1. Rental Hiring and Real Estate
2. Retail and Wholesale Trade
3. Agriculture
4. Arts and Recreation
5. Info, Media and Telecommunications

This is done to be able to rank merchants from different industries separately.
The number of merchants and transactions for each category is shown below.

In [11]:
# calculate number of transactions per category
transactionpd = sdf.groupBy("category").count().toPandas()

# calculate number of merchants per category
merchantpd = sdf.groupBy("category").agg(F.countDistinct("merchant_abn")).toPandas()

# format and show table
df = transactionpd.merge(merchantpd, on="category")
df = df.rename(columns = {"count": "number of transactions", "count(merchant_abn)": "number of merchants"})
df

                                                                                

Unnamed: 0,category,number of transactions,number of merchants
0,retail_and_wholesale_trade,11406032,2961
1,rental_hiring_and_real_estate,36695,134
2,arts_and_recreation,21218,112
3,others,245753,164
4,info_media_and_telecommunications,1904458,655


## Outlier Analysis

Outlier analysis is done on the dollar value of each transaction.

To ensure that transactions of similar scale are compared, the largest merchant category, retail and wholesale trade,is further divided into 5 subcategories.

## Fraud Data

## Ranking

After outlier analysis and detection of fraud transactions, the dataset is now ready for ranking.

To rank the merchants, the expected revenue of each merchant is calculated. This is done by summing up the dollar value of all non-fraud transactions of a merchant, and then multiplying it by the merchant's take rate.

This ranking method is chosen because the main goal of the BNPL firm would be to maximise its profits. This can be done by choosing the highest ranking merchants.

The ranking system gives the following results:

## Additional Insights

There are some notable merchants that might be worth considering.

In [26]:
# number of transactions for each merchant
numtransaction = sdf.groupBy("merchant_name").count().toPandas()

# total dollar value for each merchant
dollarvalue = sdf.groupBy("merchant_name").sum("dollar_value").toPandas()

# number of different customers for each merchant
numconsumer = sdf.groupBy("merchant_name").agg(F.countDistinct("consumer_id")).toPandas()

# mean dollar value of transaction for each merchant
meandollarvalue = sdf.groupBy("merchant_name").mean("dollar_value").toPandas()

# mean merchant fraud probability of each merchant
meanfraud = sdf.groupBy("merchant_name").mean("merchant_fraud_probability").toPandas()
meanfraud["avg(merchant_fraud_probability)"] = meanfraud["avg(merchant_fraud_probability)"].fillna(0)

# join all dataframes
merchantdf = numtransaction.merge(dollarvalue, on="merchant_name")
merchantdf = merchantdf.merge(numconsumer, on="merchant_name")
merchantdf = merchantdf.merge(meandollarvalue, on="merchant_name")
merchantdf = merchantdf.merge(meanfraud, on="merchant_name")

                                                                                

1. There are some merchants with only a few transactions

The firm might want to exclude these merchants due to lack of information.
The table below shows merchants with only one transaction.

In [27]:
merchantdf.loc[merchantdf["count"] == 1][["merchant_name", "count"]]

Unnamed: 0,merchant_name,count
3987,Aenean Gravida Institute,1
3993,Elit Dictum Eu Foundation,1
4006,Consequat Foundation,1
4007,Phasellus LLP,1
4016,Curae Foundation,1
4021,Aliquam Eu Institute,1
4024,Lobortis Nisi Associates,1


2. There are some merchants with a very small consumer base

No matter how high the dollar value is, this is not ideal. Having only one or two customer means that the merchant is definitely not doing well. Therefore the firm might want to exclude these merchants from consideration. The table below shows merchants with only one customer. The merchants are the same as those above.

In [28]:
merchantdf.loc[merchantdf["count(consumer_id)"] == 1][["merchant_name", "count(consumer_id)"]]

Unnamed: 0,merchant_name,count(consumer_id)
3987,Aenean Gravida Institute,1
3993,Elit Dictum Eu Foundation,1
4006,Consequat Foundation,1
4007,Phasellus LLP,1
4016,Curae Foundation,1
4021,Aliquam Eu Institute,1
4024,Lobortis Nisi Associates,1


3. Merchant with high mean merchant fraud probability

The firm might want to completely remove these merchants from the ranking, instead of removing their fraud transactions, since the merchant is deemed to be untrustworthy. The table shows the 5 merchants with the highest fraud probability.

In [29]:
merchantdf.sort_values(by = "avg(merchant_fraud_probability)", inplace = True, ascending = False)
merchantdf.head(5)[["merchant_name", "avg(merchant_fraud_probability)"]]

Unnamed: 0,merchant_name,avg(merchant_fraud_probability)
1361,Tempus Mauris Ltd,0.91
911,Ut Corporation,0.9
91,Duis At Inc.,0.81
222,Ut Industries,0.73
3145,Mi Eleifend Egestas LLP,0.71


In conclusion, the firm should consider whether the merchants stated above should be excluded from consideration. The decision will depend on whether the firm is willing to take on riskier partnerships in exchange for a possibility of a higher revenue.