In [1]:
import os
os.sys.path.append("../")
from scripts.preliminary_analysis import *

In [2]:
spark = (
    SparkSession.builder.appName("Preliminary Analysis")
    .config("spark.sql.repl.eagerEval.enabled", True)
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config("spark.driver.memory", "4g")
    .config("spark.executor.memory", "2g")
    .getOrCreate()
)

24/09/09 12:39:56 WARN Utils: Your hostname, DESKTOP-H6V94HM resolves to a loopback address: 127.0.1.1; using 192.168.0.204 instead (on interface wifi0)
24/09/09 12:39:56 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/09/09 12:39:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# Preliminary Analysis

In this notebook, we will conduct a brief analysis on the data that we cleaned. First, let's check the number of merchants that were given a fraud probability (fp) on transactions.

## Merchants

In [3]:
path = "../data/curated"

In [7]:
merchant = spark.read.parquet(f"{path}/merchant_info.parquet")
merchant_fp = spark.read.parquet(f"{path}/merchant_fraud_prob.parquet")

In [11]:
print(f'Total number of merchants: {merchant.select("merchant_abn").distinct().count()}')
print(f'Number of merchant with fraud probability in transactions: {merchant_fp.select("merchant_abn").distinct().count()}')

Total number of merchants: 4026
Number of merchant with fraud probability in transactions: 61


From the numbers above, we can see that there are only 61 merchants with a fraud probability out of 4026 which is only about 1%. Thus, we need to create a sufficient model to give fraud probability for each merchant as that will help us determine which transaction is valid. 

For now, let's see if a merchant has a fraud probability in a transaction, how many transactions from them have a probability and what's the average if it's greater than 1.

In [13]:
merchant_fp.groupBy("merchant_abn").agg(
    F.count(F.col("merchant_abn")).alias("num_transaction_with_prob"),
    F.avg("fraud_probability").alias("avg_prob")
)

merchant_abn,num_transaction_with_prob,avg_prob
99989036621,1,18.21089142894488
90568944804,3,30.72298492113958
29674997261,1,44.43787807900268
27093785141,3,28.88064813052203
19492220327,8,31.958306675667547
76968105359,1,68.27843632543912
97884414539,1,89.79919971536573
82999039227,1,94.1347004808891
83199298021,6,31.93490297074105
93292821052,1,66.58725735032715


Though the table is only showing 20 rows, we can see that there are merchants with more than one transaction with fraud probability. This mmight be helpful when it comes to deciding which merchant to be onboard.

Below is the summary statistic of the merchant fraud probability.

In [18]:
merchant_fp.select(F.col("fraud_probability")).describe().limit(5)

summary,fraud_probability
count,114.0
mean,40.419334695018094
stddev,17.187744795432526
min,18.21089142894488
max,94.1347004808891


## Consumer

In [19]:
consumer_pf = spark.read.parquet(f"{path}/consumer_fraud_prob.parquet")

Summary statistic of consumers fraud probability.

In [20]:
consumer_pf.select(F.col("fraud_probability")).describe().limit(5)

summary,fraud_probability
count,34864.0
mean,15.12009064415455
stddev,9.94608484957805
min,8.287143531552802
max,99.24738020302328


## Transactions

Summary statistic of consumers fraud probability.

In [21]:
transactions = spark.read.parquet(f"{path}/transactions.parquet")

In [22]:
transactions.select(F.col("dollar_value")).describe().limit(5)

summary,dollar_value
count,12561377.0
mean,166.33982036554548
stddev,520.3624254515674
min,9.756658099412162e-08
max,105193.88578925544


We can see that the minimum dollar value is $0.00000009 which is way less than 1 cents. It may be appropriate to consider these value as valid since it's an unreasonable amount. However, this does not necessarily mean we will remove them at this might be an indicator of a fraud transactions. We will find out later on once join the transactions with consumer and merchant fraud probability

For now, let's see how much each merchant makes in total, the average value of an order, and the total number of order. We will also calculate the commission amount (take rates $\times$ total revenue) that the BNPL firm will get if they collaborate with the merchant

In [38]:
merchant_info = spark.read.parquet(f"{path}/merchant_info.parquet")

merchant_sales_info = transactions.groupBy("merchant_abn").agg(
    F.sum("dollar_value").alias("total_revenue"),
    F.avg("dollar_value").alias("average_order_value"),
    F.count("dollar_value").alias("total_orders")
)

merchant_sales_info = merchant_sales_info.join(merchant_info, on="merchant_abn", how = "inner")
merchant_sales_info = merchant_sales_info.withColumn("commission_amount", 
                                                     F.round(F.col('take_rate')/100 * F.col('total_revenue'),2))
merchant_sales_info.orderBy(F.col("commission_amount"), ascending= False ).limit(10)

                                                                                

merchant_abn,total_revenue,average_order_value,total_orders,name,category,revenue_level,take_rate,commission_amount
79827781481,8657277.096810075,2036.5271928511115,4251,Amet Risus Inc.,"furniture, home f...",a,6.82,590426.3
48534649627,8316735.67184678,141.7182529069912,58685,Dignissim Maecena...,"opticians, optica...",a,6.64,552231.25
32361057556,8339994.520798449,109.94943536575283,75853,Orci In Consequat...,"gift, card, novel...",a,6.61,551273.64
86578477987,8443178.696731722,34.9851605095457,241336,Leo In Consulting,"watch, clock, and...",a,6.43,542896.39
38700038932,8482176.65570551,1337.6717640286247,6341,Etiam Bibendum In...,tent and awning s...,a,6.31,535225.35
45629217853,7436925.452881987,36.84747288748941,201830,Lacus Consulting,"gift, card, novel...",a,6.98,519097.4
96680767841,8679874.166938096,315.1619101317344,27541,Ornare Limited,motor vehicle sup...,a,5.91,512980.56
21439773999,8337853.955271486,78.1253884343867,106724,Mauris Non Institute,"cable, satellite,...",a,6.1,508609.09
63123845164,7570160.924957567,751.380736968493,10075,Odio Phasellus In...,artist supply and...,a,6.59,498873.6
64403598239,7842635.605858917,78.11856889713447,100394,Lobortis Ultrices...,music shops - mus...,a,6.31,494870.31


It's also worth to look at how much commission does each revenue level brings on average.

In [51]:
avg_revenue_level = merchant_sales_info.groupBy("revenue_level").agg(
    F.count(F.col("commission_amount")).alias("num_merchant"),
    F.sum(F.col("total_orders")).alias("total_orders"),
    F.round(F.avg(F.col("commission_amount")),2).alias("avg_commission_amount"),
    F.round(F.sum(F.col("commission_amount")),2).alias("total_commission_amount"),
)
# avg_revenue_level = avg_revenue_level.withColumn("total_commission_amount", F.format_number("total_commission_amount", 2))
avg_revenue_level.withColumns(
    {"avg_commission_amount": F.format_number("avg_commission_amount",2),
    "total_commission_amount": F.format_number("total_commission_amount",2),}
)

                                                                                

revenue_level,num_merchant,total_orders,avg_commission_amount,total_commission_amount
e,53,106218,1175.96,62325.93
d,98,121037,3368.06,330069.51
c,922,2941588,9738.67,8979050.0
b,1351,3470381,20762.98,28050783.41
a,1602,5408093,29894.01,47890212.0
