# Summary notebook

Overall, our project has 6 stages: 
1. Data downloading, cleaning and merging
2. Imputation of transaction fraud probability using a predictive model
3. Creating predictive monthly revenue model
4. Calculating consumer retention
5. Identifing 5 market segments
6. Ranking merchants

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import shapefile as shp
import pandas as pd
import numpy as np
import os

In [2]:
# Create a spark session (which will run spark jobs)
spark = (
    SparkSession.builder.appName("Summary Notebook")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.driver.memory", "9g") 
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .getOrCreate()
)
spark.sparkContext.setLogLevel("OFF")

24/10/18 15:12:44 WARN Utils: Your hostname, LAPTOP-406UJ3L3 resolves to a loopback address: 127.0.1.1; using 172.18.29.238 instead (on interface eth0)
24/10/18 15:12:44 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/10/18 15:12:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## 1. Data Preprocessing

In this stage, we first downloaded dataset from Canvas and ABS website. Then, we checked for duplicates, missing values, errors in data values, and carried out simple outlier analysis. Finally, we merged them together.

To summarize, we found that:
- Merchant, Consumer and transaction dataset are quite clean, so we only preprocessed merchant `tags` column and modified column names for consistency.
- For ABS Census data, we converted the each categorical value of type of measure column to a column. Moreover, we created a new feature based on the provided measures - weekly personal disposable.

In [22]:
# Merchant after cleaning
merchant = spark.read.parquet("../data/curated/part_1/tbl_merchants.parquet")
merchant.limit(5)

name,merchant_abn,goods,revenue_level,take_rate
Felis Limited,10023283211,"furniture, home f...",e,0.18
Arcu Ac Orci Corp...,10142254217,"cable, satellite,...",b,4.22
Nunc Sed Company,10165489824,"jewelry, watch, c...",b,4.4
Ultricies Digniss...,10187291046,"watch, clock, and...",b,3.29
Enim Condimentum PC,10192359162,music shops - mus...,a,6.33


In [4]:
# External dataset after cleaning
abs_df = spark.read.parquet("../data/curated/sa2_dataset/abs_medians.parquet")
abs_df.limit(5)

postcode,sa2_code,sa2_name,median_age,median_total_family_income,median_total_household_income,avg_household_size,weekly_personal_disposable
800,701011002,Darwin City,33.0,2403.0,2151.0,2.0,836.7727387474223
810,701021021,Lyons (NT),30.0,2981.0,2965.0,3.3,980.8810150716812
812,701021019,Karama,35.0,2021.0,1783.0,2.9,547.7236642363225
820,701011008,Stuart Park,34.0,2682.0,2278.0,2.3,890.3954228667557
822,702041063,East Arnhem,28.0,809.0,1552.0,5.1,226.95360332326135


In [None]:
# Final merged data
transactions = spark.read.parquet('../data/curated/all_details/')
transactions.limit(5)

## 2. Impute missing transaction fraud probability

After we merged all dataset together, we found many missing values as not all transaction was provided a merchant/consumer fraud probability. We decided to impute them using a Random Forest Regression model.

What we did:
- Doing some correlation analysis beforehand
- Doing feature selection and engineering, e.g `tags` column from merchant dataset.
- Evaluating model
- Making prediction

In [6]:
# Predicted consumer fraud probability
consumer_fraud = spark.read.parquet('../data/curated/predicted_consumer_fraud')
consumer_fraud.limit(5)

order_id,consumer_fraud
6a84c3cf-612a-457...,13.772944558787202
a1ff2d13-c888-469...,13.772944558787202
ccdb41fa-a5ab-472...,13.772944558787202
4fbc20d0-1e21-4b5...,13.772944558787202
c7acb95f-2cf8-4ae...,13.772944558787202


In [7]:
# Predicted merchant fraud probability
merchant_fraud = spark.read.parquet('../data/curated/predicted_merchant_fraud')
merchant_fraud.limit(5)

                                                                                

order_id,merchant_fraud
256a4082-d42a-42c...,51.72532419078493
b11ffd53-6c62-48b...,51.72532419078493
dea9ccfc-5b3b-477...,30.53800388816566
9811400c-16ea-43f...,42.71433130544876
257b8278-0d36-4d7...,51.72532419078493


## 3. Prediction of future monthly revenue

- For generating merchant revenue growth rate, revenues were predicted for the following year of the obtained data.
- Separate models were ran for different revenue levels.

In [8]:
future_revenue = spark.read.parquet('../data/curated/predicted_monthly_revenue.parquet')
future_revenue.limit(5)

merchant_abn,month_year,name,goods,revenue_level,take_rate,total_revenue,count(dollar_value),log_ratio,unscaled_earning,average_revenue,month,month_sin,month_cos,month_since_first_transaction,features,prediction
12516851436,2023-01,Mollis Corp.,"watch, clock, and...",a,6.71,32325.48434973805,210,5.036503656560932,2169.0399998674225,153.93087785589546,1,0.4997701026431024,0.866158094405463,24,"[24.0,32325.48434...",12082.179656019403
15613631617,2023-01,Ante Industries,motor vehicle sup...,e,0.35,543030.5313328261,1785,5.717747130253923,1900.6068596648915,304.2187850604068,1,0.4997701026431024,0.866158094405463,24,"[24.0,543030.5313...",35290.284627350215
19839532017,2023-01,Pellentesque Habi...,"cable, satellite,...",b,4.94,113982.0,726,5.056245805348308,5630.710800000001,157.0,1,0.4997701026431024,0.866158094405463,24,"[24.0,113982.0,72...",15760.503299179498
34440496342,2023-01,Mauris Nulla Inte...,"opticians, optica...",c,2.85,19425.358828709985,215,4.503696619458033,553.6227266182345,90.35050618004644,1,0.4997701026431024,0.866158094405463,24,"[24.0,19425.35882...",11490.46071449362
35344855546,2023-01,Quis Tristique Ac...,"watch, clock, and...",c,2.92,134737.25046268434,1522,4.483301329640816,3934.327713510383,88.52644577048906,1,0.4997701026431024,0.866158094405463,24,"[24.0,134737.2504...",16613.871636060874


## 4. Calculation of consumer retention

- We assumed consumers with more than 1 transaction with the same merchant were returning customers of the merchant.
- For each merchant, returning customer proportion is calculated by dividing its number of returning customers by its total number of customers.

In [9]:
consumer_retention = spark.read.parquet("../data/curated/customer_retention/")
consumer_retention.limit(5)

merchant_abn,number_of_customers,returning_customer_proportion
12516851436,210,0.0
15613631617,1714,0.0390898483080513
19839532017,714,0.0168067226890756
24406529929,3826,0.0865133298484056
28767881738,4,0.0


## 5. Identifying segments

Based on provided type of goods in merchant dataset, we decided on 5 market segments: Entertainment & Media, Technology, Beauty (or Fashion), Office & Home Supplies, Miscellaneous.

We employed two distinct methods to identify them:
1. Members scanned and categorized each merchant by type of goods into five segments.
2. We applied the K-means clustering algorithm.

In [10]:
# Result of manual assignment
merchant_segment = spark.read.parquet("../data/curated/merchant_segment.parquet")
merchant_segment.limit(5)

merchant_abn,segment
12516851436,beauty
15613631617,others
19839532017,technology
34440496342,others
35344855546,beauty


In [None]:
# Result of Kmean clustering
merchant_segment = spark.read.parquet("../data/curated/clean_merchant_segmented.parquet")
merchant_segment.limit(5)

## 6. Calculation of final scores and Ranking

In this final stage, we took average of 5 metrics to get the final scores and got the top 100 merchants
- Consumer fraud probability
- Merchant fraud probability
- Merchant revenue
- Growth in Gross Earnings
- Consumer retention

About the Growth in Gross Earnings, we calculated it using merchant revenue.

In [11]:
top_100 = spark.read.parquet("../data/curated/merchant_ranking")
top_100.limit(5)

merchant_abn,final_score
10023283211,0.1900368901419965
10142254217,0.0902256232683956
10187291046,0.1344891278869491
10192359162,0.1070514216316028
10206519221,0.1252139319276203


# Limitations and Assumptions

1. External consumer data are generalised and aggregated over postal regions, which may not be accurate or granular enough.
2. Due to limited data timeframe (2021-2022), ranking results of this project may not apply well to future years, where there might be changes in participating merchants, and updates to exsisting merchant data.
3. We used a simple Random Forest Regression model for imputation of missing transaction fraud probability, which may not be accurate.
4. Revenue growth rate is generated based on existing and modelled future data, due to limited data, which may have resulted in compounding inaccuracies.
5. For simplicity, geographical data mapping assumed each postcode region mapped to exactly one SA2 region, which is not entirely correct.
6. Modelling assumed that data being trained on were informative of their target data (fraud proababilities, revenue), and that the data size was sufficiently large for effective training.
