## Buy Now, Pay Later Project

The Buy Now, Pay Later (BNPL) Firm has begun offering a new “Pay in 5 Installments” feature and is going to onboard 100 merchants every year. This project focuses on these tasks:


- Overview of consumer and transaction data
- Analysis to find the 100 best merchants
- Recommendations for BNPL


The dataset provided for this project includes:
- Transaction Dataset
- Consumer Dataset
- Merchant Dataset

External Dataset employed to provide more insights into the consumer analysis:
- [Australian postcode](https://www.matthewproctor.com/australian_postcodes)
- Income by SA2 Districts（ABS）
- SA2 shapefile (ABS)

In [1]:
from pyspark.sql import SparkSession

# Create a spark session (which will run spark jobs)
spark = (
    SparkSession.builder.appName("ADS project 2")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .getOrCreate()
)

22/10/07 23:33:44 WARN Utils: Your hostname, Luo resolves to a loopback address: 127.0.1.1; using 172.20.233.63 instead (on interface eth0)
22/10/07 23:33:44 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/10/07 23:33:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Dataset Overview

In [14]:
transaction_sdf = spark.read.parquet('../data/tables/transactions_*/*')
print(f"Transaction dataset includes {transaction_sdf.count()} transaction records.")
print("Features included are: ")
transaction_sdf.printSchema()

                                                                                

Transaction dataset includes 14195505 transaction records.
Features included are: 
root
 |-- user_id: long (nullable = true)
 |-- merchant_abn: long (nullable = true)
 |-- dollar_value: double (nullable = true)
 |-- order_id: string (nullable = true)



In [15]:
consumer_sdf = spark.read.option("delimiter", "|").csv('../data/tables/tbl_consumer.csv', inferSchema =True, header=True)
print(f"Consumer dataset includes {consumer_sdf.count()} consumer records.")
print("Features included are: ")
consumer_sdf.printSchema()

Consumer dataset includes 499999 consumer records.
Features included are: 
root
 |-- name: string (nullable = true)
 |-- address: string (nullable = true)
 |-- state: string (nullable = true)
 |-- postcode: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- consumer_id: integer (nullable = true)



In [16]:
merchant_sdf = spark.read.csv("../data/curated/merchant.csv", inferSchema =True, header=True)
print(f"Merchant dataset includes {merchant_sdf.count()} merchant records.")
print("Features included are: ")
merchant_sdf.printSchema()

Merchant dataset includes 4026 merchant records.
Features included are: 
root
 |-- merchant_abn: long (nullable = true)
 |-- name: string (nullable = true)
 |-- tags: string (nullable = true)
 |-- revenue_level: string (nullable = true)
 |-- take_rate: double (nullable = true)



#### External Datasets Overview

1. **Australian Postcode**  
Used to convert postcode of each region to their SA2 code for furture geospatial plotting

2. **Income by SA2 Districts**  
Used to analyse the purchase power of consumers from different regions which may correlate with final assessment of the merchants

In [20]:
postcode_SA2_sdf = spark.read.csv("../data/curated/processed_postcode.csv", inferSchema =True, header=True)
print("Features included are: ")
postcode_SA2_sdf.printSchema()

Features included are: 
root
 |-- postcode: integer (nullable = true)
 |-- SA2_code: integer (nullable = true)



In [18]:
income_sdf = spark.read.csv("../data/curated/processed_income.csv", inferSchema =True, header=True)
print("Features included are: ")
income_sdf.printSchema()


Features included are: 
root
 |-- SA2_code: string (nullable = true)
 |-- mean_total_income: integer (nullable = true)



### Geospatial Visualisation

We inspect the relationship between each of the three features with respect to the location:
- Mean total income
- Number of Consumers
- Number of Transactions

这里有分析


<img src="../plots/mean%20total%20income%20geo.png" width="500"/> 
<img src="../plots/large%20mean%20total%20income%20geo.png" width="500"/>


<img src="../plots/num%20consumers.png" width="500"/>
<img src="../plots/num%20transactions.png" width="500"/>