# *Summary*

## Introduction
This notebook is a summary of the process building this ranking system. The actual coding is consists of three parts:
1. Analysis (To get an idea of suitable ETL and ranking way)
2. ETL scripts (The downloading and preprocessing steps)
3. Ranking (Perform the actual ranking)

### Analysis
The notebooks of analysis part are placed in the notebooks/analysis and the related notebooks are:
1. api.ipynb (To get an idea of how we combined the given and the external dataset)
2. external.ipynb (To get understanding of external dataset, such as its distribution)
3. external2.ipynb (An older version of external.ipynb)
4. fraud.ipynb (Analysis of the special feature fraud. By using this, we proposed a way to handle this feature)
5. outier.ipynb (Analysis of the outliers in dataset, and try possible way to remove the data we dont want)
6. segment.ipynb (To decide segments that worth to perform ranking)

### ETL
1. read_data.ipynb (The notebook and junior version of read_data.py)
2. download.py (First part of ETL, downloading the needed external dataset)
3. unzip.py (Second part of ETL, unzip the dataset downloading)
4. read_data.py (Final part of ETL, performing the actual dataset for ranking)

### Ranking
1. rank_model_explain.md (Explain the principle of our ranking model)
2. rank_algorithm.ipynb (The notebook version of ranking with no action on feature fraud)
3. rank_algorithm_no_fraud.ipynb (The notebook version of ranking with removing the fraud transactions for each merchant)
4. ranking.py (Combine both with and without fraud ranking into one python program)

## Assumptions
A few assumptions were made for this ranking system
- There is no online purchasing so each transaction record can be linked to an actual customer
- The merchant business area is represent by combination of its top 5 area that made the most orders

## Data preprocess & analysis
The data preprocessing and analysis can be subdivided into the following areas
 1. Download and extract data
 2. Generalized ETL
 3. Joining datasets
 4. External datasets
 5. Visualizationa and outliers

### NO.1 Download and extract data & process data 
 - First, we read the given datasets, (i.e `transactions` , `consumers` , `merchants`) , we then checked on the data size and data types accordingly, 
 we joined datasets using left outer join, repeated columns had been dropped accordingly
 - We resolve `tag` by saving merchant tags to different columns`field` , `renvenue_leve` and `take_rate`, since `take_rate` is float type
 transform all strings in "field" and "revenue_level" to lowercase
 - Then we store the curated dataframes into both `csv` and `parquet` form (`full_data`)

### NO.2 External datasets
 - We select SA2(Statistical Area Level 2) data, both `income` and `population` data and find a way to link the SA2 and postcode
 - After joining these external data, removing null values results in significant loss of data nearly `20%`
 - After plotting the distribution, we decided to use median to fill the null values(missing data), mitigating this issue

### NO.3 Outlier 
- We check and remove null values 
- Remove unhelpful columns (i.e `address`, `gender`)
- We check for wrong data 
 - i.e `user_id` , `consumer_id` , `postcode` need to be greater than `0`
- Check for features, (i.e `dollar_value` must greater than 0 , `postcode` must be four digits length)


### NO.4 Generalized ETL script



 








###  Pyspark initialization and Read data

In [18]:
import argparse
import os
import numpy as np
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window
from pyspark.sql import SparkSession
import builtins
from cmath import nan


In [19]:
#import spark
from pyspark.sql import SparkSession
# Create a spark session (which will run spark jobs)
spark = (
    SparkSession.builder.appName("MAST30034 ass2 BNPL group 28")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .getOrCreate()
)

In [20]:
merchants = spark.read.parquet('../data/tables/tbl_merchants.parquet')
consumers = spark.read.parquet('../data/tables/consumer_user_details.parquet')
transactions = spark.read.parquet('../data/tables/transactions_20210228_20210827_snapshot')
consumers_csv = spark.read.options(header='True', inferSchema='True', delimiter='|').csv('../data/tables/tbl_consumer.csv')

### Joining data set

In [22]:
## left outer join transaction data with consumers data by user_id.
new_transaction = transactions.join(consumers, transactions.user_id == consumers.user_id, "leftouter").drop(consumers.user_id)
new_transaction = new_transaction.join(merchants, new_transaction.merchant_abn == merchants.merchant_abn, "leftouter").drop(merchants.merchant_abn)
new_transaction = new_transaction.join(consumers_csv, new_transaction.consumer_id == consumers_csv.consumer_id, "leftouter").drop(consumers_csv.consumer_id)
new_transaction.limit(5)

                                                                                

user_id,merchant_abn,dollar_value,order_id,order_datetime,consumer_id,name,tags,name.1,address,state,postcode,gender
5630,60956456424,145.26081329000152,1e14adeb-8e13-44f...,2021-08-21,28242,Ultricies Digniss...,"([gift, card, Nov...",Philip Crawford,7487 Serrano Gard...,NT,841,Undisclosed
5630,48534649627,120.25889985200416,08476339-f383-4ab...,2021-08-15,28242,Dignissim Maecena...,"[[opticians, oPti...",Philip Crawford,7487 Serrano Gard...,NT,841,Undisclosed
5630,60956456424,135.5412540082104,aacfd47a-438b-47f...,2021-08-15,28242,Ultricies Digniss...,"([gift, card, Nov...",Philip Crawford,7487 Serrano Gard...,NT,841,Undisclosed
5630,89932674734,95.37693966478514,6d5790c9-0eef-453...,2021-08-16,28242,Nulla Vulputate C...,((aRtist supply a...,Philip Crawford,7487 Serrano Gard...,NT,841,Undisclosed
5630,14089706307,440.1209771148284,43d1361a-1101-41a...,2021-08-16,28242,Donec Institute,[(computer progra...,Philip Crawford,7487 Serrano Gard...,NT,841,Undisclosed


### Save tags into columns `field`, `revenue_level`, `take_rate`

In [23]:
## save merchant tags to different columns "field", "renvenue_level" and "take_rate", while "take_rate" is float type
## transform all strings in "field" and "revenue_level" to lowercase
new_transaction = new_transaction.withColumn('tags', expr("substring(tags, 3, length(tags)-4)")) \
    .withColumn('field', split(col("tags"), "\], \[|\), \(").getItem(0)) \
        .withColumn('revenue_level', split(col("tags"), "\], \[|\), \(").getItem(1)) \
            .withColumn('take_rate', split(col("tags"), "\], \[|\), \(").getItem(2)) \
                .withColumn('take_rate', regexp_extract(col("take_rate"), r'(\d+).(\d+)', 0)) \
                    .withColumn("take_rate", col('take_rate').cast(FloatType())) \
                        .withColumn('field', lower(col('field'))) \
                            .withColumn('revenue_level', lower(col('revenue_level'))) \
                                .drop("tags")

In [24]:
new_transaction.limit(5)

                                                                                

user_id,merchant_abn,dollar_value,order_id,order_datetime,consumer_id,name,name.1,address,state,postcode,gender,field,revenue_level,take_rate
5630,60956456424,145.26081329000152,1e14adeb-8e13-44f...,2021-08-21,28242,Ultricies Digniss...,Philip Crawford,7487 Serrano Gard...,NT,841,Undisclosed,"gift, card, novel...",b,4.69
5630,48534649627,120.25889985200416,08476339-f383-4ab...,2021-08-15,28242,Dignissim Maecena...,Philip Crawford,7487 Serrano Gard...,NT,841,Undisclosed,"opticians, optica...",a,6.64
5630,60956456424,135.5412540082104,aacfd47a-438b-47f...,2021-08-15,28242,Ultricies Digniss...,Philip Crawford,7487 Serrano Gard...,NT,841,Undisclosed,"gift, card, novel...",b,4.69
5630,89932674734,95.37693966478514,6d5790c9-0eef-453...,2021-08-16,28242,Nulla Vulputate C...,Philip Crawford,7487 Serrano Gard...,NT,841,Undisclosed,artist supply and...,c,1.67
5630,14089706307,440.1209771148284,43d1361a-1101-41a...,2021-08-16,28242,Donec Institute,Philip Crawford,7487 Serrano Gard...,NT,841,Undisclosed,computer programm...,b,3.33


### Outlier analysis
- drop columns which are not helpful
- Remove null values
- Check for features


In [None]:
## drop unhelpful columns
cols = ['address','gender', 'consumer_id', 'user_name', 'state']
new_transaction = new_transaction.drop(*cols)

## drop rows that have null values
new_transaction = new_transaction.dropna()

## drop transaction that has dollor value less or equal to 0
new_transaction = new_transaction.filter((col('dollar_value') >= 0))

## check order datetime to in the right range
new_transaction = new_transaction.filter((col('order_datetime') >= '2021-02-28') & (col('order_datetime') <= '2022-08-28'))

## check the consistency of postcode
new_transaction = new_transaction.filter(length(col('postcode')) == 4)

### Combining external dataset (census data)
- We chosed Sa2 population and income as external data, joined external data to main dataset(details in `external.ipynb` & `external2.ipynb`)
- Filter outliers after joining external data
    - fill null values with mean/median, avoid losing too much data
- Distribution plots(null values filled with mean/median)

## Fraud
- Fraud detection 
 - Join fraud data (`consumer_fraud_probability`, `merchant_fraud_probability`) to our main dataset
 - Consider `30%` and above are fraud
    - Add new feature "is_fraud" to classify whether a record is a fraud record
    - Used NaiveBayes model (can be more detailed)
        - high accuracy 


## Ranking system 
### Aim
 - The goal of the model is to recommend to BNPL company the top N cooperative merchants that are in the long-term interest according to some specific characteristics

### Merchant features we are considering 
- `transaction_count`: number of transcations made in a specified period.
- `take_rate`: the fee charged by the BNPL firm to a merchant on a transaction. That is, for each transaction made, a certain percentage is taken by the BNPL firm.
- `revenue_level`: `(a, b, c, d, e)` represents the level of revenue bands (unknown to groups) 'a' denotes the smallest band whilst 'e' denotes the highest revenue band.
- `total_revenue`: the total revenue made by a merchant in a specified period
- `mean_consumer_income`: the mean weekly income of each merchant's consumers (used to represents the puchasing power of merchants' target audience)
- `fraud_count`: the number of transactions that are recongnized as fraud
- `main_business_area_popu`: a sum of the number of consumers in the top five postcode areas corresponding to each merchant that has most users within these areas

### Ranking model
1. Model Theory (Implementation of Jeremy-Rudy Algorithm)
- Step 1: Setting arguments for the ranking system, especially (score_criteria, remove_rate, top_n). Note, score_criteria should be set very carefully otherwise the model could be meaningless.
- Step 2: Converting all entries of each numeric column into categorical levels (a, b, c, d, e) according to the (80%, 60%, 40%, 20%) quantiles of the current data.
- Step 3: Mark all entries of each numeric column according to the score_criteria given, and then sum the column marks of each merchant (store mark in a new column 'score').
- Step 4: Sort all merchants by their mark (descending order) and drop {len(merchant_info) * remove_rate} merchants form tail. 
- Step 5: Remove the column 'score' and use the merchants left to implement this algorithm again. (Stop until the number  merchants is going to go below 100 after the next run)

2. Model Explanation:
- Basic Consideration:
	- We give each merchant a rating level (a, b, c, d, e) for each feature, while the rating process is achieved by finding the 80th, 60th, 40th and 20th percentiles of all the data for each feature.
	- Each level could have different marks assigned in different feature, while in all features, except for revenue level, the level e is the worst level.
	- The algorithm is expected to run multiple times, each time a specified percentage of tail (or unwanted) merchants are removed. Then the remaining merchants with score to be reset are prepared for the next run, until we finally obtain the top n merchants.
	- In general, at each run, there will always groups with lower total marks since the Mark Algorithm is based on the current merchants, and we update the remaining merchants at each run.
	- Therefore, for example, at each run, those merchant with all features (except revenue level) to be level 'e' are very likely to be removed as their total marks could be very low and be considered within the drop list
	- In conclusion, for each time we run this algorithm, we don't try to figure out the 'best merchants' as the result could be very unreliable, instead, we aim to find merchants that are cosidered to be the weakest and remove them to ensure accuracy.

3. score_criteria:
	- This model will recommend the top n merchants according to our business goal by setting the 'score_criteria' properly.
	- By default, the model will weight all features equally, which means, for instance, the 'transcation_count', 'take_rate' and 'total_revenue' are equally important as criteria for choosing best merchants for BNPL company.
	- However, a 'fair model' is not always a good choice. Actually, a BNPL company may focus more on a merchant's total revenue and take rate, rather than its transaction count as BNPL company can earn more when both the former two terms are high.
	- Therefore, the BNPL company may want to weight more on company with higher 'total_revenue' and 'take_rate', which can be achieved by manually increse the 'score_criteria' for the two terms.

# Summary of results

- Most important, we perform two ranking systems based on with and without fraud transatcions
- Based on this, we found that fraud will not hugely affect a merchant ranking unless the number of fraud transactions outnumber the normal.
- Before performing the ranking system, we guess the feature 'revenue level' might be the most important. After seeing the real result, we believe it is the most important feature.
- Different segments, have different transaction numbers.
- Merchants sell tents usually has advantages in customer puchasing frequency and overall merchant revenue
- Investing merchants sell jewlery could maximise BNPL earnings per transaction but also with the highest possibility to be cheated


# Limitations
In this project, we have meet several limitations. We found ways to solve some of them but we believe there are better ways.
1. The lack of external dataset, in the Data preprocess & analysis part, we mentioned that we have linked the external and the given dataset. This way is based on the 2016 correspondence between Postcode and SA2 code. However, there are some postcode can not find its matched SA2 code, we believe if using the latest version of correspondence, we are able to increase the matching precision. Hopefully there is no need to use median filling the null values.
2. In the downloading external data part, we didn't find a way to safely extract the data we needed, but we are using url retrieving instead of API calling, this might result cypersecurity risk.
3. 