## Summary
#### This notebook is a high-level summary for our whole merchant ranking model.


#### High-level
1. **Downloading** all the external datasets by running `script/download.py`
2. **Pre-process**
- Translates postcodes to SA2 codes `notebook/poa_to_sa2.ipynb`
- Join all datasets into the transaction dataset, done in `script/ETL.py`
- Clean the dataset, done in `script/ETL.pyb`
- Removing outliers by categories, done in done in `notebooks/outlier_detection_removal.ipynb`
3. Fraud analysis and removal
4. Feature Engineering
5. Modelling
---

In [2]:
import io
import zipfile

import pandas as pd
import numpy as np
import seaborn as sns
import geopandas as gpd
from datetime import datetime
import matplotlib.pyplot as plt
%matplotlib inline

from pyspark.sql import SparkSession, Window, functions as F
from pyspark.sql.functions import countDistinct, col, date_format
import pyspark.sql.functions as func
from pyspark.sql.functions import sum, avg, count, lag, date_sub, split
from pyspark.sql.window import Window
from pyspark.sql.types import (
    StringType,
    LongType,
    DoubleType,
    StructField,
    StructType,
    FloatType
)
from scipy.stats import zscore

import warnings
warnings.filterwarnings("ignore")

### External datasets
In the earliest sprint of this project, we decided on which external dataset that can be used to support our merchant ranking model. We decided to find dataset(s) that has _income and age population_ by SA2 code (people in one SA2 area interacts together socially and economically). Initially, we want to find dataset that has _household spending_, but we can only found by state level (which is not specific enough). Therefore, we discard the idea of using _household spending_. 

From projects from previous subject, we learnt that most of Australian Bureau of Statistics data is available through Australian Urban Research Infrastructure Network (AURIN) so we utilise AURIN Data Provider (ADP - AURIN API) to retrieve the datasets; _2017-2018 Personal Income by SA2_ and _2020 Age Population by SA2_.

In [3]:
age = gpd.read_file("../data/abs/sa2_age.gml")
income = gpd.read_file("../data/abs/sa2_income.gml")

In [4]:
age_col = ['sa2_main16', 'persons_age_20_24', 'persons_age_25_29', 
           'persons_age_30_34', 'persons_age_35_39', 'persons_age_40_44', 
           'persons_age_50_54']
income_col = ['sa2_code', 'median_age_of_earners_years', 'median_aud', 
              'gini_coefficient_coef']

display(age[age_col].head())
display(income[income_col].tail())

Unnamed: 0,sa2_main16,persons_age_20_24,persons_age_25_29,persons_age_30_34,persons_age_35_39,persons_age_40_44,persons_age_50_54
0,101021007,197,137,148,184,221,316
1,101021008,479,541,585,576,564,615
2,101031015,204,112,126,122,149,242
3,101031016,440,433,597,495,426,599
4,101041017,420,322,337,326,385,443


Unnamed: 0,sa2_code,median_age_of_earners_years,median_aud,gini_coefficient_coef
2283,801101137,,,
2284,801101138,,,
2285,801101139,34.0,76754.0,0.342
2286,801111140,39.0,61096.0,0.404
2287,801111141,28.0,58498.0,


Aside from these two datasets, we also selected _postcode to SA2 (from 2011)_, _postcode boundaries (from 2016)_ and _SA2 boundaries (from 2016)_ data. These datasets will be used to translates postcodes in the consumer dataset to SA2, so _income_ and _age population_ data can be joined.

In [8]:
unzip_poa_sa2 = zipfile.ZipFile('../data/abs/poa_sa2_lookup.zip')
poa_to_sa2 = pd.read_excel(unzip_poa_sa2
                           .open('1270055006_CG_POSTCODE_2011_SA2_2011.xls')
                        , sheet_name='Table 3', skiprows=5)
poa_to_sa2 = poa_to_sa2.dropna()

In [9]:
poa_to_sa2.head()

Unnamed: 0,POSTCODE,POSTCODE.1,SA2_MAINCODE_2011,SA2_NAME_2011,RATIO,PERCENTAGE
1,800,800.0,701011002.0,Darwin City,1.0,99.999998
2,810,810.0,701021010.0,Alawa,0.071997,7.199707
3,810,810.0,701021013.0,Brinkin - Nakara,0.096392,9.639178
4,810,810.0,701021016.0,Coconut Grove,0.096494,9.649355
5,810,810.0,701021018.0,Jingili,0.061562,6.156198


In [7]:
sa2_bound = gpd.read_file(f'../data/abs/sa2_boundaries.gml')
poa_bound = gpd.read_file(f'../data/abs/poa_boundaries.gml')

In [11]:
display(sa2_bound[['sa2_maincode_2016', 'geometry']].head())
display(poa_bound[['poa_code_2016', 'geometry']].head())

Unnamed: 0,sa2_maincode_2016,geometry
0,101021007,"POLYGON ((149.58420 -35.44430, 149.58440 -35.4..."
1,101021008,"POLYGON ((149.21900 -35.36740, 149.21800 -35.3..."
2,101031015,"POLYGON ((148.60440 -36.13520, 148.60450 -36.1..."
3,101031016,"POLYGON ((148.27030 -36.46410, 148.27060 -36.4..."
4,101041017,"POLYGON ((150.23540 -35.70390, 150.23530 -35.7..."


Unnamed: 0,poa_code_2016,geometry
0,800,"POLYGON ((130.83450 -12.45800, 130.83390 -12.4..."
1,810,"POLYGON ((130.84710 -12.37750, 130.84730 -12.3..."
2,812,"POLYGON ((130.89190 -12.36880, 130.89220 -12.3..."
3,815,"POLYGON ((130.87240 -12.37650, 130.87230 -12.3..."
4,820,"POLYGON ((130.83500 -12.43010, 130.83510 -12.4..."


---
## Preprocess
### Postcodes to SA2 codes

The preprocess steps begins with translating postcodes to SA2 codes. This step is outlined and performed in `notebook/poa_to_sa2.ipynb`. There are two issues we encounter in this step. First, we found that one postcode may be located in two or more SA2 codes. We solve this issue by taking SA2 where the postcode is being covered the most (highest percentage). Second, we found some postcodes are not assigned to any SA2. We solved this issue by using the 2011 Postcode SA2 list.

Here is the final table we use to translate postcode to SA2:

In [12]:
poa_to_sa2 = pd.read_csv("../data/curated/poa_w_sa2.csv")
poa_to_sa2.head()

Unnamed: 0,poa_code_2016,poa_name_2016,sa2_maincode_2016,sa2_name_2016,geometry
0,800,800,701011002.0,Darwin City,POLYGON ((130.83450871037445 -12.4579861192223...
1,810,810,701021013.0,Brinkin - Nakara,POLYGON ((130.86380870963237 -12.3668861187314...
2,812,812,701021014.0,Buffalo Creek,POLYGON ((130.9010087079157 -12.36578611897383...
3,815,815,701021013.0,Brinkin - Nakara,POLYGON ((130.86380870963237 -12.3668861187314...
4,820,820,701011006.0,Ludmilla - The Narrows,POLYGON ((130.8449087101784 -12.41538611897068...


### ETL
Next step in preprocessing, we extract, transform and load (ETL) all dataset and consolidate into one final dataframe that we will use in the following steps. First, this step join merchant/consumer details to the transaction dataset. Then, we notice that we can clean and separate tags into `product_categories`, `revenue_level` and `take_rate`. Third step, we apply the business rules we found on the internet:
- BNPL only allow transaction above `$35`
- BNPL charge flat administration fee `$0.3` for each transaction
The first rule is applied by filtering out transactions with `dollar_value` below 35 dollars. The second rule is applied in addition to our new column, `revenue`:
- `revenue` = `take_rate` * `dollar_value` + 0.3

The final step in ETL, we discard the columns that we will not use in the next steps. This will help us avoid running into any memory (RAM) issue in the future.

This step also split train and test data by taking first 12 months of transaction data as train and the next 6 months as test data. After running `scripts/ETL.py`, the output of the this step will save two transaction files; train and test.

In [13]:
# Start Spark Session
from pyspark.sql import SparkSession
spark = (
    SparkSession.builder.appName("MAST30034 Project 2 BNPL")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config("spark.driver.memory", "4g")
    .config("spark.executor.memory", "8g")
    .getOrCreate()
)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/10/11 19:24:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [14]:
# load data from ETL
df_trx_sa2 = spark.read.parquet('../data/curated/df_trx_sa2.parquet')
df_trx_sa2_test = spark.read.parquet('../data/curated/df_trx_sa2_test.parquet')

                                                                                

In [15]:
display(df_trx_sa2.limit(1))
display(df_trx_sa2_test.limit(1))

                                                                                

merchant_abn,user_id,consumer_id,gender,dollar_value,order_id,order_datetime,sa2_maincode_2016,categories,revenue_level,take_rate,revenue
79417999332,10,1058499,Female,50.62647394670802,30e3f217-9ba4-439...,2021-08-19,203021046.0,"gift, card, novel...",b,4.95,2.8060102


merchant_abn,user_id,consumer_id,gender,dollar_value,order_id,order_datetime,sa2_maincode_2016,categories,revenue_level,take_rate,revenue
82307164889,5,712975,Female,350.45937268758155,58548b62-b6a2-420...,2022-03-19,509031247.0,artist supply and...,a,6.96,24.691973


### Removing outliers
In removing outliers, firstly we checked the initial distribution of revenue.
<img src="../plots/train_initial_distribution.png">
<img src="../plots/test_initial_distribution.png">

We then check the distribution of revenue for each product categories by looking at the quartile range. Here is an example for one category (=music shops - musical instruments, pianos, and sheet music). 
- 0 Quantile:     0.43227726221084595
- 0.1 Quantile:   2.7491043329238893
- 0.2 Quantile:   3.7892702102661135
- 0.3 Quantile:   4.980803442001343
- 0.4 Quantile:   6.308183288574219
- 0.5 Quantile:   7.696613311767578
- 0.6 Quantile:   9.151762199401855
- 0.7 Quantile:   13.548162078857413
- 0.8 Quantile:   22.482472229003914
- 0.9 Quantile:   41.3664180755615
- 0.91 Quantile:  44.23343235015869
- 0.92 Quantile:  47.46060867309574
- 0.93 Quantile:  51.12157638549806
- 0.94 Quantile:  55.17259437561035
- 0.95 Quantile:  60.2568155288696
- 0.96 Quantile:  65.95178344726574
- 0.97 Quantile:  73.02463157653838
- 0.98 Quantile:  84.09613769531268
- 0.99 Quantile:  103.42505798339911
- 1.0 Quantile:   `341.98474121001163`
It is evident here that there is a huge gap between 0.99 quantile to the 1.00 quantile (max value). This trend is also visible in other product categories. Therefore, for each categories we only take transactions with revenue less than or equal to the 0.99 quantile of each category. Here is the plot of one category's revenue distribution plot before and after outliers removed.
<img src="../plots/3_before_distribution.png">
<img src="../plots/3_after_distribution.png">

    - Train dataframe size before: 4368706
    - Test dataframe size before: 2361542

    - Train dataframe size after: 4325008
    - Test dataframe size after: 2337914

This step is performed under `notebooks/outlier_detection_removal.ipynb`. This notebook save two dataframes; train and test.