This notebook provides a summary of our approach to the project, with any issues we face and limitations/assumptions made.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import math

data_directory = "../data/curated/"
merchants = pd.read_parquet(data_directory + "merchants.parquet")
transactions = pd.read_parquet(data_directory + "transactions.parquet")
consumers = pd.read_parquet(data_directory + "consumers.parquet")
census = pd.read_csv(data_directory + "census.csv")
#merged_transactions = pd.read_parquet(data_directory + "merged_transactions.parquet")

# Normalisation

Firstly, we noticed an issue with the merchant dataset: the 'tags' column was incorrectly constructed with an overload of information - it required normalisation, not only because it showcases technical competency, but the separation could be useful in ranking later on (for example, take-rate, revenue band and category of purchase may all be relevant when ranking and, ideally, should be in different columns).

Hence, a normalisation procedure was conducted on this column and the corresponding ETL implementation was done, too. 

In [1]:
#### INSERT NORMALISED MERCHANT TABLE HERE - NOT NECESSARY, BUT NOTEBOOK MAY LOOK BETTER ####

# Preprocessing/Outlier Analysis

After downloading/extracting the relevant datasets that were provided to us and conducting some preliminary analysis, the first step was to examine outliers and determine a method to eliminate them from the dataset. 

We found that, after merging transactions with merchants, the nulls are predominantly related to the name of the merchant and what they purchased ('tags'). At the preliminary stage, it was determined that these can be considered vital features when ranking merchants, and if we lack this data, we hinder the accuracy of our selection metrics. Hence, all nulls were removed from further analysis.

To conduct further outlier analysis, we utilised a box plot (per merchant) to visualise the anomalies:

In [None]:
transactions_noNull = transactions.merge(merchants, on='merchant_abn').dropna().groupby("merchant_abn").agg(outlier_count=("dollar_value", get_outliers))

def get_outliers(col):
    q1 = np.percentile(column, 25)
    q3 = np.percentile(column, 75) 
    IQR = q3 - q1
    return sum((column<(q1 - IQR)) | (column>(q3 + IQR)))
plt.figure(figsize=(10,10))
#sns.boxplot(x='name', y='dollar_value', data = transactions_noNull)


#### INSERT BOX-PLOT/VISUALISATION OF OUTLIERS HERE ####

To eliminate the observed outliers, an IQR statistical implementation was used, resulting in approximately 3% of all data (113982 rows) being removed.

Complementing all these steps, a generic ETL script was being updated using separate functions for obtaining and preprocessing the data.

### Using a Fraud Model to Further Remove Data

After the release of the fraud dataset, we wanted to implement a model that detects fraudulent transactions and also removes them from the dataset.

In order to identify outliers, we had to normalise the dollar values according to each customer using Median and Quantile Scaling, and then train a linear regression model on the transaction-fraud data. Then, we would apply this model to the outliers (dollar values >$2 after the scaling) in the rest of the dataset. In the end, we used the predicted probability in conjunction with a randomly generated probability to remove the fraud data (for example, if probability of fraud is 60%, then there is a 60% chance it will be removed from the dataset).

### External Dataset

The external dataset (census data) was retrieved from the ABS website. We believed that some of these features could prove useful in deriving a ranking model. Additionally, as the census data linked customer data by postcode and gender, it is assumed that this average is representative of the individual.

Due to the hundreds of features available, feature engineering was done in order to obtain the most predictive attributes of a consumer. For example, a house-repayment-to-income ratio was deemed important as it showcases how risky - or how likely to default - a consumer is, and was engineered by dividing median mortgage payment of a consumer by their median income.

The data was then cleaned and merged into other datasets.

 ### Final (Merged) Dataset

By the preprocessing and outlier analysis detailed above for each dataset, we were able to obtain a final dataset by merging.

The ETL script was finalised, also. 

In [26]:
#merged_transactions = pd.read_parquet(data_directory + "merged_transactions.parquet")


####INCLUDE FINAL, MERGED DATASET HERE####

TypeError: unsupported operand type(s) for +: 'int' and 'dict'

# The Ranking Model and Segmentations

Initially, we devised a metric using three models: Customer-Merchant Model (CN), Customer Number Model (CNM) and Customer Sampling Model (CS). The CN model would be used to predict the amount each cutomer-merchant pair would spend on a monthly basis; the CNM model in conjunction with CS would be used to predict revenue for a specific month (using a Monte-Carlo sampling method). Ultimately, we would use all of this information to predict the future revenue for any given month for a specific merchant. However, we faced a number of issues in creation of this system:
- Customer were sparse: we found that each customer only had a few transactions per merchant, hence making it difficult to derive a predictive power
- Memory/Technical Issues: since these databases/operations were computationally heavy, we were met with several RAM issues across group members, as well as .env and environment errors which impeded the creation process significantly 

As a result, we decided to change the model to the one we have now. By using several, reputable articles and studies, we were able to select and create features:
- Prevalent in Afterpay's annual report was the notion that the majority of revenue comes from the take-rate - hence, highlighting the fact that this should be the most important aspect of our model
- Customer Retention served as a distinguishing factor amongst the top chosen merchants - some merchants with comparitively lower revenue scores were still pinpointed by our model due to the comparitively higher retention score, which is an interesting discovery
- Research also revealed characteristics of BNPL customers which was accounted for in our Customer Quality metric:
    - Federal Reserve Bank of Philadelphia stated that the BNPL market is far more attractive to the younger audience
    - Federal Reserve Bank of Philadelphia also advised us to avoid nil income customers as they are likely to default
    - Seek low mortgage-income ratio, as mortgage stress is a strong indicator of financial position
    - Roy Morgan (reputable market research company) stated avoid high income customers as they are unlikely to use a BNPL approach

This research prompted us to weight the features accordingly. In regard to the Customer Quality metric, we found it interesting how amongst the top merchants, the mean quality of customers converges to the mean of the individual customer quality distribution. The implication is that each merchant's customers are chosen randomly from the overall customer distribution. Possibly with more realistic data, this would be more useful. 

(INSERT A VISUAL THAT SHOWCASES THE CONVERGING CUSTOMER QUALITY STUFF)

Below is our final ranking of merchants: 


- rankings.csv

### Segmentations

### Interesting Merchants