# Buy Now, Pay Later Project - Group 8
# Applied Data Science (MAST30034)

## Table of contents
1. Problem overview  <br><br>
    1.1&nbsp;&nbsp; Overview of our insight in producing the solution <br><br>
2. Data  <br><br>
    2.1&nbsp;&nbsp; Synthetically generated data provided by the teaching team  
    2.2&nbsp;&nbsp; External dataset (obtain from the Australian Bureau of Statistics (ABS)) <br><br>
3. Understanding data and cleaning  <br><br>
    3.1&nbsp;&nbsp; General overview on synthesis datasets  
    3.2&nbsp;&nbsp; Cleaning the synthetic datasets    
        &nbsp;&nbsp;&nbsp;&nbsp; 3.2.3&nbsp;&nbsp;&nbsp;&nbsp; Resolving the missing merchants' details  
        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.2.3.1&nbsp;&nbsp;&nbsp;&nbsp; Classification (not successful)    
        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.2.3.2&nbsp;&nbsp;&nbsp;&nbsp; Clustering and imputation  
    3.3&nbsp;&nbsp; Dealing with fraudulent data  
        &nbsp;&nbsp;&nbsp;&nbsp; 3.3.1&nbsp;&nbsp;&nbsp;&nbsp; Implementing a merchant fraud detection model (not successful)  
        &nbsp;&nbsp;&nbsp;&nbsp; 3.3.2&nbsp;&nbsp;&nbsp;&nbsp; Implementing a consumer fraud detection model  
    3.4&nbsp;&nbsp; General overview on external datasets  
    3.5&nbsp;&nbsp; Cleaning the external datasets  <br><br>
4. Determining segments of merchant<br><br>
5. Ranking model assumptions (usage of variables and its intuition)<br><br>
6. Result of model<br><br>

In [6]:
# import the necessary library and run spark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import pandas as pd

spark = (
    SparkSession.builder.appName("preprocessing of taxi data")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config("spark.driver.memory", "15g")
    .getOrCreate()
)

## 1. Problem overview 

A generic Buy Now, Pay Later (BNPL) firm has begun offering 
a new “Pay in 5 Installments” feature. Merchants are looking to form a partnership as it may boost their customer base, the BNPL firm in return gets a small percentage of revenue to cover their operating cost. However, the BNPL firm can only onboard at most 100 < X number of merchants every year due to limited resources. Hence, we will design a ranking system for the merchants to assist the firm in selecting merchants that they should accept.

### 1.1 Overview of our insight in producing the solution

As a **firm that has just begun their business**, we believe that their business goals is to **earn profit while 
maintaining a long-term establishment of their business**. Hence, when selecting merchants we believe that the monetary worth of cashflows which a merchant can bring for the BNPL firm is not the sole criterion for this new firm picking business partners.

High returns may come with high risks, merchants could be making a 100,000 in a month and is out of business for the rest of the year, this could possibly impact the cash flow in the BNPL firm. As a new establish firm may not have a stable cash flow, this may hinder the objective of staying in business for a long-term and create financial risk, hence, we aim to select merchants that provide a **stable return**.

Also, as a newly establish firm, we believe that the firm **may not have a large and strong consumer base that uses their services**. Hence, merchants with a young customer demographic and other features (explained in Ranking model assumption) are also consider when creating a ranking system for the merchant.

Thus, our solution is based on our domain knowledge in Finance and Economics which give intutions of what goals the firm may have, and hopes to **maintain a stable profit to allow the firm to stay in operation but also help expand the firm for better revenues in the future**.

## 2. Data

### 2.1 Synthetically generated data provided by the teaching team
These are data related to the BNPL firm.
- Merchant data (merchants that are to be consider for selections)
    - Contains information about merchants including merchant ABN, merchant name, take rate, product description, and revenue level.
    - Each data entry is an unique merchant as define by their unique merchant ABN. <br><br> 
- Consumer data (consumers that may have a purchase in one of the merchants)
    - Contains information about consumers including consumer_id, name, address	state, postcode, and gender.
    - Each data entry is an unique consumer as define by their unique consumer ID. <br><br> 
- Transaction data (raw transactions of merchants)
    - Contains information about transactions of merchant including user_id, merchant_abn, dollar_value, order_id,	and order_datetime. 
    - Each data entry is an unique order. <br><br> 
- Consumer details (A conversion table to allow Transaction to join with consumers)
    - Two columns, user_id and consumer_id. <br><br> 
- Merchant fraud probability data
    - Contains merchant_abn, order_datetime, fraud probability
    - Each data entry represent the chance the entire batch of transaction done with that merchant on that day could be considered fraudulent. <br><br> 
- Consumer fraud probability data
    - Contains user_id, order_datetime, fraud probability
    - Each data entry represent the chance the entire batch of transaction done by that consumer on that day could be considered fraudulent.

* ** As the third set of transactions data were labelled to end at August 20th 2022, but actual transactions dated up to October 2022, it was decided to remove all transactions after August 20th 2022 to keep in line with the spec. This is just like treating data after August 20th 2022 as input errors ** *

### 2.2 External dataset (obtain from the Australian Bureau of Statistics (ABS))
These are external datasets that provide demographic statistics about a post code in Australia, as the actual real-life component of the provided datasets are the postcodes of the consumers.
- 2016 Age data 
    - Contains the reported count of age 0 to 115, and the total reported population count.
    - Each data entry is an unique postcode location. <br><br> 
- 2016 Education data
    - Contains the reported count of students, part-time, full-time, TAFE, total students
    - Each data entry is an unique postcode location. <br><br> 
- Income data
    - Contains the weekly total personal income counts under income categories ranging from $1 to $3000 or more, incrementing by $150.
    - Each data entry is an unique postcode location. <br><br> 





## 3. Understanding data and cleaning
### 3.1 General overview on synthesis data

In [14]:
# reading in the data
transact_data1 = spark.read.parquet("../data/tables/transactions_20210228_20210827_snapshot/")
transact_data2 = spark.read.parquet("../data/tables/transactions_20210828_20220227_snapshot/")
transact_data3 = spark.read.parquet("../data/tables/transactions_20220228_20220828_snapshot/")
transactions_sdf = transact_data1.union(transact_data2)
transactions_sdf = transactions_sdf.union(transact_data3)
merchant_sdf = spark.read.parquet("../data/tables/tbl_merchants.parquet")
consumer_sdf = spark.read.option("header",True) \
                         .option("inferSchema",True) \
                         .options(delimiter='|') \
                         .csv("../data/tables/tbl_consumer.csv")
consumer_fraud = spark.read.option("header",True).csv('../data/tables/consumer_fraud_probability.csv', )
merchant_fraud = spark.read.option("header",True).csv('../data/tables/merchant_fraud_probability.csv', )

                                                                                

Quick look at the dataset sizes and what they look like.

In [21]:
# printing the size and the first row of the datasets
print(f'There are {transactions_sdf.count()} entries of the Transactions dataset')
print(transactions_sdf.show(1))
print(f'There are {merchant_sdf.count()} entries of the Merchants dataset')
print(merchant_sdf.show(1))
print(f'There are {consumer_sdf.count()} entries of the Consumers dataset')
print(consumer_sdf.show(1))

There are 14195505 entries of the Transactions dataset
+-------+------------+------------------+--------------------+--------------+
|user_id|merchant_abn|      dollar_value|            order_id|order_datetime|
+-------+------------+------------------+--------------------+--------------+
|  18478| 62191208634|63.255848959735246|949a63c8-29f7-4ab...|    2021-08-20|
+-------+------------+------------------+--------------------+--------------+
only showing top 1 row

None
There are 4026 entries of the Merchants dataset
+-------------+--------------------+------------+
|         name|                tags|merchant_abn|
+-------------+--------------------+------------+
|Felis Limited|((furniture, home...| 10023283211|
+-------------+--------------------+------------+
only showing top 1 row

None
There are 499999 entries of the Consumers dataset
+----------------+--------------------+-----+--------+------+-----------+
|            name|             address|state|postcode|gender|consumer_id|
+

In [22]:
# printing the size and the first row of the datasets
print(f'There are {consumer_fraud.count()} entries of the Consumer fraud probability dataset')
print(consumer_fraud.show(1))
print(f'There are {merchant_fraud.count()} entries of the Merchant fraud probability dataset')
print(merchant_fraud.show(1))

There are 34864 entries of the Consumer fraud probability dataset
+-------+--------------+-----------------+
|user_id|order_datetime|fraud_probability|
+-------+--------------+-----------------+
|   6228|    2021-12-19| 97.6298077657765|
+-------+--------------+-----------------+
only showing top 1 row

None
There are 114 entries of the Merchant fraud probability dataset
+------------+--------------+------------------+
|merchant_abn|order_datetime| fraud_probability|
+------------+--------------+------------------+
| 19492220327|    2021-11-28|44.403658647495355|
+------------+--------------+------------------+
only showing top 1 row

None


### 3.2 Cleaning the synthetic datasets

#### Merchant dataset
From the presented row above of the merchant dataset, it is found that the tags column contains information about product description, revenue level, and take rate. Such information is extracted and treated as separate columns, for better visual and easier code handling for later analysis and feature engineering.  

#### Transaction dataset
The transactions are consider to be within the time range of 2021-02-28 to 2022-08-28. However, it is found that the dataset contains transactions past the time range and such transactions are removed.

#### Joining the merchant, consumer, and transaction dataset to check if there are any missing information


In [25]:
# Showing if there are any missing values in the columns
transaction_20210228_20210827_missings_sdf = spark.read.parquet("../data/curated/transactions_20210228_20210827_all_details_missing_counts")
transaction_20210228_20210827_missings_sdf.show(1)

+-------+------------+------------+--------+-------------+---------+-------------+---------+-------------+-------+-----+--------+------+-----------+
|user_id|merchant_abn|dollar_value|order_id|merchant_name|prod_desc|revenue_level|take_rate|consumer_name|address|state|postcode|gender|consumer_id|
+-------+------------+------------+--------+-------------+---------+-------------+---------+-------------+-------+-----+--------+------+-----------+
|      0|           0|           0|       0|       149228|   149228|       149228|   149228|            0|      0|    0|       0|     0|          0|
+-------+------------+------------+--------+-------------+---------+-------------+---------+-------------+-------+-----+--------+------+-----------+



It is found that there are merchants with no information about them, however, we decided not to discard such merchants as they have transaction and customer records that may be worth selecting for the firm. 

#### 3.2.3 Resolving the missing merchants' details
We first decided to create a classification model that is able to help classify the missing merchant details, however, the model's accuracy was too low. Hence, we resolved with clustering the unknown merchants into the known merchant clusters and performed a mean imputation for take rate as it is the variable we require for further calculations and investigation for the ranking system.

##### 3.2.3.1 Classification (not successful)    

**Objective**: Predict product description, revenue level and take rate of the missing merchants 

Expand the markdown cell below to see the classification pipeline

### **Classification pipeline**:  
 0. Preliminary Data Analysis  
 1. Data Engineering
  * Mostly done in ETL
    * Encode revenue level into integer value, e.g. 1, 2, 3, 4, 5
    * Clean the prod_desc (has been updated in ETL)
  * Need one curated dataset for modeling product description and one dataset for modeling revenue level and take rate
 2. Feature Engineering
  * Aggregate data to produce more useful features for modeling revenue level and take rate
  * Recommended features for prod_desc: dollar value, user id and order datetime
  * Recommended features for revenue level and take rate: monthly average revenue,  monthly average number of orders, monthly average number of distinct customers, average revenue per order, median revenue, variance of dollar amount
 3. Data Modeling
  * Choice of classification model: XGBClassifier, RandomForest, Naive Bayes(Last resort)
  * Choice of regression model: Linear regression, XGBregressor
  * Fitting and Tuning model to achieve optimal performance 
 4. Model Validation
 * Metrics:
    * Categorical(prod_desc and revenue_level): 
      * Accuracy
      * f1 score
    * Continuous(take_rate):
      * RMSE
 * Visualization:
    * Categorical:
      * learning curve
      * ROC curve
      * confusion matrix 
    * Continuous:
      * RMSE vs. fitted value
 5. Model deployment
   * Use the prediction to impute missing information


Since all the models have poor performance in terms of accuracy/RMSE, it is not feasible to deploy them in an imputation process. From our perspectives, tuning the model will be ultimately a waste of time because it will only refine our model instead of improving it significantly. <u>Therefore, we resolve into clustering.</u> We believe that the merchant are not fully represented in regards to take rate, product description, and revenue level, under our engineered features.

To see the classification models and their performance click [classify_missing_merchants](./A_Y_H_classify_missing_merchants.ipynb)

##### 3.2.3.2 Clustering and imputation 
**Objective**: Perform clustering on the merchants to obtain 3 to 5 clusters, which can be used for market segmentation and take rate imputation

We clustered the merchants based on their consumer base, average monthly order, average monthly revenue. Given we only have transaction and consumer data on the unknown merchants, we believe that the 3 features we created can help group similar merchants based on the business sizes evaluated by the features. We utilized **K-means, MeanShift, DBSCAN, Gausian Mixutre Model**, and evaluated their performance under **Silhouette Coefficient, Calinski-Harabasz Index, and Davies-Bouldin Index**. We found that the Gausian Mixture model performed the best. Below presents the clusters for a quick visualization,  <br><br>
<img src="../plots/GMMclusters.png" width="1000" class="center"> <br><br>

After creating the clusters, **we fit the unknown merchants into the clusters and impute the take rates of such unknown merchants by the mean of the take rates of their respective clusters.**

### 3.3 Dealing with Fraud

The BNPL buisiness model can be understood as: the merchants gain extra revenue from impulse spending because of the low upfront cost, and is guarenteed the full amount of the purchase (less the take rate) is received immediately, for the cost of the take rate; the BNPL firm receives the take rate, with the cost being operational cost, lost interest rate on later instalments, and risk of consumers not paying back (default risk).

Two datasets were provided by the teaching staff, the first consisting of a small sample of merchants-day and the second consisting of a small sample of consumer-day, labelled with probability of fraud. It was assumed that if i.e. a consumer-day was 'fraudulant', then all its transacitons on the day should be dsicarded. 

Although not every transaction may have been fradulant, the cost of a fraudulant transaction that is bought upon the BNPL company should be greater than the transaction sum itself (i.e. it would have to cough up (1-take rate) * transaction because under the business model it bears the transaction risk in return for the take rate); hence, if a consumer-day was considered fradulant, then all transactions on that day would be thrown out (i.e. could have earned money on other transactions - but those would be used to balance out the cost brought by the fraudulant transactions) 

##### 3.3.1 Dealing with consumer fraud

We treated this problem as a supervised learning problem, with labels being predicted fraud probability. We chose to use the continuous predicting model of random forest regressor.

We treated each consumer-day (i.e. each row in the fraud dataset) as an instance, and engineered variables such as the average transaction amount on that day by that consumer; the sd of transaction amounts on the day by that customer; the number of transactions they had; and number of distinct stores they shopped at. All these were derived by joining fraud dataset with transaction dataset and then aggregating.

Four other variables were also engineered: i.e. number of distinct stores they shopped at on day of being labelled / average number of distinct stores they shopped at in the past days when they made at least one transaction. The same idea was implemented for the other three variables. The latter 4 variables are expected to give a ratio that standardises each consumer-day by the consumer's historic behaviour, to make different consumers more comparable. They are denoted by appending the word 'ratio' behind the name.

Ultimately, 'transaction amount per order ratio', 'transaction amount per order', 'sd of transaction amounts on the day', 'sd of transaction amounts on the day ratio' were chosen as variables.




In [4]:
rfr_tuning = pd.read_csv('../data/tuning/RFR_brute.csv')
rfr_tuning.sort_values(['validation_accuracy'], ascending=False).head(1)

Unnamed: 0.1,Unnamed: 0,n_estimators,max_depth,max_samples,max_features,ccp_alpha,training_accuracy,validation_accuracy,testing_accuracy
757,0,150,18,0.5,0.75,0.001,0.91441,0.798436,0.798915


After tuning, the best combination of hyperparameters gave validation R^2 and testing R^2 of 0.798, which are very good results.

However, upon inspecting the predicted values (on the whole dataset), it was found that a lot of the consumer-day had extremely low predicted fraud probability (around 9%), so the predicted results were only good for ranking as opposed to actual intuition. 



In [23]:
predicted_fraud = pd.read_csv('../data/curated/fraud/final_fraud_prediction.csv')

In [27]:
predicted_fraud[['fraud rate']].describe()[1:]

Unnamed: 0,fraud rate
mean,9.394051
std,0.920289
min,9.287148
25%,9.287148
50%,9.287148
75%,9.411906
max,85.362188


Ultimately, it was decided to only remove about 0.1% of the transactions and hence the threshold for fradulant data was set at 20% predicted fraud. The 0.1% came about because we believed anything less would be so insignificant that this fraud detection task would not be worth doing; while any higher would mean the BNPL firm loses money (because the take rate margin are in general so slim)

Although this outcome is not as good as i.e. there was a more spread out distribution and the cutoff rate could be higher, in the end the threshold was always going to be manually determined, and practically does not make much of a difference.

<img src="../plots/KDE_sample_fraud_rate.jpg" width="250" class="center">

<img src="../plots/boxplot_sample_fraud_rate.jpg" width="250" class="center">

##### 3.3.2 Dealing with merchant fraud

Although the exact process could be repeated for the merchant fraud, we deemed it unviable and unnecessary because

1. there were only 114 rows of data with labels - not nearly enough to train a model given our method of treating each merchant-day as an instance. Inaccurate models risk throwing out too many transactions that were in reality not fraud

2. we already have a sufficient model dealing with fraud for consumers that would be run over all data, so a merchatn fraud model would be complimentary rather than a necessity

##### 3.3.3 Final removal

Just running the consumer fraud model over the entire dataset, approximately 16000 of the 14 million transactions were removed. 

14179422 transactions were left after the removal of fraud

### 3.4 General overview of the external datasets

In [7]:
# reading in the data
income_sdf = spark.read.option("header",True).csv('../data/curated/income.csv')
age_sdf = spark.read.option("header",True).csv('../data/curated/2016_age.csv')
education_sdf = spark.read.option("header",True).csv('../data/curated/2016_education.csv')

In [14]:
# printing the size and the columns of the income dataset
print(f'There are {income_sdf.count()} entries of the income dataset')
print(income_sdf.printSchema())

There are 2653 entries of the income dataset
root
 |-- INCP Total Personal Income (weekly): string (nullable = true)
 |-- Negative income: string (nullable = true)
 |-- Nil income: string (nullable = true)
 |-- $1-$149 ($1-$7,799): string (nullable = true)
 |-- $150-$299 ($7,800-$15,599): string (nullable = true)
 |-- $300-$399 ($15,600-$20,799): string (nullable = true)
 |-- $400-$499 ($20,800-$25,999): string (nullable = true)
 |-- $500-$649 ($26,000-$33,799): string (nullable = true)
 |-- $650-$799 ($33,800-$41,599): string (nullable = true)
 |-- $800-$999 ($41,600-$51,999): string (nullable = true)
 |-- $1,000-$1,249 ($52,000-$64,999): string (nullable = true)
 |-- $1,250-$1,499 ($65,000-$77,999): string (nullable = true)
 |-- $1,500-$1,749 ($78,000-$90,999): string (nullable = true)
 |-- $1,750-$1,999 ($91,000-$103,999): string (nullable = true)
 |-- $2,000-$2,999 ($104,000-$155,999): string (nullable = true)
 |-- $3,000 or more ($156,000 or more): string (nullable = true)
 |-- No

In [17]:
# printing the size and the columns of the age dataset
print(f'There are {age_sdf.count()} entries of the age dataset')
print(age_sdf.printSchema())

There are 2653 entries of the age dataset
root
 |-- AGEP Age: string (nullable = true)
 |-- 0: string (nullable = true)
 |-- 1: string (nullable = true)
 |-- 2: string (nullable = true)
 |-- 3: string (nullable = true)
 |-- 4: string (nullable = true)
 |-- 5: string (nullable = true)
 |-- 6: string (nullable = true)
 |-- 7: string (nullable = true)
 |-- 8: string (nullable = true)
 |-- 9: string (nullable = true)
 |-- 10: string (nullable = true)
 |-- 11: string (nullable = true)
 |-- 12: string (nullable = true)
 |-- 13: string (nullable = true)
 |-- 14: string (nullable = true)
 |-- 15: string (nullable = true)
 |-- 16: string (nullable = true)
 |-- 17: string (nullable = true)
 |-- 18: string (nullable = true)
 |-- 19: string (nullable = true)
 |-- 20: string (nullable = true)
 |-- 21: string (nullable = true)
 |-- 22: string (nullable = true)
 |-- 23: string (nullable = true)
 |-- 24: string (nullable = true)
 |-- 25: string (nullable = true)
 |-- 26: string (nullable = true)
 |-- 

In [18]:
# printing the size and the columns of the education dataset
print(f'There are {education_sdf.count()} entries of the education dataset')
print(education_sdf.printSchema())

There are 2668 entries of the education dataset
root
 |-- postcode: string (nullable = true)
 |-- Technical or Further Educ Inst (incl. TAFE Colleges): Full-time student: Aged 15-24 years: string (nullable = true)
 |-- Technical or Further Educ Inst (incl. TAFE Colleges): Full-time student: Aged 25 years and over: string (nullable = true)
 |-- Technical or Further Educ Inst (incl. TAFE Colleges): Part-time student: Aged 15-24 years: string (nullable = true)
 |-- Technical or Further Educ Inst (incl. TAFE Colleges): Part-time student: Aged 25 years and over: string (nullable = true)
 |-- Technical or Further Educ Inst (incl. TAFE Colleges): Full-time/Part-time student status not stated: string (nullable = true)
 |-- University or other Tertiary Institution: Full-time student: Aged 15-24 years: string (nullable = true)
 |-- University or other Tertiary Institution: Full-time student: Aged 25 years and over: string (nullable = true)
 |-- University or other Tertiary Institution: Part-time

### 3.5 Cleaning the external datasets

We check for any missing values and string formatting issues that are presented in the data sets and found that are none, aggregation and feature engineering was done later in the calculations of the ranking persona feature variable.

### 5. Ranking model

Background and intuition: 

The goal of our ranking model was to return a portfolio of 100 companies which maximise the Sharpe Ratio. The Sharpe Ratio is a finance concept quantifying risk compensation which for our problem equals to E( (revenue * take rate) of the whole portfolio ) / sd( (revenue * take rate) of the whole portfolio ). The purpose of using this metric is because as a company, having high incoming cashflow is of course predominantly good, but if high cashflows come with high variance, this could put the buisiness at financial risk, which brings along implicit bankruptcy costs etc; by using the Sharpe Ratio to evaluate the portfolio of 100 companies, we can guarentee that we have maximised the amount of incoming cashflow from merchant revenue * take rate for all the risk (standard deviation) we have chosen. 

Our ranking model consists a heuristic function, which is a linear function that returns the score for each merchant based on several variables. The top 100 ranking merchants would then be selected to be the final 100 merchants. 

The variables used in the linear heuristc function are: 
1. Historic mean (revenue * take rate) of the firm
2. Historic sd of (revenue * take rate) of the firm/Historic mean (revenue * take rate) of the firm
3. Historic corr of (revenue * take rate) of the firm and (revenue & take rate) of all companies in the market
4. Loyalty rate  
    The loyalty variable is a variable design to measure a merchants consumer base, we wanted merchants with loyal customers that are staying with them, thus, when we select such merchant for partnerships, it is highly possible such consumers can be converted in to our customers.
    By providing a after pay method, it is likely they would try to use it to purchase more frequently at their "favorite" store, thus, introduced to our firm's service and even use it to purchase in other merchants.
5. Persona score 
    The persona score is a score given to a merchant based on its consumers demographic statistics.
    Compose of 4 components:
    - Age 
        Young is better, as they have the ability to communicate through social media and promote the firm, and are more willing to convert or accept the financial payment methods.
    - Education
        we wanted students because by reference (), many BNPL users are students
    - Income
        we wanted high income consumers, so they are able to repay or repay stably the buy now pay later installments, in the future
    - Total population
        wanted consumers coming from areas with high population.
    



Growth rate was attempted as a variable, but was found to be not useful

*To calculate the sd of firms, company's transactions were grouped by fortnight*

Method of tuning coefficients for variables in linear function:

0. Split our data into 23 fortnights for training and 13 fortnights for validation
1. Iterate over different combinations of coefficients for each variable for the linear heuristic function
2. Use this linear function to calculate the score for each merchant, and rank it
3. Take the top 100 scoring merchants and form a portfolio, and then calculate its Sharpe Ratio in the validation set
4. The set of combination of coefficients with the highest validation sharpe ratio will be the final coefficients

Assumptions and alterations of financial theories

- The 'market' which the correlation compares to here is not the most efficient portfolio that real stock market finance analysts use - they calculate correlation of a share to the 'market portfolio', but here because we do not have a 'risk free rate', nor have resources to calculate the efficient portfolio, thus we just use the portfolio of all companies together as our 'market'.

- As stated above, we do not have a risk free rate; also we do not need to put in initial investment amount (it is like getting to invest in 100 stocks for free), so instead of using E(return rate) and sd(return rate) etc which what finance analysts do, we use E(revenue * take rate) and sd(revenue * take rate)

- Also, in real finance, a portfolio would consist of different weighting of stocks (i.e. 1/2 money in stock A, 1/3 money in stock B and 1/6 money in stock C), and the sum of weights must add up to 1. However, here, our portfolio's 'weights' add up to 100, and each merchant's weight can only ever be 1. (alternatively can think of as weight = 1/100, but because we are dealing with E( return rate ) it is best to use w=1, sum(w_i) = 100)

- in training our model, we also assume that the business transaction behaviour during the 'train' period and 'test' period are similar when a portfolio of multiple merchants are considered, and that the behaviour of the whole period is similar to the train and test 

- the act of using the same data (split into two segments) to tune up the model and then using the overall data to get a final result inherently causes overfit; but this risk/disadvantage is smaller than i.e. using a smaller section of unseen data to do final prediction. This is because the data provided only spans 18 months, which is 39 fortnights, and if splitting into three parts will likely hurt the model more than the overfit component. 

5.2 Analysis

We first see that the two fundamental characteristics which relate to our aim of stable growth: fortnightly mean cashflow and its scaled sd, shows significant differences between portfolio and non-portfolio merchants. Specifically, the within portfolio median of fortnightly mean cashflow at around 260k is much higher than that of the others at approximately 21k.

In contrast, the scaled sd of our portfolio merchants have a very narrow distribution with median of 0.21, whilst the remaining merchants’ are much higher at 0.46.

These cross sectional glimpses at statistics of the final portfolio indicate that the algorithm was successful in achieving the stability goal of high cashflow at low sd which gives high future benefit score for our portfolio.


<img src="../plots/Boxplot_5V top100 vs 5V top100' Hist Mean(Revenue*Take Rate).png" width="500" class="center">

<img src="../plots/Boxplot_5V top100 vs 5V top100' Hist Std Stdev(Revenue*Take Rate).png" width="500" class="center">

We also analysed the effectiveness of the last two features, by comparing our current top portfolio to a top portfolio ranked using a heuristic function of just the first three cashflow based features. 

We notice that, compared to the 3 featured portfolio, the 5 featured portfolio has more merchants with lower per order transaction amount, but also included some selected high per order amount merchants. This is the typical characteristic of a financial portfolio, with most merchants stable and some risky ones where risk is compensated by fair reward. This demonstrates that including repeated customer rate and persona score has enhanced our model. 

Overall, it can be seen that a portfolio most beneficial to stability and growth should contain a majority of merchants with small order quantity and low per order transaction amount, while having around 20% of high per order transaction amount merchants. 


<img src="../plots/Histogram_3V top100 vs 5V top100 Mean Transaction Amount.png" width="500" class="center">

### 6. Result of model

The final tuned coefficients (after several rounds of tuning and refinement) to the heuristic linear function was:

{60, -230000000, 105750000000, -6000000000, 602500000000}

So the final heuristic function is:

score of company = 60 * historic mean - 230000000 * standardised historic sd + 105750000000 * historic corr - 6000000000 loyalty + 602500000000 * persona score

Recommendation:



How do we know that each variable was actualy useful?

Because this is an unsupervised problem, we cannot do things such as AIC stepwise or feature selection F test to ensure the usefulness of each variable

In training we tried ran the training process (combinations of each variable taking on values {0.0001, 0.01, 1, 100, 10000, -0.0001, -0.01, -1, -100, -10000} on just the first variable (historic mean revenue * take rate), then two variables (+ historic sd of revenue * take rate), then three variables (+ historic corr), then all 5 variables together, and each time we added more variables, we got to a higher validation sharpe ratios, meaning that the extra variables helped us distinguish companies that made the validation final portfolio better performing 

This is a glimpse of our full dataset (i.e. features used for prediction)

In [16]:
merchant_overall = pd.read_csv('../data/curated/final_model/input/agg_fortnightly_mean_sd_marketcorr_NOFRAUD.csv')

loyalty_overall = pd.read_csv(('../data/curated/final_model/input/loyalty_full.csv'))
loyalty_overall = loyalty_overall[['merchant_abn', 'repeated_purchase_rate']]

persona_overall = pd.read_csv(('../data/curated/final_model/input/persona_full.csv'))

# merge dataset with 4th variable
merchant_overall = merchant_overall.merge(loyalty_overall, on='merchant_abn', how='inner')
merchant_overall = merchant_overall.merge(persona_overall, on='merchant_abn', how='inner')

merchant_overall['standardised stdev'] = merchant_overall['stdev']/merchant_overall['mean']

merchant_overall['corr'] = merchant_overall['corr'] * 100000
merchant_overall['repeated_purchase_rate'] = merchant_overall['repeated_purchase_rate'] * 1000000
merchant_overall['persona_score'] = merchant_overall['persona_score'] * 10000000

merchant_overall['standardised stdev'] = merchant_overall['standardised stdev'] * 100000

merchant_overall

Unnamed: 0,merchant_abn,mean,stdev,n_periods,corr,repeated_purchase_rate,persona_score,standardised stdev
0,10023283211,71944.954088,15268.757666,38,89865.793665,57655.593738,45527.819218,21222.833289
1,10142254217,22687.034837,6496.702667,38,80109.334092,56990.569906,47402.331867,28636.191172
2,10187291046,6831.716274,3379.999688,38,34995.557931,3484.320557,46658.738715,49475.118009
3,10192359162,40955.314495,17639.407542,38,47234.009458,3154.574132,60202.386586,43069.886679
4,10206519221,88038.400673,21292.527062,38,92172.605972,165604.172704,44677.257522,24185.499623
...,...,...,...,...,...,...,...,...
4362,99938978285,106551.591313,21699.148448,38,95903.180884,277847.759583,47348.625533,20364.921989
4363,99974311662,6326.735554,4783.350900,38,16927.146730,0.000000,68106.927357,75605.355381
4364,99976658299,871306.137224,186985.910229,38,97885.491058,354736.454110,47225.595328,21460.414686
4365,99987905597,16290.096561,10052.773545,38,25886.462839,13071.895425,32046.388945,61710.951234


The top 100 companies are:

In [13]:
data = pd.read_csv('../data/curated/final_model/output/final_top100.csv')

In [11]:
data.index = data.index+1

In [12]:
display(data[['merchant_abn']].head(50))
display(data[['merchant_abn']].tail(50))

Unnamed: 0,merchant_abn
1,57564805948
2,31400548982
3,49465266764
4,62789659343
5,99801770627
6,10881038707
7,49514806178
8,48549026640
9,29215623643
10,81548651453


Unnamed: 0,merchant_abn
51,99976658299
52,71961434094
53,30214222203
54,80779820715
55,22059270846
56,54550134954
57,15560455575
58,49322182190
59,49212265466
60,71350572766


This is the proportions of each segment in the final portfolio. 

Cluster 0: Medium business

In [15]:
pd.read_csv('../data/curated/final_model/final_10_cluster0.csv')[['merchant_abn']]

Unnamed: 0,merchant_abn
0,57564805948
1,31400548982
2,49465266764
3,62789659343
4,99801770627
5,10881038707
6,49514806178
7,48549026640
8,41001282470
9,98671274602


Cluster 1 Large business

In [16]:
pd.read_csv('../data/curated/final_model/final_10_cluster1.csv')[['merchant_abn']]

Unnamed: 0,merchant_abn
0,89640578182
1,37459245212
2,91720867026
3,50866797623
4,30122382323
5,68559320474
6,61447419161
7,88547577701
8,94472466107
9,21359184622


Cluster 2: Small Business

In [17]:
pd.read_csv('../data/curated/final_model/final_10_cluster2.csv')[['merchant_abn']]

Unnamed: 0,merchant_abn
0,20562405782
1,33604812025
2,29566626791
3,74648589246
4,75342681786
5,79953723663
6,67202032418
7,67330176930
8,99904689266
9,91848160033


This is the proportions of each segment in the final portfolio, medium sized merchants made up the majority of the portfolio. 

<img src="../plots/pie.png" width="1000" class="center">