# MAST30034 Applied Data Science Project 2
## Merchant Forecast Notebook

This notebook forecasts future performances of each merchant, as features to feed into the final ranking model

**Key Performance Indicators:**
1. Average Take Amount (Take Rate * Dollar Value)
2. Transaction Volume
3. Average Transaction Amount

In [None]:
# Import Libraries
# !pip3 install tqdm
# !pip3 install pyramid-arima

from pyspark.sql import SparkSession, functions as F, DataFrame
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

from tqdm import tqdm
import warnings

import statsmodels.api as sm
from statsmodels.tools.sm_exceptions import ConvergenceWarning
import pmdarima as pm

# Filter out some warnings
warnings.simplefilter('ignore', ConvergenceWarning)
warnings.simplefilter('ignore', UserWarning)
warnings.simplefilter('ignore', RuntimeWarning)

# Create spark session
spark = (
    SparkSession.builder.appName("Merchant Foreast")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config("spark.executor.memory", "2g")
    .config("spark.driver.memory", "4g")
    .getOrCreate()
)

In [2]:
# Read in preprocessed data
cleaned_data = spark.read.parquet('../data/curated/cleaned_data.parquet/')
cleaned_data = cleaned_data.withColumn("take_amt", F.col("take rate")*F.col("dollar_value"))

# Group Data and transform into time series data
ts = cleaned_data.groupBy(['merchant_abn', 'order_datetime']).agg(
    F.mean("take_amt").alias("mean_take_amt"),
    F.count("*").alias("transaction_count"),
    F.avg("dollar_value").alias("avg_transaction_amt")
)
ts_pd = ts.toPandas()
ts_pd = ts_pd.sort_values(by=['order_datetime','merchant_abn'])
ts_pd['order_datetime'] = pd.to_datetime(ts_pd['order_datetime'])

# Get the ABN's of all merchants in a set
merchants = sorted(set(ts_pd['merchant_abn']))

                                                                                

### **1. Determine Model for Forecasting**

As we have time series data in hand, we will use SARIMAX (Seasonal Auto-Regressive Integrated Moving Average), an autoregressive model that uses past data to forecast future data, using lagged values $(p)$, moving average $(q)$, differencing $(d)$, and seasonality $(P,D,Q,s)$.

The `SARIMAX` model takes a few hyperparameters $SARIMA(p,d,q)x(P,D,Q,s)$:
- We set $s=7$, since the granularity of the data is daily, and the closest period is a week (7 days)
- We use a stepwise algorithm (using AIC as selection criterion) to figure out the rest

Code adapted from: https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/


In [None]:
# Use one merchant as an example to determine model
test_data = ts_pd.loc[ts_pd['merchant_abn']==10142254217]
test_data = test_data.set_index('order_datetime')

#### 1.1 Average Take Amount (take rate * dollar value)

From the output, `SARIMA(1,0,0)x(2,1,0,7)` is the best model, hence we use it in our final forecasting model

In [5]:
# Seasonal - fit stepwise auto-ARIMA
smodel_take_amt = pm.auto_arima(test_data['mean_take_amt'], start_p=1, start_q=1,
                         test='adf',
                         max_p=3, max_q=3, m=7,
                         start_P=0, seasonal=True,
                         d=None, D=1,
                         error_action='ignore',  
                         suppress_warnings=True, 
                         stepwise=True)

smodel_take_amt.summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,597.0
Model:,"SARIMAX(1, 0, 0)x(2, 1, 0, 7)",Log Likelihood,-3510.17
Date:,"Thu, 22 Sep 2022",AIC,7028.341
Time:,19:40:08,BIC,7045.861
Sample:,0,HQIC,7035.166
,- 597,,
Covariance Type:,opg,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
ar.L1,-0.0867,0.047,-1.857,0.063,-0.178,0.005
ar.S.L7,-0.6994,0.032,-22.026,0.000,-0.762,-0.637
ar.S.L14,-0.3866,0.033,-11.792,0.000,-0.451,-0.322
sigma2,8550.4787,405.070,21.109,0.000,7756.556,9344.402

0,1,2,3
Ljung-Box (L1) (Q):,0.0,Jarque-Bera (JB):,53.14
Prob(Q):,1.0,Prob(JB):,0.0
Heteroskedasticity (H):,0.76,Skew:,0.49
Prob(H) (two-sided):,0.06,Kurtosis:,4.09


#### 1.2 Transaction Volume 

From the output, `SARIMA(1,0,1)x(2,1,0,7)` is the best model, hence we use it in our final forecasting model

In [7]:
# Seasonal - fit stepwise auto-ARIMA
smodel_transaction_count = pm.auto_arima(test_data['transaction_count'], start_p=1, start_q=1,
                         test='adf',
                         max_p=3, max_q=3, m=7,
                         start_P=0, seasonal=True,
                         d=None, D=1,
                         error_action='ignore',  
                         suppress_warnings=True, 
                         stepwise=True)

smodel_transaction_count.summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,597.0
Model:,"SARIMAX(1, 0, 1)x(2, 1, [], 7)",Log Likelihood,-1560.367
Date:,"Thu, 22 Sep 2022",AIC,3130.734
Time:,19:41:01,BIC,3152.635
Sample:,0,HQIC,3139.266
,- 597,,
Covariance Type:,opg,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
ar.L1,0.7162,0.137,5.234,0.000,0.448,0.984
ma.L1,-0.5923,0.154,-3.839,0.000,-0.895,-0.290
ar.S.L7,-0.6458,0.038,-17.046,0.000,-0.720,-0.572
ar.S.L14,-0.3111,0.036,-8.746,0.000,-0.381,-0.241
sigma2,11.5409,0.570,20.240,0.000,10.423,12.658

0,1,2,3
Ljung-Box (L1) (Q):,0.14,Jarque-Bera (JB):,41.78
Prob(Q):,0.71,Prob(JB):,0.0
Heteroskedasticity (H):,1.23,Skew:,0.51
Prob(H) (two-sided):,0.15,Kurtosis:,3.8


#### 1.3 Average Transaction Amount 

From the output, `SARIMA(1,0,0)x(2,1,0,7)` is the best model, hence we use it in our final forecasting model

In [9]:
# Seasonal - fit stepwise auto-ARIMA
smodel_avg_transaction_amt = pm.auto_arima(test_data['avg_transaction_amt'], start_p=1, start_q=1,
                         test='adf',
                         max_p=3, max_q=3, m=7,
                         start_P=0, seasonal=True,
                         d=None, D=1,
                         error_action='ignore',  
                         suppress_warnings=True, 
                         stepwise=True)

smodel_avg_transaction_amt.summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,597.0
Model:,"SARIMAX(1, 0, 0)x(2, 1, 0, 7)",Log Likelihood,-2660.671
Date:,"Thu, 22 Sep 2022",AIC,5329.343
Time:,19:41:32,BIC,5346.863
Sample:,0,HQIC,5336.168
,- 597,,
Covariance Type:,opg,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
ar.L1,-0.0868,0.047,-1.858,0.063,-0.178,0.005
ar.S.L7,-0.6994,0.032,-22.029,0.000,-0.762,-0.637
ar.S.L14,-0.3865,0.033,-11.792,0.000,-0.451,-0.322
sigma2,480.1014,22.742,21.110,0.000,435.527,524.676

0,1,2,3
Ljung-Box (L1) (Q):,0.0,Jarque-Bera (JB):,53.14
Prob(Q):,1.0,Prob(JB):,0.0
Heteroskedasticity (H):,0.76,Skew:,0.49
Prob(H) (two-sided):,0.06,Kurtosis:,4.09


### **2. Deploy Model for Forecasting**

- For merchants with insufficient amount of data, a time series model cannot be built, hence we take the average of their data as our forecast
- For the rest, we build 3 `SARIMA` models for each merchant on 3 key performance indicators
    - We generate forecast for the next year (365 days, since we have about $\leq$ 2 years of data) and average it

In [34]:
static_merchant = []
for merchant in tqdm(merchants):
    
    merchant_data = ts_pd.loc[ts_pd['merchant_abn']==merchant]
    merchant_data = merchant_data.set_index('order_datetime')
    merchant_data = merchant_data.asfreq('D').fillna(0)

    if(len(set(merchant_data['mean_take_amt'])) == 1 or len(set(merchant_data['transaction_count'])) == 1 or len(set(merchant_data['avg_transaction_amt'])) <= 4 ):
        static_merchant.append(merchant)

100%|██████████| 4422/4422 [00:27<00:00, 162.39it/s]


In [39]:
forecast = []

for merchant in tqdm(merchants):

    merchant_data = ts_pd.loc[ts_pd['merchant_abn']==merchant]
    merchant_data = merchant_data.set_index('order_datetime')
    merchant_data = merchant_data.asfreq('D').fillna(0)

    if(len(merchant_data) == 1):
        forecast.append([merchant, merchant_data['mean_take_amt'][0]/365, 
                                    merchant_data['transaction_count'][0]/365, 
                                    merchant_data['avg_transaction_amt'][0]/365])
        continue
    
    if(merchant in static_merchant):
        forecast.append([merchant, np.mean(merchant_data['mean_take_amt']), 
                                    np.mean(merchant_data['transaction_count']), 
                                    np.mean(merchant_data['avg_transaction_amt'])])
        continue

    take_amt = sm.tsa.ARIMA(merchant_data['mean_take_amt'],order=(1,0,0) , seasonal_order=(2, 1, 0, 7))
    take_amt_fit = take_amt.fit()

    transaction_count = sm.tsa.ARIMA(merchant_data['transaction_count'],order=(1,0,1) , seasonal_order=(2, 1, 0, 7))
    transaction_count_fit = transaction_count.fit()

    transaction_amt = sm.tsa.ARIMA(merchant_data['avg_transaction_amt'],order=(1,0,0) , seasonal_order=(2, 1, 0, 7))
    transaction_amt_fit = transaction_amt.fit()

    take_amt_forecast = np.mean(take_amt_fit.get_forecast(steps = 365).predicted_mean)
    transaction_count_forecast = np.mean(transaction_count_fit.get_forecast(steps = 365).predicted_mean)
    transaction_amt_forecast = np.mean(transaction_amt_fit.get_forecast(steps = 365).predicted_mean)

    forecast.append([merchant, take_amt_forecast, transaction_count_forecast, transaction_amt_forecast])

100%|██████████| 4422/4422 [10:01:53<00:00,  8.17s/it]    


### **3. Output Forecast Results**

In [40]:
pd.DataFrame(forecast, columns = ['merchant_abn', 'mean_take_amt', 'transaction_count', 'avg_transaction_amt']).to_csv('../data/curated/future_predictions.csv')