## Goal  ⛳️

* Use bayesian model to forecast daily sales and estimate the posterior interval

### Why bayesian model?

* In this competition we can utilize a lot of historical data (prior); and by updating the prior belief we can make a forecast (which is the posterior)
* Bayesian model allows to estimate posterior predictive interval on parameters and response variables

### Limitation

* Here we only modeled the total sale and its uncertainty
* To make the full prediction we need to scale it up to full hierarchies (aggregated by state, by store, by department etc)
* There can be more features/dimensions
* Needs a lot of experimentation on initializing the paramter, due to the overdispersion

### Referenced Notebook

https://www.kaggle.com/allunia/m5-uncertainty


## Problem formulation

We considered the sale of products is a poisson process, i.e, increasing exposing variable (days of operating) in this case, the rate of daily product sale is $\lambda$


$$ y | \beta, X_i \sim indep. Poisson(\lambda_{i})$$
where $$  \lambda = rt$$

$$ log(\lambda) \sim log(t) + log(r) $$
where $$r = \beta_{i}X $$

here we call log(t) offset, and ideally X doesn't include information on exposure variable (t)


✈️*please upvote if you like it* 🚀

## Import Data

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt

from  datetime import datetime, timedelta
import gc
import numpy as np
#plt.style.use('ggplot')

In [None]:
%matplotlib inline
import sys
import re

plt.style.use('seaborn-darkgrid')
import seaborn as sns
import patsy as pt
import pymc3 as pm

plt.rcParams['figure.figsize'] = 14, 6
np.random.seed(0)
print('Running on PyMC3 v{}'.format(pm.__version__))

In [None]:
#os.listdir('../m5-forecasting-uncertainty/')

In [None]:
submission = pd.read_csv('../input/m5-forecasting-uncertainty/sample_submission.csv')

In [None]:
#submission.shape

In [None]:
#submission.head()

In [None]:
sale = pd.read_csv('../input/m5-forecasting-uncertainty/sales_train_validation.csv')

In [None]:
#sale.head()

In [None]:
sale.shape

In [None]:
total_historical = sale.iloc[:,6:].sum()

In [None]:
total_historical.shape

In [None]:
calendar = pd.read_csv('../input/m5-forecasting-uncertainty/calendar.csv')

In [None]:
calendar['event_true_1'] = calendar.event_name_1.notna()
calendar['event_true_2'] = calendar.event_name_2.notna()

calendar['event_true_all'] = calendar.event_true_1 + calendar.event_true_2
calendar['event_true_all'] = calendar.event_true_all.apply(lambda x: x>0)
calendar['event_true_all'] = calendar.event_true_all.astype('int')
calendar['date'] = pd.to_datetime(calendar.date)

In [None]:
#calendar.dtypes

In [None]:
#calendar.columns

In [None]:
calendar['d_parse'] = calendar.d.apply(lambda x: int(x.split('_')[1]))

In [None]:
#calendar.head()

In [None]:
calendar_feature = calendar[['wm_yr_wk', 'wday', 'month', 'year', \
       'snap_CA', 'snap_TX', 'snap_WI', \
       'event_true_all', 'd_parse']]

In [None]:
calendar_feature.dtypes

## Build Bayesian model in Pymc3

In [None]:
# specify formula
fml = 'total ~ wday + month + year + snap_CA + snap_TX + snap_WI + event_true_all + d_parse'

#### Standardize data
To help with model convergence, it is better to standardardize your data first

In [None]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler, Normalizer
scaler = StandardScaler()
#minmax = MinMaxScaler()

In [None]:
calendar_feature = calendar[['wm_yr_wk', 'wday', 'month', 'year', \
       'snap_CA', 'snap_TX', 'snap_WI', \
       'event_true_all', 'd_parse']]

scaled_feature = pd.DataFrame(scaler.fit_transform(calendar_feature))
scaled_feature.columns = calendar_feature.columns
scaled_feature.min()

#### Correct outliers

It seems the outliers heavily impacted the way the model converged, so we also corrected those

In [None]:
np.where(total_historical < 10000)[0]

In [None]:
total_historical.iloc[[ 330,  696, 1061, 1426, 1791]]=np.quantile(total_historical, 0.025)

In [None]:
np.min(total_historical)

#### Create model and update it using MCMC

In [None]:
#minmax_feature.iloc[:1913,9]

In [None]:
# create data frame
df = scaled_feature.iloc[:1913,:]
df.loc[:,'total'] = total_historical.values
df.loc[:, 'd_parse'] = calendar_feature.iloc[:1913, 8] - np.min(calendar_feature.d_parse) + 1
df.head()

In [None]:
(mx_en, mx_ex) = pt.dmatrices(fml, df, return_type='dataframe', NA_action='raise')
pd.concat((mx_ex.head(3),mx_ex.tail(3)))


In [None]:
with pm.Model() as mdl_first:

    # define priors, weakly informative Normal
    # here we tried to remove all the time variable and 
    # treat all these as 'attributes' of data rather than the exposure
    b0 = pm.Normal('b0_intercept', mu=0, sigma=1)
    b2 = pm.Normal('b2_wday', mu=0, sigma=1)
    b3 = pm.Normal('b3_month', mu=0, sigma=1)
    b4 = pm.Normal('b4_year', mu=0, sigma=1)
    b5 = pm.Normal('b5_snapCA', mu=0, sigma=1)
    b6 = pm.Normal('b6_snapTX', mu=0, sigma=1)
    b7 = pm.Normal('b7_snapWI', mu=0, sigma=1)
    b8 = pm.Normal('b8_event_true_all', mu=-0.01, sigma=1)

    # define linear model and exp link function
    theta = (b0 +
            b2 * mx_ex['wday'] +
            b3 * mx_ex['month'] + 
            b4 * mx_ex['year'] + 
            b5 * mx_ex['snap_CA'] + 
             b6 * mx_ex['snap_TX'] + 
             b7 * mx_ex['snap_WI'] + 
             b8 * mx_ex['event_true_all'] + 
              np.log(mx_ex['d_parse'] ))  ## there is the log(t) as an offset

    ## Define Poisson likelihood
    y = pm.Poisson('y', mu=np.exp(theta), observed=mx_en['total'].values)

In [None]:
with mdl_first:
    trace = pm.sample(1000, tune=2000, init='adapt_diag', target_accept =.8)

In [None]:
mdl_first.check_test_point()

In [None]:
## helper function from pymc documentation
def strip_derived_rvs(rvs):
    '''Convenience fn: remove PyMC3-generated RVs from a list'''
    ret_rvs = []
    for rv in rvs:
        if not (re.search('_log',rv.name) or re.search('_interval',rv.name)):
            ret_rvs.append(rv)
    return ret_rvs


def plot_traces_pymc(trcs, varnames=None):
    ''' Convenience fn: plot traces with overlaid means and values '''

    nrows = len(trcs.varnames)
    if varnames is not None:
        nrows = len(varnames)

    ax = pm.traceplot(trcs, var_names=varnames, figsize=(12,nrows*1.4),
                      lines=tuple([(k, {}, v['mean'])
                                   for k, v in pm.summary(trcs, varnames=varnames).iterrows()]))

    for i, mn in enumerate(pm.summary(trcs, varnames=varnames)['mean']):
        ax[i,0].annotate('{:.2f}'.format(mn), xy=(mn,0), xycoords='data',
                         xytext=(5,10), textcoords='offset points', rotation=90,
                         va='bottom', fontsize='large', color='#AA0022')

In [None]:
rvs_fish = [rv.name for rv in strip_derived_rvs(mdl_first.unobserved_RVs)]
pm.summary(trace, varnames=rvs_fish)

#### Results

We can see the posterior parameters have been estimated with very little variance; r_hat is the [gelman-rubin statistics for convergence ](https://www.stata.com/new-in-stata/gelman-rubin-convergence-diagnostic/). The r_hat = 1 indicates that the simulated chains have been converged. Although this is not a good estimation (by looking at the ess effective sample size), so far we will temporily use this to estimate the **posterior interval**.


In [None]:
pm.plot_trace(trace)

## Sample posterior predictive parameters

In [None]:
with mdl_first:
    pp_trace = pm.sample_posterior_predictive(trace, var_names=rvs_fish, samples=4000)

## Create submission data set

Here since we only modeled the total sales, we will specifically use test set that indicates the total sale

In [None]:
df_2 = scaled_feature.iloc[1913:,:]
total_id = [i for i in submission.id if 'Total' in i]
# change back d_parse
df_2['d_parse']= calendar_feature.iloc[1913:,:].d_parse.values


In [None]:
df_2.d_parse.max()

In [None]:
submission_validation = df_2.iloc[:28, :]
submission_evaluation = df_2.iloc[28:, :]
submission_validation.shape,submission_evaluation.shape

## Use posterior predictive parameters to estimate the posterior interval of Y (uncertainty)

In [None]:
pp_trace.keys()

In [None]:
pp_trace['b0_intercept']

In [None]:
def return_y(df):
    result = 1*pp_trace['b0_intercept']
    for (i,j) in zip([*pp_trace.keys()][1:], df.index[1:]):
        #print(i, j)
        result += pp_trace[i]*df[j]
        #print(result)
    return np.exp(result + np.log(df['d_parse']))
    #return result
validation_y = np.zeros((28, 4000))
evaluation_y = np.zeros((28, 4000))

In [None]:
submission_validation.iloc[0].index

In [None]:
#submission_evaluation.iloc[0]

In [None]:
for row in range(len(submission_validation)):
    validation_y[row, :] = return_y(submission_validation.iloc[row])
    evaluation_y[row, :] = return_y(submission_evaluation.iloc[row])

In [None]:
np.mean(validation_y)

In [None]:
np.mean(total_historical)

In [None]:
## organize the data
total_qt = [float(i.split('_')[2]) for i in total_id]

total_only_submission = submission[submission.id.isin(total_id)]

total_only_submission['qt']=total_qt

total_only_submission.reset_index(inplace=True)

total_only_submission.loc[:7]

In [None]:
for i in range(1,29):
    col_name = 'F' + str(i)
    total_only_submission.loc[:8,col_name] =np.quantile(validation_y[i-1], total_qt[:9])

for i in range(1,29):
    col_name = 'F' + str(i)
    total_only_submission.loc[9:,col_name] =np.quantile(evaluation_y[i-1], total_qt[:9])

In [None]:
total_only_submission

In [None]:
total_only_submission.to_csv('total_submission.csv', index=False)

## For improvement...

* Adding more features
* Adding different hierachies 
* Better prior: it seems really tricky to update the MCMC chain because the overdispersed prior, I have to tune the prior condition multiple times to get a good convergence
