### <center style="background-color:Gainsboro; width:70%;">Store Sales: Using the average of the last 16 days</center>

In the [Store Sales - Time Series Forecasting](https://www.kaggle.com/c/store-sales-time-series-forecasting) GettingStarted prediction Competition we have been tasked with building a model that predicts the unit sales for thousands of items sold at different Corporación Favorita stores. Specifically we are asked to predict the sales for 16 days from 2017-08-16 to 2017-08-31. In this short notebook we shall build a very simple model that uses the values of each family of product sold, for each store number, averaged over a 'look-back' window of the same length (16 days) just prior to the date range we will be predicting.

In [None]:
import numpy  as np
import pandas as pd

# read in the data
train = pd.read_csv("../input/store-sales-time-series-forecasting/train.csv")
test  = pd.read_csv("../input/store-sales-time-series-forecasting/test.csv")

# select the very last 16 days of the training data
train_16_days = train.query("date >= '2017-07-31' ")

def exp_mean_ln(df):
    return np.expm1(np.mean(np.log1p(df['sales'])))

# calculate the average values
train_average = train_16_days.groupby(['store_nbr', 'family']).apply(exp_mean_ln).to_dict()
test['sales'] = test.set_index(['store_nbr', 'family']).index.map(train_average.get)

# create and write out the submission.csv file
submission = pd.DataFrame({'id': test.id, 'sales': test.sales})
submission.to_csv('submission.csv', index=False)

##### **Note regarding calculating the average**
From the competition [evaluation page](https://www.kaggle.com/c/store-sales-time-series-forecasting/overview/evaluation) we see that the metric we are using is the root mean squared logarithmic error (RMSLE), which is given by

$$ {\mathrm {RMSLE}}\,(y, \hat y) = \sqrt{ \frac{1}{n} \sum_{i=1}^n \left(\log (1 + \hat{y}_i) - \log (1 + y_i)\right)^2} $$

where $\hat{y}_i$ is the predicted value of the target for instance $i$, and $y_i$
is the actual value of the target for instance $i$.

It is important to note that, unlike the RMSE, the RMSLE is asymmetric; penalizing much more the underestimated predictions than the overestimated predictions. For example, say the correct value is $y_i = 1000$, then underestimating by 600 is almost twice as bad as overestimating by 600:

In [None]:
def RSLE(y_hat,y):
    return np.sqrt((np.log1p(y_hat) - np.log1p(y))**2)

print("The RMSLE score is %.3f" % RSLE( 400,1000) )
print("The RMSLE score is %.3f" % RSLE(1600,1000) )

The asymmetry arises because 

$$ \log (1 + \hat{y}_i) - \log (1 + y_i) =  \log \left( \frac{1 + \hat{y}_i}{1 + y_i} \right) $$

so we are essentially looking at ratios, rather than differences such as is the case of the RMSE. We can see the form that this asymmetry takes in the following plot, again using 1000 as our ground truth value:

In [None]:
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 16})
plt.style.use('fivethirtyeight')
plt.rcParams["figure.figsize"] = (7, 4)
x = np.linspace(5,4000,100)
plt.plot(x, RSLE(x,1000))
plt.xlabel('prediction')
plt.ylabel('RMSLE')
plt.show()

With this in mind ideally one should not directly calculate the mean, by rather first take the  $\log_e(1+y)$ of the target, in this case the target is the `sales`, using the numpy [`log1p`](https://numpy.org/doc/stable/reference/generated/numpy.log1p.html) function. Once we have done this then we can proceed to calculate the [`mean`](https://numpy.org/doc/stable/reference/generated/numpy.mean.html) of this transformed  target, and finally invert the transform using the numpy [`expm1`](https://numpy.org/doc/stable/reference/generated/numpy.expm1.html) to obtain our average value.

### <center style="background-color:Gainsboro; width:60%;">Related notebooks</center>
* [Store Sales: Naive one-day model](https://www.kaggle.com/carlmcbrideellis/store-sales-naive-one-day-model)

### <center style="background-color:Gainsboro; width:60%;">Recommended reading</center>
* [Rob J. Hyndman and George Athanasopoulos "*Forecasting: Principles and Practice*", (3rd Edition)](https://otexts.com/fpp3/)
* [Fotios Petropoulos, *et al. "Forecasting: Theory and Practice*", arXiv:2012.03854 (2020)](https://arxiv.org/pdf/2012.03854.pdf)