# Calculation of Evaluation Period Weights

Uses the same approach as I used in [this kernel for HTS in Pandas with fast loading](https://www.kaggle.com/christoffer/pandas-multi-indices-for-hts-fast-loading-etc).  

See that notebook for more info and some sample models (seasonal naïve, PyTorch neural net, simple ensemble). (This is pretty much just a fork with most of the irrelevant bits cut out.)

The data for the competition consists primarily of 30490 time series of sales data for 3049 items sold in 10 different stores in 3 states.  The items are classified as being in one of 3 categories that are further subdivided into a total of 7 departments.

The representation we'll look at in this notebook is representing each individual time series as a column in a data frame indexed by the day (`d`).

For the individual (level 12 series), we'll index the series in the columns by `(state_id, store_id, cat_id, dept_id, item_id)`.

In [None]:
import numpy as np
import pandas as pd
import csv
from collections import defaultdict

## Create dataset

Using Pandas directly to read the data and reshape it appears to be a bit slow and uses a significant amount of memory.  Instead we'll read the data line by line and store it in NumPy arrays (but we'll try and keep the rest of the code in the notebook nicely vectorized and high-level =).

In [None]:
SALES = "../input/m5-forecasting-accuracy/sales_train_evaluation.csv"
PRICES = "../input/m5-forecasting-accuracy/sell_prices.csv"
CALENDAR = "../input/m5-forecasting-accuracy/calendar.csv"

# SALES = "../data/raw/sales_train_validation.csv"
# PRICES = "../data/raw/sell_prices.csv"
# CALENDAR = "../data/raw/calendar.csv"

NUM_SERIES = 30490
NUM_TRAINING = 1941
NUM_TEST = NUM_TRAINING + 1 * 28

In [None]:
series_ids = np.empty(NUM_SERIES, dtype=object)
item_ids = np.empty(NUM_SERIES, dtype=object)
dept_ids = np.empty(NUM_SERIES, dtype=object)
cat_ids = np.empty(NUM_SERIES, dtype=object)
store_ids = np.empty(NUM_SERIES, dtype=object)
state_ids = np.empty(NUM_SERIES, dtype=object)

In [None]:
qties = np.zeros((NUM_TRAINING, NUM_SERIES), dtype=float)
sell_prices = np.zeros((NUM_TEST, NUM_SERIES), dtype=float)

### Importing and reshaping sales data

Each row in the sales data consists of six columns for an id of the series together with the five levels item, department, category, store, and, state.

In [None]:
%%time
id_idx = {}
with open(SALES, "r", newline='') as f:
    is_header = True
    i = 0
    for row in csv.reader(f):
        if is_header:
            is_header = False
            continue
        series_id, item_id, dept_id, cat_id, store_id, state_id = row[0:6]
        # Remove '_validation/_evaluation' at end by regenerating series_id
        series_id = f"{item_id}_{store_id}"

        qty = np.array(row[6:], dtype=float)

        series_ids[i] = series_id

        item_ids[i] = item_id
        dept_ids[i] = dept_id
        cat_ids[i] = cat_id
        store_ids[i] = store_id
        state_ids[i] = state_id

        qties[:, i] = qty

        id_idx[series_id] = i

        i += 1

### Importing calendar data

The calendar data has information about which day of the week a given day is, if there are any special events, and most importantly for this notebook, which week (`wm_yr_wk`) the day is in.  We'll need this to get the prices of items, which in turn is necessary in order to calculate the weights we need for estimating our scores.

In [None]:
%%time
wm_yr_wk_idx = defaultdict(list)  # map wmyrwk to d:s
with open(CALENDAR, "r", newline='') as f:
    for row in csv.DictReader(f):
        d = int(row['d'][2:])
        wm_yr_wk_idx[row['wm_yr_wk']].append(d)
        # TODO: Import the rest of the data

### Importing price data

The price data describes the weekly prices for each item in every store.

In [None]:
%%time
with open(PRICES, "r", newline='') as f:
    is_header = True
    for row in csv.reader(f):
        if is_header:
            is_header = False
            continue
        store_id, item_id, wm_yr_wk, sell_price = row
        series_id = f"{item_id}_{store_id}"
        series_idx = id_idx[series_id]
        for d in wm_yr_wk_idx[wm_yr_wk]:
            sell_prices[d - 1, series_idx] = float(sell_price)

### Building DataFrame

We'll store the dataset in two dataframes:

- **`qty_ts`:** sales data.
- **`price_ts`:** prices.

In [None]:
qty_ts = pd.DataFrame(qties,
                      index=range(1, NUM_TRAINING + 1),
                      columns=[state_ids, store_ids,
                               cat_ids, dept_ids, item_ids])

qty_ts.index.names = ['d']
qty_ts.columns.names = ['state_id', 'store_id',
                        'cat_id', 'dept_id', 'item_id']

price_ts = pd.DataFrame(sell_prices,
                        index=range(1, NUM_TEST + 1),
                        columns=[state_ids, store_ids,
                                 cat_ids, dept_ids, item_ids])
price_ts.index.names = ['d']
price_ts.columns.names = ['state_id', 'store_id',
                          'cat_id', 'dept_id', 'item_id']

And if we look at the data, we see how the series are organized into columns:

In [None]:
qty_ts

In [None]:
price_ts

## Aggregation

In this competition, our models are evaluated on 12 different levels defined by combinations of the groupings of the series.  

It is important that we can aggregate our time series, eg., calculate the total sales in each state, so that
we can evaluate a model's per-store item sales data forecasts on every level.

The levels used in the competition are:

In [None]:
LEVELS = {
    1: [],
    2: ['state_id'],
    3: ['store_id'],
    4: ['cat_id'],
    5: ['dept_id'],
    6: ['state_id', 'cat_id'],
    7: ['state_id', 'dept_id'],
    8: ['store_id', 'cat_id'],
    9: ['store_id', 'dept_id'],
    10: ['item_id'],
    11: ['state_id', 'item_id'],
    12: ['item_id', 'store_id']
}

Pandas views all column levels as independent, but here they are not; all series with the same `dept_id` belong to the same `cat_id`, for example.  When grouping our columns, we'll also keep any coarser groupings.

In [None]:
COARSER = {
    'state_id': [],
    'store_id': ['state_id'],
    'cat_id': [],
    'dept_id': ['cat_id'],
    'item_id': ['cat_id', 'dept_id']
}

In [None]:
def aggregate_all_levels(df):
    levels = []
    for i in range(1, max(LEVELS.keys()) + 1):
        level = aggregate_groupings(df, i, *LEVELS[i])
        levels.append(level)
    return pd.concat(levels, axis=1)

def aggregate_groupings(df, level_id, grouping_a=None, grouping_b=None):
    """Aggregate time series by summing over optional levels

    New columns are named according to the m5 competition.

    :param df: Time series as columns
    :param level_id: Numeric ID of level
    :param grouping_a: Grouping to aggregate over, if any
    :param grouping_b: Additional grouping to aggregate over, if any
    :return: Aggregated DataFrame with columns as series id:s
    """
    if grouping_a is None and grouping_b is None:
        new_df = df.sum(axis=1).to_frame()
    elif grouping_b is None:
        new_df = df.groupby(COARSER[grouping_a] + [grouping_a], axis=1).sum()
    else:
        assert grouping_a is not None
        new_df = df.groupby(COARSER[grouping_a] + COARSER[grouping_b] +
                            [grouping_a, grouping_b], axis=1).sum()

    new_df.columns = _restore_columns(df.columns, new_df.columns, level_id,
                                      grouping_a, grouping_b)
    return new_df

A small complication is that Pandas doesn't align during column-wise concatenation, ie., if two dataframes have some different column levels, `pd.concat` does not match levels that are the same between the frames.

The easiest solution is to add back the levels we lost after grouping for now.

In [None]:
def _restore_columns(original_index, new_index, level_id, grouping_a, grouping_b):
    original_df = original_index.to_frame()
    new_df = new_index.to_frame()
    for column in original_df.columns:
        if column not in new_df.columns:
            new_df[column] = None

    # Set up `level` column
    new_df['level'] = level_id

    # Set up `id` column
    if grouping_a is None and grouping_b is None:
        new_df['id'] = 'Total_X'
    elif grouping_b is None:
        new_df['id'] = new_df[grouping_a] + '_X'
    else:
        assert grouping_a is not None
        new_df['id'] = new_df[grouping_a] + '_' + new_df[grouping_b]

    new_index = pd.MultiIndex.from_frame(new_df)
    # Remove "unnamed" level if no grouping
    if grouping_a is None and grouping_b is None:
        new_index = new_index.droplevel(0)
    new_levels = ['level'] + original_index.names + ['id']
    return new_index.reorder_levels(new_levels)

A quick peek at the aggregated sales data:

In [None]:
aggregate_all_levels(qty_ts)

## Evaluation

### Weights

The scoring takes into account the final month's total sales and weights the series on every level accordingly.

In [None]:
def calculate_weights(totals):
    """Calculate weights from total sales.

    Uses all data in the dataframe so remember to calculate total sales
    (quantity times sell price) and .

    :param totals: Total sales
    :return: Series of weights with (level, *_id, id:) as multi-index
    """
    summed = aggregate_all_levels(totals).sum()
    
    return summed / summed.groupby(level='level').sum()

> **NB.** I'm writing this notebook when the public leaderboard is based on the actual final month (strictly speakin, the final 28 day period) of the training data, therefore the weights are actually calculated using the month before that.  A bit confusing, I know.

In [None]:
final_month_totals = (qty_ts.loc[NUM_TRAINING - 28 + 1:NUM_TRAINING + 1] *
                      price_ts.loc[NUM_TRAINING - 28 + 1:NUM_TRAINING + 1])

weights = calculate_weights(final_month_totals)

We need to save the weights to a CSV (since this is the purpose of this notebook.)

In [None]:
weights_export = weights.transpose()\
        .reset_index(level=['level', 'state_id', 'store_id', 'cat_id', 'dept_id', 'item_id'],
                    drop=True)
weights_export.to_csv("weights.csv")