# **Predict Future Sales**
## *Exploratory Data Analysis and Preprocessing*

This challenge is marked as final project for the ["How to win a data science competition"](https://www.coursera.org/learn/competitive-data-science/home/welcome) Coursera course.

In [this competition](https://www.kaggle.com/c/competitive-data-science-predict-future-sales) we will work with a challenging time-series dataset consisting of daily sales data, kindly provided by one of the largest Russian software firms, [1C Company](http://1c.ru/eng/title.htm). 

They are asking to **predict total sales for every product and store in the next month**. I have previously translated and processed their datasets in my notebook [[Translation/Text Processing] Predict Future Sales](https://www.kaggle.com/tymecd/translation-text-processing-predict-future-sales?scriptVersionId=84610744). They also tell us that submissions are evaluated by root mean squared error (RMSE) and that true **target values are clipped into [0,20] range**.

In this notebook we will continue with **understanding and preparing the data** for the subsequent modeling part. The competition is marked as **_Playground_** type, and I can see that it is indeed a playground. There are **endless possibilities** with these datasets. We'll start with first approaches, and add more explorations and studies in following versions.

### Tasks covered
- [x] Process Russian texts and create new variables
- [x] Perform some EDA and data mining for deeper understanding, and prepare data
- [ ] Create model and evaluate results

### Content
* [1. Libraries](#1.-Libraries).
* [2. Datasets profiling and description](#2.-Dataset-profiling-and-description).
    + [2.1. Items, categories and shops catalogues](#2.1.-Items,-categories-and-shops-catalogues)
    + [2.2. Sales and test sets](#2.2.-Train-and-test-sets)
* [3. Preprocessing](#3.-Preprocessing).
    + [3.1. Outliers](#3.1.-Outliers)
        + [3.1.1. Item price](#3.1.1.-Item-price)
        + [3.1.2. Item count](#3.1.2.-Item-count)
    + [3.2. Cartesian product](#3.2.-Cartesian-product)
        + [3.2.1. Reducing memory usage](#3.2.1.-Reducing-memory-usage)
        + [3.2.2. Imputing new items](#3.2.2.-Imputing-new-items)
    + [3.3. Feature engineering](#3.3.-Feature-engineering)
        + [3.3.1. Shops time-series clustering](#3.3.1.-Shops-time-series-clustering)
        + [3.3.2. Items RFM (Recency, Frequency and Monetary value)](#3.3.2.-Items-RFM-(Recency,-Frequency-and-Monetary-value))
        + [3.3.3. Mean-based, trends, and other features](#3.3.3.-Mean-based,-trends,-and-other-features)
    + [3.4. Quantitative variable transformations](#3.4.-Quantitative-variable-transformations)
    + [3.5. Encoding categorical variables](#3.5.-Encoding-categorical-variables)
    + [3.6. Dividing dataset](#3.6.-Dividing-dataset)
* [4. Visualization summary](#4.-Visualization-summary).
    
* [Saving files](#Saving-files).

## 1. Libraries

In [None]:
! pip install tslearn

In [None]:
# system
import os
import warnings

# preprocessing
import pandas as pd
import numpy as np
from itertools import product
from calendar import monthrange
from tslearn.preprocessing import TimeSeriesScalerMeanVariance
from sklearn import preprocessing

# data mining
import pandas_profiling as pf
from tslearn.clustering import TimeSeriesKMeans, silhouette_score
import scipy.stats as stats
import statsmodels.api as sm
from collections import Counter

# visualization
import ipywidgets as widgets
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('notebook', rc={"axes.titlesize":14, "axes.labelsize":13})
sns.set_style('white')
#alt_palette = ["#111425", "#554946", "#006DE4", "#438BD0", "#AD997A", "#00565C", 
#               "#EA9C39", "#AD6B3E"]
#sns.set_palette(alt_palette)
sns.palplot(sns.color_palette()) #default

## 2. Datasets profiling and description

Now, let's dive into the data! We start by loading the competition's datasets and inspecting their basic properties. The first step must always be to perform some EDA, not only to understand the data but also to start detecting possible problems.

#### File descriptions
* `sales_train.csv` - the training set. Daily historical data from January 2013 to October 2015.
* `test.csv` - the test set. We need to forecast the sales for these shops and products for November 2015.
* `items.csv` - supplemental information about the items/products.
* `item_categories.csv`  - supplemental information about the items categories.
* `shops.csv`- supplemental information about the shops.

#### Original data fields
* `ID` - an Id that represents a (Shop, Item) tuple within the test set
* `shop_id` - unique identifier of a shop
* `item_id` - unique identifier of a product
* `item_category_id` - unique identifier of item category
* `item_cnt_day` - number of products sold. We are predicting a monthly amount of this measure
* `item_price` - current price of an item
* `date` - date in format dd/mm/yyyy
* `date_block_num` - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
* `item_name` - name of item
* `shop_name` - name of shop
* `item_category_name` - name of item category

In [None]:
folder = 'input/competitive-data-science-predict-future-sales'
path = f'../{folder}'
files = os.listdir(path)
print(files)

In [None]:
# original datasets
items_original = pd.read_csv(path + '/' + files[0])
categories_original = pd.read_csv(path + '/' + files[2])
shops_original = pd.read_csv(path + '/' + files[-2])

# pretranslated datasets
items = pd.read_csv('../input/items-english/items_english.csv')
categories = pd.read_csv('../input/categories-english/categories_english.csv')
shops = pd.read_csv('../input/shops-english/shops_english.csv')

# sales and test
train = pd.read_csv(path + '/' + files[-3])
test = pd.read_csv(path + '/' + files[-1])

### 2.1. Items, categories and shops catalogues

The three catalogues have neither missing data nor any primary duplicates, so that's great. Nevertheless, after inspecting a bit the shops dataset, one can observe that there are three overlapping shops, with similar but different names. It's then corroborated by computing sales by each shop ID, so we will correct those IDs when preprocessing.

The main (potential) inconvenient is that all **names were given in Russian**, and, at least for now, that's not a language I'm fluent in. 😅 As I stated in a previous notebook, names may contain a lot of information about items and shops, and they will surely provide major insight towards creating a model for predicting sales. We've **already cleaned and translated all names to English**. We've also created new variables with provisional IDs, so we'll use those processed datasets. You can check the code for doing so with the link to my notebook provided in the introduction above, [[Translation/Text Processing] Predict Future Sales](https://www.kaggle.com/tymecd/translation-text-processing-predict-future-sales?scriptVersionId=84610744).

In [None]:
for df in [items, categories, shops]:
    print('\nOriginal dataset:')
    if df is items:
        display(items_original.head(3))
    elif df is categories:
        display(categories_original.head(3))
    else:
        display(shops_original.head(3))
    print('Processed dataset:')
    display(df.head(3))
    print(df.shape)
    print('\nMissing values: \n', df.isna().sum())
    print('\nUnique values: \n', df.nunique())

We will now join items and categories, and continue with inspecting the sales and test sets before aggregating all the info. We will also seize this cell to group here some categories that we've seen in further inspections after merging with sales data that have low quantities of items.

In [None]:
# grouping some categories with low quantity of items before computing new features
categories['category_name'] = (categories['category_name']
                               .apply(lambda x: 
                                      'other games' if x in ['android games', 'mac games'] else 
                                      ('blank media' if 'blank media' in x else 
                                       ('service' if 'tickets' in x else x))))
categories['subcategory_name'] = (categories['subcategory_name']
                                  .apply(lambda x: 
                                         'tickets' if 'tickets' in x else
                                         ('blank media' if 'blank media' in x else x)))

# overriding provisional IDs
categories['category_id'] = preprocessing.LabelEncoder().fit_transform(categories.category_name
                                                                           .values).astype('int8')
categories['subcategory_id'] = preprocessing.LabelEncoder().fit_transform(categories.subcategory_name
                                                                              .values).astype('int8')

items_categories = items.merge(categories, on='item_category_id', how='left')
items_categories.sample(5)
print('New number of categories: ', categories.category_id.nunique(), 
      '\nNew number of subcategories: ', categories.subcategory_id.nunique())

### 2.2. Sales and test sets

We can see that there are no missing data in both datasets, but we're lacking `item_price` column in the **test set**, as it merely consists of the **cartesian product of 42 shops and 5100 items** in November 2015, with 214200 total shop-item pairs. Also, there are more shops in the training set than there exist in the test set, and quantile ranges and standard deviation from mean give us a hint for the presence of outliers in `item_price` and `item_count` variables. We'll see them in detail in a subsequent report. After doing some more research, we'll start by creating a **new dataset** adding all the supplemental information, and later **build train, validation and test sets from it**.

In [None]:
it = 0
for df in [train, test]:
    if it == 0:
        print('----- Sales (Train) -----')
    else:
        print('----- Test -----')
    display(df.head());
    print(df.info(show_counts=True))
    display(df.describe());
    print('Unique values: \n', df.nunique())
    
    it +=1

This next profile report gives us quick and **further insight** into the variables. We see that we have 6 duplicate rows, and **histograms** tell us that there are some shops and items that have had much higher sales than others. Also, **outliers** are indeed present in both `item_price` and `item_cnt_day` variables, making them also highly skewed, with some of them looking like accounting or annotation errors. We'll have to take a closer look at those when preprocessing data.

In [None]:
# profile report with pandas profiling
pf.ProfileReport(train)

Total shops are not constant over time, nor are total items, as it can be seen in the following graph. To reflect test data, we'll also create a cartesian product of shops and items (mostly active, as we'll see later) every month for the train set, so this variation over time won't be such a cumbersome problem. Notwithstanding, we'll eliminate products and shops that seem to no longer be active. Let's dig deeper and inspect those included in the test set (i.e. November 2015) vs those not included.

In [None]:
# plot number of unique shops and items over time
fig, ax1 = plt.subplots(figsize=(7,4))

ax1.set_xlabel('Month block')
ax1.set_ylabel('# of shops')
ax1.plot(train.set_index('date').groupby('date_block_num').nunique()[['shop_id']], 
         color='darkgreen', label='Shops')
ax1.legend(loc='upper left')

ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis

ax2.set_ylabel('# of items')  # we already handled the x-label with ax1
ax2.plot(train.set_index('date').groupby('date_block_num').nunique()[['item_id']], 
         color='lightgreen', alpha=.9, label='Items')
ax2.legend(loc='upper right')
ax2.set_title('Unique shops and items')

fig.tight_layout()  # otherwise the right y-label is slightly clipped
plt.show()

We can see that shops included in the test set behave disparately than those not included. **In-test shops (ITS)**, as we'll call them, follow a **quasi-constant trend over time**, tilting upwards, with an oscilating 80%-100% remaining active; although the number of items sold is decreasing. **Out-of-test shops (OOTS)**, on the contrary, **behave very differently**, with a **decreasing trend** over time from 60% of active shops to a small 10%. Number of items sold, also decreases sharply. In the second graph, we can see that sales ratios have also different magnitudes. Thus, this confirms our necessity to remove these seemingly inactive shops from the sales dataset prior to a model construction, as they could add unnecessary bias. However, we won't simply eliminate those not present in the test set, since this would be a case of incorporating information from the future. We'll use inference for this.

🔬 One **possible line of investigation** that is derived from here can be to divide the dataset into groups, maybe one for active shops and another for inactive shops -- either by means of a clustering algorithm or a rule-based segmentation incorporating their trends -- and develop separate models for each group. It could be a case of **stacking a classification algorithm with a regressor**. This could also help in the detection of shops that won't have very much success, or that are decreasing their revenue.

In [None]:
# plot percentage of unique shops and number of items over time
fig, (ax1, ax3) = plt.subplots(1, 2, figsize=(15,4))
ax1.set_xlabel('Month block')
ax1.set_ylabel('% of active shops')
ax1.plot(train[~train.shop_id.isin(test.shop_id.unique())]
         .set_index('date').groupby('date_block_num').nunique()[['shop_id']]*100/
         len(train[~train.shop_id.isin(test.shop_id.unique())].shop_id.unique()), 
         color='darkblue', label='Out-of-test shops (OOTS)')
ax1.legend(loc='lower left')

ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis

ax2.set_ylabel('# of items')  # we already handled the x-label with ax1
ax2.plot(train[~train.shop_id.isin(test.shop_id.unique())]
         .set_index('date').groupby('date_block_num').nunique()[['item_id']], 
         color='lightgrey', alpha=.8, label='OOTS Items')
ax2.legend(loc='upper right')

ax1.plot(train[train.shop_id.isin(test.shop_id.unique())]
         .set_index('date').groupby('date_block_num').nunique()[['shop_id']]*100/len(test.shop_id.unique()), 
         color='red', label='In-test shops (ITS)')
ax1.legend(loc='lower left')

ax2.plot(train[train.shop_id.isin(test.shop_id.unique())]
         .set_index('date').groupby('date_block_num').nunique()[['item_id']], 
         color='pink', alpha=.8, label='ITS Items')
ax2.legend(loc='upper right')
ax2.set_title('Active shops and unique items')

fig.tight_layout()  # otherwise the right y-label is slightly clipped

ax3.plot(train[~train.shop_id.isin(test.shop_id.unique())]
         .set_index('date').groupby('date_block_num').sum()[['item_cnt_day']]/
         len(train[~train.shop_id.isin(test.shop_id.unique())].shop_id.unique()), 
         color='blue', linestyle='dashed', label='OOTS Sales Ratio')

ax3.plot(train[train.shop_id.isin(test.shop_id.unique())]
         .set_index('date').groupby('date_block_num').sum()[['item_cnt_day']]/len(test.shop_id.unique()), 
         color='orange', linestyle='dashed', label='ITS Sales Ratio')
ax3.legend(loc='upper right')
ax3.set_xlabel('Month block')
ax3.set_ylabel('# of sales')
ax3.set_title('Total sales per shop')

fig.tight_layout()  # otherwise the right y-label is slightly clipped
plt.show()

These next number figures tell us even more things. There are **363 items for which we won't have any item supplemental information**, so we will have to infer it from closer items (we'll see later how). Even though there are no new shops, when computing shop-item pairs, we're left with a total of 102796 new shop-item pairs. From those, 15246 new pairs are due to new items, and the rest is simply due to the cartesian product. So, the only thing we have to do is to **infer the information for those items**, as we have everything we need for shops.

In [None]:
print('Number of shops not included in train (new shops): ', 
      test[~test.shop_id.isin(train.shop_id.unique())].shop_id.nunique())
print('Number of items not included in train (new items): ',
      test[~test.item_id.isin(train.item_id.unique())].item_id.nunique())
print('\nNumber of shop/item pairs to predict: ', 
      len(test.groupby(['shop_id', 'item_id']).count()))
merged_df = (train[['shop_id', 'item_id']].drop_duplicates()
             .merge(test, on=['shop_id', 'item_id'], how='right', indicator=True))
print('Number of new shop/item pairs: ', 
      len(merged_df[merged_df['_merge'] == 'right_only']))
print('    Number of new shop/item pairs due to new items: ', 
      len(merged_df[(merged_df['_merge'] == 'right_only') & (~merged_df.item_id.isin(train.item_id.unique()))]))
print('Number of shop/item pairs in both sets: ', 
      len(merged_df[merged_df['_merge'] == 'both']))

We will also pay attention to **items that seem to be outdated** when evaluating our model. Depending on the item, it would be anomalous to have a non-zero prediction for them. Besides, it depends on our definition of outdated, but there are items that have not even been sold for the last 2 years. If the model does not capture them, we could add some sort of "business rule". Also, there are 4 items in train that have not been sold in any month of the whole dataset and are not present in test, so we'll just delete them.

🔬 There's a simple **customer segmentation model** in retail sales called **RFM**, which stands for Recency, Frequency, and Monetary value, that could be helpful in this project. We'll explain it later when performing feature engineering, but it basically consists of executing a segmentation of customers based on the recency of their purchases, the frequency with which they buy, and the money they spend. We could make an **inverse analogy with items** to add interesting features for a model.

In [None]:
def pivot_df(df, index, values, columns, agg=np.sum, fill=0):
    """Pivot dataframe and arrange levels.
    
    Parameters
    ----------
    df: pandas dataframe
    index: list
        Index for pivoted df.
    values: list
        Column to aggregate.
    columns: str
        Column with levels to expand as new columns.
    agg: numpy function
        Function for aggregation.
    fill: object
        Value for filling empty values.
    """
    pivoted = df.pivot_table(index=index, values=values, 
                             columns=columns, aggfunc=agg, fill_value=fill).reset_index()
    pivoted.columns = pivoted.columns.droplevel()
    pivoted = pivoted.set_index('').rename_axis(index[0], axis=1)
    
    return pivoted

sales_by_item_id = pivot_df(train, ['item_id'], ['item_cnt_day'], 'date_block_num')

# calculating "outdated items"
for blocknum in [0, 10, 22, 28]:
    
    outdated_items = sales_by_item_id[sales_by_item_id.iloc[:, blocknum:].sum(axis=1) == 0]

    print(f'Number of items that have not been sold for the past {len(outdated_items.columns) - blocknum} months (outdated items): ',
          len(outdated_items))
    print('Number of these outdated items included in test set: ',
          test[test.item_id.isin(outdated_items.index)].item_id.nunique(), '\n')

In the table below we can see a sample of 5 shops with sales lower than that of the mean for each month. Some are very recent. Shop 36, in particular, has just been added in October 2015, the previous month of test -- which will be used for validation. For this kind of recent shops, we'll maybe have to include the prediction of some other model, possibly based on similarity scores, or use the help of clustering algorithms, as we said earlier. On the other hand, shops with decreasing trends to zero are not present in the test set (OOTS), as the test set only consists of active shops in November 2015.

In [None]:
# computing sales by shop
sales_by_shop_id = pivot_df(train, ['shop_id'], ['item_cnt_day'], 'date_block_num')

# sales lower than mean for month
sales_by_shop_id[sales_by_shop_id < sales_by_shop_id.mean(axis=1).mean()].dropna().tail()

The following graphs show sales by shop over time wrapped by the categories we created when processing their name. For the first three types, we can see a similar behaviour, with major sales around the last quarter of each year and spikes in the months of December, coinciding with shopping for New Year's and Christmas Eve in Russia. We can see that shops deemed special and online behave very differently. The only sales in special shops are during this high-demand season, being possibly prepared for these events, and online shops show an incresing demand over time. 

In [None]:
long_sales_by_shop_id = (sales_by_shop_id.reset_index()
                   .melt(id_vars='', var_name='date_block_month', value_name='sales')
                   .merge(shops[['shop_id', 'shop_type_name']], right_on='shop_id', left_on='', how='left'))
                          
g = sns.FacetGrid(long_sales_by_shop_id,
                   col = "shop_type_name", col_wrap = 5, aspect = 1.2, sharey = False, sharex = True, despine = False);
g.map_dataframe(sns.lineplot, x='date_block_month', 
                y='sales', hue='shop_id').set_axis_labels("Month number", "# of sales");
g.set_titles("Shop type: {col_name}");

All of this means that our shop-item sales time series are, as expected, **heteroskedastic and non-stationary, with strong seasonal components and varying trends**, being both trend-stationary and difference-stationary. As a further remark, it therefore follows that our series could be made stationary after de-trending and then taking differences between seasons, but we won't be testing for a definitive answer, as the methods we'll use to forecast do not assume this property of stationarity and the complexity of our data would hinder accurate computations. Nonetheless, we'll **engineer features to account for these components**. 

💡 In addition, our data shows a **clear hierarchical structure**, in which the lower levels (items) are nested within the higher-level groups (shops). However, not only are our series hierarchical, but also **grouped**, as we can have multiple levels of detail: item category, price range, etc; thus, our disaggregating factors are both nested and crossed. As such, having this grouped structure in mind when performing feature engineering and modelling will be paramount. 

In this section we won't be modifying our datasets, as our aim was to merely introduce them and compute basic statistics. In the following section, we'll **continue to the core task of preprocessing our data**. Let's go!

## 3. Preprocessing

Hands-on work!

In order to make better visualizations and derive greater insights in this section, we'll merge all the information together in the next cell. Nonetheless, this first conglomerate won't be our final dataset. As we stated earlier, we'll make a cartesian product of items and shops for the trainning set to reflect the testing set.

In [None]:
sales_pre = train.merge(items_categories, on='item_id', how='left').merge(shops, on='shop_id', how='left')

As we've seen earlier, our first step will be to update shop IDs, remove items with no sales at all and duplicates in the sales dataset. We have to act as having no information from the future, i.e. from the test set or validation set (which we'll create later with October 2015), so we cannot simply remove shops and items that are not present in test. For a simple mark, we'll compute activity based on the last information in September 2015.

In [None]:
sales = sales_pre.copy()

# assigning current shops to old ones
for df in [sales, test]:
    df['shop_id'] = np.where(df.shop_id == 0, 57, 
                             np.where(df.shop_id == 1, 58, 
                                      np.where(df.shop_id == 11, 10, df.shop_id)))

# eliminating completely outdated items
sales = sales[~sales.item_id.isin(sales_by_item_id.sum(axis=1) == 0)]

# eliminating duplicates
sales = sales.drop_duplicates()

# adding datetime format date for easier processing
sales['datetime_date'] = pd.to_datetime(sales.date, format="%d.%m.%Y")
del sales['date']

### 3.1. Outliers

In the previous report, we saw that variables `item_price` and `item_cnt_day` are heavily right-skewed. Let's visualize them better and inspect them. Most parametric statistics, such as means, are highly sensitive to outliers, and they can really distort an analysis if we don't deal with them well. However, outliers may be legitimate observations and hide lots of insights, so it's important to investigate their nature before deciding what to do with them. What you'll see here is a first approach on dealing with these seemingly anomalous observations.

#### 3.1.1. Item price

The maximum value in item price is 2 orders of magnitude above the highest value of the interquartile range for all sold items, as can be seen on the introduction. Taking a closer look at it, we can see that it belongs to a single item named _"radmin 3 522 persons"_ bought once. Radmin is a remote access software product, and operates through licenses. One can buy buckets of hundreds of licenses, and it looks like a bucket for 522 persons was bought here. If we look for more Radmin products in the dataset, we can find another item called _"radmin 3 1 person"_ for a price of RUB 1299, which corroborates our hypotheses. Also, probably, buckets have discounts over the total price. We will eliminate this item from the dataset to yield more accurate predictions.

In [None]:
higher = sales[sales.item_price > sales.item_price.quantile(0.999)] # highest 0.1%
lower = sales[sales.item_price < sales.item_price.quantile(0.001)] # lowest 0.1%
# alternatively one can compute the z scores; depends on the necessity

display(sales[sales.item_name.str.contains('radmin')][['shop_city_name', 'shop_type_name',
                                               'item_name', 'item_id','item_price']].drop_duplicates(['item_name', 
                                                                                                      'item_price']))

# removing outlier with maximum price
sales = sales[sales.item_id != 6066]

In the next two subplots we can see that the distributions of item prices widely differ among categories, being _game consoles_ the one containing the biggest number of expensive items, and having the highest mean (marked as a little cyan triangle in the graphs). For now, we'll remove items above the maximum price in this category, which amount to various orders of magnitude below the 0.0001% of our data and they are mere marginal outliers. We'll see how the rest affect the results of the model, and then, we'll decide upon different procedures, maybe deleting outliers by category. Nonetheless, we'll mark the top 1% for later usage.

🔬 As our variable spans various orders of magnitude, a logarithmic transformation when preparing data for model creation can surely be beneficial. For example, _gifts_ category is the one with most price variance, spanning 6 magnitude orders, and this can be better discovered by the second graph.

In [None]:
plt.figure(figsize=(15,4))
plt.subplot(121)
boxplot_sales = sales.reset_index()
ax = sns.boxplot(data = boxplot_sales, 
                 x='category_name', y='item_price', showfliers=True, showmeans=True)
ax.set_xticklabels(ax.get_xticklabels(), rotation=30, ha="right")
plt.xlabel('Item category'); plt.ylabel('Item price [RUB]');
plt.title('Price distribution amongst categories');

plt.subplot(122)
boxplot_sales = sales.reset_index()
ax = sns.boxplot(data = boxplot_sales, 
                 x='category_name', y=np.log10(abs(boxplot_sales.item_price)), showfliers=True, showmeans=True)
ax.set_xticklabels(ax.get_xticklabels(), rotation=30, ha="right")
plt.xlabel('Item category'); plt.ylabel('Log10(Item price)');
plt.title('Price magnitude orders amongst categories');
plt.tight_layout();
plt.show()

In [None]:
# deleting outliers above max price for game consoles category
sales = sales[sales.item_price < max(sales[sales.category_name == 'game consoles'].item_price)]

# marking the highest 1%
sales['highest_price'] = np.where(sales.item_id.isin(higher.item_id.unique()), 'highest', 'regular')

Regarding the lowest prices, we can see that there are items with values even below RUB 1, which is currently 1 cent of EUR after conversion, and that do not make any sense. If we compare them to similar items, for example, some _stuffed toys_
that can be seen in the next table, we can observe that these low priced items are undoubtedly due to errors. We'll impute them with the median of their subcategory.

In [None]:
# lowest 1% of prices
display(lower.drop_duplicates(['item_price'])[['item_id','item_name', 'category_name',
                                               'subcategory_name', 'item_price']].sort_values(by='item_price').head(3))

# sample for some stuffed toys
display(sales[(sales.item_name.str.contains('cm')) & 
              (sales.subcategory_name == 'stuffed toys')].sample(3)[['item_id','item_name', 'category_name',
                                                             'subcategory_name', 'item_price']])

In [None]:
# imputing with median of category for values below RUB 1
sales['item_price'] = np.where(sales.item_price <= 1, np.nan, sales.item_price)
sales['item_price'] = sales.groupby("subcategory_name")['item_price'].transform(lambda x: x.fillna(x.median()))

#### 3.1.2. Item count

The first approach we'll take for the number of purchases in a day for an item will be to remove those higher than 10e3 counts in one order, as they look influential, and leave the ones below zero untouched, as we'll suppose that they're returned items, and a net amount will be computed when aggregating -- ideally, we would ask the provider about this (they could be systemic errors, for example). Analogously to `item_price`, we'll mark the highest 0.1%, but after aggregating by month. 

When evaluating the model, we'll see how our approach behaves and maybe think about more elaborate ways of dealing with outliers. It's also important to note that this variable will be clipped in the range (0,20) when making predictions, and that's why we're being cautelous for now with its anomalous values.

In [None]:
plt.figure(figsize=(15,4))
plt.subplot(131)
boxplot_sales = sales.reset_index()
ax = sns.boxplot(data = boxplot_sales, 
                 x='category_name', y='item_cnt_day', showfliers=True, showmeans=True)
ax.set_xticklabels(ax.get_xticklabels(), rotation=30, ha="right")
plt.xlabel('Item category'); plt.ylabel('Number of sales');
plt.title('Sales amongst categories');

plt.subplot(132)
boxplot_sales = sales.reset_index()
ax = sns.boxplot(data = boxplot_sales, 
                 x='shop_type_name', y=boxplot_sales.item_cnt_day, showfliers=True, showmeans=True)
ax.set_xticklabels(ax.get_xticklabels(), rotation=30, ha="right")
plt.xlabel('Shop type'); ax.set(ylabel=None)
plt.title('Sales amongst shop types');

plt.subplot(133)
boxplot_sales = sales.reset_index()
ax = sns.boxplot(data = boxplot_sales, 
                 x='shop_city_name', y='item_cnt_day', showfliers=True, showmeans=True)
ax.set_xticklabels(ax.get_xticklabels(), rotation=30, ha="right")
plt.xlabel('City'); ax.set(ylabel=None)
plt.title('Sales amongst cities');
plt.tight_layout();
plt.show()

# removing those above 10
sales = sales[sales.item_cnt_day <= 10e3]

Also, note the difference between these two lists. The first is not the number of sales, but the number of times the category is repeated across the dataset. _PC games_ is the greatest in total sales, and appearing third in the first list means that it has more bulk orders, and therefore, more outliers, as can be seen in the previous graphs.

In [None]:
print('Top 5 categories (value counts):\n', sales.category_name.value_counts()[:5], '\n')
print('Top 5 categories (sales):\n',
      sales.groupby('category_name')['item_cnt_day'].sum().sort_values(ascending=False)[:5])

### 3.2. Cartesian product

We will now create the cartesian product of shops and items for each month so as to have a 0 every time an item in each unique shop-item pair has not been sold within that month, and join it with our previous dataframes. This way train data will be analogous to test data. Also, we'll seize this subsection to reduce memory usage.

In [None]:
## aggregation
# calculating revenue before aggregation
sales['revenue'] = sales['item_cnt_day'] * sales['item_price']

# changing dates to first day of month before aggregation and adding year and month
sales['date'] = sales.datetime_date.apply(lambda x: x.replace(day=1))
sales['year'] = sales.datetime_date.apply(lambda x: x.year)
sales['month'] = sales.datetime_date.apply(lambda x: x.month)

# creating dict for aggregation
dic = {}
for key in sales.columns:
    vals = {key: 'sum' if key in ['item_cnt_day', 'revenue'] else 'last'}
    dic.update(vals)
del dic['date_block_num'], dic['shop_id'], dic['item_id']

# grouping by month, shop and item
grouped_sales = (sales
                 .groupby(['date_block_num', 'shop_id', 'item_id'])
                 .agg(dic)).reset_index()

# adjusting negative item counts even after aggregation
grouped_sales['item_cnt_day'] = np.where(grouped_sales.item_cnt_day < 0, 0, grouped_sales.item_cnt_day)
grouped_sales['revenue'] = np.where(grouped_sales.item_cnt_day == 0, 0, grouped_sales.revenue)

## product of shops x items within each month
# creating cartesian product for new dataset
df = [] 
for block_num in grouped_sales['date_block_num'].unique():
    active_shops = grouped_sales[grouped_sales['date_block_num'] == block_num].shop_id.unique()
    active_items = grouped_sales[grouped_sales['date_block_num'] == block_num].item_id.unique()
    df.append(np.array(list(product(*[active_shops, active_items, [block_num]]))))

data = pd.DataFrame(np.vstack(df), columns=['shop_id', 'item_id', 'date_block_num'])

# adding test product
test['date_block_num'] = 34
del test['ID']
data = pd.concat([data, test], ignore_index=True, sort=False)
data.count()

In [None]:
%%time
## adding information
# merging on item, shop and month together
data_second = (data.merge(grouped_sales[['shop_id', 'item_id', 'date_block_num',
                                         'item_price', 'item_cnt_day', 'revenue']].drop_duplicates(), 
                                on=['shop_id', 'item_id', 'date_block_num'], how='left'))

# imputing values for shop-item pairs not in sales data for corresponding month but with item information
# revenue and count will be simply 0, whereas item price will be a median of item prices for each item ID
values = {'item_cnt_day': 0, 'revenue': 0, 'item_price': np.nan}
data_second.fillna(value=values, inplace=True)
# imputing price
data_second = data_second.merge(sales.groupby(['date_block_num', 'item_id'])['item_price'].median().reset_index(),
                                on=['date_block_num', 'item_id'], how='left')
data_second['item_price'] = np.where(data_second.item_price_x.isna(), 
                                     data_second.item_price_y, data_second.item_price_x)
del data_second['item_price_x'], data_second['item_price_y']
# filling month 34
data_second['item_price'] = data_second['item_price'].fillna(data_second.groupby(['item_id'])['item_price'].ffill())

# merging on item, shop and month separately
items_cols = list(items_categories.columns)
items_cols.remove('item_category_id')
items_cols.append('highest_price')
shops_cols = list(shops.columns)

data_third = (data_second
              .merge(grouped_sales[items_cols].drop_duplicates(), on='item_id', how='left')
              .merge(grouped_sales[shops_cols].drop_duplicates(), on='shop_id', how='left')
              .merge(grouped_sales[['year', 'month', 'date', 'date_block_num']].drop_duplicates(), 
                     on='date_block_num', how='left')
              .rename(columns={'item_cnt_day': 'item_cnt_month',
                               'item_name_en_number_stopwords': 'item_stopwords',
                               'item_name_en_number_nouns': 'item_nouns',
                               'item_name_en_number_words': 'item_words'}))

# filling month 34
data_third['year'] = data_third['year'].fillna(2015).apply(int).astype('int16')
data_third['month'] = data_third['month'].fillna(11).apply(int).astype('int8')
data_third['date'].fillna(pd.to_datetime("01.11.2015", format="%d.%m.%Y"), inplace=True)

# marking highest 0.1% sales count for after use
data_third['highest_count'] = np.where(data_third.item_cnt_month >= 
                                       data_third.query('item_cnt_month > 0').item_cnt_month.quantile(0.999),
                                       'highest', 'regular')
# saving memory
del data_second, data, grouped_sales, sales

Great, everything is well according to figures. Previously, we saw that the total number of new shop-item pairs due to the 363 new items incorporated in November 2015 was 15246, and, after the previous operations, we would expect to see that number of observations without any item information, as total observations (11127604) - non-null observations (11112358) = 15246. The proportion is significant (~7% of total samples in the test set). Thus we will have to **impute these values** before continuing. There are various modules that offer several imputation alternatives but, as our case is special, we'll take a straightforward approach, filling them with the median or the mode for **similar groups based on our informed variables**.

🔬 When imputing multiple variables with frequent missing values, it is convenient to perform a **sensibility analysis** to explore whether our imputation system is introducing bias in the data. As our missing values correspond to the test set, we'll have to take this into account when evaluating our model.

In [None]:
data_third.info(null_counts=True)

#### 3.2.1. Reducing memory usage

We will be _downcasting_ our data types in this subsection in order to reduce memory usage and speed up processes.

In [None]:
def reduce_mem_usage(data):
    """ Iterate through all the columns of a dataframe and modify the data type
    to reduce memory usage.
    
    Parameters
    ----------
    data: Pandas dataframe
    """
    start_mem = data.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in data.columns:
        col_type = data[col].dtype
        if str(col_type).startswith('int'):
            if max(data[col] > 30000):
                new_type = 'int32'
            elif max(data[col] > 100):
                new_type = 'int16'
            else:
                new_type = 'int8'
            data[col] = data[col].astype(new_type)
        elif str(col_type).startswith('float'):
            if max(data[col] > 6e4):
                new_type = 'float32'
            else:
                new_type = 'float16'
            data[col] = data[col].astype(new_type)
        elif str(col_type) == 'object':
            data[col] = data[col].astype('category')

    end_mem = data.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return data

In [None]:
%%time
opt_data = reduce_mem_usage(data_third)
del data_third

#### 3.2.2. Imputing new items

As stated previously, we'll impute values for new items based on the columns where we can extract some information. We'll use buckets of aproximately twenty items based on `item_id` and compute the median to assign it to new items. Our assumption, as we have observed, is that similar items have neighbour IDs. After doing some testing, I've decided that we won't group by other columns -- such as shop type -- because of neighbouring complexity. The more columns we add, the more difficult it will be to find adjacent neighbours with ID buckets.

In [None]:
%%time

items_cols = [col for col in items_cols if col not in['item_name_en_number_words',
                                                  'item_name_en_number_stopwords',
                                                  'item_name_en_number_nouns']]
items_cols.extend(['item_stopwords', 'item_words', 'item_nouns', 'item_price'])

def basicImputing(df):
    """Function for imputing new items based on median or mode.
    
    Parameters
    ----------
    df: pandas dataframe
    """
    # this will take the neighbours in 1000 buckets, aprox. 22 neighbours in each bucket given total items
    df['neighbours'] = pd.cut(df.item_id, 1000)
    
    # imputing with median for neighbours
    # neighbour ids are similar
    for col in items_cols:
        if df[col].isnull().values.any():
            if str(df[col].dtype) == 'category': # if col is categorical, we take the mode of the group
                df[col] = (df.groupby(['neighbours'])[col]
                           .transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else np.nan)))
            else:
                df[col] = (df.groupby(['neighbours'])[col]
                           .transform(lambda x: x.fillna(x.median())))
        else:
            pass
    
    return df

opt_data = basicImputing(opt_data)
print('Missing values left: ', opt_data.isna().sum().sum())

### 3.3 Feature engineering

This is yet another important step in every data science problem. A model is only as good as the data it is fed, right? Since we're dealing with time series, things get trickier. When computing new features, we have to pay attention not to add any new information in a past moment that wouldn't be available at that time. Every computation that involves **calculations along the time axis** has to be **performed on a time cross-section basis or on expanding time windows** to **avoid _data leakage_**. Models with data leakage lead to good performances in the trainning set, even in the test set, but result in poor implementations in production, which, in the end, is all that matters.

#### 3.3.1. Shops time-series clustering

Clustering is the task of grouping together similar objects; hence, it heavily depends on the notion of similarity one relies on. **Time series clustering allow us to cluster series based on its shape along the time axis**, and we'll use it here to find possible groups of shops based on their sales behaviour. We'll find the similarities within Euclidean space using the **Euclidean distance**, so it will take into account amplitudes --i.e., sales quantities --, but will not be invariant to time shifts. There are other more complex metrics specifically designed for time series, like DTW (Dynamic Time Warping), that do take into account the series phase, but for our problem, I do not deem them necessary. As a first approach, we'll be using the simple **_k_-means algorithm adapted to time series**. As a further remark, since this configuration is very sensitive to outliers, we'll be removing the ones in _highest count_ to compute the clusters.

The different clusters are indeed interesting, as, for example, the algorithm managed to differentiate the special shops category in cluster 6 that we visualized in section 2; so, if we wouldn't have arranged the _category_ feature, we could have sort of infer its existence from here. Silhouette score is positive but not very high, and number of shops, i.e., cardinality, is not even amongst clusters, as expected, but they're still insightful. Nonetheless, we won't be using this feature in our model to avoid data leakage, since we're using the whole time range to compute the groups. This is just for analysis purposes.

🔬 In a future version, after having performed some feature engineering, we may try to cluster our whole dataset (not just the shops behaviour) using a static photo in time (cross-section) of the each available month, with a more robust technique such as PAM (Partitioning Around Medoids), which is similar to _k_-means but uses medoids instead of centroids, and visualize them with t-SNE in a low-dimensional space.

In [None]:
# euclidean k-means
print("Euclidean time series k-means:")

# only train info
opt_train = opt_data.query('date_block_num < 33')

# removing highest outliers 
opt_data_clust = opt_train.query('highest_count == "regular"')

# computing new sales by shop
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    opt_data_clust['item_cnt_month'] = opt_data_clust.item_cnt_month.astype('int16')
    
sales_by_shop_id = pivot_df(opt_data_clust, ['shop_id'], ['item_cnt_month'], 'date_block_num')

# shifting one month so as to use previous sales data every month
shop_shifted_sales = sales_by_shop_id.shift(1, axis=1).dropna(axis=1)

# standardizing series (mean 0, std 1)
X_train = shop_shifted_sales.reset_index().drop([''], axis=1)
scaled = TimeSeriesScalerMeanVariance(mu=0., std=1.).fit_transform(X_train.values)
data_train = pd.DataFrame(np.squeeze(scaled), columns=X_train.columns, index=X_train.index)
df = data_train.values

# computing methods for cluster selection
sil_scores = {}
inertias = [] 
for n_clusters in range(2, 9):
    km = TimeSeriesKMeans(n_clusters=n_clusters, verbose=False, random_state=0, metric="euclidean")
    y_pred = km.fit_predict(df)
    sil_scores[n_clusters] = silhouette_score(df, y_pred, metric="euclidean")
    inertias.append(km.inertia_)

# plotting methods
fig, ax1 = plt.subplots()
ax1.set_xlabel('Number of clusters')
ax1.set_ylabel('Inertia')
ax1.plot(range(2,9), inertias, 
         color='b', label='Elbow method (inertia)')
ax1.legend(loc='upper left')
ax2 = ax1.twinx() 
ax2.set_xlabel('Number of clusters')
ax2.set_ylabel('Silhouette score')
ax2.plot(range(2,9), list(sil_scores.values()), 
         color='r', label='Silhouette score')
ax2.legend(loc='upper right')
fig.tight_layout()  # otherwise the right y-label is slightly clipped

# getting number of clusters that maximizes silhouette score
sil_scores.pop(2) # dropping option with 
n_clusters = max(sil_scores, key=sil_scores.get)

plt.axvline(x=n_clusters, color='k', ls='--', alpha=.3);
plt.text(0.67, 0.75,'Optimal $k$', transform=plt.gca().transAxes);
plt.show();

# performing definite clustering
# random seed is very important for k-means, as a bad one may lead to slower convergence and bias
km = TimeSeriesKMeans(n_clusters=n_clusters, verbose=True, random_state=0, metric="euclidean")
y_pred = km.fit_predict(df)

# plotting and computing tot ss
plt.figure(figsize=(15,10))
plt.title("Euclidean $k$-means");
tot_ss = {}
for yi in range(n_clusters):
    ss = []
    plt.subplot(3, 3, yi + 1)
    for xx in df[y_pred == yi]:
        ss.append((xx - km.cluster_centers_.squeeze()[yi])**2)
        tot_ss[yi] = ss
        plt.plot(xx.ravel(), "k-", alpha=.2)
    tot_ss[yi] = np.mean(sum(tot_ss[yi]))
    plt.plot(km.cluster_centers_[yi].ravel(), "c-")
    plt.text(0.55, 0.85,'Cluster %d' % (yi + 1), transform=plt.gca().transAxes);
    plt.xlabel('Month'); plt.ylabel('Standardized sales')

In [None]:
# adding the clusters as a new variable (but remember that they won't be used in the model)
shop_clusters = {}
it = 0
for shop in shop_shifted_sales.index:
    shop_clusters[shop] = y_pred[it] + 1
    it += 1

# mapping shops
# NAs will be due to new shops in the months 33 and 34, i.e., shop 36 that we saw earlier in month 33
# we'll simply fill them with 0
# if we were dealing with cross-sectional data, we would assign them to their nearest centroid
# but since we're dealing with growing time series data and euclidean distance, length wouldn't be the same and
# computations couldn't be performed. Whole new clusters would have to be calculated
opt_data['shop_group'] = opt_data.shop_id.map(shop_clusters).fillna(0).apply(int).astype('int8')

We're also computing a mean of total sum of squared distances in each cluster versus the number of shops in each cluster, i.e., magnitude versus cardinality, as another means of quantifying cluster quality. A higher cluster cardinality tends to result in a higher cluster magnitude, which intuitively makes sense. When cardinality doesn't correlate with magnitude relative to the other clusters, it may indicate that the cluster is anomalous. Our clusters are well in line with this dependency.

In [None]:
cluster_quality = pd.DataFrame()
cluster_quality['cardinality'] = opt_data.query('shop_group != 0').groupby(['shop_group'])['shop_id'].nunique()
cluster_quality['magnitude'] = list(tot_ss.values())

sns.regplot(data=cluster_quality, x='cardinality', y='magnitude');
plt.xlabel('Cardinality'); plt.ylabel('Magnitude');
plt.title('Cluster quality');

Finally, we wanted to see how our OOT shops are distributed among the shape clusters, and from the following piece of code, we can see that they mostly belong to clusters 2 and 5, the ones exhibiting a predominant downwards trend.

In [None]:
lists_shops = opt_data.groupby('shop_group')['shop_id'].apply(set).to_dict()
test_shops_clusters = []
for shop in list(set(train.shop_id) - set(test.shop_id)): # iterate over OOT
    for cluster in lists_shops.keys(): # iterate over clusters
        if shop in lists_shops[cluster]:
            test_shops_clusters.append(cluster)
print(Counter(np.sort(test_shops_clusters)))

#### 3.3.2. Items RFM (Recency, Frequency and Monetary value)

As stated in section 2, here we will make an **analogy to the RFM customer segmentation model, but for items**. The original model is used to segment customers into high-value customers, medium-value customers or low-value customers, and similarly many others. We'll compute the RFM segments in an expanding time window, being recency the time span between the present in the window and the last time the item was sold; calculating frequency as the ratio between the number of months the item was sold and the total window span in months; and revenue acting as the monetary value. Segment 111 will be that of highest value, 444 the lowest one, and so on, like sticking together ordinal encodings of each variable involved.

First, let's clear our dataset of inactive shops and items.

In [None]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    # transforming type before pivoting
    opt_data['item_cnt_month'] = opt_data.item_cnt_month.astype('int16')

# pivoting df for computing new sales by item id
sales_by_item_id = pivot_df(opt_data, ['item_id'], ['item_cnt_month'], 'date_block_num')

# shifting one month so as to use previous sales data every month
item_shifted_sales = sales_by_item_id.shift(1, axis=1).dropna(axis=1)

# dropping inactive shops to reduce variance, as stated in section 1
# assumption is that if a shop does not have any sales during the previous month before val, 
# it will be inactive unless it's special. This way we won't use information from test, i.e. the future,
# by simply dropping the ones that are not included (OOTS). 
# We have to make sure, however, not to eliminate any shop nor item in test that falls in this assumption, 
# since we're deriving the new test set from this dataframe
prev_length = len(opt_data.query('date_block_num < 33'))
opt_data = opt_data[(opt_data.shop_id
                    .isin(shop_shifted_sales[shop_shifted_sales.loc[:,[32]].sum(axis=1) > 0].index)) |
                    (opt_data.shop_type_name == 'special') | 
                    (opt_data.shop_id.isin(test.shop_id.unique()) & (opt_data.date_block_num == 34))]

shops_not_included = list(set(train.shop_id.unique()) - 
                          set(opt_data.query('date_block_num < 33').shop_id.unique()))
print('\nNumber of shops not included in model: ', len(shops_not_included))

# dropping outdated items for more than 2 years, as stated in section 1 (unless they're on test set)
# assumption is that if an item of the type that we're dealing with in this dataset has no sales during 
# the last 2 years in the whole Russia and had had previous sales, it is therefore out of sales catalogue. 
# Ideally assumptions would be discussed with Business department
opt_data = opt_data[(opt_data.item_id
                    .isin(item_shifted_sales[item_shifted_sales.iloc[:,11:].sum(axis=1) > 0].index)) |
                    ((opt_data.item_id.isin(test.item_id.unique())) & (opt_data.date_block_num == 34))]

items_not_included = list(set(train.item_id.unique()) - 
                          set(opt_data.query('date_block_num < 33').item_id.unique()))
print('Number of items not included in model: ', len(items_not_included))

print('Number of train samples decreased by {}%'.format(
      round((prev_length - len(opt_data.query('date_block_num < 33')))*100 / prev_length)))

Now, we'll compute the RFM segments.

In [None]:
%%time

# pivoting df for computing new sales by item id after dropping inactive ones
sales_by_item_id = pivot_df(opt_data, ['item_id'], ['item_cnt_month'], 'date_block_num')

# shifting one month so as to use previous sales data every month
item_shifted_sales = sales_by_item_id.shift(1, axis=1).dropna(axis=1)
item_shifted_sales = item_shifted_sales.applymap(lambda x: 1 if x > 0 else 0) # binary sales

# functions for computing score or quartile belonging (ordinal encoding)
def RScore(x,p,d):
    """Computes recency score.
    
    Parameters
    ----------
    x: int or float
        Value.
    p: string
        Dictionary key.
    d: float
        Quantile.
    """
    if x <= d[p][0.25]:
        return 1
    elif x <= d[p][0.50]:
        return 2
    elif x <= d[p][0.75]: 
        return 3
    else:
        return 4

def FMScore(x,p,d):
    """Computes frequency and monetary scores.
    
    Parameters
    ----------
    x: int or float
        Value.
    p: string
        Dictionary key.
    d: float
        Quantile.
    """
    if x <= d[p][0.25]:
        return 4
    elif x <= d[p][0.50]:
        return 3
    elif x <= d[p][0.75]: 
        return 2
    else:
        return 1

    
# instanciating rfm_df
rfm_df = pd.DataFrame([])

# computing rfm segments in expanding date blocks
for date_block in range(1, len(item_shifted_sales.columns) + 1):

    # expanding temp dfs
    temp_shifted_sales = item_shifted_sales.loc[:,[col for col in item_shifted_sales.columns if col <= date_block]]
    pre_temp = opt_data.query(f'date_block_num <= {date_block}')

    # computing recency
    df_recency = (temp_shifted_sales.reset_index().melt(id_vars='', var_name='date_block_num', 
                  value_name='active_last_month').rename(columns={'': 'item_id'}).query('active_last_month > 0')
                  .groupby(['item_id'])['date_block_num'].max().reset_index())
    df_recency['recency'] = (pre_temp.date_block_num.max() - df_recency.date_block_num) + 1
    del pre_temp
    
    # computing frequency
    temp_shifted_sales['frequency'] = temp_shifted_sales.sum(axis=1)/len(temp_shifted_sales.columns)

    # adding columns to cross-section temp df
    temp = opt_data.query(f'date_block_num == {date_block}')
    temp = temp.merge(temp_shifted_sales.reset_index()[['', 'frequency']], 
                      left_on=['item_id'], right_on='', how='left')
    temp = temp.merge(df_recency[['item_id', 'recency']].drop_duplicates(), on=['item_id'], how='left')
    del temp_shifted_sales, temp[''] # '' is just the item_id when reseting index in shifted df

    # computing quartile df 
    quantiles = temp[['item_id', 'recency','frequency', 'item_price']].quantile(q=[0.25,0.5,0.75])
    quantiles = quantiles.to_dict()

    # segmenting with quartile df
    temp['r_quartile'] = temp['recency'].apply(RScore, args=('recency',quantiles,))
    temp['f_quartile'] = temp['frequency'].apply(FMScore, args=('frequency',quantiles,))
    temp['m_quartile'] = temp['item_price'].apply(FMScore, args=('item_price',quantiles,))

    # computing total score
    temp['rfm'] = (temp['r_quartile'].apply(str) + temp['f_quartile'].apply(str) + temp['m_quartile'].apply(str))

    rfm_df = pd.concat([rfm_df, temp])
    del temp

# adding rfm computation to our dataset
# first month (date block 0) will be empty, but we'll drop it after the next subsection
opt_data = pd.concat([opt_data.query('date_block_num == 0'), rfm_df])

# filling empty block with -1 to change data type
opt_data.loc[:, ['recency', 'rfm', 'frequency',
             'r_quartile', 'f_quartile', 'm_quartile']] = (opt_data.loc[:, ['recency', 'rfm', 'frequency',
                                                                        'r_quartile', 'f_quartile', 'm_quartile']]
                                                           .fillna(-1).applymap(int).astype('int16'))
# saving memory
del quantiles, rfm_df, sales_by_item_id

In [None]:
print('Samples of RFM segments and their mean:')
display(opt_data.groupby('rfm')[['recency', 'frequency', 'item_price']].mean().sample(3))

#### 3.3.3. Mean-based, trends, and other features

There are countless possibilities, so for this version we'll compute some logical features that may be important based on what we've seen so far and just loop over for improving results. One could also compute ACF and PACF values to derive the autocorrelation behaviour of aggregated values of our target variable, or pair-wise correlations with the predictors and the dependent variable to get more insights and center the calculations. We'll be **clipping the target variable before computations so as to diminish bias due to extreme values**.

In [None]:
%%capture
# optimizing df again
opt_data = reduce_mem_usage(opt_data)

In [None]:
%%time
# general for computing any agg operation and generate a column
def generate_agg_features(df, grouping_cols, col, operation):
    """Generates new features with transform operations over a group.
    
    Parameters
    ----------
    df: pandas dataframe
    grouping_cols: list of str
    col: str
        Column to part from for creating new feature.
    operation: lambda function or str
    """
    return df.groupby(grouping_cols)[col].transform(operation)

# we need to use this function for lagging features, as months are not continuous within shop-item pairs
def generate_lag_features(df, lag, col):
    """Generates lagged features over date block num.
    
    Parameters
    ----------
    df: pandas dataframe
    lag: int
    col: str
    """
    temp = df[['date_block_num','shop_id','item_id', col]]
    shifted = temp.copy()
    shifted.columns = ['date_block_num','shop_id','item_id', col +'_lag'+ str(lag)]
    shifted['date_block_num'] += lag
    df = pd.merge(df, shifted, on=['date_block_num','shop_id','item_id'], how='left')
    if 'cnt_month' in col:
        df[col +'_lag'+ str(lag)] = df[col +'_lag'+ str(lag)].fillna(0).apply(int).astype('int16')
    return df     

# clipping target variable before executing operations
opt_data['unclipped_item_cnt_month'] = opt_data['item_cnt_month']
opt_data['item_cnt_month'] = opt_data['item_cnt_month'].clip(0,20)

# generating lags for item count
for lag in [1,2,3,6,12]:
    opt_data = generate_lag_features(opt_data, lag, 'item_cnt_month')

## time
# days in month to correct for monthly variation
opt_data['days_in_month'] = opt_data[['year', 'month']].apply(lambda x: monthrange(x['year'], x['month'])[1], axis=1)

# dummy to account for seasonality
opt_data['is_in_season'] = opt_data.month.apply(lambda x: 1 if x == 12 else 0)
opt_data['was_in_season'] = opt_data.month.apply(lambda x: 1 if x == 11 else 0)

## price and revenue magnitude order
opt_data['log_item_price'] = np.log10(opt_data.item_price)
opt_data['log_revenue'] = opt_data.revenue.apply(lambda x: np.log10(x) if x != 0 else 0).fillna(0)
opt_data['price_segment'] = opt_data.item_price.apply(lambda x: round(np.log10(x))).astype('int8')

# lagging revenue
opt_data = generate_lag_features(opt_data, 1, 'log_revenue')
del opt_data['log_revenue']
    
## unique objects within some groups
opt_data['shops_in_city'] = generate_agg_features(opt_data, ['shop_city_id', 'date_block_num'],'shop_id', 
                                                  'nunique').fillna(0).astype('int16')
opt_data['items_in_shop'] = generate_agg_features(opt_data, ['shop_id', 'date_block_num'], 'item_id', 
                                                  'nunique').fillna(0).astype('int16')
opt_data['items_in_cat'] = generate_agg_features(opt_data, ['category_id', 'date_block_num'], 'item_id', 
                                                 'nunique').fillna(0).astype('int16')
# averaged by superior hierarchy
opt_data['shops_in_city'] = (opt_data['shops_in_city'] / 
                             opt_data.groupby('date_block_num')['shop_city_id'].transform('nunique')).fillna(0).astype('int16')
opt_data['items_in_shop'] = (opt_data['items_in_shop'] / 
                             opt_data.groupby('date_block_num')['shop_id'].transform('nunique')).fillna(0).astype('int16')
opt_data['items_in_cat'] = (opt_data['items_in_cat'] / 
                            opt_data.groupby('date_block_num')['category_id'].transform('nunique')).fillna(0).astype('int16')

# lagging these features
for col in ['shops_in_city', 'items_in_cat', 'items_in_shop']:
    opt_data = generate_lag_features(opt_data, 1, col)
    opt_data[col + '_lag1'] = opt_data[col + '_lag1'].fillna(0).astype('int16')
    del opt_data[col]

## trend
# ideally we would use rolling functions over a date index, but I've found them to be extremely slow for this project
for lag in [3,6,12]:
    opt_data[f'mean_item_cnt_month_lags1_{lag}'] = (pd.DataFrame(opt_data.item_cnt_month_lag1,
                                                                 opt_data[f'item_cnt_month_lag{lag}'])
                                                    ).reset_index().mean(axis=1).fillna(0).astype('float16')

In [None]:
%%time
# this function will help us on computing operations to avoid data leakage
# also tried with an expanding .rolling() window over datetime date, but computation was too expensive
def compute_expanding(df, grouping_cols, operation, col, new_col, drop_zeros=False):
    """Function to compute operations by looping over an expanding dataframe.
    
    Parameters
    ----------
    df: pandas daframe
    grouping_cols: list of str
    operation: lambda function or str
    col: str
    new_col: str
    drop_zeros: bool
        Whether to drop rows where sales are zero.
    """
    new_df = pd.DataFrame([])
    for date_block in range(1, len(item_shifted_sales.columns) + 1):
        query_temp = df.query(f'date_block_num < {date_block}')
        cross = df.query(f'date_block_num == {date_block}')
        if drop_zeros: 
            query_temp = query_temp.query('item_cnt_month > 0')
        temp = query_temp.copy()
        temp.loc[:, new_col] = temp.groupby(grouping_cols)[col].transform(operation)
        temp.loc[:,'date_block_num'] = date_block
        cross = cross.merge(temp[[new_col, 'date_block_num'] + grouping_cols].drop_duplicates(), 
                            on=['date_block_num'] + grouping_cols, how='left')
        new_df = pd.concat([new_df, cross])  
    new_df[new_col] = new_df[new_col].fillna(0).apply(int).astype('int16')
    return new_df  

## time
# first month of sales
opt_data = compute_expanding(opt_data, ['item_id'], 'min', 'date_block_num', 'item_release', True)
opt_data = compute_expanding(opt_data, ['subcategory_id'], 'min', 'date_block_num', 'subcat_release', True)
opt_data = compute_expanding(opt_data, ['shop_id'], 'min', 'date_block_num', 'shop_opening', True)
opt_data['item_age'] = opt_data['date_block_num'] - opt_data['item_release']

# last month of sales
opt_data = compute_expanding(opt_data, ['item_id'], 'max', 'date_block_num', 'item_last_sale', True)
opt_data = compute_expanding(opt_data, ['shop_id'], 'max', 'date_block_num', 'shop_last_sale', True)
    
# part of the following code would be better summarised by lambda functions inside transform, but 
# computation time increases exponentially. This way is so much more efficient
## totals averaged by time span
opt_data = compute_expanding(opt_data, ['shop_id', 'item_id'], 'sum', 'item_cnt_month', 'total_avg_shop_item_sales')
opt_data = compute_expanding(opt_data, ['shop_id', 'category_id'], 'sum', 'item_cnt_month', 'total_avg_shop_cat_sales')
opt_data = compute_expanding(opt_data, ['shop_id', 'subcategory_id'], 'sum', 'item_cnt_month', 'total_avg_shop_subcat_sales')
opt_data['total_avg_shop_item_sales'] = opt_data['total_avg_shop_item_sales']/opt_data.date_block_num
opt_data['total_avg_shop_cat_sales'] = opt_data['total_avg_shop_cat_sales']/opt_data.date_block_num
opt_data['total_avg_shop_subcat_sales'] = opt_data['total_avg_shop_subcat_sales']/opt_data.date_block_num

## price
# price diff within shop-item life and amongst shops
opt_data = compute_expanding(opt_data, ['shop_id', 'item_id'], 'max', 'item_price', 'max_price')
opt_data = compute_expanding(opt_data, ['shop_id', 'item_id'], 'min', 'item_price', 'min_price')
opt_data['price_diff'] = (opt_data.max_price - opt_data.min_price) / opt_data.max_price
del opt_data['min_price'], opt_data['max_price']

Okay, so we've added quite a few variables. Let's continue.

### 3.4. Quantitative variable transformations

Several parametric statistics and models -- chiefly linear models -- assume certain properties in the input data such as similar variable scales or Gaussian (normal) distributions. Sometimes these properties can be achieved by applying some simple transformations to the original data, such as standardization (for making comparable scales) and Box-Cox or Yeo-Johnson transformations (for matching dispersion and normalizing data). Notwithstanding, given the nature of our data and the problem posed we'll probably use tree-based methods for making predictions, which are robust to differences in dispersion and scale, so we won't be needing many operations in this section, since these algorithms are invariant to monotonic transformations of the data. If we consider using a stacking of various other algorithms, we'll transform the data on the corresponding modelling notebook. On the other hand, most algorithms will at least benefit from standardization, but one has to also consider if the original units are more suitable than the standardized ones, and such is our case at first.

We'll simply remove `item_price` and `revenue` and leave their log-transforms, as it is their order of magnitude what really interests us.

In [None]:
# normalization of item price by log-transform, example with qq-plot
fig = plt.figure()
ax1 = fig.add_subplot(211)
x = opt_data['item_price'].dropna()
prob = stats.probplot(x, dist=stats.norm, plot=ax1)
ax1.set_xlabel('')
ax1.set_title('Probplot against normal distribution')

ax2 = fig.add_subplot(212)
prob = stats.probplot(np.log10(x), dist=stats.norm, plot=ax2)
ax2.set_title('Probplot after log-transformation')
plt.show()

In [None]:
del opt_data['item_price'], opt_data['revenue']

### 3.5. Encoding categorical variables

As said, we'll be using tree-based algorithms for our model. Some of them may handle categorical and text variables internally, such as CatBoost, but others, such as XGBoost do not. We'll prepare a numerical representation for every categorical feature to also have greater control of the outcome, and compare results and performance in the next notebook. Also, most of our categorical features possess high-cardinality, so we'll probably treat them as a numerical input when modelling.

In [None]:
opt_data[[col for col in opt_data.columns if str(opt_data[col].dtypes) == 'category']].head(3)

For the `item_name` variable, we'll use a **bag of words** created with an expanding time window over date blocks, and apply quartiles, such as with the RFM segmentation, to every name based on the count of its tokens. We'll always assign the maximum quartile found in all the tokens within the item name. We'll also leave the provisional ordinal ID created in the Text Processing notebook.

In [None]:
%%time
def apply_bag_of_words(row, bow):
    """Function for returning the max frequency found in a sentence with a previously created BoW.
    
    Parameters
    ----------
    row: pandas row
    bow: pandas df
        Bag of words in dataframe form.
    """
    list_count = []
    for token in row.split():
        if not token.isdigit():
            try:
                list_count.append(bow.loc[token,:]['word_quartile'])
            except KeyError:
                list_count.append(0)
        else: 
            list_count.append(0)
    return max(list_count)

def expanding_bow_quartile(df, new_col='item_name_quartile'):
    """Function for creating a new feature segments assigned from the quartiles of a distribution 
    of word counts from a BoW. Has a dependency on function apply_bag_of_words().
    
    Parameters
    ----------
    df: pandas dataframe
    new_col: str
    """
    new_df = pd.DataFrame([])
    for date_block in range(2, len(item_shifted_sales.columns) + 1):
        query_temp = df.query(f'date_block_num < {date_block}')
        cross = df.query(f'date_block_num == {date_block}')
        cross_copy = cross.copy()
        
        dict_names = Counter([token for x in np.array(query_temp.item_name).flatten() 
                              for token in x.split() if not token.isdigit()])
        bag_of_words = pd.DataFrame.from_dict(dict_names, orient='index').rename(columns={0:'count'})
        quantiles = bag_of_words[['count']].quantile(q=[0.25,0.5,0.75])
        bag_of_words['word_quartile'] = bag_of_words['count'].apply(RScore, args=('count',quantiles,))
        
        cross_copy.loc[:, new_col] = cross_copy.item_name.apply(lambda x: apply_bag_of_words(x, bag_of_words))
        new_df = pd.concat([new_df, cross_copy])
    
    del bag_of_words, query_temp, cross, cross_copy
    
    new_df[new_col] = new_df[new_col].fillna(0).apply(int).astype('int8')
    return new_df 

en_data = expanding_bow_quartile(opt_data)
del opt_data

display(en_data[['item_name', 'item_name_quartile']].sample(5))
display(en_data.groupby('item_name_quartile').nunique()[['item_id']])

For the `highest_price` and `highest_count` variables, we'll use one-hot encoding since they have only two categories each. For the rest of categorical variables, we have deemed that a mean-encoding from the first lag of target will be best. As said, most of our variables have many categories, so using one-hot or binary encoding (which reduces the number of newly created variables but still increments them) would create a very sparse matrix and increment computation time for the tree-based algorithms. Also, another simple encoding such as frequency encoding would not add much information since our data for every month is a cartensian product of shops and items, and the non-normality of our target variable difficults the use of some other encoding methods. You can check this useful Python library for [more types of encoding](https://contrib.scikit-learn.org/category_encoders/). However, as always, having time data adds to the complexity of the task.

In [None]:
%%time
# one hot
en_data['highest_price'] = en_data.highest_price.apply(lambda x: 1 if x=='highest' else 0)
en_data['highest_count'] = en_data.highest_count.apply(lambda x: 1 if x=='highest' else 0)

# mean-encoding and lagging
for col in ['item_id', 'shop_id', 'category_id', 'subcategory_id', 'shop_city_id', 'shop_type_id', 
            'rfm', 'price_segment', 'neighbours', 'month']:
    if any(string in col for string in ['category', 'rfm', 'item_id', 'category_id', 'subcategory_id']):
        en_data['mean_' + col + '_sales'] = generate_agg_features(en_data, ['shop_id', col, 'date_block_num'], 
                                                                  'item_cnt_month', 
                                                                  'mean').fillna(0).astype('float16')
        en_data['mean_' + col + '_all_sales'] = generate_agg_features(en_data, [col, 'date_block_num'], 
                                                                      'item_cnt_month', 
                                                                      'mean').fillna(0).astype('float16')
        en_data = generate_lag_features(en_data, 1, 'mean_' + col + '_sales')
        en_data = generate_lag_features(en_data, 2, 'mean_' + col + '_sales')
        del en_data['mean_' + col + '_sales']
    else:
        en_data['mean_' + col + '_all_sales'] = generate_agg_features(en_data, [col, 'date_block_num'], 
                                                                      'item_cnt_month', 
                                                                      'mean').fillna(0).astype('float16')
    en_data = generate_lag_features(en_data, 1, 'mean_' + col + '_all_sales')
    en_data = generate_lag_features(en_data, 2, 'mean_' + col + '_all_sales')
    del en_data['mean_' + col + '_all_sales']
    

# ordinal ID for neighbours 
en_data['neighbours_id'] = preprocessing.LabelEncoder().fit_transform(en_data.neighbours.values).astype('int16')
del en_data['neighbours']

# leaving only months from date block 2 to remove first nas
en_data = en_data.query('date_block_num > 1')

display(en_data.sample(5))

### 3.6. Dividing dataset

It's due time to divide our dataset into train, validation and test sets. Always remind that this is a practice that should be performed just before preprocessing, unless it's a case like ours, where we need to create time-based features. We'll use November 2015 as test, as set by the competition, and use October 2015 as validation.

In [None]:
# we'll fill the remaining nas with 0
en_data[[col for col in en_data.columns if 
         str(en_data[col].dtype) != 'category']] = en_data[[col for col in en_data.columns if 
                                                            str(en_data[col].dtype) != 'category']].fillna(0)

train = en_data.query('date_block_num < 33')
val = en_data.query('date_block_num == 33')
test = en_data.query('date_block_num == 34')

print('Train, val and test percentages: {:.1f}%, {:.1f}%, {:.1f}%'.format(len(train)*100/len(en_data), 
                                                                          len(val)*100/len(en_data),
                                                                          len(test)*100/len(en_data)))
print('Total number of variables: ', len(train.columns)) # though not all of them will go to the final model

## 4. Visualization summary

As the name of this section suggests, here we've developed a small interactive app for visualizing all features as a final exploration of our engineered train dataset. It's filtered by city for performance purposes. Enjoy! 🔎

In [None]:
vis_data = en_data.copy()
del en_data

vis_data.drop(['item_name', 'date'], axis=1, inplace=True) # dropping for vis purposes due to extremely high cardinality
vis_data[['rfm', 'r_quartile', 'f_quartile', 'm_quartile',
          'shop_group', 'price_segment', 'highest_price', 'highest_count',
          'was_in_season', 'is_in_season', 'year', 'month', 'days_in_month']] = vis_data[
        ['rfm', 'r_quartile', 'f_quartile', 'm_quartile', 
         'shop_group', 'price_segment', 'highest_price', 'highest_count',
         'was_in_season', 'is_in_season', 'year', 'month', 'days_in_month']].astype('category')

# vis buttons
predictors = widgets.Dropdown(options=sorted([col for col in vis_data.columns]),
                              value='log_item_price', description='Predictor:', disabled=False)
cities = widgets.Dropdown(options=sorted(list(vis_data.shop_city_name.unique())),
                          value='moscow', description='City:', disabled=False)
date_blocks = widgets.IntRangeSlider(value=[27, 32], min=0, max=34, step=1, description='Date block:')
clipped = widgets.RadioButtons(options=['item_cnt_month', 'unclipped_item_cnt_month'], value='item_cnt_month',
                               description='Target:', disabled=False)

def plot_variables(predictors, cities, date_blocks, clipped):
    """Plots various variables against target with interactive widgets.
    
    Parameters
    ----------
    predictors: ipywidget
        Independent variables.
    cities: ipywidget
    date_blocks: ipywidget
    clipped: ipywidget
        Enables the choice for the target variable to be clipped or not.
    """
    if str(vis_data[predictors].dtype) == 'category':
        plt.figure(figsize=(15,4));
        ax = sns.boxplot(data=vis_data[(vis_data.shop_city_name == cities) & 
                                       (vis_data.date_block_num.isin([*range(date_blocks[0], date_blocks[1])]))], 
                                        x=predictors, y=clipped);
        ax.set_xticklabels(ax.get_xticklabels(), rotation=30, ha="right")
        plt.title(f'Sales vs {predictors}');
        plt.tight_layout();
        plt.show()
    else:
        plt.figure(figsize=(15,4));
        sns.scatterplot(data=vis_data[(vis_data.shop_city_name == cities) & 
                                       (vis_data.date_block_num.isin([*range(date_blocks[0], date_blocks[1])]))], 
                        x=predictors, y=clipped, hue='category_id', legend=None); 
        #we don't care about the values of category id here, so we remove the legend accounting also for speed purposes
        plt.title(f'Sales vs {predictors}');
    
v1 = widgets.VBox([date_blocks, cities])  
v2 = widgets.VBox([predictors, clipped])
ui = widgets.HBox([v1, v2])
out = widgets.interactive_output(plot_variables, {'predictors':predictors, 'cities':cities, 
                                                  'date_blocks':date_blocks, 'clipped':clipped})
display(ui, out)

## Saving files

Awesome! ✨ Let's save the results.

In [None]:
#train.to_csv('train.csv', index=False)
#val.to_csv('val.csv', index=False)
#test.to_csv('test.csv', index=False)

---

## 💡 **Stay tuned for the next part: [XGBoost/Model/Performance] Predict Future Sales**

Also, if you have any question or comment to add, please, feel welcome to do so!