<a href="https://colab.research.google.com/github/Kerriea-star/time-series-forecasting-with-lstm-autoencoders/blob/main/Time_Series_Forecasting_with_LSTM_Autoencoders.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



*   The purpose of this work is to show one way time-series data can be effiently encoded to lower dimensions, to be used into non time-series models.
*   Here I'll encode a time-series of size 12 (12 months) to a single value and use it on a MLP deep learning model, instead of using the time-series on a LSTM model that could be the regular approach.

**Predict future sales**

The task is to forecast the total amount of products sold in every shop for the test set. Note that the list of shops and products slightly changes every month. Creating a robust model that can handle such situations is part of the challenge.

**Data fields description:**

*   ID - an Id that represents a (Shop, Item) tuple within the test set
*   shop_id - unique identifier of a shop
*   item_id - unique identifier of a product
*   item_category_id - unique identifier of item category
*   date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
*   date - date in format dd/mm/yyyy
*   item_cnt_day - number of products sold. You are predicting a monthly amount of this measure
*   item_price - current price of an item
*   item_name - name of item
*   shop_name - name of shop
*   item_category_name - name of item category



In [2]:
# Dependencies
import os, warnings, random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

import tensorflow as tf
import keras.layers as L

from keras import optimizers, Sequential, Model



In [3]:
# Set seeds to make the experiment more reproducible
def seed_everything(seed=0):
  random.seed(seed)
  np.random.seed(seed)
  tf.random.set_seed(seed)
  os.environ['PYTHONHASHSEED'] = str(seed)
  os.environ["TF_DETERMINISTIC_OPS"] = '1'

seed = 0
seed_everything(seed)
warnings.filterwarnings("ignore")
pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [6]:
# Loading data
test = pd.read_csv("drive/MyDrive/Colab Notebooks/time-series-forecasting-with-lstm-autoencoders/data/test.csv", dtype={"ID": 'int32', 'shop_id': 'int32', 'item_id': 'int32'})
item_categories = pd.read_csv("drive/MyDrive/Colab Notebooks/time-series-forecasting-with-lstm-autoencoders/data/test.csv", dtype={'item_category': 'str', 'item_category_id': 'int32'})
items = pd.read_csv("drive/MyDrive/Colab Notebooks/time-series-forecasting-with-lstm-autoencoders/data/items.csv", dtype={"item_name": 'str', 'item_id': 'int32'})
shops = pd.read_csv("drive/MyDrive/Colab Notebooks/time-series-forecasting-with-lstm-autoencoders/data/shops.csv", dtype={'shop_name': 'str', 'shop_id': 'int32'})
sales = pd.read_csv("drive/MyDrive/Colab Notebooks/time-series-forecasting-with-lstm-autoencoders/data/sales_train.csv", parse_dates=['date'], dtype={'date': 'str', 'date_block_num': 'int32', 'shop_id': 'int32',
                                                                                                                                                 'item_id': 'int32', 'item_price': 'float32', 'item_cnt_day': 'int32'})


In [7]:
# Join data sets
train = sales.join(items, on="item_id", rsuffix='_').join(shops, on="shop_id", rsuffix='_').join(item_categories, on='item_category_id', rsuffix='_').drop(["item_id_", "shop_id_", "item_category_id"], axis=1)


In [9]:
# Let's look at the raw data
print(f"Train rows: {train.shape[0]}")
print(f"Train columns: {train.shape[1]}")

display(train.head())
display(train.describe())

Train rows: 2935849
Train columns: 9


Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_name,shop_name,ID
0,2013-02-01,0,59,22154,999.0,1,ЯВЛЕНИЕ 2012 (BD),"Ярославль ТЦ ""Альтаир""",37
1,2013-03-01,0,25,2552,899.0,1,DEEP PURPLE The House Of Blue Light LP,"Москва ТРК ""Атриум""",58
2,2013-05-01,0,25,2552,899.0,-1,DEEP PURPLE The House Of Blue Light LP,"Москва ТРК ""Атриум""",58
3,2013-06-01,0,25,2554,1709.05,1,DEEP PURPLE Who Do You Think We Are LP,"Москва ТРК ""Атриум""",58
4,2013-01-15,0,25,2555,1099.0,1,DEEP PURPLE 30 Very Best Of 2CD (Фирм.),"Москва ТРК ""Атриум""",56


Unnamed: 0,date_block_num,shop_id,item_id,item_price,item_cnt_day,ID
count,2935849.0,2935849.0,2935849.0,2935849.0,2935849.0,2935849.0
mean,14.57,33.0,10197.23,890.85,1.24,40.0
std,9.42,16.23,6324.3,1729.8,2.62,17.1
min,0.0,0.0,0.0,-1.0,-22.0,0.0
25%,7.0,22.0,4476.0,249.0,1.0,28.0
50%,14.0,31.0,9343.0,399.0,1.0,40.0
75%,23.0,47.0,15684.0,999.0,1.0,55.0
max,33.0,59.0,22169.0,307980.0,2169.0,83.0


In [10]:
# Time period of the dataset
print(f"Min date from train set: {train['date'].min().date()}")
print(f"Max date from train set: {train['date'].max().date()}")

Min date from train set: 2013-01-01
Max date from train set: 2015-12-10


In [11]:
# Leave only the "shop_id" and "item_id" that exist in the test set to have more accurate results
test_shop_ids = test["shop_id"].unique()
test_item_ids = test["item_id"].unique()

# Only shops that exist in test set.
train = train[train["shop_id"].isin(test_shop_ids)]

# Only items that exist in test set.
train = train[train["item_id"].isin(test_item_ids)]

**Data Preprocessing**

*   Drop all features but "item_cnt_day" because I'll be using only it as a univariate time-series.
*   We are asked to predict total sales for every product and store in the next month, and our data is given by day, so let's aggregate the data by month
*   Leave only monthly "item_cnt" >= 0 and <= 20, as this seems to be the distributions of the test set.



In [12]:
train_monthly = train[['date', 'date_block_num', 'shop_id', 'item_id', 'item_cnt_day']]
train_monthly = train_monthly.sort_values('date').groupby(['date_block_num', 'shop_id', 'item_id'], as_index=False)
train_monthly = train_monthly.agg({'item_cnt_day': ['sum']})
train_monthly.columns = ['date_block_num', 'shop_id', 'item_id', 'item_cnt']
train_monthly = train_monthly.query('item_cnt >= 0 and item_cnt <= 20')

# Label
train_monthly['item_cnt_month'] = train_monthly.sort_values('date_block_num').groupby(['shop_id', 'item_id'])['item_cnt'].shift(-1)

display(train_monthly.head(10))
display(train_monthly.describe())

Unnamed: 0,date_block_num,shop_id,item_id,item_cnt,item_cnt_month
0,0,2,33,1,2.0
1,0,2,482,1,1.0
2,0,2,491,1,1.0
3,0,2,839,1,1.0
4,0,2,1007,3,1.0
5,0,2,1010,1,1.0
6,0,2,1023,2,1.0
7,0,2,1204,1,
8,0,2,1224,1,
9,0,2,1247,1,


Unnamed: 0,date_block_num,shop_id,item_id,item_cnt,item_cnt_month
count,593829.0,593829.0,593829.0,593829.0,482536.0
mean,20.18,32.07,10015.02,2.1,2.07
std,9.14,16.9,6181.82,2.31,2.17
min,0.0,2.0,30.0,0.0,0.0
25%,13.0,19.0,4418.0,1.0,1.0
50%,22.0,31.0,9171.0,1.0,1.0
75%,28.0,47.0,15334.0,2.0,2.0
max,33.0,59.0,22167.0,20.0,20.0
