<a href="https://colab.research.google.com/github/Kerriea-star/time-series-forecasting-with-lstm-autoencoders/blob/main/Time_Series_Forecasting_with_LSTM_Autoencoders.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



*   The purpose of this work is to show one way time-series data can be effiently encoded to lower dimensions, to be used into non time-series models.
*   Here I'll encode a time-series of size 12 (12 months) to a single value and use it on a MLP deep learning model, instead of using the time-series on a LSTM model that could be the regular approach.

**Predict future sales**

The task is to forecast the total amount of products sold in every shop for the test set. Note that the list of shops and products slightly changes every month. Creating a robust model that can handle such situations is part of the challenge.

**Data fields description:**

*   ID - an Id that represents a (Shop, Item) tuple within the test set
*   shop_id - unique identifier of a shop
*   item_id - unique identifier of a product
*   item_category_id - unique identifier of item category
*   date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
*   date - date in format dd/mm/yyyy
*   item_cnt_day - number of products sold. You are predicting a monthly amount of this measure
*   item_price - current price of an item
*   item_name - name of item
*   shop_name - name of shop
*   item_category_name - name of item category



In [1]:
# Dependencies
import os, warnings, random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

import tensorflow as tf
import keras.layers as L

from keras import optimizers, Sequential, Model



In [2]:
# Set seeds to make the experiment more reproducible
def seed_everything(seed=0):
  random.seed(seed)
  np.random.seed(seed)
  tf.random.set_seed(seed)
  os.environ['PYTHONHASHSEED'] = str(seed)
  os.environ["TF_DETERMINISTIC_OPS"] = '1'

seed = 0
seed_everything(seed)
warnings.filterwarnings("ignore")
pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [3]:
# Loading data
test = pd.read_csv("drive/MyDrive/Colab Notebooks/time-series-forecasting-with-lstm-autoencoders/data/test.csv", dtype={"ID": 'int32', 'shop_id': 'int32', 'item_id': 'int32'})
item_categories = pd.read_csv("drive/MyDrive/Colab Notebooks/time-series-forecasting-with-lstm-autoencoders/data/test.csv", dtype={'item_category': 'str', 'item_category_id': 'int32'})
items = pd.read_csv("drive/MyDrive/Colab Notebooks/time-series-forecasting-with-lstm-autoencoders/data/items.csv", dtype={"item_name": 'str', 'item_id': 'int32'})
shops = pd.read_csv("drive/MyDrive/Colab Notebooks/time-series-forecasting-with-lstm-autoencoders/data/shops.csv", dtype={'shop_name': 'str', 'shop_id': 'int32'})
sales = pd.read_csv("drive/MyDrive/Colab Notebooks/time-series-forecasting-with-lstm-autoencoders/data/sales_train.csv", parse_dates=['date'], dtype={'date': 'str', 'date_block_num': 'int32', 'shop_id': 'int32',
                                                                                                                                                 'item_id': 'int32', 'item_price': 'float32', 'item_cnt_day': 'int32'})


In [4]:
# Join data sets
train = sales.join(items, on="item_id", rsuffix='_').join(shops, on="shop_id", rsuffix='_').join(item_categories, on='item_category_id', rsuffix='_').drop(["item_id_", "shop_id_", "item_category_id"], axis=1)


In [5]:
# Let's look at the raw data
print(f"Train rows: {train.shape[0]}")
print(f"Train columns: {train.shape[1]}")

display(train.head())
display(train.describe())

Train rows: 2935849
Train columns: 9


Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_name,shop_name,ID
0,2013-02-01,0,59,22154,999.0,1,ЯВЛЕНИЕ 2012 (BD),"Ярославль ТЦ ""Альтаир""",37
1,2013-03-01,0,25,2552,899.0,1,DEEP PURPLE The House Of Blue Light LP,"Москва ТРК ""Атриум""",58
2,2013-05-01,0,25,2552,899.0,-1,DEEP PURPLE The House Of Blue Light LP,"Москва ТРК ""Атриум""",58
3,2013-06-01,0,25,2554,1709.05,1,DEEP PURPLE Who Do You Think We Are LP,"Москва ТРК ""Атриум""",58
4,2013-01-15,0,25,2555,1099.0,1,DEEP PURPLE 30 Very Best Of 2CD (Фирм.),"Москва ТРК ""Атриум""",56


Unnamed: 0,date_block_num,shop_id,item_id,item_price,item_cnt_day,ID
count,2935849.0,2935849.0,2935849.0,2935849.0,2935849.0,2935849.0
mean,14.57,33.0,10197.23,890.85,1.24,40.0
std,9.42,16.23,6324.3,1729.8,2.62,17.1
min,0.0,0.0,0.0,-1.0,-22.0,0.0
25%,7.0,22.0,4476.0,249.0,1.0,28.0
50%,14.0,31.0,9343.0,399.0,1.0,40.0
75%,23.0,47.0,15684.0,999.0,1.0,55.0
max,33.0,59.0,22169.0,307980.0,2169.0,83.0


In [6]:
# Time period of the dataset
print(f"Min date from train set: {train['date'].min().date()}")
print(f"Max date from train set: {train['date'].max().date()}")

Min date from train set: 2013-01-01
Max date from train set: 2015-12-10


In [7]:
# Leave only the "shop_id" and "item_id" that exist in the test set to have more accurate results
test_shop_ids = test["shop_id"].unique()
test_item_ids = test["item_id"].unique()

# Only shops that exist in test set.
train = train[train["shop_id"].isin(test_shop_ids)]

# Only items that exist in test set.
train = train[train["item_id"].isin(test_item_ids)]

**Data Preprocessing**

*   Drop all features but "item_cnt_day" because I'll be using only it as a univariate time-series.
*   We are asked to predict total sales for every product and store in the next month, and our data is given by day, so let's aggregate the data by month
*   Leave only monthly "item_cnt" >= 0 and <= 20, as this seems to be the distributions of the test set.



In [8]:
train_monthly = train[['date', 'date_block_num', 'shop_id', 'item_id', 'item_cnt_day']]
train_monthly = train_monthly.sort_values('date').groupby(['date_block_num', 'shop_id', 'item_id'], as_index=False)
train_monthly = train_monthly.agg({'item_cnt_day': ['sum']})
train_monthly.columns = ['date_block_num', 'shop_id', 'item_id', 'item_cnt']
train_monthly = train_monthly.query('item_cnt >= 0 and item_cnt <= 20')

# Label
train_monthly['item_cnt_month'] = train_monthly.sort_values('date_block_num').groupby(['shop_id', 'item_id'])['item_cnt'].shift(-1)

display(train_monthly.head(10))
display(train_monthly.describe())

Unnamed: 0,date_block_num,shop_id,item_id,item_cnt,item_cnt_month
0,0,2,33,1,2.0
1,0,2,482,1,1.0
2,0,2,491,1,1.0
3,0,2,839,1,1.0
4,0,2,1007,3,1.0
5,0,2,1010,1,1.0
6,0,2,1023,2,1.0
7,0,2,1204,1,
8,0,2,1224,1,
9,0,2,1247,1,


Unnamed: 0,date_block_num,shop_id,item_id,item_cnt,item_cnt_month
count,593829.0,593829.0,593829.0,593829.0,482536.0
mean,20.18,32.07,10015.02,2.1,2.07
std,9.14,16.9,6181.82,2.31,2.17
min,0.0,2.0,30.0,0.0,0.0
25%,13.0,19.0,4418.0,1.0,1.0
50%,22.0,31.0,9171.0,1.0,1.0
75%,28.0,47.0,15334.0,2.0,2.0
max,33.0,59.0,22167.0,20.0,20.0


**Time-series processing**

*   As I only need the "item_cnt" feature as a series, I can get that easily by just using a pivot operation.
*   This way I'll also get the missing months from each "shop_id" and "item_id", and then replace them with 0 (otherwise would be "nan").



In [9]:
monthly_series = train_monthly.pivot_table(index=['shop_id', 'item_id'], columns='date_block_num', values='item_cnt', fill_value=0).reset_index()
monthly_series.head()

date_block_num,shop_id,item_id,0,1,2,3,4,5,6,7,...,24,25,26,27,28,29,30,31,32,33
0,2,30,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,31,0,4,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,2,32,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0
3,2,33,1,0,0,0,0,0,0,0,...,0,1,0,1,1,0,1,0,1,0
4,2,53,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


Currently I have one series (33 months) for each unique pair of "shop_id" and "item_id", but probably would be better to have multiple smaller series for each unique pair, so I'm generating multiple series of size 12 (one year) for each unique pair.

In [10]:
first_month = 20
last_month = 33
serie_size = 12
data_series = []

for index, row in monthly_series.iterrows():
  for month1 in range((last_month - (first_month + serie_size)) + 1):
    serie = [row["shop_id"], row["item_id"]]
    for month2 in range(serie_size + 1):
      serie.append(row[month1 + first_month + month2])
    data_series.append(serie)

columns = ['shop_id', 'item_id']
[columns.append(i) for i in range(serie_size)]
columns.append('label')

data_series = pd.DataFrame(data_series, columns=columns)
data_series.head()

Unnamed: 0,shop_id,item_id,0,1,2,3,4,5,6,7,8,9,10,11,label
0,2,30,0,0,0,0,0,0,0,0,0,0,0,0,0
1,2,30,0,0,0,0,0,0,0,0,0,0,0,0,0
2,2,31,0,0,0,0,0,0,0,0,0,0,0,0,0
3,2,31,0,0,0,0,0,0,0,0,0,0,0,0,1
4,2,32,2,2,0,2,0,0,1,0,0,0,0,1,0


In [11]:
# Dropping identifier columnms as we don't need them anymore
data_series = data_series.drop(["item_id", 'shop_id'], axis=1)

**Train and Validation sets.**

In [12]:
labels = data_series["label"]
data_series.drop("label", axis=1, inplace=True)
train, valid, Y_train, Y_valid = train_test_split(data_series, labels.values, test_size=0.10, random_state=0)


In [13]:
print("Train set", train.shape)
print("Validation set", valid.shape)
train.head()

Train set (200327, 12)
Validation set (22259, 12)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
207604,0,0,0,0,0,0,0,0,0,0,0,0
45150,0,0,0,0,0,0,0,0,0,0,0,0
143433,0,0,4,2,1,2,2,1,0,0,0,1
202144,0,0,0,0,0,0,0,0,0,0,0,0
136088,0,0,0,0,0,0,0,1,0,0,1,0


**Reshape data**

*   Time-series shape (data points, time-steps, features).



In [14]:
X_train = train.values.reshape((train.shape[0], train.shape[1], 1))
X_valid = valid.values.reshape((valid.shape[0], valid.shape[1], 1))

print("Train set reshaped", X_train.shape)
print("Validation set reshaped", X_valid.shape)

Train set reshaped (200327, 12, 1)
Validation set reshaped (22259, 12, 1)


Start with regular RNN time-series approach

**Regular LSTM model.**

In [15]:
serie_size = X_train.shape[1] # 12
n_features = X_train.shape[2] # 1

epochs = 20
batch = 128
lr = 0.0001

lstm_model = Sequential()
lstm_model.add(L.LSTM(10, input_shape=(serie_size, n_features), return_sequences=True))
lstm_model.add(L.LSTM(6, activation='relu', return_sequences=True))
lstm_model.add(L.LSTM(1, activation='relu'))
lstm_model.add(L.Dense(10, kernel_initializer='glorot_normal', activation='relu'))
lstm_model.add(L.Dense(10, kernel_initializer='glorot_normal', activation='relu'))
lstm_model.add(L.Dense(1))
lstm_model.summary()

# adam = optimizers.Adam(lr)
lstm_model.compile(loss='mse', optimizer="adam")

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 12, 10)            480       
                                                                 
 lstm_1 (LSTM)               (None, 12, 6)             408       
                                                                 
 lstm_2 (LSTM)               (None, 1)                 32        
                                                                 
 dense (Dense)               (None, 10)                20        
                                                                 
 dense_1 (Dense)             (None, 10)                110       
                                                                 
 dense_2 (Dense)             (None, 1)                 11        
                                                                 
Total params: 1,061
Trainable params: 1,061
Non-trainabl

In [16]:
lstm_history = lstm_model.fit(X_train, Y_train, validation_data=(X_valid, Y_valid),
                              batch_size=batch,
                              epochs=epochs,
                              verbose=2)

Epoch 1/20
1566/1566 - 38s - loss: 1.2715 - val_loss: 1.1618 - 38s/epoch - 24ms/step
Epoch 2/20
1566/1566 - 31s - loss: 1.1803 - val_loss: 1.1442 - 31s/epoch - 20ms/step
Epoch 3/20
1566/1566 - 29s - loss: 1.1768 - val_loss: 1.1457 - 29s/epoch - 19ms/step
Epoch 4/20
1566/1566 - 29s - loss: 1.1684 - val_loss: 1.1471 - 29s/epoch - 19ms/step
Epoch 5/20
1566/1566 - 29s - loss: 1.1676 - val_loss: 1.1637 - 29s/epoch - 18ms/step
Epoch 6/20
1566/1566 - 31s - loss: 1.1668 - val_loss: 1.1537 - 31s/epoch - 20ms/step
Epoch 7/20
1566/1566 - 29s - loss: 1.1653 - val_loss: 1.1423 - 29s/epoch - 19ms/step
Epoch 8/20
1566/1566 - 29s - loss: 1.1646 - val_loss: 1.1510 - 29s/epoch - 19ms/step
Epoch 9/20
1566/1566 - 29s - loss: 1.1634 - val_loss: 1.1427 - 29s/epoch - 18ms/step
Epoch 10/20
1566/1566 - 29s - loss: 1.1621 - val_loss: 1.1403 - 29s/epoch - 19ms/step
Epoch 11/20
1566/1566 - 30s - loss: 1.1605 - val_loss: 1.1398 - 30s/epoch - 19ms/step
Epoch 12/20
1566/1566 - 29s - loss: 1.1604 - val_loss: 1.1456 -

**Autoencoder**

*   Now we will build an autoencoder to learn how to reconstruct the input, this way it internally learns the best way to represent the input in lower dimensions.
*   The reconstruct model is composed of an encoder and a decoder, the encoder is responsible for learning how to represent the input into lower dimensions and the decoder learns how to rebuild the smaller representations into the input again.
*   After the models is trained we can keep only the encoder part and we'll have a model that is able to do what we want.

**LSTM Autoencoder**


In [17]:
encoder_decoder = Sequential()
encoder_decoder.add(L.LSTM(serie_size, activation='relu', input_shape=(serie_size, n_features), return_sequences=True))
encoder_decoder.add(L.LSTM(6, activation='relu', return_sequences=True))
encoder_decoder.add(L.LSTM(1, activation='relu'))
encoder_decoder.add(L.RepeatVector(serie_size))
encoder_decoder.add(L.LSTM(serie_size, activation='relu', return_sequences=True))
encoder_decoder.add(L.LSTM(6, activation='relu', return_sequences=True))
encoder_decoder.add(L.TimeDistributed(L.Dense(1)))
encoder_decoder.summary()

encoder_decoder.compile(loss='mse', optimizer='adam')

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_3 (LSTM)               (None, 12, 12)            672       
                                                                 
 lstm_4 (LSTM)               (None, 12, 6)             456       
                                                                 
 lstm_5 (LSTM)               (None, 1)                 32        
                                                                 
 repeat_vector (RepeatVector  (None, 12, 1)            0         
 )                                                               
                                                                 
 lstm_6 (LSTM)               (None, 12, 12)            672       
                                                                 
 lstm_7 (LSTM)               (None, 12, 6)             456       
                                                      

In [18]:
encoder_decoder_history = encoder_decoder.fit(X_train, X_train,
                                              batch_size=batch,
                                              epochs=epochs,
                                              verbose=2)

Epoch 1/20
1566/1566 - 56s - loss: 1.1259 - 56s/epoch - 36ms/step
Epoch 2/20
1566/1566 - 51s - loss: 1.0040 - 51s/epoch - 32ms/step
Epoch 3/20
1566/1566 - 48s - loss: 0.9687 - 48s/epoch - 31ms/step
Epoch 4/20
1566/1566 - 49s - loss: 0.9436 - 49s/epoch - 31ms/step
Epoch 5/20
1566/1566 - 50s - loss: 0.9124 - 50s/epoch - 32ms/step
Epoch 6/20
1566/1566 - 48s - loss: 0.8938 - 48s/epoch - 31ms/step
Epoch 7/20
1566/1566 - 50s - loss: 0.9285 - 50s/epoch - 32ms/step
Epoch 8/20
1566/1566 - 48s - loss: 0.8753 - 48s/epoch - 30ms/step
Epoch 9/20
1566/1566 - 49s - loss: 0.8574 - 49s/epoch - 32ms/step
Epoch 10/20
1566/1566 - 49s - loss: 0.8619 - 49s/epoch - 31ms/step
Epoch 11/20
1566/1566 - 48s - loss: 0.8790 - 48s/epoch - 31ms/step
Epoch 12/20
1566/1566 - 50s - loss: 0.8805 - 50s/epoch - 32ms/step
Epoch 13/20
1566/1566 - 51s - loss: 0.9021 - 51s/epoch - 32ms/step
Epoch 14/20
1566/1566 - 48s - loss: 0.8777 - 48s/epoch - 31ms/step
Epoch 15/20
1566/1566 - 49s - loss: 0.8367 - 49s/epoch - 31ms/step
Epoc

The better the autoencoder is able to reconstruct the input the better it internally encodes the input, in other words if we have a good autoencoder we probably will have an equally good encoder.

Let's take a look at the layers of the encoder_decoder model

In [19]:
rpt_vector_layer = Model(inputs=encoder_decoder.inputs, outputs=encoder_decoder.layers[3].output)
time_dist_layer = Model(inputs=encoder_decoder.inputs, outputs=encoder_decoder.layers[5].output)
encoder_decoder.layers

[<keras.layers.rnn.lstm.LSTM at 0x7e7d21fe3190>,
 <keras.layers.rnn.lstm.LSTM at 0x7e7d21ff81f0>,
 <keras.layers.rnn.lstm.LSTM at 0x7e7d1e046080>,
 <keras.layers.reshaping.repeat_vector.RepeatVector at 0x7e7d21fe2290>,
 <keras.layers.rnn.lstm.LSTM at 0x7e7d1e06c7c0>,
 <keras.layers.rnn.lstm.LSTM at 0x7e7d1e047040>,
 <keras.layers.rnn.time_distributed.TimeDistributed at 0x7e7d1e0b85b0>]

About the autoencoders layers

**LSTM**

*   This is just a regular LSTM layer, a layer that is able to receive sequence data and learn based on it, nothing much.

**RepeatVector layer**

*   Here is something we don't usually see, this layers basically repeats it's input "n" times, the reason to us it is because the last layers from the encoder part (the layer with one neuron) don't return sequences, so it does not output a sequenced data, this way we can't just add another LSTM layer after it, we need a way to turn this output into a sequence of the same time-steps of the model input, this is where "RepeatVector" layers come in.
*   Outputs of the RepeatVector layers



In [20]:
rpt_vector_layer_output = rpt_vector_layer.predict(X_train[:1])
print("Repeat vector output shape", rpt_vector_layer_output.shape)
print("Repeat vector output sample")
print(rpt_vector_layer_output[0])

Repeat vector output shape (1, 12, 1)
Repeat vector output sample
[[0.03313876]
 [0.03313876]
 [0.03313876]
 [0.03313876]
 [0.03313876]
 [0.03313876]
 [0.03313876]
 [0.03313876]
 [0.03313876]
 [0.03313876]
 [0.03313876]
 [0.03313876]]


This is just the same value repeated some times to match the same shape of the model input.

**Time Distributed layer**

*   Sometimes used when you want to mix RNN layers with other kind of layers.
*   We could output the model with another LSTM layer with one neuron and "return_sequences=True" parameter, but using a "TimeDistributed" layer wrapping a "Dense" layer we will have the same weights for each outputted time-step.



In [21]:
time_dist_layer_output = time_dist_layer.predict(X_train[:1])
print("Time distributed output shape", time_dist_layer_output.shape)
print("Time distributed output sample")
print(time_dist_layer_output[0])

Time distributed output shape (1, 12, 6)
Time distributed output sample
[[9.1343522e-02 8.8544004e-03 0.0000000e+00 3.8176257e-02 0.0000000e+00
  0.0000000e+00]
 [1.5611199e-01 1.6443709e-02 0.0000000e+00 7.8091383e-02 0.0000000e+00
  4.0493463e-03]
 [2.0408258e-01 4.2056203e-02 0.0000000e+00 1.1919017e-01 0.0000000e+00
  8.4021660e-03]
 [2.4340208e-01 5.3615946e-02 0.0000000e+00 1.5772015e-01 0.0000000e+00
  1.2606783e-02]
 [2.7206373e-01 4.0622629e-02 0.0000000e+00 1.8889965e-01 0.0000000e+00
  2.0060046e-02]
 [2.8817618e-01 1.9783989e-02 0.0000000e+00 2.1213411e-01 0.0000000e+00
  3.0110430e-02]
 [2.9531121e-01 9.3388334e-03 0.0000000e+00 2.3003334e-01 0.0000000e+00
  3.9602276e-02]
 [2.9899374e-01 4.3473314e-03 0.0000000e+00 2.4550396e-01 0.0000000e+00
  4.6405323e-02]
 [3.0113742e-01 2.0082574e-03 0.0000000e+00 2.5873640e-01 0.0000000e+00
  5.0617564e-02]
 [3.0243355e-01 9.2315400e-04 0.0000000e+00 2.6946071e-01 0.0000000e+00
  5.3063884e-02]
 [3.0317521e-01 4.2295206e-04 0.000000

**Defining the encoding model**

*   What I want is to encode the whole series into a single value, so I need the output from the layer with a single neuron (in this case it's the third LSTM layer)
*   I'll take only the encoding part of the model and define it as a new one.



In [22]:
encoder = Model(inputs=encoder_decoder.inputs, outputs=encoder_decoder.layers[2].output)


In [23]:
# Now encode the train and validation time-series
train_encoded = encoder.predict(X_train)
validation_encoded = encoder.predict(X_valid)
print("Encoded time-series shape", train_encoded.shape)
print("Encoded time-series sample", train_encoded[0])

Encoded time-series shape (200327, 1)
Encoded time-series sample [0.03313876]
