# Demand Forecasting using LSTM
---
## Problem Statement ~
>Every retailer must stay on top of planning activity to stand the demand of goods based on needs. A highly accurate demand forecast is the only way retailers can predict which goods are needed for each store location. This will also ensure high availability for customers while maintaining minimal stock risk and support capacity management, store staff labour force planning, etc. <br>
The project will use LSTM, which is very suitable for handling time-series data and widely
used for forecasting purposes.

## Dataset
> The dataset for this project is available on Kaggle. 

**Link:** https://www.kaggle.com/c/demand-forecasting-kernels-only/data?select=train.csv

### Table of Contents¶
#### 1. Environment Setup
#### 2. Dataset Gathering
#### 3. Exploratory Data Analysis
#### 4. Dataset Preprocessing
#### 5. Model Evaluation
#### 6. Performance Measurement

# 1. Environment Setup:
---
> In this step, we have installed and imported all neccessary libraries required to proceed with the solution to the given problem statement.

In [None]:
import math
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from tensorflow.keras.models import Sequential
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import LSTM, Dense, Dropout

# 2. Dataset Gathering
---
> In this step, we have gathered the dataset from kaggle and have verified its integrity.

In [None]:
train = pd.read_csv("../input/demand-forecasting-kernels-only/train.csv")
test =  pd.read_csv("../input/demand-forecasting-kernels-only/test.csv")

# 3. Exploratory Data Analysis
---
> In this step, we took a deeper look at the data, and checked if the data is properly gathered in the previous steps.

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train.describe()

In [None]:
test.describe()

In [None]:
print('Min date from test set: %s' % train['date'].min())
print('Max date from test set: %s' % train['date'].max())
import datetime
lag_size = len(test['date'].unique())
print('Forecast lag size: ', lag_size)

In [None]:
daily_sales = train.groupby('date', as_index=False)['sales'].sum()
print(daily_sales)

In [None]:
daily_sales=daily_sales.reset_index()['sales']
print(daily_sales)

#### Overall Daily Sales
> In this step, we have aggregated the sales value and grouped it by date before finally plotting it.

In [None]:
plt.figure(figsize=(25,13))
plt.plot(daily_sales, linewidth=1)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.show()

#### Daily Sales by Store
> In this step, we have taken sub-tables for each store, and then group their sales values by the date. Finally, we have plotted the graph for the sales value for each individual store.

In [None]:
plt.figure(figsize=(25,13))
legend = []
for i in range(10):
    store_sales=train.loc[train['store'] == i]
    store_sales=store_sales.groupby('date', as_index=False)['sales'].sum()
    store_sales=store_sales.reset_index()['sales']
    plt.plot(store_sales, linewidth=1)    
    legend.append(('Store '+str(i+1)))
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend(legend, loc='upper left', ncol=1, fancybox=True, shadow=True)
plt.show()

#### Daily Sales by Item
> In this step, we have taken sub-tables for each item, and then group their sales values by the date. Finally, we have plotted the graph for the sales value for each individual item.

In [None]:
plt.figure(figsize=(25,13))
legend = []
for i in range(50):
    item_sales=train.loc[train['item'] == i]
    item_sales=item_sales.groupby('date', as_index=False)['sales'].sum()
    item_sales=item_sales.reset_index()['sales']
    plt.plot(item_sales, linewidth=1)    
    legend.append(('Item '+str(i+1)))
plt.xlabel('Sales')
plt.ylabel('Date')
plt.legend(legend, loc='upper right', ncol=1, bbox_to_anchor=[1.005, 1.04], fancybox=True, shadow=True)
plt.show()

# 4. Data Preprocessing:
---
> In this step, we have cleaned the data thus obtained for the previous steps before splitting them into training and testing datasets.

#### Sub-sampling training set to get only the last year of data and reduce training time
> In this step, we have sub-sampled the training set to only look at last year's data to reduce our training time.

In [None]:
train.head()

In [None]:
train = train[(train['date'] >= '2017-01-01')]
train_gp = train.sort_values('date').groupby(['item', 'store', 'date'], as_index=False)
train_gp = train_gp.agg({'sales':['mean']})
train_gp.columns = ['item', 'store', 'date', 'sales']
train_gp.head()

In [None]:
train_gp

#### Transforming the data into a time series problem

> In this step, we have tranformed the data into a time series problem so that we can take into account a portion of it and use that to look into the future.

In [None]:
def series_to_supervised(data, window=1, lag=1, dropnan=True):
    cols, names = list(), list()
    # Input sequence (t-n, ... t-1)
    for i in range(window, 0, -1):
        cols.append(data.shift(i))
        names += [('%s(t-%d)' % (col, i)) for col in data.columns]
    # Current timestep (t=0)
    cols.append(data)
    names += [('%s(t)' % (col)) for col in data.columns]
    # Target timestep (t=lag)
    cols.append(data.shift(-lag))
    names += [('%s(t+%d)' % (col, lag)) for col in data.columns]

    agg = pd.concat(cols, axis=1)
    agg.columns = names
    # Drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)
    return agg

#### Using the current timestep and the last 29 days to forecast 90 days ahead

> In this step, we are utilising the current timestep and the last 29 days, to forecast 90 days into the future.

In [None]:
window = 29
lag = lag_size
series = series_to_supervised(train_gp.drop('date', axis=1), window=window, lag=lag)
series.head()

#### Dropping rows with different item/store values other than the shifted columns
> In this step, we have dropped any item values or store values which are different from the shifted columns.

In [None]:
last_item = 'item(t-%d)' % window
last_store = 'store(t-%d)' % window
series = series[(series['store(t)'] == series[last_store])]
series = series[(series['item(t)'] == series[last_item])]

In [None]:
# Removing unnecessary columns
columns_to_drop = [('%s(t+%d)' % (col, lag)) for col in ['item', 'store']]
for i in range(window, 0, -1):
    columns_to_drop += [('%s(t-%d)' % (col, i)) for col in ['item', 'store']]
series.drop(columns_to_drop, axis=1, inplace=True)
series.drop(['item(t)', 'store(t)'], axis=1, inplace=True)

#### Splitting the dataset into Training and Testing set
> In this step, we have splitted the datset into training and testing sets, for further development.

In [None]:
labels_col = 'sales(t+%d)' % lag_size
labels = series[labels_col]
series = series.drop(labels_col, axis=1)

X_train, X_valid, Y_train, Y_valid = train_test_split(series, labels.values, test_size=0.4, random_state=42)
print('Train set shape', X_train.shape)
print('Validation set shape', X_valid.shape)
X_train.head()

In [None]:
X_train_series = X_train.values.reshape((X_train.shape[0], X_train.shape[1], 1))
X_valid_series = X_valid.values.reshape((X_valid.shape[0], X_valid.shape[1], 1))
print('Train set shape', X_train_series.shape)
print('Validation set shape', X_valid_series.shape)

# 5. Model Evaluation:
---
> In this step, we have chosen LSTM layers for our model as it poses the most performance in problems such as these, where even a small amount of data can provide a lot of insight to the model. The LSTM model actually sees the input data as a sequence, so it's able to learn patterns from sequenced data (assuming it exists) better than the other ones, especially patterns from long sequences.

In [None]:
model_lstm = Sequential()
model_lstm.add(LSTM(50, activation='relu', input_shape=(X_train_series.shape[1], X_train_series.shape[2])))
model_lstm.add(Dense(1))
model_lstm.compile(loss='mse', optimizer='adam')
model_lstm.summary()

In [None]:
model_lstm.compile(optimizer='adam', loss='mse')

In [None]:
lstm_history = model_lstm.fit(X_train_series, Y_train, validation_data=(X_valid_series, Y_valid), epochs=40, verbose=2)

# 6. Performance Measurement
---
> In this step, we have evaluated the performance measure of the model.

In [None]:
plt.plot(lstm_history.history['loss'])
plt.plot(lstm_history.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train Loss', 'Validation Loss'], fancybox=True, shadow=True)
plt.show()

### Performance on Testing Data
> In this step, we have utilised the test data and made the model predict the values, to validate its performance.

In [None]:
# Predicting the prices
predicted_sales = model_lstm.predict(X_valid_series)

# # We flatten the 2 dimensional array so we can plot it with matplotlib
Y_valid = Y_valid.flatten()
predicted_sales = predicted_sales.flatten()

In [None]:
plt.plot(Y_valid, color='black', label=f"Actual Sales")
plt.plot(predicted_sales, color= 'green', label="Predicted Sales")
plt.title("Sales vs Predicted Sales")
plt.xlabel("Days in test period")
plt.ylabel("Price")
plt.legend(fancybox=True, shadow=True)
plt.show()