# Store Sales - Time series Forecasting📈📉

### This notebook includes Data cleaning, exploratory data analysis,data visualization among different factors of the data.
### Main technique I used is Recurrent Neural Networks which is LSTM.

# Importing Essential Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
sns.set_style('darkgrid')
import os

# Load the dataset

In [None]:
df_holi = pd.read_csv('../input/store-sales-time-series-forecasting/holidays_events.csv')
df_oil = pd.read_csv('../input/store-sales-time-series-forecasting/oil.csv')
df_stores = pd.read_csv('../input/store-sales-time-series-forecasting/stores.csv')
df_test = pd.read_csv('../input/store-sales-time-series-forecasting/test.csv')
df_train = pd.read_csv('../input/store-sales-time-series-forecasting/train.csv')
df_transactions = pd.read_csv('../input/store-sales-time-series-forecasting/transactions.csv')

# Exploratory Data Analysis (Analyzing and cleaning)

There are different types of files in this task, so instead of playing with each dataset let's create a function which will tell us all required characteristics of the particular data(file).

Characteristcs are
- Data (first 5 records)
- Shape of the data
- Essential information 
- Columns in the data
- Desciption (Statistical)
- Datatypes of columns
- Presence of null values
- N/A values form the dataset

In [None]:
# function related basic eda 
def eda_basic(df):
    print("\n >> Data <<\n\n")
    print(df.head())
    print("\n======================================\n")
    print("\n >> Shape <<")
    print(df.shape)
    print("\n======================================\n")
    print("\n >> Info <<")
    print(df.info())
    print("\n======================================\n")
    print("\n >> Columns <<")
    print(df.columns)
    print("\n======================================\n")
    print("\n >> Description <<")
    print(df.describe())
    print("\n======================================\n")
    print("\n >> Dataypes <<")
    print(df.dtypes)
    print("\n======================================\n")
    print("\n >> Null values <<")
    print(df.isnull().sum())
    print("\n======================================\n")
    print("\n >> N/A values <<")
    print(df.isna().sum())
    print("\n======================================\n")
#     print(df.value_counts())

In [None]:
print("Basi EDA of holidays_event dataset\n")
eda_basic(df_holi)

In [None]:
print("Basi EDA of oil dataset\n")
eda_basic(df_oil)

In [None]:
print("Basi EDA of Stores dataset\n")
eda_basic(df_stores)

In [None]:
print("Basi EDA of train dataset\n")
eda_basic(df_train)

In [None]:
print("Basi EDA of test dataset\n")
eda_basic(df_test)

In [None]:
print("Basi EDA of transactions dataset\n")
eda_basic(df_transactions)

**Date** is very important factor in the dataset

In [None]:
def date_form(df):
    df['date'] = pd.to_datetime(df['date'], format = "%Y-%m-%d")
#     df.head()
    

### Applying this function to all datas which contains a **Date** column

Except **Stores** dataset each data cotains **Date** column.

In [None]:
# Applying data_from function to dataset
date_form(df_holi)
date_form(df_oil)
date_form(df_train)
date_form(df_test)
date_form(df_transactions)

In [None]:
# df_holi.head()
# df_oil.head()
# df_train.head()
# df_test.head()
# df_transactions.head()

# Visualization 

Here we can look through some variables and see some dependencies. Firstly, let's check the **dependency of the oil from the date**

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(20,10))
df_oil.plot.line(x="date", y="dcoilwtico", color="b", ax=axes, rot=0)
plt.title("Dependency of the oil from the data")
plt.show()

As we have so much rows in out dataset, it will be easier to group data, as example, by week or month. The aggregation will be made by **mean**

In [None]:
def grouped(df,key,freq,col):
    df_grouped = df.groupby([pd.Grouper(key=key, freq=freq)]).agg(mean = (col, 'mean'))
    df_grouped = df_grouped.reset_index()
    return df_grouped

Grouped data on transactions dataset

In [None]:
df_grouped_trans_w = grouped(df_transactions, 'date', 'w', 'transactions')
df_grouped_trans_w

And, for better forecasting we'll add **time** column to our dataframe.

In [None]:
def add_time(df, key, freq, col):
    df_grouped = grouped(df, key,freq, col)
    df_grouped['time'] = np.arange(len(df_grouped.index))
    column_time = df_grouped.pop('time')
    df_grouped.insert(1, 'time', column_time)
    return df_grouped

So, now we can check the results of grouping on the example of **df_train (grouped by weeks on sales, after that, mean was counted).**

In [None]:
df_grouped_train_w = add_time(df_train, 'date', 'W', 'sales')
df_grouped_train_m = add_time(df_train, 'date', 'M', 'sales')

In [None]:
df_grouped_train_w.head()

In [None]:
df_grouped_train_m.head()

Plots based on **Linear Regression**

In [None]:
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(30,20))

# Transactions(weekly)
axes[0].plot('date', 'mean', data=df_grouped_train_w, color='grey', marker='o')
axes[0].set_title("Transactions (grouped by week)", fontsize=20)

# Sales (weekly)
axes[1].plot('time', 'mean', data=df_grouped_train_w, color='0.75')
axes[1].set_title('Sales (grouped by week)', fontsize=20)

# Linear regression
axes[1] = sns.regplot(x='time',
                     y='mean',
                     data = df_grouped_train_w,
                     scatter_kws = dict(color='0.75'),
                     ax = axes[1])

# Sales (Monthly)
axes[2].plot('time', 'mean', data=df_grouped_train_m, color='0.75')
axes[2].set_title('Sales [grouped by Month]', fontsize=20)

# Linear Regression
axes[2] = sns.regplot(x='time',
                     y = 'mean',
                     data = df_grouped_train_m,
                     scatter_kws = dict(color='0.75'),
                     line_kws={"color": "red"},
                     ax = axes[2])

plt.show()

## Lag feature

Lag features are values at prior timesteps that are considered useful because they are created on the assumption that what happened in the past can influence or contain a sort of intrinsic information about the future. For example, it can be beneficial to generate features for sales that happened in previous days at 4:00 p.m. if you want to predict similar sales at 4:00 p.m. the next day.

In [None]:
def add_lag(df, key, freq, col, lag):
    df_grouped = grouped(df, key, freq, col)
    name = 'Lag_' + str(lag)
    df_grouped['Lag'] = df_grouped['mean'].shift(lag)
    return df_grouped

In [None]:
df_grouped_train_w_lag1 = add_lag(df_train, 'date', 'W', 'sales',1)
df_grouped_train_m_lag1= add_lag(df_train, 'date', 'W', 'sales',1)

df_grouped_train_w_lag1.head()

So lag features let us fit curves to lag plots where each observation in a series is plotted against the previous observation. Let's build same plots, but with 'lag' feature:

In [None]:
fig,axes = plt.subplots(nrows = 2, ncols=1, figsize=(30,20))
axes[0].plot('Lag', 'mean', data=df_grouped_train_w_lag1,color="0.75",linestyle=(0,(1,10)))
axes[0].set_title('Sales (grouped by week)', fontsize=20)
axes[0] = sns.regplot(x='Lag',
                     y='mean',
                     data = df_grouped_train_w_lag1,
                     scatter_kws= dict(color='0.75'),
                     ax = axes[0])

axes[1].plot('Lag', 'mean', data=df_grouped_train_m_lag1, color="0.75",linestyle=(0,(1,10)))
axes[1].set_title("Sales (groupes by month)", fontsize=20)
axes[1] = sns.regplot(x='Lag',
                     y='mean',
                     data = df_grouped_train_m_lag1,
                     scatter_kws = dict(color='0.75'),
                     line_kws={'color':'red'},
                     ax = axes[1])

plt.show()

Exploring and visualizing the data int statistical aspect

In [None]:
def plot_stats(df, column, ax,color,angle):
    count_classes = df[column].value_counts()
    ax = sns.barplot(x=count_classes.index, y=count_classes, ax=ax, palette=color)
    ax.set_title(column.upper(), fontsize=20)
    for tick in ax.get_xticklabels():
        tick.set_rotation(angle)

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15,5))
fig.autofmt_xdate()
fig.suptitle("Stats of df_holidays".upper())
plot_stats(df_holi, "type", axes[0], "pastel", 45)
plot_stats(df_holi, "locale", axes[1], "rocket", 45)
plt.show()

count values of some columns of df_stores

In [None]:
fig, axes = plt.subplots(nrows = 4, ncols=1, figsize=(20,40))
plot_stats(df_stores, "city", axes[0], "mako_r", 45)
plot_stats(df_stores, "state", axes[1], "rocket_r", 45)
plot_stats(df_stores, "type", axes[2], "magma", 0)
plot_stats(df_stores, "cluster", axes[3], "viridis", 0)

Let's **plot pie** chart for **'family'** of **df_train**

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(20,10))
count_classes = df_train['family'].value_counts()
plt.title("Stats of df_train".upper())
colors = ['#ff9999','#66b3ff','#99ff99',
          '#ffcc99', '#ffccf9', '#ff99f8', 
          '#ff99af', '#ffe299', '#a8ff99',
          '#cc99ff', '#9e99ff', '#99c9ff',
          '#99f5ff', '#99ffe4', '#99ffaf']

plt.pie(count_classes, 
        labels = count_classes.index, 
        autopct='%1.1f%%',
        shadow=True, 
        startangle=90, 
        colors=colors)

plt.show()

# Forecasting the model

Let'focus on the **family** factor

# Data Preprocessing

In [None]:
df_train["family"].nunique(dropna=True)

In [None]:
df_test.head()

In [None]:
# dropping the onpromotion coz it won't be used

train_data = df_train.copy().drop(['onpromotion'], axis=1)
test_data = df_test.copy().drop(['onpromotion'], axis=1)

#### Encoding the family feature

In [None]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import MinMaxScaler

In [None]:
ordinal_encoder = OrdinalEncoder(dtype=int)
train_data[['family']] = ordinal_encoder.fit_transform(train_data[['family']])
test_data[['family']] = ordinal_encoder.transform(test_data[['family']])

In [None]:
train_data

In [None]:
#counting number of days
n_o_days_train=train_data["date"].nunique(dropna = False) 
print('number of day train:',n_o_days_train)

# number of store
n_o_stores_train=train_data["store_nbr"].nunique(dropna = False) 
print('number of stores train:',n_o_stores_train)

# number of family
n_o_families_train=train_data["family"].nunique(dropna = False) 
print('number of family/type of prod train:',n_o_families_train)

In [None]:
##counting the number of days
n_o_days_test=test_data["date"].nunique(dropna = False) 
print('number of day test:',n_o_days_test)

# number of store
n_o_stores_test=test_data["store_nbr"].nunique(dropna = False) 
print('number of stores test:',n_o_stores_test)

# number of family
n_o_families_test=test_data["family"].nunique(dropna = False) 
print('number of family/type of prod test:',n_o_families_test)

The data need to be re-organized as discrete-time data (days)
 date as timestamp/time-series input, store number and family as columns and sales is the numerical data of interest for RNN

In [None]:
pivoted_train = train_data.pivot(index=['date'], columns=['store_nbr', 'family'], values='sales')
pivoted_train.head()

Let's check store number 1 and product number 0

In [None]:
pivoted_train[1][0]

## Splitting the data into train and validation

In [None]:
train_samples = int(n_o_days_train*0.95)
train_samples

In [None]:
train_samples_df = pivoted_train[:train_samples]
train_samples_df

In [None]:
valid_samples_df = pivoted_train[train_samples:]
valid_samples_df

### Scaling the data

In [None]:
minmax = MinMaxScaler()
minmax.fit(train_samples_df)

scaled_train_samples = minmax.transform(train_samples_df)
scaled_val_samples = minmax.transform(valid_samples_df)

In [None]:
scaled_train_samples[10:]

In [None]:
scaled_val_samples[10:]


sliding window for converting series to sample to be used with supervised learning algorithm

In [None]:
# n_past --> no. of past observations
# n_future --> no.of past observations

def split_series(series, n_past, n_future):
    X, y = list(), list()
    for window_start in range(len(series)):
        past_end = window_start + n_past
        future_end = past_end + n_future
        if future_end > len(series):
            break
            
        # slicing past and future
        past, future = series[window_start:past_end,:], series[past_end:future_end,:]
        X.append(past)
        y.append(future)
    
    return np.array(X), np.array(y)

n_past =16
n_future = 16
n_features = n_o_stores_train * n_o_families_train # num of features

Now converting the data via split_series function

In [None]:
X_train, y_train = split_series(scaled_train_samples, n_past, n_future)
X_val, y_val = split_series(scaled_val_samples, n_past, n_future)

In [None]:
print('X_train.shape',X_train.shape)
print('y_train.shape',y_train.shape)
print('X_val.shape',X_val.shape)
print('y_val.shape',y_val.shape)

# Traning the model - LSTM

In [None]:
from tensorflow.keras.layers import LSTM, Dense, Embedding
from tensorflow.keras.layers import Dropout, BatchNormalization, TimeDistributed
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam

In [None]:
model = Sequential()

model.add(LSTM(units=256, return_sequences=True,input_shape=[n_past, n_features]))
model.add(BatchNormalization())
model.add(Dropout(0.2))
model.add(LSTM(units=128, return_sequences=True))
model.add(BatchNormalization())
model.add(Dropout(0.2))
#TimeDistributed layer
model.add(TimeDistributed(Dense(n_features)))

model.compile(loss="mae", optimizer=Adam(learning_rate=0.001), metrics=['mae'])

In [None]:
model.summary()

In [None]:
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_mae', 
                           min_delta=0.0001,
                           patience=100,
                           restore_best_weights=True)

epochs= 1000

model_history = model.fit(X_train, y_train, 
                          validation_data=(X_val, y_val),
                          epochs = epochs,
                          callbacks = [early_stop],
                          batch_size=512,
                          shuffle=True)

In [None]:
plt.plot(model.history.history['loss'])
plt.plot(model.history.history['val_mae'])
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend(['Train', 'Validation'])
plt.show()

From above graph we can say that model trained well!

In [None]:
X_test_pred = scaled_val_samples[-n_past:,:].reshape((1, n_past, n_features))
print(X_test_pred.shape)
scaled_test_predict = model.predict(X_test_pred)

In [None]:
scaled_test_predict.shape

In [None]:
X_train_pred = scaled_train_samples[-n_past:,:].reshape((1, n_past, n_features))
print(X_train_pred.shape)
scaled_train_predict = model.predict(X_test_pred)

In [None]:
scaled_train_predict.shape

In [None]:
# Inverse transform from the previous min max scaler
y_predict = pd.DataFrame(minmax.inverse_transform(scaled_test_predict.reshape((n_future, n_features))),columns=valid_samples_df.columns)

In [None]:
y_predict

In [None]:
pivoted_test = test_data.pivot(index=['date'], columns=['store_nbr', 'family'], values=None)
pivoted_test

In [None]:
pivoted_test.values

In [None]:
pivoted_train.values

# Submitting resulting csv file for Kaggle competition

In [None]:
submission = pd.read_csv('../input/store-sales-time-series-forecasting/sample_submission.csv')

In [None]:
# submission

In [None]:
## mapping ypredict to pivoted test data
for day_ith, day_ith_pred in y_predict.iterrows():
    #day_ith iteration, 16 days in totals
    #day_ith_pred, predicted data of 9 stores, 33 classes of good for each day
    #Iterate over DataFrame rows as (index, Series) pairs.
#     print(n_samples_per_day)
    # n_samples_per_day number of 
    for n_samples_per_day in range(len(day_ith_pred)): ## iterating the number of sample, from 0 to 1781, for 16 days
#         print(pivoted_test.iloc[[day_ith], [n_samples_per_day]])
        sample_id = pivoted_test.iloc[[day_ith], [n_samples_per_day]].values[0][0] #total number of samples
        values= max(0,day_ith_pred.values[n_samples_per_day]) #price that is negative will be set to 0
        submission.at[sample_id, 'sales'] = values

In [None]:
submission

In [None]:
submission.to_csv('submission.csv')

## Thank You!