> # Predict Future Sales

> ## Introduction

It is a Dataset from kaggle with 5 csv files of daily sales data, kindly provided by one of the largest Russian software firms - 1C Company. The sales data is from January 2013 - October 2015, splittet mainly into sold products and shops.

My challenge is to predict future sales in a time series.

(The challenge of the kaggle competion will be to predict total sales for every product and store in the next month for the test set and to create a robust model that can handle monthly, slightly changes in the list of shops and products.)

### Agenda

1. Libraries 

2. Exploratory Data Analysis

3. Predict Future Sales

    3.1. Analysing historical data
    
    3.2. Test Stationarity with Augmented Dicky Fuller Test
    
    3.3. Forecast Time Series with ARIMA
    
    3.4. Fitting the SARIMAX model
    
    3.5. Validating forecasts
    
    3.6. Visualization of the forecast   
    
4. Focusing on certain shops

    4.1. Offline Shops: sold products per day
    
    4.2. Online Shop: sold products per day
    
        4.2.1. Analysing historical data
    
        4.2.2 Test Stationarity with Augmented Dicky Fuller Test
    
        4.2.3. Forecast Time Series with ARIMA
    
        4.2.4. Fitting the SARIMAX model
    
        4.2.5. Validating forecasts
    
        4.2.6. Visualization of the forecast 
        
    4.3. Extra Analysis


### 1. Libraries 

In [None]:
import pandas as pd
import numpy as np
#from googletrans import Translator

from itertools import product
import itertools

import matplotlib.pyplot as plt
import statsmodels.api as sm
import matplotlib
import seaborn as sns

from sklearn.metrics import mean_squared_error
from sklearn import metrics
from statsmodels.tsa.stattools import adfuller,pacf
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.graphics.gofplots import qqplot
from statsmodels.graphics.tsaplots import plot_pacf, plot_acf

from pylab import rcParams
matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['text.color'] = 'k'

import statsmodels.formula.api as smf
import statsmodels.tsa.api as smt
import statsmodels.api as sm
import scipy.stats as scs
from pandas.plotting import autocorrelation_plot

from statsmodels.tsa.seasonal import seasonal_decompose
from scipy import stats

### 2. Exploratory Data Analysis

#### Translate (item name, item category and) shop name into english

In [None]:
"""
import re
translator = Translator()
translate = ["item_name","item_category_name","shop_name"]
shops = pd.read_csv("shops.csv")
shops_lst = list(shops.shop_name.unique())
shops["shop_name_en"] = shops["shop_name"].apply(translator.translate, src = "ru", dest = "en").apply(getattr, args=('text',))

shops = shops.drop(columns = {"shop_name"})
shop_lst = list(shops.shop_name_en)
list_of_list_shops = [re.findall(r'[a-zA-Z]+', i) for i in shop_lst]
shops["City"] = [list_of_list_shops[i][0] +" "+ list_of_list_shops[i][1] 
                 if ((list_of_list_shops[i][0] == "St") |(list_of_list_shops[i][0] == "Itinerant") | 
                                                                             (list_of_list_shops[i][0] =="Digital"))
                 else list_of_list_shops[i][0] + " "+ list_of_list_shops[i][1] +" "+ list_of_list_shops[i][2] 
                 if (list_of_list_shops[i][0] == "Shop")
                 else list_of_list_shops[i][0] for i in range(len(list_of_list_shops))]

shops.to_csv("shops_new.csv", sep = ";")
"""

Unfortunately items and items_category are too big to translate, they are also not neccessary.

#### Import datasets

In [None]:
test = pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/test.csv")
item_categories = pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/item_categories.csv")
sales_train = pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/sales_train.csv")
items = pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/items.csv")
shops = pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/shops.csv")

#### Transform date to datetime

In [None]:
sales_train.date = pd.to_datetime(sales_train.date)

In [None]:
sales_train_after11 = sales_train[(sales_train["date"] >= "2015-11-01")]
sales_train_after11.date.unique()

In [None]:
sales_train = sales_train[(sales_train["date"] < "2015-11-01")]
sales_train.shape

In [None]:
sales_train.head()

##### Data fields

ID - an Id that represents a (Shop, Item) tuple within the test set

shop_id - unique identifier of a shop

item_id - unique identifier of a product

item_category_id - unique identifier of item category

item_cnt_day - number of products sold. You are predicting a monthly amount of this measure

item_price - current price of an item

date - date in format dd/mm/yyyy

date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33

item_name - name of item

shop_name - name of shop

item_category_name - name of item category

#### Merge Datasets


Merge datasets to have dataset with more informations, columns.

item_categories.item_category_id = items.item_category_id

items.item_id = sales_train.item_id

shops.shop_id = sales_train.shop_id


In [None]:
items = pd.merge(items, item_categories, on = "item_category_id")
items.shape

In [None]:
sales_train = pd.merge(sales_train, items, on = "item_id")
sales_train.head()

In [None]:
sales_train = pd.merge(sales_train, shops, on = "shop_id")
sales_train.head()

In [None]:
sales_train.shape

In [None]:
sales_train = sales_train[(sales_train["date"] < "2015-11-01")]
sales_train.shape

#### Bestseller Products

In [None]:
sales_per_product = sales_train.groupby("item_name", as_index=False).agg({"item_cnt_day":"sum"}).sort_values(by = "item_cnt_day", ascending = False)[0:10]


In [None]:
sales_per_product

In [None]:
ax = sns.barplot(x = "item_cnt_day", y = "item_name", data = sales_per_product)
plt.figure(figsize=(20,10))
plt.tight_layout()
#sns.set_style("whitegrid")
ax.set_title("Bestseller",y= 1.1, fontsize=18, weight = "semibold")
ax.set_xlabel("# of products", fontsize = 18, weight = "semibold")
ax.set_ylabel("Products", fontsize = 18, weight = "semibold")


#### Shop with the highest amount of sold products

In [None]:
sales_per_shop = sales_train.groupby(by = "shop_name", as_index=False).agg({"item_cnt_day":"sum"}).sort_values(by = "item_cnt_day",ascending = False)[0:10]

In [None]:
ax = sns.barplot(x = "item_cnt_day", y = "shop_name", data = sales_per_shop, palette="gist_heat")
sns.set_style("whitegrid")

ax.set_title("Shops sort by amount of sold products",y= 1.1, fontsize=20, weight = "semibold")
ax.set_xlabel("Amount of products", fontsize = 18, weight = "semibold")
ax.set_ylabel("shops", fontsize = 18, weight = "semibold")


In [None]:
sales_train["revenue"] = sales_train["item_cnt_day"]*sales_train["item_price"]

### 3. Prediction Future Sales

Analyse the data based on the sold products (item_cnt_day) per day.

In [None]:
sales_train.date.min(), sales_train.date.max()

In [None]:
sales = sales_train.groupby('date')['item_cnt_day'].sum().reset_index()

In [None]:
sales = sales.set_index('date')

In [None]:
sales.index

In [None]:
sales.dtypes

In [None]:
y = sales['item_cnt_day'].resample('MS').mean()

In [None]:
y["2015":]

In [None]:
# plot historical data about all sold products per day
y.plot(figsize=(15, 6))
plt.show()

In [None]:
coefficients, residuals, _, _, _ = np.polyfit(range(len(y.index)),y,1,full=True)
mse = residuals[0]/(len(y.index))
nrmse = np.sqrt(mse)/(y.max() - y.min())
print('Slope ' + str(coefficients[0]))
print('NRMSE: ' + str(nrmse))

In [None]:
(y[33]-y[0])/y[0]

#### 3.1. Analysing historical data

Analysis regarding observation, trend, seasonality and residuals

In [None]:
rcParams['figure.figsize'] = 18, 8
decomposition = sm.tsa.seasonal_decompose(y, freq=12, model='additive')
fig = decomposition.plot()
plt.show()

#### Results

The data is from 2013-01-01 till 2015-10-31, you can see seasonality over a year, with a peak at the end of the year. The trend goes down.

#### 3.2. Test Stationarity with Augmented Dicky Fuller Test

Augmented Dicky Fuller Test is to check the stationarity of the sold items per day. Null Hypothesis (H0): If failed to be rejected, it suggests the time series has a unit root, meaning it is non-stationary. It has some time dependent structure.

In [None]:
# check the sum of item_cnt_day per day.
result = adfuller(y)
print("Daily Basis:")
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
    print('\t%s: %.3f' % (key, value))
    
# p-value is smaller than 0.05 so we can reject the Null Hypothesis, 
# the time series is stationary and has no time dependent structure

The p-value is smaller than 0.05 so we can reject the Null Hypothesis, the time series is stationary and has no time dependent structure.

#### 3.3. Forecast Time Series with ARIMA

Analysing the parameters (p=season, d=trend, q=noise) for the seasonal ARIMA (Autoregressive Integrated Moving Average) to recieve the best AIC (Akaike’s Information Criterion).
AIC estimates the quality of a model, relative to each of other models. The lower AIC score is, the better the model is. Therefore, a model with lowest AIC - in comparison to others, is chosen.

Since we saw in the analysis before that there is a seasonality over the year, we will use the model SARIMAX, this model allows us to set a seasonality of 12 months.


In [None]:
p = d = q = range(0, 2)
pdq = list(itertools.product(p, d, q))
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]
print('Examples of parameter combinations for Seasonal ARIMA...')
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[1]))
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[2]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[3]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[4]))

In [None]:
for param in pdq:
    for param_seasonal in seasonal_pdq:
        try:
            mod = sm.tsa.statespace.SARIMAX(y,
                                            order=param,
                                            seasonal_order=param_seasonal,
                                            enforce_stationarity=False,
                                            enforce_invertibility=False)
            results = mod.fit()
            print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))
        except:
            continue

#### 3.4. Fitting the SARIMAX model

In [None]:
# The best AIC is:
# ARIMA(1, 1, 1)x(1, 1, 0, 12)12 - AIC:115.62002802642752

mod = sm.tsa.statespace.SARIMAX(y,
                                order=(1, 1, 1),
                                seasonal_order=(1, 1, 0, 12),
                                enforce_stationarity=False,
                                enforce_invertibility=False)
results = mod.fit()
print(results.summary().tables[1])

In [None]:
results.plot_diagnostics(figsize=(16, 8))
plt.show()

The diagnostic plots gave us the suggests that the model residuals are near normally distributed.

#### 3.5. Validating forecasts

In [None]:
pred = results.get_prediction(start=pd.to_datetime('2015-01-01'), dynamic=False)
pred_ci = pred.conf_int()
ax = y['2013':].plot(label='observed')
pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast', alpha=.7, figsize=(14, 7))
ax.fill_between(pred_ci.index,
                pred_ci.iloc[:, 0],
                pred_ci.iloc[:, 1], color='k', alpha=.2)
ax.set_xlabel('date')
ax.set_ylabel('item_cnt_day')
plt.legend()
plt.show()

In [None]:
y_forecasted = pred.predicted_mean
y_truth = y['2015-01-01':]
mse = ((y_forecasted - y_truth) ** 2).mean()
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))

In [None]:
print('The Root Mean Squared Error of our forecasts is {}'.format(round(np.sqrt(mse), 2)))

#### 3.6. Visualization of the forecast

In [None]:
pred_uc = results.get_forecast(steps=100)
pred_ci = pred_uc.conf_int()
ax = y.plot(label='observed', figsize=(14, 7))
pred_uc.predicted_mean.plot(ax=ax, label='Forecast')
ax.fill_between(pred_ci.index,
                pred_ci.iloc[:, 0],
                pred_ci.iloc[:, 1], color='k', alpha=.25)
ax.set_xlabel('date')
ax.set_ylabel('item_cnt_day')
plt.legend()
plt.show()

In [None]:
pred_uc = results.get_forecast(steps=3)
pred_ci = pred_uc.conf_int()
ax = y.plot(label='observed', figsize=(14, 7))
pred_uc.predicted_mean.plot(ax=ax, label='Forecast')
ax.fill_between(pred_ci.index,
                pred_ci.iloc[:, 0],
                pred_ci.iloc[:, 1], color='k', alpha=.25)
ax.set_xlabel('date')
ax.set_ylabel('item_cnt_day')
plt.legend()
plt.show()

In [None]:
pred_uc = results.get_forecast(steps=24)
pred_ci = pred_uc.conf_int()
ax = y.plot(label='observed', figsize=(14, 7))
pred_uc.predicted_mean.plot(ax=ax, label='Forecast')
ax.fill_between(pred_ci.index,
                pred_ci.iloc[:, 0],
                pred_ci.iloc[:, 1], color='k', alpha=.25)
ax.set_xlabel('Date',fontsize=18, weight = "bold")
ax.set_ylabel('Amount of sold items',fontsize=18, weight = "semibold")
plt.title("Forecast sold products next 2 years",y= 1.1, fontsize=18, weight = "semibold" )
plt.legend()
plt.show()


### 4. Focusing on certain shops

In [None]:
shops_lst = list(sales_train.shop_id.unique())

In [None]:
online_shop = sales_train[(sales_train["shop_id"] == 12)]

In [None]:
offline_shop = sales_train[(sales_train["shop_id"] != 12)]

#### 4.1. Offline Shops: sold products per day

In [None]:
offline_sales = offline_shop.groupby(['date',"shop_id"])['item_cnt_day'].sum().reset_index()

In [None]:
offline_sales = offline_sales.set_index('date')

In [None]:
offline_sales.dtypes

In [None]:
# plot time series for each offline shop
for i in shops_lst:
    sales_shop = offline_sales[(offline_sales["shop_id"] == i)]
    y = sales_shop["item_cnt_day"].resample("MS").mean()
    
    y.plot()
   
    


#### 4.2. Online Shop: sold products per day

In [None]:
sales_per_online_shop = online_shop.groupby(['date',"shop_id"])['item_cnt_day'].sum().reset_index()

In [None]:
sales_per_online_shop = sales_per_online_shop.set_index('date')

In [None]:
o = sales_per_online_shop["item_cnt_day"].resample("MS").mean()
#X = sales_per_online_shop.index

#plt.plot(X, coefficients[0]*X +residuals, color="red")
axo = o.plot()

plt.title("Online Sales", y= 1.1, fontsize=18, weight = "semibold")
plt.xlabel("Date", fontsize=14, weight = "semibold")
plt.ylabel("# sold products", fontsize=14, weight = "semibold")
plt.show()


In [None]:
coefficients, residuals, _, _, _ = np.polyfit(range(len(o.index)),o,1,full=True)
mse = residuals[0]/(len(o.index))
nrmse = np.sqrt(mse)/(o.max() - o.min())
print('Slope ' + str(coefficients[0]))
print('NRMSE: ' + str(nrmse))


In [None]:
(o[33]-o[0])/o[0]*100

##### 4.2.1 Analysing historical data

In [None]:
rcParams['figure.figsize'] = 18, 8
decomposition = sm.tsa.seasonal_decompose(o, freq=12, model='additive')
fig = decomposition.plot()
plt.show()

##### 4.2.2. Test Stationarity with Augmented Dicky Fuller Test

In [None]:
# check the sum of item_cnt_day per day.
result = adfuller(o)
print("Daily Basis:")
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
    print('\t%s: %.3f' % (key, value))
    
# p-value is smaller than 0.05 so we can reject the Null Hypothesis, 
# the time series is stationary and has no time dependent structure

##### 4.2.3. Forecast Time Series with ARIMA

In [None]:
p = d = q = range(0, 2)
pdq = list(itertools.product(p, d, q))
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]
print('Examples of parameter combinations for Seasonal ARIMA...')
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[1]))
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[2]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[3]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[4]))

##### 4.2.4. Fitting the SARIMAX model

In [None]:
for param in pdq:
    for param_seasonal in seasonal_pdq:
        try:
            mod = sm.tsa.statespace.SARIMAX(o,
                                            order=param,
                                            seasonal_order=param_seasonal,
                                            enforce_stationarity=False,
                                            enforce_invertibility=False)
            results = mod.fit()
            print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))
        except:
            continue

In [None]:
# The best AIC is:
# ARIMA(1, 1, 0)x(1, 1, 0, 12)12 - AIC:87.79639573691884

mod = sm.tsa.statespace.SARIMAX(o,
                                order=(1, 1, 0),
                                seasonal_order=(1, 1, 0, 12),
                                enforce_stationarity=False,
                                enforce_invertibility=False)
results = mod.fit()
print(results.summary().tables[1])

##### 4.2.5. Validating forecasts

In [None]:
pred = results.get_prediction(start=pd.to_datetime('2015-01-01'), dynamic=False)
pred_ci = pred.conf_int()
ax = o['2013':].plot(label='observed')
pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast', alpha=.7, figsize=(14, 7))
ax.fill_between(pred_ci.index,
                pred_ci.iloc[:, 0],
                pred_ci.iloc[:, 1], color='k', alpha=.2)
ax.set_xlabel('date')
ax.set_ylabel('item_cnt_day')
plt.legend()
plt.show()

In [None]:
o_forecasted = pred.predicted_mean
o_truth = o['2015-01-01':]
mse = ((o_forecasted - o_truth) ** 2).mean()
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))

In [None]:
print('The Root Mean Squared Error of our forecasts is {}'.format(round(np.sqrt(mse), 2)))

##### 4.2.6. Visualization of the forecast

In [None]:
pred_uc = results.get_forecast(steps=100)
pred_ci = pred_uc.conf_int()
ax = o.plot(label='observed', figsize=(14, 7))
pred_uc.predicted_mean.plot(ax=ax, label='Forecast')
ax.fill_between(pred_ci.index,
                pred_ci.iloc[:, 0],
                pred_ci.iloc[:, 1], color='k', alpha=.25)
ax.set_xlabel('date')
ax.set_ylabel('item_cnt_day')
plt.legend()
plt.show()

In [None]:
pred_uc = results.get_forecast(steps=15)
pred_ci = pred_uc.conf_int()
ax = o.plot(label='observed', figsize=(14, 7))
pred_uc.predicted_mean.plot(ax=ax, label='Forecast')
ax.fill_between(pred_ci.index,
                pred_ci.iloc[:, 0],
                pred_ci.iloc[:, 1], color='k', alpha=.25)
ax.set_xlabel('date')
ax.set_ylabel('item_cnt_day')
plt.legend()
plt.show()

#### 4.3. Extra analysis

In [None]:
last_offline_sales = sales_train[(sales_train.date_block_num == 33) & (sales_train.shop_id != 12)]
w = last_offline_sales.item_cnt_day.sum()

In [None]:
last_online_sales = sales_train[(sales_train.date_block_num == 33) & (sales_train.shop_id == 12)]
z = last_online_sales.item_cnt_day.sum()

In [None]:
labels = ['Offline', "online"]
sizes = [(w/(w+z)),(z/(w+z))]

In [None]:
explode = (0, 0.1)
plt.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',  startangle=60)
plt.axis('equal', fontsize=14, weight = "semibold")

plt.show()
