# Problem Statement:
    
Rossmann operates over 3,000 drug stores in 7 European countries. Currently, 
Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. 
Store sales are influenced by many factors, including promotions, competition, school and state holidays, 
seasonality, and locality.

 

With thousands of individual managers predicting sales based on their unique circumstances, 
the accuracy of results can be quite varied. You are provided with historical sales data for 1,115 Rossmann 
stores. The task is to forecast the "Sales" column. Note that some stores in the dataset were temporarily
closed for refurbishment.

# Data definition

Most of the fields are self-explanatory. The following are descriptions for important variables.

Id - an Id that represents a (Store, Date) duple within the test set

Store - a unique Id for each store

Sales - the turnover for any given day (this is what you are predicting)

Customers - the number of customers on a given day

Open - an indicator for whether the store was open: 0 = closed, 1 = open

StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None

SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools

StoreType - differentiates between 4 different store models: a, b, c, d

Assortment - describes an assortment level: a = basic, b = extra, c = extended

CompetitionDistance - distance in meters to the nearest competitor store

CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened

Promo - indicates whether a store is running a promo on that day

Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating

Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2

PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

# Please keep the following question in mind while forecasting sales:


Is the sales data non-stationary?

If so, how do you find it and correct for it?


Is the data co-integrated?

Which variables are co-integrated and how do you find it?

What is the impact of the number of customers on sales? How do you measure it?

What is the impact of promo and promo2 variables on sales? How do you measure it?

Forecast sales for the next 6 weeks? Report the accuracy of the model using MAPE.

Let’s say promo2 is decided based on the sales in the previous day. How do you measure the impact of promo2 on sales and the impact of sales on promo2?

Which independent variables have long term impact on sales and which have short term impact? Describe the approach in detail.


# Please follow these steps while you explore the data and build the model.

 

Find outliers at 99th percentile and remove them.

Standardize the sales and number of customers variables before modelling.

Test the data for non-stationarity using ADF test for sales

Determine if the data is stationary

If stationary then specify Vector Autoregression Model in Levels

If non-stationary then specify the model in differences

Make sales, promo2 and any other variables you think as dependent variables.

Check for cointegration using Johansen test. 

Find out Impulse Response Function to answer questions Q3- to Q7. 

Predict sales for the next 6 weeks.

In [19]:
import pandas as pd
import numpy as np


from collections import  Counter

In [2]:
inputData = pd.read_csv("./inputDataProcessed.csv")
store = pd.read_csv("./data/store.csv")
train = pd.read_csv("./data/train.csv")

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
inputData.head(5)

Unnamed: 0,Store,Date,Sales,Customers,Promo,SchoolHoliday,CompetitionDistance,DayOfWeek_0,DayOfWeek_1,DayOfWeek_2,DayOfWeek_3,DayOfWeek_4,DayOfWeek_5,DayOfWeek_6
0,1,01-01-2013,5530,668,0,1,1270,0,1,0,0,0,0,0
1,1,02-01-2013,4327,578,0,1,1270,0,0,1,0,0,0,0
2,1,03-01-2013,4486,619,0,1,1270,0,0,0,1,0,0,0
3,1,04-01-2013,4997,635,0,1,1270,0,0,0,0,1,0,0
4,1,05-01-2013,7176,785,1,1,1270,0,0,0,0,0,1,0


In [4]:
store.head(5)

Unnamed: 0,Store,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
0,1,c,a,1270.0,9.0,2008.0,0,,,
1,2,a,a,570.0,11.0,2007.0,1,13.0,2010.0,"Jan,Apr,Jul,Oct"
2,3,a,a,14130.0,12.0,2006.0,1,14.0,2011.0,"Jan,Apr,Jul,Oct"
3,4,c,c,620.0,9.0,2009.0,0,,,
4,5,a,a,29910.0,4.0,2015.0,0,,,


In [5]:
train.head(5)

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday
0,1,5,2015-07-31,5263,555,1,1,0,1
1,2,5,2015-07-31,6064,625,1,1,0,1
2,3,5,2015-07-31,8314,821,1,1,0,1
3,4,5,2015-07-31,13995,1498,1,1,0,1
4,5,5,2015-07-31,4822,559,1,1,0,1


In [6]:
print("input  shape : ", inputData.shape)
print("store shape : ", store.shape)
print("train shape : ", train.shape)

input  shape :  (11729, 14)
store shape :  (1115, 10)
train shape :  (1017209, 9)


In [7]:
inputData.isna().sum()

Store                  0
Date                   0
Sales                  0
Customers              0
Promo                  0
SchoolHoliday          0
CompetitionDistance    0
DayOfWeek_0            0
DayOfWeek_1            0
DayOfWeek_2            0
DayOfWeek_3            0
DayOfWeek_4            0
DayOfWeek_5            0
DayOfWeek_6            0
dtype: int64

In [8]:
store.isna().sum()

Store                          0
StoreType                      0
Assortment                     0
CompetitionDistance            3
CompetitionOpenSinceMonth    354
CompetitionOpenSinceYear     354
Promo2                         0
Promo2SinceWeek              544
Promo2SinceYear              544
PromoInterval                544
dtype: int64

In [9]:
train.isna().sum()

Store            0
DayOfWeek        0
Date             0
Sales            0
Customers        0
Open             0
Promo            0
StateHoliday     0
SchoolHoliday    0
dtype: int64

In [10]:
inputData.columns

Index(['Store', 'Date', 'Sales', 'Customers', 'Promo', 'SchoolHoliday',
       'CompetitionDistance', 'DayOfWeek_0', 'DayOfWeek_1', 'DayOfWeek_2',
       'DayOfWeek_3', 'DayOfWeek_4', 'DayOfWeek_5', 'DayOfWeek_6'],
      dtype='object')

In [11]:
store.columns

Index(['Store', 'StoreType', 'Assortment', 'CompetitionDistance',
       'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2',
       'Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval'],
      dtype='object')

In [12]:
train.columns

Index(['Store', 'DayOfWeek', 'Date', 'Sales', 'Customers', 'Open', 'Promo',
       'StateHoliday', 'SchoolHoliday'],
      dtype='object')

In [13]:
store.shape , train.shape

((1115, 10), (1017209, 9))

In [18]:
inputData["Store"].nunique()

15

In [20]:
Counter(inputData["Store"])

Counter({1: 781,
         2: 784,
         3: 779,
         4: 784,
         5: 779,
         6: 780,
         7: 786,
         8: 784,
         9: 779,
         10: 784,
         11: 784,
         12: 784,
         13: 777,
         14: 784,
         15: 780})