In [46]:
%%html

Page Links — <a href="index.html" target="_self">Project Overview Page</a>
— <a href="page1.html" target="_self">Dataset Page</a>
— <a href="page2.html" target="_self">Data Imputation Page</a>
— <a href="page3.html" target="_self">Data Exploration Page</a>
— <a href="page4.html" target="_self">Data Modeling Page</a>

In [39]:
# Toggle raw code on/off, from stackoverflow
from IPython.display import HTML

HTML('''<script>
code_show=true;
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Toggle raw code on/off"></form>''')

---

![rossman](styles/p1.png)

# Project Overview — Rossman Stores

Luka Kordic, Marko Kostich, and Christopher Read 

### Project Description

Rossmann is the largest drugstore in Germany, operating at least 1,115 drugstores within the country and over 3,000 drugstores in Europe. Most recently, the company reported 9.4% growth in sales volume, with annual sales revenue totaling €7.9 billion (approximately $8.5 billion USD). 
 
In 2015, Rossmann sponsored a Kaggle competition focused on predicting its German drugstores’ sales over 6 continuous weeks, based off over 2 years of data. Our undergraduate team was tasked with replicating the Kaggle competition prediction using what we learned in the course, existing literature, and published attempts of the competition. 

### Datasets

The training dataset was comprised of 1,115 stores spanning over 2.5 years. The training set lists the store numbers (IDs 1 through 1,115), as well as the date (between 1/1/13 and 7/31/15). There are 1,017,209 observations.The following are attributes associated with each store on each date: 

![data1](styles/p2.png)

The test dataset was similarly structured, but it naturally excluded sales and customer volume. 

Perhaps the most important thing noted during this brief exploration was the presence of missing data, signified by NaN values in the train set. In the next phase of our data science exploration, we tackled the identification and imputation of these missing values.

### Data Imputation

The first challenge we faced in imputing missing data was first identifying what stores, and on what dates, were missing values. We created a plot to identify those dates:

![](styles/p5.png)

Training data had two cohorts of missing data: New Year’s Day in 2013 and an anomaly period of store refurbishments that took place between July 1st and December 31st of 2014

From here, we took two strategies to imputing the two missing data cases.

For New Year’s Day 2013, Store 988 was missing data so we imputed observations from 1/1/2014, when the store was closed. We assumed the store we be closed again, so we set customers, sales, and promotion, relevant holiday indicator to 0.

For the 2014 Store Refurbishments, 180 stores were missing sales and customer data. We used a plot of individual stores Sales to guide imputation decisions conditionally. If the store was not open, we set sales and customers to 0. If the store was open, we took the median of the column observations where the store shared the same day and the week and the presence of a promotion.

The blue is the result of our imputation:

![](styles/p6.png)

### Data Exploration

The clearly heteroscedastic trend rose our suspicions in regards to potentially grouped data. Our analysis showed that that store type drove the trend.

![](styles/p7.png)

![](styles/p8.png)

However, distance to competitors did not, as we expected, correlate with stores sales. This may imply some level of marketplace efficiency.

![](styles/p9.png)

Already touching upon store types, and the consequential social and economic factors in play, we next observe potential trends and patterns in regards to store sales and seasonality, where there is a clear relation.

![](styles/p11.png)

We used this repetitious nature of the dataset as we modeled.

### Data Modeling

In regards to final modeling (beyond multiple baseline approaches), we settled on presenting two quite different approaches. First, a rather simple approach that yielded surprisingly strong results was the concept of a moving-average. It required remarkably little information – just continuous data for the same dates being tested in a previous, data-reported year. 

![](styles/p13.png)

Ultimately, we see there is no need to use a more computational intensive, higher-numbered moving average as the root MSE’s from a random test set are nearly identical. 

Therefore, we explored the differences between 3, 5, and 7 day moving averages by viewing plots of the predictions on top of the actual plot of sales. The following charts demonstrate improved fit with lower number of days at the steep cost of overfitting, so the relatively “less fitted” 5 or 7 day averages may be preferred. 

3 day:
![](styles/p14.png)
5 day:
![](styles/p15.png)
7 day:
![](styles/p16.png)

Second, a more information-intensive and rigorous approach we took was vector auto regression (VAR). Our VAR train set was composed of all the original attributes of the train set, appended characteristics of each store from the description data, and appropriate lags found in partial autocorrelation and autocorrelation, as seen below.

![](styles/p17.png)
![](styles/p18.png)

Our final VAR model displayed a low mean-squared error, without signs of overfitting.
![](styles/p19.png)

### Conclusions

After testing a variety of models, including Decision Trees, Moving Average, and VAR, we decided that while the moving average model offers simplicity and easy comprehension, its error is considerably larger (by 0.09) than that of VAR, which we believe is our best model. Its error rate is very reasonable given the relative performance of the majority of Kaggle takers.

---

In [47]:
%%html

Page Links — <a href="index.html" target="_self">Project Overview Page</a>
— <a href="page1.html" target="_self">Dataset Page</a>
— <a href="page2.html" target="_self">Data Imputation Page</a>
— <a href="page3.html" target="_self">Data Exploration Page</a>
— <a href="page4.html" target="_self">Data Modeling Page</a>