Predicting the daily sales for Rossmann Drug Store
- Created a tool to predict the daily sales of any store of the Rossmann Drug Store Chain which is the 2nd largest drug store chain in Germany. The data was taken from Kaggle.
- Created features referring the active promos run by the store, competitor distance, competitor time of opening, demarking the start of year/quarter etc.
- Tried 3 main modelling approaches, Random Forest, XGBoost & Neural Networks with Entity Embeddings. Selected XGBoost as the final model.
- Created a webapp using Streamlit and deployed on AWS EC2 Instance.
Check out the webapp here (Currently Inactive)
- Python 3.6
- Pandas, Numpy
- Matplotlib, Seaborn
- Scikit Learn, XGBoost
- FastAi, Pytorch
- Streamlit
- AWS
- The data has been taken from Kaggle.
- It contains 3 csv files, the train, test and store.
- Train and Test contains information for the day for which the sales need to be predicted and store containes some features for the stores.
- After mergtin the train and store dataframes, we get the following features
- Only the days when the store was opened have been used to train the model, hence rows where Sales == 0 (or Open != 1) have been dropped.
- All the nan values in categorical variables have been imputed with 0.
- For continuous variable as well 0 imputation has been done.
- I have explored the distributions of the variables to identify important features and transformations. For the full EDA check out
EDA - RossMann Sales.ipynb
The metric used is the root mean squared percentage error
- Cross Validation Scheme: The model has been trained on data from 2013-1-1 to 2015-6-31 and then evalutaed on the next 6 weeks. The datapoints have been selected such that
all the strores are included in the trainning and validation sets. For the final model submission model was trained on data from 2013-1-1 to 2015-7-31 and tested on the next 42 days worth of data on Kaggle.
- I have tried 3 models:
- Random Forest - Majority of the features being categorical, random forests were a good starting point to get a strong baseline and eliminate unimportant features using feature importances.
- XGBooost - Since random forest gave decent performance I tried XGBoost as boosting techniques usually give a little better performance than bagging models
- Neural Networks - I used entity embeddings to encode categorical variables, this reduces the dimensionality which is usually increased by one hot encoding and is also able to capture the relationships between categorical datapoints.
- The model gets an rmspe of
0.11213
on the kaggle testing set. Here are the results on the validation set:
- We can check the feature importance according to XGBoost
- I have also plotted the force plot which is calculated using shap values. For a prediction it tells us the positive or negative "force" provided by the features to push the base value to the the predicted value. The base value is nothing but the expected value of the prediction or the mean of the training data.
- Date - The data for which sales need to be predicted
- Store Id - The store for which we need the predictions
- Promo - If the store will be running a promo on that dat
- State Holiday - If the day is actually a state holiday or not
- The rest of the features used are related to the store id, a database in the form of a csv contains all the information regarding every Rossmann store.
- The webapp also generates interpretation plots (GAIN + SHAP Force Plot), as shown above.
- Finally, the app has been deployed using AWS EC2 Instance with an elastic IP.
Check it out: https://cutt.ly/rossmann-app