Rossmann Sales Predict

This project has academic purposes.

1. Business Problem.

Rossmann's is a big European drugstore. Some store managers call me to help them to predict sales for the next six weeks. The root cause is a demand from the CFO, discussed at their weekly meeting: he needed to plan store renovations, and for that, the budget needs to be aligned with each store's sales. Therefore, the principal stakeholder is the CFO, but from which all store managers will benefit.

2. Business Assumptions.

All data was taken from the company's internal sales base of the last 30 months. Any data coming from before this period would be seriously affected by external events (biased). Several details were provided, such as type of store, variety of products offered and the competition proximity. Other variable info such as customers per day and sales per day, holidays, marketing promotions were available too.

However, it was necessary to assume some things. As you can see down below.

Conmpetition proximity: Was expressed in meters but, sometimes it was zero. So, 'Zero Competition Distance' was the same as 'No Competition Proximity'. But, for ML Algorithms this input is a bias. In this case, I assumed a fixed value (100,000 m) higher than the highest value in the dataset.
Assortment: I assumed there is a hierarchy between types. So, stores with Assortment Type C must offer Types A and B too.
Store Open: I removed all the lines that indicate Store Closed, as we also had Zero sales on the same day. For ML purposes, this will be reviewed in future CRISP cycle.
Sales Prediction: In agreement with the CFO, I presumed they would provide the total sales for eatch store at the end of the sixth week.

My strategy to solve this challenge was based in CRISP-DM Cycle:

00. Understand the Problem: The most important step for correct plan entire solution.

01. Data Description: My goal is to use statistics metrics to identify data outside the scope of business.

02. Data Filtering: Filter rows and select columns that do not contain information for modeling or that do not match the scope of the business.

03. Feature Engineering: Derive new attributes based on the original variables to better describe the phenomenon that will be modeled.

04. Exploratory Data Analysis: Explore the data to find insights and better understand the impact of variables on model learning.

05. Data Preparation: Prepare the data so that the Machine Learning models can learn the specific behavior.

06. Feature Selection: Selection of the most significant attributes for training the model.

07. Machine Learning Modelling: Machine Learning model training

08. Hyperparameter Fine Tunning: Choose the best values for each of the parameters of the model selected from the previous step.

09. Convert Model Performance to Business Values: Convert the performance of the Machine Learning model into a business result.

10. Deploy Model to Production: Publish the model to a cloud environment so other people or services can use the results to improve the business decision. In this particular case, the model can be accessible from a Telegram Bot.

3. ML and Metrics

ML step may be the most interesting setp. There is were the 'magic' happens. Well, I tested four (4) Machine Learning Algoritms: Linear Regression, Lasso Regression, Random Forest Regressor and a XGBoost Regressor. The metrics applied to measure the performance of the algorithms were MAE, MAPE and RMSE.

So, the results from these metrics can be seen in the table below:

Model Name	MAE CV	MAPE CV	RMSE CV
Linear Regression	2082.46 +/- 295.78	30.26 +/- 1.66	2950.11 +/- 468.88
Lasso Regression	2116.65 +/- 341.58	29.2 +/- 1.18	3056.56 +/- 504.44
Random Forest Regressor	837.7 +/- 219.23	11.61 +/- 2.32	1256.59 +/- 320.26
XGBoost Regressor	2889.54 +/- 343.25	34.54 +/- 1.39	3714.69 +/- 456.1

My final choice of model was XGBoost.

What !?!? Why did I choose XGBoost over Random Forest ? See the RMSE above !

Well, I did it because the company's policy is to do not go over the budget ($0.00).

All ML projects have to be lined with budget. Can the final model be sustained by the business infrastructure?

Do the results (direct or indirect) affect, or are affected, by any internal company policy?

In this case, the model will be hosted on a free cloud (Heroku), where we have a space limitation. If I choose random forest, the final model would be 1GB. Meanwhile, with XGBoost, the model became much smaller.

The metric problems can be solved in "Hyperparameter fine tuning" step. Look the metrics after a better choice of parameters to train the model:

Model Name	MAE	MAPE%	RMSE
XGBoost Regressor	764.9756	11.4861	1,100.7251

(RMSE better than Random Forest !) This is the result of a lot of work ... and a (very) little bit of experience.

4. Business Results.

It all appears to be good, beautiful ... but without converting these metrics in Business Words ... all the work can be ruined ! Business People don't understand RMSE. Maybe they understand MAE and MAPE, but 'cold' numbers they, certainly, understand. Graphics also help.

For example, the table below show the TOTAL of predictions. Considering the best and worst cenarios.

Scenario	Values
predictions	$ 286,435,616.00
worst_scenario	$ 285,579,535.55
best_scenario	$ 287,291,675.73

Below we have a Scatter Plot with all the predictions. Notice that most are centered around a line parallel to the X axis (MAPE 11% in Y axis). However, there are points quite far apart. This is because there are stores for which the forecasts are not so accurate, while others are very assertive.

Ok, but what does this deviation really represent?

Check the table below for the 5 best cases.

store	predictions	worst_scenario	best_scenario	MAE	MAPE
1089	373,394.1875	372,825.1184	373,963.2566	569.0691	5.3232
667	315,185.8438	314,693.4028	315,678.2847	492.4410	5.5487
323	282,916.4688	282,488.0610	283,344.8765	428.4077	5.6277
742	301,657.5312	301,199.1451	302,115.9174	458.3861	5.6393
1097	450,342.1562	449,703.2118	450,981.1007	638.9445	5.7761

And the table below with 5 worst cases.

store	predictions	worst_scenario	best_scenario	MAE	MAPE
292	108,359.7891	104,977.6086	111,741.9695	3,382.1804	60.2768
909	220,300.0781	212,395.1411	228,205.0152	7,904.9371	51.8675
876	194,060.8125	189,924.5347	198,197.0903	4,136.2778	33.7730
170	201,541.6875	200,194.4216	202,888.9534	1,347.2659	33.2923
749	206,800.9531	205,789.1920	207,812.7142	1,011.7611	28.3049

5. Sneak peak from Telegram Bot.

6. Lessons Learned.

Metrics are important, but they are not everything;
Make more graphics, make better graphics;
Keep your code clean;
Plan... and re-plan your work;
Git is your friend.

7. In the next cycle ?

Better plots;
Try new ML algoritms;
Build a pipeline to retrain model.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
api		api
images		images
notebooks		notebooks
webapi		webapi
webbot		webbot
.gitignore		.gitignore
README.md		README.md
graph-gc6fdabf68_1280.jpg		graph-gc6fdabf68_1280.jpg
scatter_plot.png		scatter_plot.png
telegram-bot.png		telegram-bot.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rossmann Sales Predict

This project has academic purposes.

1. Business Problem.

2. Business Assumptions.

3. ML and Metrics

4. Business Results.

5. Sneak peak from Telegram Bot.

6. Lessons Learned.

7. In the next cycle ?

Edited with Haroopad

About

Releases

Packages

Languages

CastilhosR/Rossmann

Folders and files

Latest commit

History

Repository files navigation

Rossmann Sales Predict

This project has academic purposes.

1. Business Problem.

2. Business Assumptions.

3. ML and Metrics

4. Business Results.

5. Sneak peak from Telegram Bot.

6. Lessons Learned.

7. In the next cycle ?

Edited with Haroopad

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages