Skip to content

MSc thesis about sales forecasting in high-dimensional contexts taking into account product relationships and promotions

Notifications You must be signed in to change notification settings

Blaieet/Master-thesis

Repository files navigation

Note: this is the public repository of the project, which does not contain the XGBoost implementation.

Accuracy comparison between Sparse Autoregressive and XGBoost models for high-dimensional product sales forecasting

This is my thesis of the Master in Fundamentals Principles of Data Science from Universitat de Barcelona, with the collaboration of Accenture's applied intelligence department and the supervision of Jordi Vitrià, PhD.

Brief introduction

Sales forecasting is the process of estimating future revenue for resource allocation, budgeting or simply to perform informed business decisions. When predicting a product's sales we use its past data in order to train our algorithms. Yet, there are lots of external factors that affect a product's revenue: sales of another similar product cannibalization), applied promotions (like discounts) or even the weather of that week-sales.

Our work consists of evaluating how much do these inter-product relationships affect a forecast accuracy. Using popular machine learning and data science tools, we designed a framework that enables the building, training and evaluation of two models and its comparison through a detailed set of forecast metrics.

Given today's data abundance, companies face another issue when performing these predictions: the big dimensionality of their datasets of sales records, known has endogenous variables, and discounts, marketing campaigns or advertising results, known as exogenous variables.

Therefore, we trained the following models:

  • Sparse Vector Autoregressive model (VAR): specialized on modeling these possible product associations. Modified for the input of high-dimensional datasets.

  • XGBoost: a widely, versatile and flexible used tool among Kaggle competitors that overperformed in the past few years any other kind of machine learning algorithm on large-scale regression, classification or ranking problems. Nonetheless, is not specialized on detecting these product relationships.

Given a huge dataset of ore than 2600 product (divided in segments) sales from a home improvement, gardening, and workshop retailer, alongside discount and duration promotion variables, through a two-year or more date span, we forecasted one-month ahead sales. Then, we tested both models using time series evaluation metrics in order to assess which model is better and how good were our predictions.

As a conclusion, we found that the Vector Autoregressive model is more accurate than the XGBoost model, showing that relationships between products affect a forecast performance.

Structure

Notebooks

Contact

You can contact me for any kind of questions, discussions or comments about my work at:

Citation


@misc{VARComparisonBRJ,
title={Accuracy comparison between Sparse Autoregressive and XGBoost models for high-dimensional product sales forecasting},
url={https://bitbucket.org/blaiot/master_thesis_blai_ras/src/master/Master%20Thesis%20Report.pdf},
note={Master Thesis paper},
author={Blai Ras},
  year={2021}
}

About

MSc thesis about sales forecasting in high-dimensional contexts taking into account product relationships and promotions

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published