<img src='../img/logo.png' alt='DS Market logo' height='150px'>

# Sales forecast

## Table of Contents

* [A. Introduction](#introduction)
* [B. Importing Libraries](#libraries)
* [C. Importing data](#data)
* [D. Total Sales predictions (by income)](#sales_pred_income)
* [E. Total Sales Predictions (by number of items sold)](#sales_pred_num_items)

## A. Introduction <a class="anchor" id="introduction"></a>

DSMarket has always been depending on rudimentary approaches to forecast product sales. The current process works by obtaining the aggregated sales per department / store / city and add up the independent predictions.

The idea is to provide some forecasting and predictions over 28 days of data (4 weeks).

## B. Importing Libraries <a class="anchor" id="libraries"></a>

In [73]:
# system and path management
import sys
sys.path.append('../scripts') # including helper functions inside the scripts folder

# removing system warnings
import warnings
warnings.filterwarnings('ignore')

# data manipulation
import pandas as pd
import numpy as np

# time series prediction libs
from prophet import Prophet

# plotting
import matplotlib.pyplot as plt
import plotly.express as px
from prophet.plot import plot_plotly, plot_components_plotly

# plotting options
%matplotlib inline
plt.style.use('ggplot')
plt.rcParams["figure.figsize"] = (10, 7)

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.options.display.float_format = '{:,.2f}'.format

# helper functions
import file_management

# constants
PERIODS = 28

## C. Importing Data <a class="anchor" id="data"></a>

In [74]:
# downloading the processed data files from gdrive
directory = '../data/processed/'
urls = [
    {'filename': 'sales_processed.csv', 'url': 'https://drive.google.com/file/d/1JdeAgraKcaFQJrjG2HPVb5D0VD0iTlNB/view?usp=sharing'},
    {'filename': 'prices_processed.csv', 'url': 'https://drive.google.com/file/d/1pSEJAQfAU-owDjKmxcPrxf3CpGFivwa6/view?usp=sharing'},
    {'filename': 'calendar_processed.csv', 'url': 'https://drive.google.com/file/d/1Lnji96iBkTpFiWo-QXeW3TvESiNYWCML/view?usp=sharing'}
]
        
file_management.download_files_from_url(urls, directory)

sales = pd.read_csv(directory + 'sales_processed.csv', index_col = 0)
prices = pd.read_csv(directory + 'prices_processed.csv', index_col = 0)
calendar = pd.read_csv(directory + 'calendar_processed.csv', index_col = 0)

sales_processed.csv file already exists in ../data/processed/
prices_processed.csv file already exists in ../data/processed/
calendar_processed.csv file already exists in ../data/processed/


In [75]:
# downloading the feature files from gdrive
directory = '../data/features/'
urls = [
    {'filename': 'sales_by_date.csv', 'url': 'https://drive.google.com/file/d/1JMy2pJUp7DscjnY3_vhCNM7NZk9Th4i9/view?usp=sharing'},
    {'filename': 'sales_by_date_store_item.csv', 'url': 'https://drive.google.com/file/d/1e2elGUrr-8lR5qegHQvqVvOCj0jcQY-S/view?usp=sharing'}
]

file_management.download_files_from_url(urls, directory)

sales_by_date = pd.read_csv(directory + 'sales_by_date.csv', index_col = 0)
sales_by_date_store_item = pd.read_csv(directory + 'sales_by_date_store_item.csv', index_col = 0)

sales_by_date.csv file already exists in ../data/features/
sales_by_date_store_item.csv file already exists in ../data/features/


## D. Total Sales Predictions (by income)<a class="anchor" id="sales_pred_income"></a>

### Preparing the dataframes for predictions

Let's check the `sales_by_date` dataframe to prepare it for our forecast. In this section, we are going to use Prophet, which is a time series data prediction algorithm / procedure based on an additive model in which non-linear trends are adjusted for annual, weekly and daily seasonality, in addition to the effects of holidays, which we have in our data set. Prophet works best with time series that have strong seasonal effects (be they weekly, monthly) and a sufficient amount of historical data where that seasonality is represented. Prophet is robust to missing data and trend changes, and it tends to handle outliers well, so this is why we won't be correcting those before applying the model.

In [76]:
sales_by_date.head(5)

Unnamed: 0_level_0,num_sales,total_income
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2011-01-29,32631,100444.46
2011-01-30,31749,97073.53
2011-01-31,23783,70924.5
2011-02-01,25412,74568.19
2011-02-02,19146,57583.22


Since we want to forecast sales, not only by the amount of items sold, but also have the price consideration in it, we have decided to predict based on the total_income instead of the number of sales only.

In [77]:
# preparing the dataframe with the columns to be used by Prophet
df = sales_by_date.reset_index()
df['date'] = pd.to_datetime(df['date'])

df = df[['date', 'total_income']]
df.columns = ['ds', 'y']
df

Unnamed: 0,ds,y
0,2011-01-29,100444.46
1,2011-01-30,97073.53
2,2011-01-31,70924.50
3,2011-02-01,74568.19
4,2011-02-02,57583.22
...,...,...
1908,2016-04-20,139043.58
1909,2016-04-21,136022.84
1910,2016-04-22,156581.75
1911,2016-04-23,191485.16


In [78]:
fig = px.line(
    df, 
    x ='ds', 
    y = 'y', 
    hover_name = 'ds', 
    title = 'Daily Sales Evolution'
)
fig.update_xaxes(title = 'Date')
fig.update_yaxes(title = 'Total Income')
fig.show()

Including holidays in our analysis is also important because, as we can see from the previous plot, Christmas is a date where the sales decrease a lot, almost down to 0, so we want our model to take into consideration this effect.

In [79]:
# formatting holidays to what Prophet expects
calendar_holidays = calendar[calendar.event != 'None']
calendar_holidays.drop(columns = ['weekday', 'weekday_int', 'd'], inplace = True)

holiday_dict = calendar_holidays.groupby('event')['date'].apply(list).to_dict()

holidays = pd.DataFrame()
for key, value in holiday_dict.items():
    result = {}
    result['holiday'] = key
    result['ds'] = pd.to_datetime(value)
    result['lower_window'] = 0
    result['upper_window'] = 1
    
    holidays = pd.concat([holidays, pd.DataFrame(result)])

### Using Prophet for time series predictions

Let's start by instantiating the model (including our holidays) and fitting the data we have in our dataframe.

In [80]:
model = Prophet(holidays = holidays)
model.fit(df)

INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.


Initial log joint probability = -29.3913


<prophet.forecaster.Prophet at 0x7fcffeaedb20>

    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      99       5022.55    0.00523484        265.46       6.088      0.6088      133   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     154       5023.29   0.000306208       198.274   4.592e-06       0.001      259  LS failed, Hessian reset 
     199       5023.39    0.00011562       40.8326           1           1      322   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     289       5023.93   3.99961e-05       61.3374   4.169e-07       0.001      482  LS failed, Hessian reset 
     299       5023.94   0.000211427       64.9529           1           1      492   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     360       5024.04   3.50225e-05       77.3736   5.684e-07       0.001      656  LS failed, Hessian reset 
     399       5024.09   0.000585827  

Once the model has been fit, we will be getting the predictions. We will start with a 28 day period predictions.

In [81]:
future = model.make_future_dataframe(periods = PERIODS)
future.tail()

Unnamed: 0,ds
1936,2016-05-18
1937,2016-05-19
1938,2016-05-20
1939,2016-05-21
1940,2016-05-22


In [82]:
forecast = model.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

Unnamed: 0,ds,yhat,yhat_lower,yhat_upper
1936,2016-05-18,144852.46,131860.5,156792.62
1937,2016-05-19,146208.49,133795.67,157730.02
1938,2016-05-20,160361.34,148756.96,173365.33
1939,2016-05-21,186191.14,173041.35,199394.1
1940,2016-05-22,187865.75,174945.89,200287.08


In [83]:
# plotting data and the results
# it looks really small because we only predicted 28 days, but the plot can be zoomed in
fig = plot_plotly(model, forecast)
fig.update_xaxes(title = 'Date')
fig.update_yaxes(title = 'Total Income')
fig.show()

Let's see the trends that Prophet extracted from the data we have introduced.

In [84]:
# evaluating trends
plot_components_plotly(model, forecast)

From here we can extract that the model:
- Sees that there's a positive trend in the growth of the income of the sales. Additionally, the upper trend and the lower trends are pretty thin and attached to the trend, so we can say that we achieved quite a certain global prediction.
- Holidays have quite an effect on the forecast. Basically, Thanksgiving, Christmas and New Year provide a really bad effect on sales; while, on the other hand, Labor Day affects positively.
- If we check the months, from mid-February to the end of April sales grow and achieve its maximum. But by the end of the year, there's a really bad trend (taking into account that we have Thanksgiving and Christmas holidays). We believe DS Market should try to focus marketing campaigns or new product launches to attract customer engagement and increase sales with additional offers.
- If we take a look at the weeks, the days with the highest increase in Sales are the weekends.

## E. Total Sales Predictions (by number of items sold)<a class="anchor" id="sales_pred_num_items"></a>

However, what we've done until now, only helps us in forecasting the income DS Market would get, but if we want to be more operational, and be able to later forecast how many products we will need to sell, we need to focus on products and stores separately not in an aggregated level.  

In [112]:
def clean_df(df_in):
    df = df_in.copy()
    df.index.name = 'ds'
    df.columns = ['store', 'item', 'y']
    df.reset_index(inplace = True)
    df['ds'] = pd.to_datetime(df['ds'])
    return df


def filter_df(df, store = None, item = None):
    if store and item:
        return df[(df.store == store) & (df.item == item)].drop(columns = ['item', 'store'])
    elif store:
        return df[df.store == store].drop(columns = ['item', 'store']).groupby('ds').sum().reset_index()
    elif item:
        return df[df.item == item].drop(columns = ['item', 'store']).groupby('ds').sum().reset_index()
    else:
        print('A store or an item is mandatory')
        return None


def train_prophet(df, holidays_list = None):
    if holidays is None:
        model = Prophet()
    else:
        model = Prophet(holidays = holidays_list)
    model.fit(df)

    future = model.make_future_dataframe(periods = PERIODS)
    forecast = model.predict(future)

    return model, forecast  

def plot_prophet_time_series(df, styles, model = None, forecast = None):
    if model is None:
        fig = px.line(
            df,
            x = 'ds',
            y = 'y',
            hover_name = 'ds',
            title = styles['title']
        )
    else:
        fig = plot_plotly(model, forecast)
    
    if styles['xaxes_label']:
        fig.update_yaxes(title = styles['xaxes_label'])
    if styles['yaxes_label']:
        fig.update_yaxes(title = styles['yaxes_label'])
    
    fig.show()


def pipeline(df_in, store = None, item = None, holidays_list = None):
    print('Cleaning Data...')
    df = clean_df(df_in) 

    print('Selecting features...')
    df = filter_df(df, store = store, item = item)
    
    title = 'Daily number of sold items '
    if store:
        title += '(store = ' + store + ') '
    if item:
        title += '(item = ' + item + ')'

    plot_prophet_time_series(
        df,
        {
            'title': title,
            'xaxes_label': 'Date',
            'yaxes_label': 'Number of Sales'
        }
    )

    print('Training model...')
    model, forecast = train_prophet(df, holidays_list)

    plot_prophet_time_series(
        df,
        {
            'xaxes_label': 'Date',
            'yaxes_label': 'Total Income'
        },
        model,
        forecast
    )

    plot_components_plotly(model, forecast).show()

    return df

In [113]:
_ = pipeline(sales_by_date_store_item, item = 'ACCESORIES_1_001', store = 'BOS_1', holidays_list = holidays)

Cleaning Data...
Selecting features...


INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.


Training model...
Initial log joint probability = -45.3761
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      99       3870.79    0.00618674       114.253      0.3375           1      123   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     148       3873.32   0.000839246       203.025   1.131e-05       0.001      229  LS failed, Hessian reset 
     199       3873.73   2.06182e-05       75.6693           1           1      298   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     218       3873.75   3.49967e-05       75.0423   4.086e-07       0.001      362  LS failed, Hessian reset 
     244       3873.75   6.14052e-08       71.7686     0.03496      0.8514      400   
Optimization terminated normally: 
  Convergence detected: relative gradient magnitude is below tolerance


Unnamed: 0,ds,y
0,2011-01-29,0
30490,2011-01-30,0
60980,2011-01-31,0
91470,2011-02-01,0
121960,2011-02-02,0
...,...,...
58174920,2016-04-20,0
58205410,2016-04-21,1
58235900,2016-04-22,0
58266390,2016-04-23,0
