<header style="padding:1px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Demand Forecasting with In-Database Time Series </b>
</header>

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Introduction</b></p>

<p style = 'font-size:16px;font-family:Arial'>Forecasting is a process of making predictions of the future based on past and present data and, most commonly, by analyzing trends.<br>
 
<p style = 'font-size:16px;font-family:Arial'>For businesses, the ability to predict the future and make informed decisions is critical to their survival. The traditional method of generating forecasts from time series data often struggles to generate accurate predictions, especially when dealing with extensive data with irregular trends.</p>
<p style = 'font-size:16px;font-family:Arial'>The ability to predict demand accurately is a critical need for retailers. They need to know how many inventory store units to have at hand to be at full stock for each product at a given time. A low inventory level increases the risk of having a stock out, and a too-high inventory level increases the cost related to handling inventory.</p>
<br>
<p style = 'font-size:16px;font-family:Arial'>Unbounded Array Framework (UAF) is the Teradata framework for building end-to-end time series forecasting pipelines. It also provides functions for digital signal processing and 4D spatial analytics. The series can reside in any Teradata supported or Teradata accessible table or in an analytic result table (ART).</p>

<p style = 'font-size:16px;font-family:Arial'>UAF provides data scientists with the tools for all phases of forecasting:</p>
<li style = 'font-size:16px;font-family:Arial'>Data preparation functions</li>
<li style = 'font-size:16px;font-family:Arial'>Data exploration functions</li>
<li style = 'font-size:16px;font-family:Arial'>Model coefficient estimation functions</li>
<li style = 'font-size:16px;font-family:Arial'>Model validation functions</li>
<li style = 'font-size:16px;font-family:Arial'>Model scoring functions</li>

<p></p>    
<br>  
<p style = 'font-size:16px;font-family:Arial'>Hence as a data science consultant, we are showcasing the complete approach about how we can make prediction for the demand for each store. We are demonstrating how we can train our models and use them for scoring using the ClearScape Analytics platform. The data we are using is a sample dataset and the results and predictions may not be entirely accurate.
</p>


<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>1. Start by connecting to the Vantage system.</b></p>

<p style = 'font-size:16px;font-family:Arial'>In the section, we import the required libraries and set environment variables and environment paths (if required).</p>

In [None]:
import time
import teradataml as tdml
from teradataml import * 

import getpass
import warnings
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

from tdsense.plot import plotcurves
from tdsense.clustering import resample

display.max_rows=5
warnings.filterwarnings('ignore')

<p style = 'font-size:16px;font-family:Arial'>You will be prompted to provide the password. Enter your password, press the Enter key, then use down arrow to go to next cell. Begin running steps with Shift + Enter keys.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)
eng.execute('''SET query_band='DEMO=DemandForecasting.ipynb;' UPDATE FOR SESSION; ''')

<hr>
<b style = 'font-size:20px;font-family:Arial;color:#E37C4D'>2. Getting Data for This Demo </b>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage.  You have the option of either running the demo using foreign tables to access the data without using any storage on your environment or downloading the data to local storage which may yield somewhat faster execution, but there could be considerations of available storage.  There are two statements in the following cell, and one is commented out.  You may switch which mode you choose by changing the comment string. 


In [None]:
# %run -i ../run_procedure.py "call get_data('DEMO_DemandForecast_cloud');"
 # Takes about 45 seconds
%run -i ../run_procedure.py "call get_data('DEMO_DemandForecast_local');"
 # Takes about 70 seconds

<p style = 'font-size:16px;font-family:Arial'>Next is an optional step – if you want to see status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>3. Analyze Raw Data.</b></p>


In [None]:
df = DataFrame(in_schema('DEMO_DemandForecast','Demand_Data'))
df

<p style = 'font-size:18px;font-family:Arial'>The dataset is a retail dataset where we have the timekey , the Product(MODELID), the Store(MARKET) and the column DEMAND which will be used for analysis.</p>
<p style = 'font-size:18px;font-family:Arial'>In this demo we will consider data for the 3 Products(ModelIDs) which are mentioned below .</p>

In [None]:
MODELID_1 = 'MARKET0301261'
MODELID_2 = 'MARKET0501264'
MODELID_3 = 'MARKET0200798'

<p style = 'font-size:18px;font-family:Arial'>We will check the Demand for these Products</p>

In [None]:
df_ts_1 = df.loc[df.MODELID == MODELID_1,['timeKey','DEMAND']].sort('timeKey').to_pandas().set_index('timeKey')
df_ts_2 = df.loc[df.MODELID == MODELID_2,['timeKey','DEMAND']].sort('timeKey').to_pandas().set_index('timeKey')
df_ts_3 = df.loc[df.MODELID == MODELID_3,['timeKey','DEMAND']].sort('timeKey').to_pandas().set_index('timeKey')

In [None]:
fig, ax = plt.subplots(1,3,figsize=(21,5))
df_ts_1.plot(ax=ax[0], title=MODELID_1)
df_ts_2.plot(ax=ax[1], title=MODELID_2)
df_ts_3.plot(ax=ax[2], title=MODELID_3)

<p style = 'font-size:16px;font-family:Arial'>We will check if the Demand is Zero, and also calculate the duration of the Demand based on the timekey for these Products.</p>

In [None]:
dataset_metrics = df[['MODELID','timeKey','DEMAND']]. \
                    assign(demand_is_zero=tdml.sqlalchemy.literal_column('CASE WHEN DEMAND=0 THEN 1 ELSE 0 END')). \
                    groupby('MODELID'). \
                    agg({'timeKey' : ['min','max'], 'demand_is_zero':['sum']}). \
                    assign(duration=tdml.sqlalchemy.literal_column('max_timeKey - min_timeKey')). \
                    select(['MODELID','sum_demand_is_zero','duration']). \
                    assign(ratio = tdml.sqlalchemy.literal_column('CAST(sum_demand_is_zero AS FLOAT) / NULLIFZERO(duration)'))

dataset_metrics

<p style = 'font-size:16px;font-family:Arial'>We will check only those Series where the duration is greater than 50.</p>

In [None]:
dataset = df.join(other=dataset_metrics, on='MODELID', how='inner', lsuffix='l', rsuffix='r').assign(MODELID=tdml.sqlalchemy.literal_column('l_MODELID')).drop(columns=['l_MODELID','r_MODELID'])
dataset = dataset[dataset.duration > 50]

In [None]:
%%time
plotcurves(dataset[dataset.MATERIAL < 300],field='DEMAND',row_axis='timeKey', series_id='MODELID',row_axis_type='sequence',plot_type='line')

<p style = 'font-size:16px;font-family:Arial'>The above graph shows the Demand for each Product(MODELID) along the timekey axis</p>
<p style = 'font-size:16px;font-family:Arial'>We use the window function on the timekey column to build a series for the Demands for different ModelIDs.</p>

In [None]:
window_for_counting = dataset.timeKey.window(
                            partition_columns   = "MODELID",
                            order_columns       = 'timeKey'
)

In [None]:
dataset_new = dataset.assign(series_length = window_for_counting.count(),
                             nb_zeros = tdml.sqlalchemy.literal_column('SUM(CASE WHEN DEMAND = 1 THEN 1 ELSE 0 END) OVER (PARTITION BY MODELID)'),
                             frac_zeros = tdml.sqlalchemy.literal_column('CAST((SUM(CASE WHEN DEMAND = 0 THEN 1 ELSE 0 END) OVER (PARTITION BY MODELID)) AS FLOAT)/series_length'),
                             fold = tdml.sqlalchemy.literal_column("CASE WHEN timeKey < 0.67*series_length + (min(timeKey) OVER (PARTITION BY MODELID)) THEN 'train' ELSE 'test' END"),
                             time_no_unit = tdml.sqlalchemy.literal_column("timeKey-(min(timeKey) OVER (PARTITION BY MODELID))")
                            )
dataset_new

<p style = 'font-size:16px;font-family:Arial'>We use the subset of data where the series length is greater than 90 and the ratio od zero demand and series length is less than 0.1.</p>

In [None]:
subset = dataset_new[(dataset_new.series_length > 90)&(dataset_new.frac_zeros < 0.1)]
subset

In [None]:
subset.shape

<p style = 'font-size:16px;font-family:Arial'>So the dataset we are using for our analysis has around 46k rows and 21 columns.</p>

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>4. Checking for Stationarity of Time Series using the Dickey Fuller Test</b></p>

<p style = 'font-size:16px;font-family:Arial'>To be able to model a time series, it needs to be stationary. ARIMA models deal with non-stationary time series by differencing (The "d' parameter in ARIMA determines the number of differences needed to make a series stationary)</p>
<p style = 'font-size:16px;font-family:Arial'>Here we will check for stationarity of all time series using the Dickey-Fuller Test. For more info on the test,  see <a href="https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-VantageTM-Unbounded-Array-Framework-Time-Series-Reference-17.20/Diagnostic-Statistical-Test-Functions/TD_DICKEY_FULLER/TD_DICKEY_FULLER-Example">here.</a> 
<p style = 'font-size:16px;font-family:Arial'>The null hypothesis for the test is that the data is non-stationary. We want to REJECT the null hypothesis for this test. So we want a p-value of less that 0.05 (or smaller) and a negative coefficient value for the lag term in our regression model.</p> 
<p style = 'font-size:16px;font-family:Arial'>The Dickey fuller function needs series data so we use the TDSeries function to create a series and apply DickeyFuller to check the stationarity of the data.</p>

In [None]:
# Create teradataml TDSeries object.
data_series_df = tdml.TDSeries(data=subset,
                          id="MODELID",
                          row_index="time_no_unit",
                          row_index_style="SEQUENCE",
                          payload_field="DEMAND",
                          payload_content="REAL")

In [None]:
from teradataml import DickeyFuller
df_out = DickeyFuller(   data=data_series_df,
                           algorithm='NONE')

# Print the result DataFrame.
print(df_out.result)

<p></p>
<p style = 'font-size:16px;font-family:Arial'>In the above output the p-value corresponding to the calculated test statistic is less than 0.05. It means that the series is stationary. The output column NULL_HYP which means NULL HYPOTHESIS can have 2 values 
    <li style = 'font-size:16px;font-family:Arial'>ACCEPT means the null hypothesis is accepted. No Unit roots are present, and therefore the process is stationary.</li>
<li style = 'font-size:16px;font-family:Arial'>REJECT means the null hypothesis is rejected. Unit roots are present, and the process may or may not be stationary, depending on other factors.</li>
</p>
<p style = 'font-size:16px;font-family:Arial'>Since the P_VALUE is less than 0.05 we consider the series and stationary.</p>

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>5. Autocorrelation and Partial Autocorrelation of the time series</b></p>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>5.1 Check for Autocorrelation of the time series</b></p>
<p style = 'font-size:16px;font-family:Arial'>ACF calculates the autocorrelation or autocovariance of a time series. The autocorrelation and autocovariance show how the time series correlates or covaries with itself when delayed by a lag in time or space. Here we check autocorrelation with a maximum lag of 10 time steps.</p>

<p style = 'font-size:16px;font-family:Arial'>First we use the Series created above to get the ACF and PACF.</p>

In [None]:
from teradataml import ACF, PACF
uaf_out = ACF(data=data_series_df,
                  max_lags=10,
             ALPHA=0.05)


In [None]:
uaf_out.result

<hr>
<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>5.2. Check for partial autocorrelation of the time series</b>
<p style = 'font-size:16px;font-family:Arial'>The PACF function provides insight as to whether the modelled function is stationary. The partial autocorrelations measure the degree of correlation between time series sample points. Here we check partial autocorrelation with a maximum lag of 10 time steps.</p>

In [None]:
PACF_out = PACF(data=data_series_df,
                    algorithm='LEVINSON_DURBIN',
                    max_lags=10)

In [None]:
PACF_out.result

<hr>
<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>5.3. Plot graphs for ACF and PACF of the time series</b>
<p style = 'font-size:16px;font-family:Arial'>We plot the ACF and PACF graphs for all the 3 series we are considering in our analysis.</p>

In [None]:
df_ts_1 = subset.loc[subset.MODELID == MODELID_1,['timeKey','DEMAND']].sort('timeKey').to_pandas().set_index('timeKey')
df_ts_2 = subset.loc[subset.MODELID == MODELID_2,['timeKey','DEMAND']].sort('timeKey').to_pandas().set_index('timeKey')
df_ts_3 = subset.loc[subset.MODELID == MODELID_3,['timeKey','DEMAND']].sort('timeKey').to_pandas().set_index('timeKey')

In [None]:
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
def plot_acf_pacf(df,m=12):
    # Create figure
    fig, (ax1, ax2) = plt.subplots(1,2, figsize=(20,6))
    # Make ACF plot
    plot_acf(df, lags=12, zero=False, ax=ax1)
    # Make PACF plot
    plot_pacf(df, lags=12, zero=False, ax=ax2)    

In [None]:
plot_acf_pacf(df_ts_1)

<p style = 'font-size:16px;font-family:Arial'>To get the value of the Moving Average or Q, we need the lag(here, ROW_I is the X axis) where the value from the ACF plot is just outside the significant limit. Looking at the graph, the Auto-Correlation value at ROW_I = 2 is outside the confidence band and much closer to it. Hence it is acceptable to say that the value of the Moving Average or <b>Q = 2</b>.</p>
<p style = 'font-size:16px;font-family:Arial'>To get the value of Auto-Regressive lags or P, we need the lag(here, Row_I) where the value from the PACF plot falls just outside the significant limit. Looking at the graph, the Partial Auto-Correlation value at ROW_I = 1 falls outside the significant limit. Hence we can say that the value of Auto-Regressive lags or <b>P = 1</b>.</p>

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>6. Using ARIMA (AutoRegressive Integrated Moving Average) model to forecast Demand</b></p>
<p style = 'font-size:16px;font-family:Arial'>ARIMA functions on VANTAGE run in the following order:</p>

<li style = 'font-size:16px;font-family:Arial'>Run <b>ARIMAESTIMATE</b> function to get the coefficients for the ARIMA model.
</li>
<li style = 'font-size:16px;font-family:Arial'><i>[Optional]</i> Run <b>ARIMAVALIDATE</b> function to validate the "goodness of fit" of the ARIMA model, when FIT_PERCENTAGE is not 100 in ARIMAESTIMATE.
</li>
<li style = 'font-size:16px;font-family:Arial'>Run the <b>ARIMAFORECAST</b> function with input from step 1 or step 2 to forecast the future periods beyond the last observed period.</li>
</p>


<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>6.1 Estimation step using ARIMAESTIMATE</b></p>
<p style = 'font-size:16px;font-family:Arial'>The ARIMAESTIMATE function estimates the coefficients corresponding to an ARIMA model and fits a series with an existing ARIMA model. The function can also provide the "goodness of fit" and the residuals of the fitting operation. The function generates a model layer used as input for the ARIMAVALIDATE and ARIMAFORECAST functions. This function is for univariate series.</p>

<br>

<p style = 'font-size:16px;font-family:Arial'>Here, the previously estimated parameters P, d and Q need to be passed in the MODEL_ORDER(P, d, Q), i.e. <b>MODEL_ORDER(1, 1, 2)</b>. The output is stored in a dataframe. The fit percentage is 80, meaning the ARIMA model is being trained on 80% of the data. The remaining 20% of the data will be used to validate the model.</p>

In [None]:
from teradataml import ArimaEstimate,ArimaValidate, ArimaForecast, TDAnalyticResult
arima_estimate_op = ArimaEstimate(data1=data_series_df,
                                       nonseasonal_model_order=[1,1,2],
                                       seasonal_period=12,
                                       seasonal_model_order=[0,1,0], 
                                       constant=False,
                                       algorithm="MLE",
                                       coeff_stats=True,
                                       fit_metrics=True,
                                       residuals=True,
                                       fit_percentage=80)

In [None]:
results_estimate = arima_estimate_op.fitresiduals
results_estimate

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>6.2 Validate using ArimaValidate</b></p>
<p style = 'font-size:16px;font-family:Arial'>The ArimaValidate() function performs an in-sample     forecast for both seasonal and non-seasonal auto-regressive (AR), moving-average (MA), ARIMA models and Box-Jenkins seasonal ARIMA model formula followed by an analysis of the produced residuals. The aim is to provide a collection of metrics useful to select the model and expose the produced residuals such that multiple model validation and statistical tests can be conducted.</p>
<p style = 'font-size:16px;font-family:Arial'>The TDAnalyticResult function retrieves auxiliary result sets stored in the output dataframe of the ArimaEstimate. Here we extract the residuals from the previous estimation step. Analytical Result Tables have multiple layers that store different data.</p>

In [None]:
data_art_df = tdml.TDAnalyticResult(data=arima_estimate_op.result)

In [None]:
arima_validate_op = ArimaValidate(data=data_art_df, fit_metrics=True, residuals=True)

In [None]:
arima_validate_op.result.sort('AIC')

In [None]:
results_validate = arima_validate_op.fitresiduals

<p style = 'font-size:16px;font-family:Arial'>We plot the actual vs calculated values for the 3 different Products(MODELIDs) we are analyzing.</p>

In [None]:
def plot_results(MODELID):
    res1 = results_validate[results_validate.MODELID == MODELID].sort('ROW_I').to_pandas()
    res2 = results_estimate[results_estimate.MODELID == MODELID].sort('ROW_I').to_pandas()
    res1['ROW_I'] = res1['ROW_I']-res1['ROW_I'].values[0]+res2['ROW_I'].values[-1]+1
    res3 = subset[subset.MODELID == MODELID][['MODELID','time_no_unit','DEMAND']].sort('time_no_unit').to_pandas()
    fig, ax = plt.subplots(figsize=(10,4))
    res1.plot(x='ROW_I',y=['CALC_VALUE'],ax=ax, marker='o',xlabel='Time',ylabel='Demand',)
    res2.plot(x='ROW_I',y=['CALC_VALUE'],ax=ax, marker='s',xlabel='Time',ylabel='Demand')
    res3.plot(x='time_no_unit',y='DEMAND',ax=ax,xlabel='Time',ylabel='Demand')
    return

plot_results(MODELID_1)
plot_results(MODELID_2)
plot_results(MODELID_3)

<p style = 'font-size:16px;font-family:Arial'>The above graphs show the Actual Demand Values(Green) and the Calculated Values for the Demand using the ArimaEstimate(Orange) and ArimaValidate(Blue). The 3 graphs are for the 3 Products(MODELIDs).</p>

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>6.3 Forecast Demand using ArimaForecast</b></p>
<p style = 'font-size:16px;font-family:Arial'>The ArimaForecast() function is used to forecast a user-defined number of periods based on
    models fitted from the ArimaEstimate() function.</p>
<p style = 'font-size:16px;font-family:Arial'>The ArimaForecast() function with input from step 1 or step 2 to forecast the future periods beyond the last observed period. Here we are forecasting for 20 periods.</p>

In [None]:
data_art_df = TDAnalyticResult(data=arima_validate_op.result)

In [None]:
arima_forecast_op = ArimaForecast(data=data_art_df, forecast_periods=20)

In [None]:
results_forecast = arima_forecast_op.result
results_forecast

<p style = 'font-size:16px;font-family:Arial'>We plot the actual vs forecast values for the 3 different Products(MODELIDs) we are analyzing.</p>

In [None]:
def plot_forecast(MODELID):
    fig, ax = plt.subplots(figsize=(10,4))
    # Plot prediction
    mean_forecast = results_forecast[results_forecast.MODELID==MODELID].sort('ROW_I').to_pandas()
    res3 = subset[subset.MODELID == MODELID][['MODELID','time_no_unit','DEMAND']].sort('time_no_unit').to_pandas()
    res3['time_no_unit'] = res3['time_no_unit'] - res3.time_no_unit.values[-1]
    res3.plot(x='time_no_unit',y='DEMAND',label='actual',ax=ax)
    mean_forecast.plot(x='ROW_I',y='FORECAST_VALUE',label='forecast',color='red',ax=ax)
    # Shade uncertainty area
    plt.fill_between(mean_forecast.ROW_I, mean_forecast.LO_80, mean_forecast.HI_80, color='pink', alpha=0.5)
    plt.fill_between(mean_forecast.ROW_I, mean_forecast.LO_95, mean_forecast.HI_95, color='pink', alpha=0.2)
    plt.title(MODELID)
    plt.show()
    return
plot_forecast(MODELID_1)
plot_forecast(MODELID_2)
plot_forecast(MODELID_3)

<p style = 'font-size:16px;font-family:Arial'>The plot in pink color shows the forecasted values for each Product(MODELID) for the 20 periods we have specified in the ArimaForecast.</p>

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>7. Conclusion:</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have trained and validated the ARIMA model on the Weekly Sales dataset, and the results closely match the actual data. The goodness of fit metrics calculated in the estimate and validate phase also resonate with our understanding that the model is well-trained to forecast. This can be observed in the Estimate and the Validate function graphs. So we can say that the model is well trained to forecast the Weekly Sales.</p>

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>8. Cleanup</b></p>


<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Databases and Tables</b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_DemandForecast');" 
#Takes 45 seconds

In [None]:
remove_context()

<footer style="padding:10px;background:#f9f9f9;border-bottom:3px solid #394851">Copyright © Teradata Corporation - 2023. All Rights Reserved.</footer>