<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Demand Forecasting with In-Database Time Series
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Introduction</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Retail stores rely on sales and an accurate amount of inventory to support these sales. However, demand can be everchanging leading to stores being overstocked or out of stock. To drive sales and control costs, retailers need to be able to adapt to changes fast. But that can be challenging without the right data. Stay ahead of supply and demand fluctuations with VantageCloud and ClearScape Analytics. Our complete cloud analytics and data platform for AI enables you to harmonize and analyze sales and inventory data across all your stores while considering factors like holidays or sudden spikes in demand.</p> 


<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Business Value </b></p>
<li style = 'font-size:16px;font-family:Arial'>Accurately predict sales over specific time periods. </li>
<li style = 'font-size:16px;font-family:Arial'>Identify seasonal trends in sales and demand to improve inventory management. </li>
<li style = 'font-size:16px;font-family:Arial'>Better plan for non-seasonal sales spikes and dips. </li>
<li style = 'font-size:16px;font-family:Arial'>Drive customer satisfaction and strengthen customer loyalty.</li></p>  
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Why Vantage? </b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Unbounded Array Framework (UAF) is the Teradata framework for building end-to-end time series forecasting pipelines. It also provides functions for digital signal processing and 4D spatial analytics. The series can reside in any Teradata supported or Teradata accessible table or in an analytic result table (ART). </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>UAF provides data scientists with the tools for all phases of forecasting: </p>
<li style = 'font-size:16px;font-family:Arial'>Data preparation functions </li>
<li style = 'font-size:16px;font-family:Arial'>Data exploration functions </li>
<li style = 'font-size:16px;font-family:Arial'>Model coefficient estimation functions </li>
<li style = 'font-size:16px;font-family:Arial'>Model validation functions </li>
<li style = 'font-size:16px;font-family:Arial'>Model scoring functions </li>
    </p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Plus, with Teradata Vantage, users can perform these functions at scale and analyze and forecast hundreds/thousands at once. Time Series analysis requires significant effort in analyzing, preparing, and testing forecast models. Traditional approaches require users to perform these laborious tasks multiple times for each prediction, so scaling forecasting efforts beyond a small number of different forecasts becomes prohibitive. The UAF architecture provides a range of unique benefits including: </p>
<li style = 'font-size:16px;font-family:Arial'>Rapid data exploration, preparation, and testing functions that can analyze massive amounts of data across an unlimited number of forecasts in parallel; drastically reducing the development and testing times. </li>
<li style = 'font-size:16px;font-family:Arial'>The creation of a nearly unlimited number of forecasts in parallel, unlocking value in hyper-segmented (per-store-per-SKU inventory demand, per-household energy consumption) predictions, based on individualized models.</li>
<li style = 'font-size:16px;font-family:Arial'>The ability to deploy the preparation and forecasting functions into automated pipelines that can run in near-real-time, eliminating the gaps between preparation, development, and deployment. 
</li>


<p></p>    
 
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Hence as a data science consultant, we are showcasing the complete approach about how we can make prediction for the demand for each store. We are demonstrating how we can train our models and use them for scoring using the ClearScape Analytics platform. The data we are using is a sample dataset and the results and predictions may not be entirely accurate.
</p>


<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>1. Connect to Vantage.</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We start by importing the required libraries and set environment variables and environment paths (if required).</p>

In [None]:
%%capture
# # '%%capture' suppresses the display of installation steps of the following packages
# !pip install tdsense==0.1.3.11
# !pip install  tdnpathviz==0.1.2.22

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Note: </b><i>The above statements may need to be uncommented if you run the notebooks on a platform other than ClearScape Analytics Experience that does not have the libraries installed. If you uncomment those installs, be sure to restart the kernel after executing those lines to bring the installed libraries into memory. The simplest way to restart the Kernel is by typing zero zero: <b> 0 0</b></i></p>
</div>

In [None]:
import time
import teradataml as tdml
from teradataml import * 
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

import getpass
import warnings
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

from tdsense.plot import plotcurves
from tdsense.clustering import resample

display.max_rows=5

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell. Begin running steps with Shift + Enter keys.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=Retail_Demand_Forecasting_Python.ipynb;' UPDATE FOR SESSION; ''')

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>2. Getting Data for This Demo </b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have provided data for this demo on cloud storage. We have the option of either running the demo using foreign tables to access the data without using any storage on our environment or downloading the data to local storage, which may yield somewhat faster execution. However, we need to consider available storage. There are two statements in the following cell, and one is commented out. We may switch which mode we choose by changing the comment string.</p>


In [None]:
# %run -i ../run_procedure.py "call get_data('DEMO_DemandForecast_cloud');"
 # Takes about 45 seconds
%run -i ../run_procedure.py "call get_data('DEMO_DemandForecast_local');"
 # Takes about 70 seconds

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Optional step – We should execute the below step only if we want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>3. Analyze Raw Data.</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let us start by creating a "Virtual DataFrame" that points directly to the dataset in Vantage.</p>

In [None]:
df = DataFrame(in_schema('DEMO_DemandForecast','Demand_Data'))
df

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The dataset is a retail dataset where we have the timekey(is the lowest granularity column used for our analysis), the Product(MODELID), the Store(MARKET) and the column DEMAND which will be used for analysis. The timekey is the column generated for creating a series for analyzing our data over time period.</p>

In [None]:
df_count = df.select(['timeKey', 'MARKET', 'MODELID'])
df_count.count(distinct=True)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The dataset contains 6 different Stores and 4106 different Products which we are analyzing over the timeKey generated series having 166 series IDs</p>

In [None]:
df2=df_count.groupby('MARKET')
df_plot=df2.count(distinct=True).to_pandas()


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We can see that the aggregated data is available to us in teradataml dataframe. Let's visualize this data to better understand the count of products by each market. Vantage's Clearscape Analytics can easily integrate with 3rd party visualization tools like Tableau, PowerBI or many python modules available like plotly, seaborn etc. We can do all the calculations and pre-processing on Vantge and pass only the necessary information to visualization tools, this will not only make the calculation faster but also reduce the time due to less data movement between tools. We do the data transfer for this and the subsequent visualizations wherever necessary.</p>

In [None]:
sns.barplot(x = 'MARKET',y = 'count_MODELID',data = df_plot)
plt.xlabel('MARKET')
plt.ylabel('Count of Products')
plt.title('Count of Products Sold by each Market')
plt.show()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>As seen in the above chart , as MARKET01 and MARKET06 do not have much data we will not consider these in further analysis.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will check if the Demand is Zero, and also calculate the duration of the Demand based on the timekey for these Products.</p>

In [None]:
dataset_metrics = df[['MODELID','timeKey','DEMAND']]. \
                    assign(demand_is_zero=tdml.sqlalchemy.literal_column('CASE WHEN DEMAND=0 THEN 1 ELSE 0 END')). \
                    groupby('MODELID'). \
                    agg({'timeKey' : ['min','max'], 'demand_is_zero':['sum']}). \
                    assign(duration=tdml.sqlalchemy.literal_column('max_timeKey - min_timeKey')). \
                    select(['MODELID','sum_demand_is_zero','duration']). \
                    assign(ratio = tdml.sqlalchemy.literal_column('CAST(sum_demand_is_zero AS FLOAT) / NULLIFZERO(duration)'))

dataset_metrics

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will check only those Series where the duration(difference between the max timekey and min timekey for each Product) is greater than 30 and MATERIAL less than 300. To fit data in a seasonal model, we would like to consider only series with at least 2 years  (24 months). So, considering duration greater than 30.</p>

<p style = 'font-size:14px;font-family:Arial'><i>Material less than 300 is considered only to show lesser data in plots for better understanding </i></p>

In [None]:
dataset = df.join(other=dataset_metrics, on='MODELID', how='inner', lsuffix='l', rsuffix='r').assign(MODELID=tdml.sqlalchemy.literal_column('MODELID_l')).drop(columns=['MODELID_l','MODELID_r'])
dataset = dataset[dataset.duration > 30]


In [None]:
from tdnpathviz.visualizations import plotcurves
plotcurves(dataset[dataset.MATERIAL < 300],field='DEMAND',row_axis='timeKey', series_id='MODELID',row_axis_type='sequence',plot_type='line',
           legend='best',width=1800,height=1000)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above graph shows the Demand for each Product(MODELID) along the timekey axis.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>One of our strengths is that we can run thousands of models concurrently, however for the purposes of this demo, we will arbitrarily choose 3 products so you can follow the process. The Product selection is random. </p>
<p style = 'font-size:12px;font-family:Arial'><i>** You can change the products(MODELIDs) in the below cell in case you want to see the output for other Products. </p>

In [None]:
MODELID_1 = 'MARKET0412164'
MODELID_2 = 'MARKET0412341'
MODELID_3 = 'MARKET0205595'

<p style = 'font-size:18px;font-family:Arial'>We will check the Demand for these Products</p>

In [None]:
df_ts_1 = df.loc[df.MODELID == MODELID_1,['timeKey','DEMAND']].sort('timeKey')#.to_pandas().set_index('timeKey')
df_ts_2 = df.loc[df.MODELID == MODELID_2,['timeKey','DEMAND']].sort('timeKey')#.to_pandas().set_index('timeKey')
df_ts_3 = df.loc[df.MODELID == MODELID_3,['timeKey','DEMAND']].sort('timeKey')#.to_pandas().set_index('timeKey')

In [None]:
fig, axes = subplots(nrows=1, ncols=3)
fig.height,fig.width = 400,1200 
plot = df_ts_1.plot(x=df_ts_1.timeKey, y=df_ts_1.DEMAND,
                          ax=axes[0],
                          figure=fig, kind="line",
                          title="MARKET0412164", style="blue")
 
plot = df_ts_2.plot(x=df_ts_2.timeKey, y=df_ts_2.DEMAND,
                          ax=axes[1],
                          figure=fig, kind="line",
                          title="MARKET0412341", style="blue")
 
plot = df_ts_3.plot(x=df_ts_3.timeKey, y=df_ts_3.DEMAND,
                          ax=axes[2], 
                          figure=fig, kind="line",
                          title="MARKET0205595", style="blue")
 
# Display the plot.
plot.show()


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We use the window function on the timekey column to build a series for the Demands for different ModelIDs. We use this function to count the series length and filter out timeseries that are too short for ARIMA.</p>

In [None]:
window_for_counting = dataset.timeKey.window(
                            partition_columns   = "MODELID",
                            order_columns       = 'timeKey'
)

In [None]:
dataset_new = dataset.assign(series_length = window_for_counting.count(),
                             nb_zeros = tdml.sqlalchemy.literal_column('SUM(CASE WHEN DEMAND = 1 THEN 1 ELSE 0 END) OVER (PARTITION BY MODELID)'),
                             frac_zeros = tdml.sqlalchemy.literal_column('CAST((SUM(CASE WHEN DEMAND = 0 THEN 1 ELSE 0 END) OVER (PARTITION BY MODELID)) AS FLOAT)/series_length'),
                             fold = tdml.sqlalchemy.literal_column("CASE WHEN timeKey < 0.67*series_length + (min(timeKey) OVER (PARTITION BY MODELID)) THEN 'train' ELSE 'test' END"),
                             time_no_unit = tdml.sqlalchemy.literal_column("timeKey-(min(timeKey) OVER (PARTITION BY MODELID))")
                            )
dataset_new

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We use the subset of data where the series length is greater than 90 and the ratio of zero demand and series length is less than 0.1, which will filter out the Markets which show almost zero Demand (Market01 and Market06).</p>

In [None]:
subset = dataset_new[(dataset_new.series_length > 90)&(dataset_new.frac_zeros < 0.1)]
subset

In [None]:
subset.shape

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>So, the dataset we are using for our analysis has around 46k rows and 21 columns.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>4. Checking for Stationarity of Time Series using the Dickey Fuller Test</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>To be able to model a time series, it needs to be stationary(Stationarity is a property of a time series where the statistical properties of the series do not change over time. In other words, a stationary time series exhibits constant mean, constant variance, and constant covariance (or autocovariance) over different time periods.). ARIMA models deal with non-stationary time series by differencing (The "d' parameter in ARIMA determines the number of differences needed to make a series stationary)</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here we will check for stationarity of the time series using the Dickey-Fuller Test. For more info on the test,  see <a href="https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-VantageTM-Unbounded-Array-Framework-Time-Series-Reference-17.20/Diagnostic-Statistical-Test-Functions/TD_DICKEY_FULLER/TD_DICKEY_FULLER-Example">here.</a> 
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The null hypothesis for the test is that the data is non-stationary. We want to REJECT the null hypothesis for this test. So, we want a p-value of less than 0.05 (or smaller) and a negative coefficient value for the lag term in our regression model.</p> 
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The Dickey fuller function needs series data so we use the TDSeries function to create a series and apply DickeyFuller to check the stationarity of the data.</p>

In [None]:
# Create teradataml TDSeries object.
data_series_df = tdml.TDSeries(data=subset,
                          id="MODELID",
                          row_index="time_no_unit",
                          row_index_style="SEQUENCE",
                          payload_field="DEMAND",
                          payload_content="REAL")

In [None]:
from teradataml import DickeyFuller
df_out = DickeyFuller(   data=data_series_df,
                           algorithm='NONE')

# Print the result DataFrame.
print(df_out.result)

<p></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the above output the p-value corresponding to the calculated test statistic is less than 0.05. It means that the series is stationary. The output column NULL_HYP which means NULL HYPOTHESIS can have 2 values 
    <li style = 'font-size:16px;font-family:Arial'>ACCEPT means the null hypothesis is accepted. No Unit roots are present, and therefore the process is stationary.</li>
<li style = 'font-size:16px;font-family:Arial'>REJECT means the null hypothesis is rejected. Unit roots are present, and the process may or may not be stationary, depending on other factors.</li>
</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Since the P_VALUE is less than 0.05 we consider the series and stationary.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>5. Autocorrelation and Partial Autocorrelation of the time series</b></p>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.1 Check for Autocorrelation of the time series</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>ACF calculates the autocorrelation or autocovariance of a time series. The autocorrelation and autocovariance show how the time series correlates or covaries with itself when delayed by a lag in time or space. Here we check autocorrelation with a maximum lag of 10 time steps.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>First we use the Series created above to get the ACF and PACF.</p>

In [None]:
from teradataml import ACF, PACF
uaf_out = ACF(data=data_series_df,
                  max_lags=12,
              demean=True,
              qstat = False,
             alpha=0.05)


In [None]:
uaf_out.result

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The ACF() function calculates the autocorrelation or autocovariance of a time series. The autocorrelation and autocovariance show how the time series correlates or covaries with itself when delayed by a lag in time or space.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the above output: </p> 
    <li style = 'font-size:16px;font-family:Arial'>ROW_I :- index of the series.</li>
    <li style = 'font-size:16px;font-family:Arial'>CONF_OFF :- Confidence bands in accordance with Bartlett’s formula.</li>
    <li style = 'font-size:16px;font-family:Arial'>CONF_LOW :- Confidence bands in accordance with Bartlett’s formula (Lower limit).</li>
    <li style = 'font-size:16px;font-family:Arial'>CONF_HI :- Confidence bands in accordance with Bartlett’s formula (Higher limit).</li>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.2. Check for partial autocorrelation of the time series</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The PACF function provides insight as to whether the modelled function is stationary. The partial autocorrelations measure the degree of correlation between time series sample points. Here we check partial autocorrelation with a maximum lag of 10 time steps.</p>

In [None]:
PACF_out = PACF(data=data_series_df,
                    algorithm='LEVINSON_DURBIN',
                    max_lags=12,
             alpha=0.05)

In [None]:
PACF_out.result

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The PACF() function provides insight as to whether the function being modeled is stationary or not. The partial auto correlations are used to measure the degree of correlation between series sample points. The algorithm removes the effects of the previous lag.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the above output: </p> 
    <li style = 'font-size:16px;font-family:Arial'>ROW_I :- index of the series.</li>
    <li style = 'font-size:16px;font-family:Arial'>CONF_OFF :- Confidence bands in accordance with Bartlett’s formula.</li>
    <li style = 'font-size:16px;font-family:Arial'>CONF_LOW :- Confidence bands in accordance with Bartlett’s formula (Lower limit).</li>
    <li style = 'font-size:16px;font-family:Arial'>CONF_HI :- Confidence bands in accordance with Bartlett’s formula (Higher limit).</li>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.3. Plot graphs for ACF and PACF of the time series</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We plot the ACF and PACF graphs for all the 3 series we are considering in our analysis.</p>

In [None]:
df_acf=uaf_out.result
df_pacf=PACF_out.result

In [None]:
df_acf_plot = df_acf.loc[df_acf.MODELID == MODELID_3]
df_pacf_plot = df_pacf.loc[df_pacf.MODELID == MODELID_3]

In [None]:
# fig, axes = subplots(nrows=2, ncols=1)
 
plot = df_acf_plot.plot(x=df_acf_plot.ROW_I, 
        y=(df_acf_plot.OUT_DEMAND, df_acf_plot.CONF_OFF_DEMAND),
        kind='corr', figsize=(600,400),ylabel = " ",
        color="blue",title="Auto Correlation")
plot.show()
plot = df_pacf_plot.plot(x=df_pacf_plot.ROW_I, 
        y=(df_pacf_plot.OUT_DEMAND, df_pacf_plot.CONF_OFF_DEMAND),
        kind='corr',figsize=(600,400),ylabel = " ",
        color="blue",title="Partial Auto Correlation")

plot.show()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>To get the value of the Moving Average or Q, we need the lag(here, ROW_I is the X axis) where the value from the ACF plot is outside the significant limit above the zero line. Looking at the graph, the Auto-Correlation value at ROW_I = 3 is outside the confidence band and much closer to it. Hence it is acceptable to say that the value of the Moving Average or <b>Q = 2</b>.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>To get the value of Auto-Regressive lags or P, we need the lag(here, Row_I) where the value from the PACF plot falls just outside the significant limit. Looking at the graph, the Partial Auto-Correlation value at ROW_I = 1 falls way outside the significant limit of the confidence band so here we will consider the value as zero. Hence we can say that the value of Auto-Regressive lags or <b>P = 0</b>.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>6. Using ARIMA (AutoRegressive Integrated Moving Average) model to forecast Demand</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>ARIMA functions on VANTAGE run in the following order:</p>

<li style = 'font-size:16px;font-family:Arial'>Run <b>ARIMAESTIMATE</b> function to get the coefficients for the ARIMA model.
</li>
<li style = 'font-size:16px;font-family:Arial'><i>[Optional]</i> Run <b>ARIMAVALIDATE</b> function to validate the "goodness of fit" of the ARIMA model, when FIT_PERCENTAGE is not 100 in ARIMAESTIMATE.
</li>
<li style = 'font-size:16px;font-family:Arial'>Run the <b>ARIMAFORECAST</b> function with input from step 1 or step 2 to forecast the future periods beyond the last observed period.</li>
</p>


<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>6.1 Estimation step using ARIMAESTIMATE</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The ARIMAESTIMATE function estimates the coefficients corresponding to an ARIMA model and fits a series with an existing ARIMA model. The function can also provide the "goodness of fit" and the residuals of the fitting operation. The function generates a model layer used as input for the ARIMAVALIDATE and ARIMAFORECAST functions. This function is for univariate series.</p>

<br>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, the previously estimated parameters P, d and Q need to be passed in the MODEL_ORDER(P, d, Q), i.e. <b>MODEL_ORDER(0, 1, 2)</b>. The output is stored in a dataframe. The fit percentage is 80, meaning the ARIMA model is being trained on 80% of the data. The remaining 20% of the data will be used to validate the model.</p>

In [None]:
from teradataml import ArimaEstimate,ArimaValidate, ArimaForecast, TDAnalyticResult
arima_estimate_op = ArimaEstimate(data1=data_series_df,
                                       nonseasonal_model_order=[0,1,2],
                                       seasonal_period=12,
                                       seasonal_model_order=[1,1,1], 
                                       constant=False,
                                       algorithm="MLE",
                                       coeff_stats=True,
                                       fit_metrics=True,
                                       residuals=True,
                                       fit_percentage=80)

In [None]:
results_estimate = arima_estimate_op.fitresiduals
results_estimate

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The ArimaEstimate() function estimates the coefficients corresponding to an ARIMA (AutoRegressive Integrated Moving Average) model, and to fit a series with an existing ARIMA model.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the above output: </p> 
    <li style = 'font-size:16px;font-family:Arial'>ROW_I :- Indexing column for the one dimensional multivariate output array containing the residuals. It is incremented by 1 for each row, starting from 1.</li>
    <li style = 'font-size:16px;font-family:Arial'>ACTUAL_VALUE :- The actual value of the response variable.</li>
    <li style = 'font-size:16px;font-family:Arial'>CALC_VALUE :- The calculated value of the response variable using the model.</li>
    <li style = 'font-size:16px;font-family:Arial'>RESIDUAL :- The difference between the calculated response value and the actual response value.</li>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>6.2 Validate using ArimaValidate</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The ArimaValidate() function performs an in-sample     forecast for both seasonal and non-seasonal auto-regressive (AR), moving-average (MA), ARIMA models and Box-Jenkins seasonal ARIMA model formula followed by an analysis of the produced residuals. The aim is to provide a collection of metrics useful to select the model and expose the produced residuals such that multiple model validation and statistical tests can be conducted.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The TDAnalyticResult function retrieves auxiliary result sets stored in the output dataframe of the ArimaEstimate. Here we extract the residuals from the previous estimation step. Analytical Result Tables have multiple layers that store different data.</p>

In [None]:
data_art_df = tdml.TDAnalyticResult(data=arima_estimate_op.result)

In [None]:
arima_validate_op = ArimaValidate(data=data_art_df, fit_metrics=True, residuals=True)

In [None]:
arima_validate_op.result.sort('AIC')

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The ArimaValidate function produces a multilayer output and returns up to four result
sets (layers).</p>
<li style = 'font-size:14px;font-family:Arial;color:#00233C'>Primary layer contains the model selection metrics.</li>
<li style = 'font-size:14px;font-family:Arial;color:#00233C'>Secondary layer contains the goodness-of-fit metrics.</li>
<li style = 'font-size:14px;font-family:Arial;color:#00233C'>Tertiary layer contains the residuals from the validation procedure.</li>
<li style = 'font-size:14px;font-family:Arial;color:#00233C'>Quaternary layer contains the model context, which can be used for forecasting with the model.</li>
<br>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the above output:    <li style = 'font-size:14px;font-family:Arial'>ROW_I :- index of the series.</li>
    <li style = 'font-size:14px;font-family:Arial'>NUM_SAMPLES :- Total number of sample points found in each of the original, calculated,
and residual series.</li>
    <li style = 'font-size:14px;font-family:Arial'>VAR_COUNT :- Integer Total number of parameters involved in the model. For an ARMA(p,q) model, the calculation of VAR_COUNT is p + q + 1.</li>
    <li style = 'font-size:14px;font-family:Arial;color:#00233C'>AIC :- The calculated Akaike Information Criteria value.</li>
    <li style = 'font-size:14px;font-family:Arial;color:#00233C'>SBIC :- The calculated Schwarz Bayesian Information Criteria value.</li>
    <li style = 'font-size:14px;font-family:Arial;color:#00233C'>HQIC :- The calculated Hannon Quinn Information Criteria value.</li>
    <li style = 'font-size:14px;font-family:Arial;color:#00233C'>MLR :- The calculated Maximum Likelihood Rule value.</li>
    <li style = 'font-size:14px;font-family:Arial;color:#00233C'>MSE :- The calculated Mean Square Error value.</li>


In [None]:
results_validate = arima_validate_op.fitresiduals
results_validate

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We plot the actual vs calculated values for the 3 different Products(MODELIDs) we are analyzing.</p>

In [None]:
res1 = results_validate[results_validate.MODELID == MODELID_1].sort('ROW_I')#.to_pandas()
res2 = results_estimate[results_estimate.MODELID == MODELID_1].sort('ROW_I')#.to_pandas()
res3 = subset[subset.MODELID == MODELID_1][['MODELID','time_no_unit','DEMAND']].sort('time_no_unit')
res3 = res3.assign(drop_columns=True, MODELID = res3.MODELID, ROW_I = res3.time_no_unit, DEMAND = res3.DEMAND)
val1=res1.get("ROW_I").iloc[0].get_values()
val2 = res2.get("ROW_I").iloc[res2.shape[0]-1].get_values()
res1 = res1.assign(drop_columns=True, ROW_I = res1.ROW_I-val1[0][0]+val2[0][0]+1 ,CALC_VALUE = res1.CALC_VALUE)

In [None]:
res4 = results_validate[results_validate.MODELID == MODELID_2].sort('ROW_I')#.to_pandas()
res5 = results_estimate[results_estimate.MODELID == MODELID_2].sort('ROW_I')#.to_pandas()
res6 = subset[subset.MODELID == MODELID_2][['MODELID','time_no_unit','DEMAND']].sort('time_no_unit')
res6 = res6.assign(drop_columns=True, MODELID = res6.MODELID, ROW_I = res6.time_no_unit, DEMAND = res6.DEMAND)
val3=res4.get("ROW_I").iloc[0].get_values()
val4 = res5.get("ROW_I").iloc[res5.shape[0]-1].get_values()
res4 = res4.assign(drop_columns=True, ROW_I = res4.ROW_I-val3[0][0]+val4[0][0]+1 ,CALC_VALUE = res4.CALC_VALUE)

In [None]:
res7 = results_validate[results_validate.MODELID == MODELID_3].sort('ROW_I')#.to_pandas()
res8 = results_estimate[results_estimate.MODELID == MODELID_3].sort('ROW_I')#.to_pandas()
res9 = subset[subset.MODELID == MODELID_3][['MODELID','time_no_unit','DEMAND']].sort('time_no_unit')
res9 = res9.assign(drop_columns=True, MODELID = res9.MODELID, ROW_I = res9.time_no_unit, DEMAND = res9.DEMAND)
val5=res7.get("ROW_I").iloc[0].get_values()
val6 = res8.get("ROW_I").iloc[res8.shape[0]-1].get_values()
res7 = res7.assign(drop_columns=True, ROW_I = res7.ROW_I-val5[0][0]+val6[0][0]+1 ,CALC_VALUE = res7.CALC_VALUE)

In [None]:
# figure = Figure(width=800, height=400,  heading="Actual vs Predicted Demand")
fig, axes = subplots(nrows=3, ncols=1)
fig.height,fig.width = 800,1000
plot = res3.plot(
                x=res3.ROW_I,
                y=[res1.CALC_VALUE, res2.CALC_VALUE, res3.DEMAND],
                ax=axes[0],
                figure=fig,
                
                xlabel='Time',
                ylabel='Demand',
                grid_linestyle='--',
                grid_linewidth=0.5,
                marker=["o","s"]
)

plot = res6.plot(
                x=res6.ROW_I,
                y=[res4.CALC_VALUE, res5.CALC_VALUE, res6.DEMAND],
                ax=axes[1],
                figure=fig,
                xlabel='Time',
                ylabel='Demand',
                grid_linestyle='--',
                grid_linewidth=0.5,
                marker=["o","s"]
)

plot = res9.plot(
                x=res9.ROW_I,
                y=[res7.CALC_VALUE, res8.CALC_VALUE, res9.DEMAND],
                ax=axes[2],
                figure=fig,
                xlabel='Time',
                ylabel='Demand',
                grid_linestyle='--',
                grid_linewidth=0.5,
                marker=["o","s"]
)

plot.show()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above graphs show the <b>Actual Demand Values(Green)</b> and the Calculated Values for the Demand using the <b>ArimaEstimate(Orange)</b>, which is the train dataset(80%) as specified in the ArimaEstimate and <b>ArimaValidate(Blue)</b>, which is the test dataset(remaining 20%)</b>. The 3 graphs are for the 3 Products(MODELIDs).</p>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>6.3 Forecast Demand using ArimaForecast</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The ArimaForecast() function is used to forecast a user-defined number of periods based on
    models fitted from the ArimaEstimate() function.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The ArimaForecast() function with input from step 1 or step 2 to forecast the future periods beyond the last observed period. Here we are forecasting for 20 periods.</p>

In [None]:
data_art_df = TDAnalyticResult(data=arima_validate_op.result)

In [None]:
arima_forecast_op = ArimaForecast(data=data_art_df, forecast_periods=20)

In [None]:
results_forecast = arima_forecast_op.result
results_forecast

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>This function outputs a result set that contains the forecasted values.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the above output:    
    <li style = 'font-size:14px;font-family:Arial;color:#00233C'>ROW_I :- index of the series.</li>
    <li style = 'font-size:14px;font-family:Arial;color:#00233C'>FORECAST_VALUE :- Forecasted values for the model.</li>
    <li style = 'font-size:14px;font-family:Arial;color:#00233C'>LO_80 :- Low end of the 80% prediction interval.</li>
    <li style = 'font-size:14px;font-family:Arial;color:#00233C'>HI_80 :- High end of the 80% prediction interval.</li>
    <li style = 'font-size:14px;font-family:Arial;color:#00233C'>LO_95 :- Low end of the 95% prediction interval.</li>
    <li style = 'font-size:14px;font-family:Arial;color:#00233C'>HI_95 :- High end of the 95% prediction interval.</li>
    

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We can see that the ARIMA estimation, validation and forecasting is done using Teradata functions in a teradataml dataframe. Let's visualize this data to better understand the relation between the actual demand and forecasted values for the demand. Vantage's Clearscape Analytics can easily integrate with 3rd party visualization tools like Tableau, PowerBI or many python modules available like plotly, seaborn etc. We can do all the calculations and pre-processing on Vantge and pass only the necessary information to visualization tools, this will not only make the calculation faster but also reduce the time due to less data movement between tools. We do the data transfer for this and the subsequent visualizations wherever necessary.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We plot the actual vs forecast values for the 3 different Products(MODELIDs) we are analyzing.</p>

In [None]:
def plot_forecast(MODELID):
    fig, ax = plt.subplots(figsize=(10,4))
    # Plot prediction
    mean_forecast = results_forecast[results_forecast.MODELID==MODELID].sort('ROW_I').to_pandas()
    res3 = subset[subset.MODELID == MODELID][['MODELID','time_no_unit','DEMAND']].sort('time_no_unit').to_pandas()
    res3['time_no_unit'] = res3['time_no_unit'] - res3.time_no_unit.values[-1]
    res3.plot(x='time_no_unit',y='DEMAND',label='actual',ax=ax)
    mean_forecast.plot(x='ROW_I',y='FORECAST_VALUE',label='forecast',color='red',ax=ax)
    # Shade uncertainty area
    plt.fill_between(mean_forecast.ROW_I, mean_forecast.LO_80, mean_forecast.HI_80, color='pink', alpha=0.5)
    plt.fill_between(mean_forecast.ROW_I, mean_forecast.LO_95, mean_forecast.HI_95, color='pink', alpha=0.2)
    plt.title(MODELID)
    plt.show()
    return
plot_forecast(MODELID_1)
plot_forecast(MODELID_2)
plot_forecast(MODELID_3)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The plot in pink color shows the forecasted values for each Product(MODELID) for the 20 periods we have specified in the ArimaForecast.</p>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Conclusion:</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have trained and validated the ARIMA model on the Weekly Sales dataset, and the results closely match the actual data. The goodness of fit metrics calculated in the estimate and validate phase also resonate with our understanding that the model is well-trained to forecast. This can be observed in the Estimate and the Validate function graphs. So, we can say that the model is well trained to forecast the Weekly Sales.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Thus, with the Teradata VantageCloud, we are able to build a powerful end-to-end forecasting pipelines. Tools for each forecasting phase, from data preparation and exploration to model validation and scoring, empower you to forecast more efficiently and at scale with lesser development and testing times and later deploy forecasting functions into automated pipelines to run in near real-time.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>7. Cleanup</b></p>


<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Databases and Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will use the following code to clean up tables and databases created for this demonstration.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_DemandForecast');" 
#Takes 45 seconds

In [None]:
remove_context()

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Required Materials</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let’s look at the elements we have available for reference for this notebook:</p>
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Filters: </b>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Industry:</b> Retail </li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Functionality:</b> ARIMA Estimation and Forecasting </li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Use Case:</b> Retail Demand Forecast </li> </p>
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Related Resources: </b>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><a href = 'https://www.teradata.com/Blogs/NPS-is-a-metric-not-the-goal'>In the fight to improve customer experience, NPS is a metric, not the goal</a></li> 
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><a href = 'https://www.teradata.com/Blogs/Hyper-scale-time-series-forecasting-done-right'>Hyper-scale time series forecasting done right </a></li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><a href = 'https://www.teradata.com/Blogs/Crystal-Ball-or-Black-Box-in-Retail-and-CPG'>Crystal Ball, Black Box or Advanced Forecasting and Demand Planning in Retail and CPG</a></li>
</p>

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2023,2024 All Rights Reserved
        </div>
    </div>
</footer>