<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Sales Forecasting using Teradata AUTOARIMA
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style='font-size:20px;font-family:Arial;color:#00233C'><b>Introduction</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Predicting future sales is crucial for any company as it helps in making informed business decisions. However, these sales are usually impacted by a plethora of reasons including seasonality, sales, macro-economic conditions throughout the year which can result in sales being significantly higher or lower than average. This can negatively impact future revenue if sales are not accurately predicted.</p>

<p style='font-size:18px;font-family:Arial;color:#00233C'><b>Solution</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We forecast the amount of future sales by developing a time-series modelling pipeline on sales data. The demo shows the power of Vantage through its In-DB analytics time-series capabilities which provides a comprehensive suite of functions most commonly used by Data Scientists across the industry in forecasting pipelines including but not limited to the following standard activities;</p>

<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Data preparation: Resampling, convert irregular to regular</li>
    <li>Data exploration: Detect stationarity and periodicity</li>
    <li>Eliminate Non-stationarity: Seasonal normalizing</li>
    <li>Formulate models: AUTO ARIMA</li>
    <li>Model Forecasting: Use the results from AUTO ARIMA to do Forecasting</li>
</ul>


<p style='font-size:18px;font-family:Arial;color:#00233C'><b>Sales Forecasting Demo Data</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Features</b>: Other exogenous features related to store and environment for time-series analysis</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Train</b>: Weekly sales input data for time-series analysis</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Test</b>: Weekly sales test data for time-series model testing</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Store</b>: Anonymized information about the 45 stores, indicating the type and size of the store</p>

      
![Dataset%20Description-3.PNG](attachment:Dataset%20Description-3.PNG)
    


<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>1. Connect to Vantage</b>


In [None]:
%%capture
!pip install teradataml --upgrade

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Note: </b><i>The above statements may need not be executed if you are on teradataml 20.0.0.2 or greater. Else, please execute the pip install step and be sure to restart the kernel after executing the cell to bring the installed libraries into memory. The simplest way to restart the Kernel is by typing zero zero: <b> 0 0</b></i></p>
</div>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
# Standard Python libraries
import csv
import getpass
import io

# Third-party libraries
import pandas as pd
import sqlalchemy
from sqlalchemy import event
from collections import OrderedDict
from PIL import Image

# Teradata related imports
from teradataml import (create_context, 
                        remove_context, 
                        execute_sql, 
                        copy_to_sql, 
                        configure, 
                        DataFrame, 
                        in_schema)

from teradatasqlalchemy.types import *

from teradataml import to_numeric

# Modify the following to match the specific client environment settings
display.max_rows = 5
configure.val_install_location = 'val'

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=PP_Sales_Forecasting_AutoArima_Python.ipynb;' UPDATE FOR SESSION;''')

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Begin running steps with Shift + Enter keys. </p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>2. Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
# %run -i ../run_procedure.py "call get_data('DEMO_SalesForecastingUAF_cloud');"        # Takes 1 minute
%run -i ../run_procedure.py "call get_data('DEMO_SalesForecastingUAF_local');"        # Takes 2 minutes

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Next is an optional step – if you want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"        # Takes 10 seconds

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>3. Data Preparation </b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
    Let us start by creating a "Virtual DataFrame" that points directly to the dataset in Vantage for Sales, Features, and Stores tables.</p>

In [None]:
sales_data = DataFrame(in_schema('DEMO_SalesForecastingUAF', 'Weekly_Sales'))
feature_data = DataFrame(in_schema('DEMO_SalesForecastingUAF', 'Features')).drop(['IsHoliday'], axis=1)
store_data = DataFrame(in_schema('DEMO_SalesForecastingUAF', 'Stores'))

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will join datasets to create the Analytic Data Set using tdml for basic dataframe manipulations</p>

In [None]:
# Join store_data with sales_data
sales_data = (
    sales_data.join(store_data, on='Store', how='left', lprefix='t1', rprefix='t2')
    .drop(['t2_Store'], axis=1)
)
sales_data = sales_data.assign(Store=sales_data['t1_Store'])
sales_data = sales_data.drop(['t1_Store'], axis=1)

# Join feature_data with sales_data
sales_data = (
    sales_data.join(feature_data, on=['Store', 'Date'], how='left', lprefix='t1', rprefix='t3')
    .drop(['t3_Store'], axis=1)
    .drop(['t3_Date'], axis=1)
)
sales_data = sales_data.assign(Store=sales_data['t1_Store'])
sales_data = sales_data.assign(Date=sales_data['t1_Date'])
sales_data = sales_data.drop(['t1_Store'], axis=1)
sales_data = sales_data.drop(['t1_Date'], axis=1)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will creating time series identifier for partitioning the data. We will join the department and store ID which provides a unique ID for every time series to create a column for partitioning.</p>

In [None]:
# Assign a new column 'idcols' based on string concatenation
sales_data = sales_data.assign(idcols=sales_data.Dept.str.strip() + '-' + sales_data.Store.str.strip())
sales_data = sales_data.assign(idcols=sales_data.idcols.cast(type_=VARCHAR(10)))

# Check the shape of the DataFrame
sales_data.shape

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The dataset we created contains more than 421k rows and 19 columns. This final dataset will be copied to Vantage database</p>

In [None]:
copy_to_sql(df = sales_data, table_name = "az_sf_joined", if_exists = "replace")

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b><i>*Please scroll down to the end of the notebook for detailed column descriptions of the dataset.</i></b></p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>4. Checking for Stationarity of Time Series using the Dickey Fuller Test</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>To be able to model a time series, it needs to be stationary. ARIMA models deal with non-stationary time series by differencing (The "d' parameter in ARIMA determines the number of differences needed to make a series stationary)</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here we will check for stationarity of all time series using the Dickey-Fuller Test. For more info on the test,  see <a href="https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-VantageTM-Unbounded-Array-Framework-Time-Series-Reference-17.20/Diagnostic-Statistical-Test-Functions/TD_DICKEY_FULLER/TD_DICKEY_FULLER-Example">here.</a> 
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The null hypothesis for the test is that the data is non-stationary. We want to REJECT the null hypothesis for this test. So, we want a p-value of less than 0.05 (or smaller) and a negative coefficient value for the lag term in our regression model.</p> 
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The Dickey fuller function needs series data, so we use the TDSeries function to create a series and apply DickeyFuller to check the stationarity of the data.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We use the OutlierFilterFit and the OutlierFilterTransform functions to remove the outliers in the series and then use the Rescaled Data to check the stationarity of the data using the DickeyFuller function.</p>


In [None]:
sales_df=DataFrame('az_sf_joined')
sales_df.shape

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>4.1 Remove outliers using OutlierFilter Fit and Transform</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The OutlierFilterFit() function calculates the lower_percentile, upper_percentile, count of rows and median for all the "target_columns" provided by the user. These metrics for each column helps the function OutlierTransform() detect outliers in the input table. It also stores parameters from arguments into a FIT table used during transformation. The lower_percentile specifies lower range of percentile to be used to detect if value is outlier or not and the upper_percentile specifies upper range of percentile to be used to detect if value is outlier or not.</p>


In [None]:
from teradataml import OutlierFilterFit, OutlierFilterTransform

OutlierFilterFit_out = OutlierFilterFit(
    data=sales_df,
    target_columns="Weekly_Sales",
)

out_df = OutlierFilterFit_out.output_data
out_df

<p></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'> The OutlierFilterfit creates a fit table with different values which need to be applied on the data to get the transformed data.</p>
<p></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'> OutlierFilterTransform() function filters the outliers from the input teradataml DataFrame.</p> 
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>OutlierFilterTransform() uses the result DataFrame from OutlierFilterFit() function to get statistics like median, count of rows, lower percentile and upper percentile for every column specified in target columns argument and filters the outliers in the input data. </p>

In [None]:
obj = OutlierFilterTransform(
    data=sales_df,
    object=OutlierFilterFit_out.result
)

out_transform_df = obj.result
out_transform_df

<p></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'> The OutlierFilterTransform transforms the data and creates the output data after applying the Fit Table details on the data.</p>
<p></p>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>4.2 Convert into a regular timeseries using Resample</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The Resample() function transforms an irregular time series into a regular time series. It can also be used to alter the sampling interval for a time series. The Resample functions requires a series as inuput for which we use the TDSeries function.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>TDSeries object from a teradataml DataFrame representing a SERIES in time series which is used as input to Unbounded Array Framework, time series functions. A series is a one-dimensional array. They are the basic input of UAF functions. A series is identified by its series ID, i.e., "id" argument, and indexed by "row_index" argument. Series is passed to and returned from UAF functions as wavelets. Wavelets are collections of rows, grouped by one or more fields, and ordered on the "row_index" argument.</p>

In [None]:
from teradataml import TDSeries, Resample

data_series_df = TDSeries(
    data=obj.result,
    id="idcols",
    row_index=("times"),
    row_index_style="TIMECODE",
    payload_field="Weekly_Sales",
    payload_content="REAL"
)

In [None]:
uaf_out1 = Resample(data=data_series_df,
                    interpolate='LINEAR',
                    timecode_start_value="TIMESTAMP '2010-02-05 00:00:00'",
                    timecode_duration="WEEKS(1)")

In [None]:
df=uaf_out1.result
df1=df.select(['idcols','ROW_I', 'Weekly_Sales']).assign(Sales_Date=df.ROW_I)
df1

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have 3331 unique Dept-Store combination which will result into 3331 time series. We are processing the entire data, but in order to understand better, we will use only 3 different series for visualizations. Here we are considering 3 different stores(10,11,12) in one(10) Department.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In case we want to visualize any different combination we can replace them in the below cell and all the subsequent steps will use the same set of Dept-Store to visualize.</p>

In [None]:
Dept_Store1='10-10'
Dept_Store2='10-11'
Dept_Store3='10-12'

In [None]:
df = df1.select(['idcols','Sales_Date','Weekly_Sales'])
res1 = df[df.idcols == Dept_Store1]
res2 = df[df.idcols == Dept_Store2]
res3 = df[df.idcols == Dept_Store3]

In [None]:
# Plot series data
from teradataml import plot, subplots
fig, axes = subplots(nrows=3, ncols=1)
fig.height,fig.width = 800,1000
plot = res1.plot(
                x=res1.Sales_Date,
                y=res1.Weekly_Sales,
                ax=axes[0],
                figure=fig,
                color='blue',
                xlabel='Sales Date',
                ylabel='Weekly Sales',
                legend=['Dept-10,Store-10'],
                grid_linestyle='--',
                grid_linewidth=0.5
)

plot = res2.plot(
                x=res2.Sales_Date,
                y=res2.Weekly_Sales,
                ax=axes[1],
                figure=fig,
                color='dark orange',
                xlabel='Sales Date',
                ylabel='Weekly Sales',
                legend=['Dept-10,Store-11'],
                grid_linestyle='--',
                grid_linewidth=0.5
)

plot = res3.plot(
                x=res3.Sales_Date,
                y=res3.Weekly_Sales,
                ax=axes[2],
                figure=fig,
                color='green',
                xlabel='Sales Date',
                ylabel='Weekly Sales',
                legend=['Dept-10,Store-12'],
                grid_linestyle='--',
                grid_linewidth=0.5
)
plot.show()

<p></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>From the above visualizations we can conclude that our data is seasonal data. We will now check if the series is stationary.</p>


<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>4.3 Use DickeyFuller to check Stationarity</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The DickeyFuller() function tests for the presence of one or more unit roots in a series to determine if the series is non-stationary. When a series contains unit roots, it is non-stationary. When a series contains no unit roots, whether the series is stationary is based on other factors.</p>


In [None]:
from teradataml import DickeyFuller
data_series_df_1 = TDSeries(data=df1,
                            id="idcols",
                            row_index=("Sales_Date"),
                            row_index_style= "TIMECODE",
                            payload_field="Weekly_Sales",
                            payload_content="REAL")

In [None]:
df_out = DickeyFuller(data=data_series_df_1,
                      algorithm='NONE')

# Print the result DataFrame.
df_out.result

<p></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the above output the p-value corresponding to the calculated test statistic is more than 0.05. It means that the series is not stationary. The output column NULL_HYP which means NULL HYPOTHESIS can have 2 values 
    <li style = 'font-size:16px;font-family:Arial;color:#00233C'><b>ACCEPT</b> means the null hypothesis is accepted. Unit roots are present, and the process is non-stationary.</li>
    <li style = 'font-size:16px;font-family:Arial;color:##00233C'><b>REJECT</b> means the null hypothesis is rejected. Unit roots are present, and the process may or may not be stationary, depending on other factors.</li>
</p>


<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>4.4 Make the series stationary using SeasonalNormalize</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The function SeasonalNormalize() takes a non-stationary series and normalizes the series by removing the unit roots. The function can be used with any cyclic data that can be subdivided into a collection of logical periods, in which each period can be further subdivided into a collection of logical intervals.</p>
 
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The following procedure is an example of how to use SeasonalNormalize():</p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Detect the unit roots using DickeyFuller().</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Use DIFF() or SeasonalNormalize() to eliminate unit roots.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Use Unnormalize() to undo the effects of SeasonalNormalize(), and compare it to the original series.</li></p>

In [None]:
data_series_df_norm = TDSeries(data=df1,
                              id="idcols",
                              row_index="Sales_Date",
                              row_index_style="TIMECODE",
                              payload_field="Weekly_Sales",
                              payload_content="REAL",
                              interval="WEEKS(1)")
 

In [None]:
# Normalize the series by removing the unit roots.
from teradataml import SeasonalNormalize
uaf_out = SeasonalNormalize(data=data_series_df_norm,
                                season_cycle="WEEKS",
                                cycle_duration=1,
                                output_fmt_index_style = 'FLOW_THROUGH')
uaf_out.result

In [None]:
df=uaf_out.result
res1 = df[df.idcols == Dept_Store1]
res2 = df[df.idcols == Dept_Store2]
res3 = df[df.idcols == Dept_Store3]

In [None]:
fig, axes = subplots(nrows=3, ncols=1)
fig.height,fig.width = 800,1000
plot = res1.plot(
                x=res1.ROW_I,
                y=res1.Weekly_Sales,
                ax=axes[0],
                figure=fig,
                color='blue',
                xlabel='Sales Date',
                ylabel='Weekly Sales',
                legend=['Dept-10,Store-10'],
                grid_linestyle='--',
                grid_linewidth=0.5
)

plot = res2.plot(
                x=res2.ROW_I,
                y=res2.Weekly_Sales,
                ax=axes[1],
                figure=fig,
                color='dark orange',
                xlabel='Sales Date',
                ylabel='Weekly Sales',
                legend=['Dept-10,Store-11'],
                grid_linestyle='--',
                grid_linewidth=0.5
)

plot = res3.plot(
                x=res3.ROW_I,
                y=res3.Weekly_Sales,
                ax=axes[2],
                figure=fig,
                color='green',
                xlabel='Sales Date',
                ylabel='Weekly Sales',
                legend=['Dept-10,Store-12'],
                grid_linestyle='--',
                grid_linewidth=0.5
)
plot.show()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Using DickeyFuller again we will confirm that the series is stationary</p>

In [None]:
data_series_df_2 = TDSeries(data=uaf_out.result,
                            id="idcols",
                            row_index=("ROW_I"),
                            row_index_style= "TIMECODE",
                            payload_field="Weekly_Sales",
                            payload_content="REAL")

In [None]:
df_out_norm = DickeyFuller(data=data_series_df_2,
                           algorithm='NONE')

# Print the result DataFrame.
df_out_norm.result

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>As seen above the P_VALUE is less than 0.05 for all the series and the NULL_HYP is REJECT, so we consider that the series is now stationary</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>5. Use AUTOARIMA to Forecast</b></p>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.1 Apply AUTOARIMA</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>AUTOARIMA fits the best ARIMA model to univariate time series. The function searches the possible models within the order constrains in the function parameters, and returns the best ARIMA model based on the criterion provided by the INFOR_CRITERIA parameter.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>AUTOARIMA is based on the best-information criterion selection process. The returned model is the best model and does not need more validation like ARIMAVALIDATE to compare the AIC and BIC with different model candidates.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Run ARIMAFORECAST directly against the result generated from AUTOARIMA. Extract the ICANDORDER layer to get model order. Next, run ARIMAESTIMATE using the model order, and then run ARIMAFORECAST on the result from ARIMAESTIMATE.</p>


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Long run times should be expected when input data is entered without any order constraints, especially for a long seasonal period or large time series. Since we are creating ARIMA models for 3331 partition it will take longer time.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>**Note: Since we have a very small system, the step for AutoArima takes approx 10-12 minutes.</b></p>

In [None]:
data_series_df_arima = TDSeries(data=uaf_out.result,
                                id="idcols",
                                row_index="ROW_I",
                                row_index_style="TIMECODE",
                                payload_field="Weekly_Sales",
                                payload_content="REAL")         

In [None]:
import time
from teradataml import AutoArima
starttime = time.time()
arima_out = AutoArima(data=data_series_df_arima,
                               max_pq_seasonal=[3, 3],
                               stationary=True,
                               stepwise=False,
                               arma_roots=True,
                               residuals=True,
                                 output_fmt_index_style="FLOW_THROUGH")
endtime = time.time()
print('Time taken: %.2f mins' % round((endtime-starttime)/60,2)) 


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The AUTOARIMA function can create a multilayered output table.The function generates up to five analytical result sets:</p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Primary result set containing the selected best model’s coefficients.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Secondary result set containing “goodness of fit” metrics.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Tertiary result set containing residuals from the fitting exercise.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Quaternary result set containing the best model context, which is used during the forecasting process.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Quinary result set containing the information criteria such as AIC and SBIC, and the order of the best model.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Senary result set containing the roots information.</li></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will check the residual which show the comparison of ACTUAL_VALUE and CALC_VALUE of the timeseries, and the information criteria having the AIC , SBIC and the order of the best model.</p>

In [None]:
df_val = arima_out.fitresiduals
df_val = df_val[df_val.idcols.isin([Dept_Store1,Dept_Store2,Dept_Store3])]
df_val

In [None]:
df_pdq = arima_out.icandorder
df_pdq = df_pdq[df_pdq.idcols.isin([Dept_Store1,Dept_Store2,Dept_Store3])]
df_pdq

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will try to visualizing the comparison of the actual and the calculated values for the 3 different series we have selected.</p>

In [None]:
df_val_plot = df_val.merge(right=df_pdq, how='inner', on='idcols', lsuffix='arval', rsuffix='pdq')
df_val_plot=df_val_plot.assign(drop_columns=True,
                              idcols=df_val_plot.idcols_arval,
                              Sales_Date=df_val_plot.ROW_I_arval,
                              ACTUAL_VALUE=df_val_plot.ACTUAL_VALUE,
                              CALC_VALUE=df_val_plot.CALC_VALUE,
                              MODEL_ORDER=df_val_plot.MODEL_ORDER)

In [None]:
res1 = df_val_plot[df_val_plot.idcols == Dept_Store1]
res2 = df_val_plot[df_val_plot.idcols == Dept_Store2]
res3 = df_val_plot[df_val_plot.idcols == Dept_Store3]

In [None]:
fig, axes = subplots(nrows=3, ncols=1)
fig.height,fig.width = 800,1000
plot = res1.plot(
                x=res1.Sales_Date,
                y=[res1.ACTUAL_VALUE,res1.CALC_VALUE],
                ax=axes[0],
                figure=fig,
                color=['blue','cyan'],
                xlabel='Sales Date',
                ylabel='Weekly Sales',
                legend=['ACTUAL VALUE','CALC VALUE'],
                title=f"Actual vs Calculated value for Dept-Store {str(res1.select('idcols').get_values()[0][0])} using {str(res1.select('MODEL_ORDER').get_values()[0][0])}",
                grid_linestyle='--',
                grid_linewidth=0.5
)

plot = res2.plot(
                x=res2.Sales_Date,
                y=[res2.ACTUAL_VALUE,res2.CALC_VALUE],
                ax=axes[1], 
                figure=fig,
                color=['orange','purple'],
                xlabel='Sales Date',
                ylabel='Weekly Sales',
                legend=['ACTUAL VALUE','CALC VALUE'],
                title=f"Actual vs Calculated value for Dept-Store {str(res2.select('idcols').get_values()[0][0])} using {str(res2.select('MODEL_ORDER').get_values()[0][0])}",
                grid_linestyle='--',
                grid_linewidth=0.5
)

plot = res3.plot(
                x=res3.Sales_Date,
                y=[res3.ACTUAL_VALUE,res3.CALC_VALUE],
                ax=axes[2],
                figure=fig,
                color=['green','olive'],
                xlabel='Sales Date',
                ylabel='Weekly Sales',
                legend=['ACTUAL VALUE','CALC VALUE'],
                title=f"Actual vs Calculated value for Dept-Store {str(res3.select('idcols').get_values()[0][0])} using {str(res3.select('MODEL_ORDER').get_values()[0][0])}",
                grid_linestyle='--',
                grid_linewidth=0.5
)
plot.show()

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>5.2 ARIMA Forecast</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The ArimaForecast() function is used to forecast a user-defined number of periods based on models fitted from the ArimaEstimate() function.</p>

In [None]:
from teradataml import TDAnalyticResult, ArimaForecast
data_art_df = TDAnalyticResult(data=arima_out.result)

arima_forecast_out=ArimaForecast(data=data_art_df,
                                    forecast_periods=10,
                                    output_fmt_index_style="FLOW_THROUGH")

df_forecast=arima_forecast_out.result
df_forecast

In [None]:
res1 = df_forecast[df_forecast.idcols == Dept_Store1]
res2 = df_forecast[df_forecast.idcols == Dept_Store2]
res3 = df_forecast[df_forecast.idcols == Dept_Store3]

In [None]:
fig, axes = subplots(nrows=3, ncols=1)
fig.height,fig.width = 800,1000
plot = res1.plot(
                x=res1.ROW_I,
                y=res1.FORECAST_VALUE,
                ax=axes[0],
                figure=fig,
                color=['blue'],
                xlabel='Sales Date',
                xtick_format='YYYY-MM-DD',
                ylabel='Forecast Sales',
                grid_linestyle='--',
                grid_linewidth=0.5
)

plot = res2.plot(
                x=res2.ROW_I,
                y=res2.FORECAST_VALUE,
                ax=axes[1], 
                figure=fig,
                color=['orange'],
                xlabel='Sales Date',
                xtick_format='YYYY-MM-DD',
                ylabel='Forecast Sales',
                grid_linestyle='--',
                grid_linewidth=0.5
)

plot = res3.plot(
                x=res3.ROW_I,
                y=res3.FORECAST_VALUE,
                ax=axes[2],
                figure=fig,
                color=['green'],
                xlabel='Sales Date',
                xtick_format='YYYY-MM-DD',
                ylabel='Forecast Sales',
                grid_linestyle='--',
                grid_linewidth=0.5
)
plot.show()

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>6. Unnormalize the Forecast Data</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The Unnormalize() function reconstructs a series created by SeasonalNormalize(). The function is usually used for the forecasting phase of modeling.</p>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>6.1 Unnormalize Forecast output</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will unnormalize the forecasted data using the Arima Forecast output and the meta data of the normalized output. Since we want to normalize all columns of the forecast output , we will have to create similar columns in the metadata.</p>

In [None]:
uaf_out.metadata2 = uaf_out.metadata.assign(drop_columns = False,
                                           mean2 = uaf_out.metadata.MEAN_Weekly_Sales,
                                           sd2 = uaf_out.metadata.SD_Weekly_Sales,
                                           mean3 = uaf_out.metadata.MEAN_Weekly_Sales,
                                           sd3 = uaf_out.metadata.SD_Weekly_Sales,
                                           mean4 = uaf_out.metadata.MEAN_Weekly_Sales,
                                           sd4 = uaf_out.metadata.SD_Weekly_Sales,
                                           mean5 = uaf_out.metadata.MEAN_Weekly_Sales,
                                           sd5 = uaf_out.metadata.SD_Weekly_Sales)
uaf_out.metadata2

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will create series objects for the forecast output and the changed metadata.</p>

In [None]:
#Create teradataml TDSeries objects.
td_series_forecast = TDSeries(data=arima_forecast_out.result,
                              id="idcols",
                              row_index="ROW_I",
                              row_index_style="TIMECODE",
                              payload_field=["FORECAST_VALUE","LO_80","HI_80","LO_95","HI_95"],
                              payload_content="MULTIVAR_REAL",
                              interval="WEEKS(1)"
                              )

In [None]:
td_series_metadata_forecast = TDSeries(data=uaf_out.metadata2, #from the seasonlized series
                                       id="idcols",
                                       row_index="ROW_I",
                                       row_index_style="SEQUENCE",
                                       payload_field=["MEAN_Weekly_Sales", "SD_Weekly_Sales","mean2","sd2",
                                                      "mean3","sd3","mean4","sd4","mean5","sd5"],
                                       payload_content="MULTIVAR_REAL")

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Using the two Series created above , we will unnormalize the forecast data.</p>

In [None]:
from teradataml import Unnormalize
forecast_unnormalize = Unnormalize(data1=td_series_forecast,
                                   data2=td_series_metadata_forecast,
                                   input_fmt_input_mode="MATCH",
                                   output_fmt_index_style="FLOW_THROUGH")
df_forecast_un = forecast_unnormalize.result
df_forecast_un

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>To compare the Actual Value and the Forecasted Value we will unnormalize the fitresiduals using the arima output and the normalized metadata.</p>

In [None]:
# Create teradataml TDSeries objects.
td_series_residuals = TDSeries(
                                data=arima_out.fitresiduals,
                                id="idcols",
                                row_index="ROW_I",
                                row_index_style="TIMECODE",
                                payload_field=["ACTUAL_VALUE"],
                                payload_content="REAL",
                                interval="WEEKS(1)"
                            )

In [None]:
td_series_metadata_residual = TDSeries(data=uaf_out.metadata, #from the seasonlized series
                                       id="idcols",
                                       row_index="ROW_I",
                                       row_index_style="SEQUENCE",
                                       payload_field=["MEAN_Weekly_Sales", "SD_Weekly_Sales"],
                                       payload_content="MULTIVAR_REAL"                  
                                      )

In [None]:
residual_unnormalize = Unnormalize(data1=td_series_residuals,
                                   data2=td_series_metadata_residual,
                                   input_fmt_input_mode="MATCH",
                                   output_fmt_index_style="FLOW_THROUGH")
df_residual=residual_unnormalize.result
df_residual

In [None]:
import matplotlib.pyplot as plt
def plot_forecast(ID_COL):
    fig, ax = plt.subplots(figsize=(16,4))
    # Plot prediction
    mean_forecast = df_forecast_un[df_forecast_un.idcols==ID_COL].sort('ROW_I').to_pandas()
    res3 = df_residual[df_residual.idcols == ID_COL][['idcols','ROW_I','ACTUAL_VALUE']].sort('ROW_I').to_pandas()
    # res3['time_no_unit'] = res3['time_no_unit'] - res3.time_no_unit.values[-1]
    res3.plot(x='ROW_I',y='ACTUAL_VALUE',label='actual',ax=ax)
    # res3.plot(x='Sales_Date',y='CALC_VALUE',label='calculated',ax=ax[1])
    mean_forecast.plot(x='ROW_I',y='FORECAST_VALUE',label='forecast',color='red',ax=ax)
    # Shade uncertainty area
    plt.fill_between(mean_forecast.ROW_I, mean_forecast.LO_80, mean_forecast.HI_80, color='pink', alpha=0.5)
    plt.fill_between(mean_forecast.ROW_I, mean_forecast.LO_95, mean_forecast.HI_95, color='pink', alpha=0.2)
    plt.title("Actual and Forecasted values for Dept-Store " + ID_COL + " for 7 periods")
    plt.show()
    return


plot_forecast(Dept_Store1)
plot_forecast(Dept_Store2)
plot_forecast(Dept_Store3)

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Conclusion:</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have trained and validated the AUTOARIMA model on the Weekly Sales dataset, and the results closely match the actual data. The goodness of fit metrics calculated in the estimate and validate phase also resonate with our understanding that the model is well-trained to forecast. This can be observed in the Estimate and the Validate function graphs. So, we can say that the model is well trained to forecast the Weekly Sales.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Thus, with the Teradata VantageCloud, we are able to build a powerful end-to-end forecasting pipelines. Tools for each forecasting phase, from data preparation and exploration to model validation and scoring, empower you to forecast more efficiently and at scale with lesser development and testing times and later deploy forecasting functions into automated pipelines to run in near real-time.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>6. Cleanup</b></p>


<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Cleanup work tables to prevent errors next time.</p>

In [None]:
try:
    db_drop_table(table_name = 'az_sf_joined')
except:
    pass

<p style = 'font-size:18px;font-family:Arial;color:#00233C'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_SalesForecastingUAF');"        # Takes 10 seconds

In [None]:
remove_context()

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>Dataset:</b>

- `Store`: Store number
- `Date`: Week
- `Temperature`: Average temperature in the region
- `Fuel_Price`: Cost of fuel in the region
- `MarkDown1`: Anonymized data related to promotional markdowns that Walmart is running.
- `MarkDown2`: Anonymized data related to promotional markdowns that Walmart is running.
- `MarkDown3`: Anonymized data related to promotional markdowns that Walmart is running.
- `MarkDown4`: Anonymized data related to promotional markdowns that Walmart is running.
- `MarkDown5`: Anonymized data related to promotional markdowns that Walmart is running.
- `CPI`: The consumer price index
- `Unemployment`: The unemployment rate
- `IsHoliday`: Whether the week is a special holiday week
- `Type`: Store type has been provided, there are 3 types — A, B and C.
- `Size`: Stores size has provided
- `Dept`: The department number
- `Weekly_Sales`: Sales for the given department in the given store

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
    Dataset source: <a href="https://www.kaggle.com/datasets/aslanahmedov/walmart-sales-forecast?select=stores.csv">Kaggle</a>
</p>

<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Required Materials</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let’s look at the elements we have available for reference for this notebook:</p>
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Filters: </b>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Industry:</b> Retail </li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Functionality:</b> AUTOARIMA and Forecasting </li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Use Case:</b> Sales Forecast </li> </p>
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Related Resources: </b>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><a href = 'https://www.teradata.com/Blogs/NPS-is-a-metric-not-the-goal'>In the fight to improve customer experience, NPS is a metric, not the goal</a></li> 
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><a href = 'https://www.teradata.com/Blogs/Hyper-scale-time-series-forecasting-done-right'>Hyper-scale time series forecasting done right </a></li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><a href = 'https://www.teradata.com/Blogs/Crystal-Ball-or-Black-Box-in-Retail-and-CPG'>Crystal Ball, Black Box or Advanced Forecasting and Demand Planning in Retail and CPG</a></li>
</p>

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2024 All Rights Reserved
        </div>
    </div>
</footer>