<header style="padding:1px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Time Series Forecasting on Number of Passengers for FlyHigh Airline</b>
</header>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Introduction:</b></p>

<p style = 'font-size:16px;font-family:Arial'>Consider a startup airline <b>FlyHigh Airlines</b> as our client. The client requires us to forecast the number of passengers who fly using their airline. The ARIMA model has been widely utilized in many fields for forecasting since it is recognized as reliable, efficient, and capable of predicting short-term share market movements. Hence, we'll build an ARIMA model to forecast the demand(passenger traffic) for Airplanes.
<br>
<br>
ARIMA(Auto Regressive Integrated Moving Average) is a combination of 2 models, AR(Auto Regressive) and MA(Moving Average). It has three hyperparameters - P(autoregressive lags), d(order of differentiation), and Q(moving avg.), which respectively come from the AR, I and MA components. The AR part is the correlation between previous and current periods. The MA part is used to smooth out the noise. The I part binds together the AR and MA parts.</p>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Data:</b></p>
<p style = 'font-size:16px;font-family:Arial'>The data for this demonstration resides on Vantage. The data consists of monthly flights between 2007 and 2018  (144 rows), which we selected to be able to understand the functionality. The same functionality can be applied to massive datasets with multiple series concurrently.
</p>

<p style = 'font-size:16px;font-family:Arial;color:#E37C4D'><b>Steps in the analysis:</b></p>
<ol style = 'font-size:16px;font-family:Arial'>
    <li>Connect to Vantage</li>    
    <li>Explore the dataset</li>
    <li>Check for Stationarity using Dickey-Fuller Test</li>
    <li>Make series stationary using TD_DIFF (D)</li>
    <li>Check for autocorrelation of the time series (Q)</li>
    <li>Check for partial autocorrelation of the time series (P)</li>
    <li>Using ARIMA(AutoRegressive Integrated Moving Average) model to forecast number of passengers</li>
        <ul>
            <li>7.1 Estimation step using TD_ARIMAESTIMATE</li>
            <li>7.2 Extract residuals</li>
            <li>7.3 Create table PLOT_ESTIMATE for plotting</li>
            <li>7.4 Validation step using TD_ARIMAVALIDATE</li>
            <li>7.5 Extract residuals</li>
            <li>7.6 Create table PLOT_VALIDATE for plotting</li>
            <li>7.7 Forecast step using TD_ARIMAFORECAST</li>
            <li>7.8 Create table PLOT_FORECAST for plotting</li>
        </ul>
    <li>Cleanup</li>
</ol>

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>1. Connect to Vantage</b>
<p style = 'font-size:16px;font-family:Arial'>You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%connect local, hidewarnings=True

<p style = 'font-size:16px;font-family:Arial'>Setup for execution of notebook. Begin running steps with Shift + Enter keys.</p>


In [None]:
Set query_band='DEMO=AirPassengersTimeSeriesForecasting.ipynb;' update for session;

<b style = 'font-size:20px;font-family:Arial;color:#E37C4D'>Getting Data for This Demo
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one of them is commented out. You may switch between the modes by changing the comment string.</p>

In [None]:
call get_data('DEMO_AirPassengers_cloud');           -- Takes 10 seconds
-- call get_data('DEMO_AirPassengers_local');           -- Takes 10 seconds

<p style = 'font-size:16px;font-family:Arial'>Optional step – if you want to see status of databases/tables created and space used.</p>


In [None]:
call space_report();          -- Takes 5 seconds

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>2. Explore the dataset</b>
<p style = 'font-size:16px;font-family:Arial'>The dataset consists of time-series data that is organized by date/time and the corresponding number of passengers traveling per month. It contains two columns: <b>Date</b>, which represents the temporal information, and <b>Passengers</b>, which represents the value to be forecasted.</p>

In [None]:
SELECT TOP 5 * FROM DEMO_AirPassengers.airpassengers ORDER BY "Date";

<p style = 'font-size:16px;font-family:Arial'>The dataset captures the monthly variation in passenger numbers over time, allowing for the analysis and prediction of passenger trends.
    <br>
    <br> 
The <b>TD_PLOT</b> function will return an image in the cell of the results showing the total passengers by month from 2007 to 2018.</p>

<i>* Please <b> right click on the cell under the IMAGE column </b> from the output and choose view image to see the plot generated. </i>

In [None]:
EXECUTE FUNCTION
TD_Plot
(
    SERIES_SPEC
    (
        TABLE_NAME(DEMO_AirPassengers.airpassengers),
        ROW_AXIS(TIMECODE("Date")),
        SERIES_ID(seriesID),
        PAYLOAD (FIELDS("Passengers"),CONTENT(REAL))
    ),
    FUNC_PARAMS
    (
        PLOTS[(
            TYPE('line'),
            LEGEND('upper left'),
            TITLE('Number of Passengers Travelling Monthly')
        )],
        IMAGE('png')
    )
);

<p style = 'font-size:16px;font-family:Arial'>If you followed the instructions above, you should have seen a graph that looks like follows:</p>
<img id="fig1" src="images/fig1.png" alt="Number of Passengers Travelling Monthly" width="400" />
<p style = 'font-size:16px;font-family:Arial'>This shows that the data has yearly cycles and an upward trend.</p>

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>3. Check for Stationarity using Dickey-Fuller Test</b>
<br>
<b style = font-size:16px;font-family:Arial>What is Stationarity?</b>
<br>
<p style = 'font-size:16px;font-family:Arial'>Before applying any statistical model on a Time Series, the series has to be stationary, which means that, over different time periods,
<br>
a) It should have constant mean.
<br>
b) It should have constant variance or standard deviation.
<br>
c) Auto-covariance should not depend on time.
</p>
<p style = 'font-size:16px;font-family:Arial'> We can visually examine if the mean and variance are constant over different periods of time or not. Alternatively, Dickey-Fuller test is widely used to check for stationarity in the time series data.
<br>

<p style = 'font-size:16px;font-family:Arial'>The following query would check for stationarity by using Dickey-Fuller Test. If the Null Hypothesis of the Dickey-Fuller Test:
<br>
• <b>ACCEPT</b> means the null hypothesis is accepted. Unit roots are present, and the process is non-stationary.
<br>
• <b>REJECT</b> means the null hypothesis is rejected. Unit roots are present, and the process may or may not be stationary, depending on other factors.</p>

In [None]:
EXECUTE FUNCTION
TD_DICKEY_FULLER(
    SERIES_SPEC(
        TABLE_NAME(DEMO_AirPassengers.airpassengers),
        ROW_AXIS(TIMECODE("Date")),
        SERIES_ID("seriesID"),
        PAYLOAD(
            FIELDS("Passengers"),
            CONTENT(REAL)
        )
    ),
    FUNC_PARAMS(
        ALGORITHM('NONE')
    )
);

<p style = 'font-size:16px;font-family:Arial'>Examining the column labeled "NULL_HYP," we can see that the NULL hypothesis is accepted. This acceptance indicates the presence of unit roots in the series, implying that the series is non-stationary. Therefore, it is necessary to take steps to transform the series and make it stationary. By making the series stationary, we aim to remove the unit roots and create a more suitable basis for analysis and modeling.</p>

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>4. Make series stationary using TD_DIFF</b>
<p style = 'font-size:16px;font-family:Arial'>TD_DIFF is a transformation method used to convert a time series into a differenced time series. This transformation can be applied to a variety of types of time series, including stationary, seasonal, or nonstationary series.
<br>
<br>
In order to assess stationarity, we will generate a new table that contains the differenced series. The differenced series helps remove trends or seasonal patterns, making the data potentially more stationary.
<br>
<br>
For this analysis, we have set the parameter LAG to 12, representing 12 units of time (e.g., 12 months or one year). This choice considers the cyclical nature of the data, enabling us to capture seasonal patterns.
<br>
<br>
Using the Dickey-Fuller test, we will examine whether the newly created table, which contains the differenced series, exhibits stationarity. If the test indicates non-stationarity, we will increase the DIFFERENCES parameter and repeat the test. This iterative process aims to identify the minimum number of differences required to achieve stationarity in the series. By progressively differencing the series until it becomes stationary, we can enhance the suitability of the data for further analysis and modeling.</p>

In [None]:
EXECUTE FUNCTION
COLUMNS(OUT_Passengers AS Passengers)
INTO VOLATILE ART(diff1_air)
TD_DIFF(
    SERIES_SPEC(
            TABLE_NAME(DEMO_AirPassengers.airpassengers),
            ROW_AXIS(TIMECODE("Date")),
            SERIES_ID("seriesID"),
            PAYLOAD(
                FIELDS("Passengers"),
                CONTENT(REAL)
            )
    ),
    FUNC_PARAMS(
          LAG(12),
          DIFFERENCES(1),
          SEASONAL_MULTIPLIER(0)
    )
);

In [None]:
SELECT TOP 5 * FROM diff1_air ORDER BY ROW_I;

<p style = 'font-size:16px;font-family:Arial'>In the above result, <b>ROW_I</b> is the Row identifier of the ordered result sets and <b>OUT_Passengers</b> is Transformed magnitudes of differenced time series elements.
    <br>
    <br>
The following cell applies Dickey-Fuller test to check for stationarity.</p>

In [None]:
EXECUTE FUNCTION
TD_DICKEY_FULLER(
    SERIES_SPEC(
        TABLE_NAME(diff1_air),
        ROW_AXIS(SEQUENCE(ROW_I)),
        SERIES_ID("seriesID"),
        PAYLOAD(
            FIELDS("Passengers"),
            CONTENT(REAL)
        )
    ),
    FUNC_PARAMS(
        ALGORITHM('NONE')
    )
);

<p style = 'font-size:16px;font-family:Arial'>Based on the examination of the rightmost column labeled "NULL_HYP," it is evident that the NULL hypothesis has been accepted. This acceptance suggests the presence of unit roots in the series, indicating that the series is non-stationary.
<br>
To address this non-stationarity, we will perform differencing on the series. In this case, we will apply the differencing operation twice, as indicated by setting the DIFFERENCES parameter to 2. By differencing the series twice, we aim to further eliminate any remaining trends or patterns that contribute to non-stationarity.</p>

In [None]:
EXECUTE FUNCTION
COLUMNS(OUT_Passengers AS Passengers)
INTO VOLATILE ART(diff2_air)
TD_DIFF(
    SERIES_SPEC(
            TABLE_NAME(DEMO_AirPassengers.airpassengers),
            ROW_AXIS(TIMECODE("Date")),
            SERIES_ID("seriesID"),
            PAYLOAD(
                FIELDS("Passengers"),
                CONTENT(REAL)
            )
    ),
    FUNC_PARAMS(
          LAG(12),
          DIFFERENCES(2),
          SEASONAL_MULTIPLIER(0)
    )
);

In [None]:
SELECT TOP 5 * FROM diff2_air ORDER BY ROW_I;

<p style = 'font-size:16px;font-family:Arial'>In the above result, <b>ROW_I</b> is the Row identifier of the ordered result sets and <b>OUT_Passengers</b> is Transformed magnitudes of differenced time series elements.
    <br>
    <br>
The following cell applies Dickey-Fuller test to check for stationarity.</p>

In [None]:
EXECUTE FUNCTION
TD_DICKEY_FULLER(
    SERIES_SPEC(
        TABLE_NAME(diff2_air),
        ROW_AXIS(SEQUENCE(ROW_I)),
        SERIES_ID("seriesID"),
        PAYLOAD(
            FIELDS("Passengers"),
            CONTENT(REAL)
        )
    ),
    FUNC_PARAMS(
        ALGORITHM('NONE')
    )
);

<p style = 'font-size:16px;font-family:Arial'>Based on the examination of the rightmost column labeled "NULL_HYP," we can see that the NULL hypothesis has been rejected. This rejection suggests that there is evidence of the presence of unit roots in the series, which indicates that the series may or may not be stationary.
<br>
<br>
Additionally, analyzing the p-value, we observe that it is less than 0.05. This indicates that the observed data would occur by chance less than 5% of the time if the null hypothesis of non-stationarity were true. Therefore, based on the significant p-value, it is reasonable to suggest that the series is stationary.
<br>
<br>
Consequently, in order to achieve stationarity, we will apply differences to the series. In this case, the number of differences, denoted by <b>D = 2</b>, signifies that the series will be differenced four times</p>

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>5. Check for autocorrelation of the time series</b>
<p style = 'font-size:16px;font-family:Arial'>The TD_ACF method is used to compute the autocorrelation or autocovariance of a time series. Autocorrelation measures the correlation between a time series and its lagged versions, while autocovariance measures the covariance between a time series and its lagged versions. These metrics help us understand the relationship or dependency of the time series on its past values.
<br>
<br> 
In this analysis, we examine the autocorrelation using a maximum lag of 12 time steps. This means we calculate the autocorrelation or autocovariance at different lagged intervals up to 12 time steps. By considering a range of lags, we gain insights into how the time series correlates or covaries with itself over different time periods.</p>

In [None]:
EXECUTE FUNCTION
COLUMNS(OUT_Passengers AS Auto_Correlation)
INTO VOLATILE ART(ACFDemo)
TD_ACF(
    SERIES_SPEC(
        TABLE_NAME(diff2_air),
        ROW_AXIS(SEQUENCE(ROW_I)),
        SERIES_ID("seriesID"),
        PAYLOAD(
            FIELDS("Passengers"),
            CONTENT(REAL)
        )
    ),
    FUNC_PARAMS(
        MAXLAGS(12),
        UNBIASED(0),    -- Use 0 for Jenkins-Watts formula, or 1 for BoxJenkins formula
        FUNC_TYPE(0),   -- Use 0 for autocorrelation, or 1 for autocovariance
        DEMEAN(1),
        QSTAT(0),
        ALPHA(0.05)
    )
);

In [None]:
SELECT TOP 5 * FROM ACFDemo ORDER BY ROW_I;

<p style = 'font-size:16px;font-family:Arial'>The <b>TD_PLOT</b> function will return an image in the cell of the results showing the Auto Correlation Plot.</p>
<i>* Please <b> right click on the cell under the IMAGE column </b> from the output and choose view image to see the plot generated. </i>

In [None]:
EXECUTE FUNCTION
TD_Plot
(
    SERIES_SPEC
    (
        TABLE_NAME(ACFDemo),
        ROW_AXIS(SEQUENCE(ROW_I)),
        SERIES_ID(seriesID),
        PAYLOAD (FIELDS(Auto_Correlation, CONF_OFF_Passengers),CONTENT(MULTIVAR_REAL))
    ),
    FUNC_PARAMS
    (
        PLOTS[(
            TYPE('corr')
           ,LEGEND('best') 
        )],
        IMAGE('png')
    )
);

<p style = 'font-size:16px;font-family:Arial'>If you followed the instructions above, you should have seen a graph that looks like follows:</p>
<img id="fig2" src="images/fig2.png" alt="Auto Correlation" width="400" />
<p style = 'font-size:16px;font-family:Arial'>To determine the value of the Moving Average or Q, we examine the lag (represented by ROW_I) where the Auto-Correlation Function (ACF) plot exhibits a value that falls just outside the significant limit. By analyzing the ACF plot, we can identify the lag at which the autocorrelation value deviates from the confidence band.
<br>
<br>
Upon inspecting the graph, we observe that the Auto-Correlation value at ROW_I = 4 lies outside the confidence band and is also in close proximity to it. Based on this observation, we can reasonably conclude that the value of the Moving Average or <b>Q = 4</b>.</p>

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>6. Check for partial autocorrelation of the time series</b>
<p style = 'font-size:16px;font-family:Arial'>The TD_PACF function provides insight as to whether the modelled function is stationary or not. The partial autocorrelations measure the degree of correlation between time series sample points.
<br>
<br>
In this analysis, we examine the partial autocorrelation with a maximum lag of 12 time steps. By considering a range of lags, we can assess the correlations between the current value and past values at different time intervals.</p>

In [None]:
EXECUTE FUNCTION
COLUMNS(OUT_Passengers AS Partial_Auto_Correlation)
INTO VOLATILE ART(PACFDemo)
TD_PACF (
    SERIES_SPEC(
        TABLE_NAME(diff2_air),
        ROW_AXIS(SEQUENCE(ROW_I)),
        SERIES_ID(seriesID),
        PAYLOAD(
            FIELDS("Passengers"),
            CONTENT(REAL)
        )
    ) ,
    FUNC_PARAMS(
        MAXLAGS(12),
        UNBIASED(0),    -- Use 0 for Jenkins-Watts formula, or 1 for BoxJenkins formula
        ALGORITHM(LEVINSON_DURBIN),
        ALPHA(0.05)
    )
);

In [None]:
SELECT TOP 5 * FROM PACFDemo ORDER BY ROW_I;

<p style = 'font-size:16px;font-family:Arial'>The <b>TD_PLOT</b> function will return an image in the cell of the results showing the Partial Auto Correlation Plot.</p>
<i>* Please <b> right click on the cell under the IMAGE column </b> from the output and choose view image to see the plot generated. </i>

In [None]:
EXECUTE FUNCTION
TD_Plot
(
    SERIES_SPEC
    (
        TABLE_NAME(PACFDemo),
        ROW_AXIS(SEQUENCE(ROW_I)),
        SERIES_ID(seriesID),
        PAYLOAD (FIELDS(Partial_Auto_Correlation, CONF_OFF_Passengers),CONTENT(MULTIVAR_REAL))
    ),
    FUNC_PARAMS
    (
        PLOTS[(
            TYPE('corr')
           ,LEGEND('best') 
        )],
        IMAGE('png')
    )
);

<p style = 'font-size:16px;font-family:Arial'>If you followed the instructions above, you should have seen a graph that looks like follows:</p>
<img id="fig3" src="images/fig3.png" alt="Partial Auto Correlation" width="400" />
<p style = 'font-size:16px;font-family:Arial'>To determine the value of Auto-Regressive lags or P, we examine the lag (represented by Row_I) where the Partial Autocorrelation Function (PACF) plot displays a value that falls just outside the significant limit. By analyzing the PACF plot, we can identify the lag at which the partial autocorrelation value deviates from the confidence band.
<br>
<br>
Upon observing the graph, we find that the Partial Autocorrelation value at Row_I = 2 falls outside the significant limit. Based on this observation, we can conclude that the value of Auto-Regressive lags or <b>P = 2</b>.</p>

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>7. Using ARIMA (AutoRegressive Integrated Moving Average) model to forecast number of passengers</b>
<p style = 'font-size:16px;font-family:Arial'>
ARIMA functions on VANTAGE run in the following order:
<br>
1. Run <b>TD_ARIMAESTIMATE</b> function to get the coefficients for the ARIMA model.
<br>
2. <i>[Optional]</i> Run <b>TD_ARIMAVALIDATE</b> function to validate the the "goodness of fit" of the ARIMA model, when
FIT_PERCENTAGE is not 100 in TD_ARIMAESTIMATE.
<br>
3. Run the <b>TD_ARIMAFORECAST</b> function with input from step 1 or step 2 to forecast the future periods
beyond the last observed period.
</p>

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>7.1 Estimation step using TD_ARIMAESTIMATE</b></p>
<p style = 'font-size:16px;font-family:Arial'>The TD_ARIMAESTIMATE function estimates the coefficients corresponding to an ARIMA model and fits a series with an existing ARIMA model. The function can also provide the "goodness of fit" and the residuals of the fitting operation. The function generates a model layer used as input for the TD_ARIMAVALIDATE and TD_ARIMAFORECAST functions. This function is for univariate series.</p>

<br>

<p style = 'font-size:16px;font-family:Arial'>Here, the previously estimated parameters, namely P (Auto-Regressive lags), d (differencing order), and Q (Moving Average lags), are required to be passed into the MODEL_ORDER function. For example, the specific values used here are MODEL_ORDER(2, 2, 4).
<br>
<br>
The output of the analysis is stored in an ART (Analytical Result Table), which contains relevant information and results of the ARIMA modeling process.
<br>
<br>
Furthermore, the fit percentage is determined to be 80. This fit percentage indicates that the ARIMA model is trained using 80% of the available data. The remaining 20% of the data will be used for validating the model's performance.</p>

In [None]:
EXECUTE FUNCTION INTO VOLATILE ART(ART_EST)
TD_ARIMAESTIMATE(
    SERIES_SPEC(
        TABLE_NAME(DEMO_AirPassengers.airpassengers),
        ROW_AXIS(TIMECODE("Date")),
        SERIES_ID(seriesID),
        PAYLOAD(
            FIELDS("Passengers"),
            CONTENT(REAL))),
     FUNC_PARAMS(
        NONSEASONAL(MODEL_ORDER(2, 2, 4)),
        CONSTANT(1), COEFF_STATS(1), FIT_METRICS(1),
        RESIDUALS(1), ALGORITHM(CSS_MLE),  FIT_PERCENTAGE(80)
    )
);

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>7.2 Extract residuals</b></p>
<p style = 'font-size:16px;font-family:Arial'>The TD_EXTRACT_RESULTS function serves the purpose of retrieving auxiliary result sets stored within an Analytical Result Table (ART). In this particular case, we focus on extracting the residuals from the ART obtained during the previous estimation step.
<br>
<br>
Analytical Result Tables consist of multiple layers that store various types of data. By default, the initial layer contains model information. However, we are interested in accessing the ARTFITRESIDUALS layer, which contains crucial information about the actual values, calculated values, and residuals of the model.
<br>
<br>
Additionally, the ARTFITMETADATA layer within the ART provides relevant performance metrics associated with the fitted model. These metrics help evaluate the accuracy and reliability of the model in capturing the underlying patterns and making predictions.
</p>

In [None]:
CREATE TABLE AR_RESIDUALS AS (
    EXECUTE FUNCTION
    TD_EXTRACT_RESULTS(
        ART_SPEC(
            TABLE_NAME(ART_EST),
            LAYER(ARTFITRESIDUALS)
        )
    )
) WITH DATA;

In [None]:
SELECT TOP 5 * FROM AR_RESIDUALS ORDER BY ROW_I;

<p style = 'font-size:16px;font-family:Arial'>The output displayed above provides insights into the ARIMA model's actual values, calculated values, and residuals. In this context, the actual value represents the observed number of passengers flying, reflecting the real-world data.
<br>
<br>
The calculated value corresponds to the values generated by the ARIMA model during the estimation phase. These calculated values are based on the model's learned patterns, relationships, and parameters derived from the training data.
<br>
<br>
The residual value represents the discrepancy or difference between the actual value and the calculated value. It quantifies the model's prediction error or the extent to which the model's estimates deviate from the actual observations.
<br>
<br>
In the following cell, we extract additional metrics from the estimate phase i.e. TD_ARIMAESTIMATE.
</p>

In [None]:
SELECT * FROM (
    EXECUTE FUNCTION
    TD_EXTRACT_RESULTS(
        ART_SPEC(
            TABLE_NAME(ART_EST),
            LAYER(ARTFITMETADATA)
        )
    )
) AS T;

<p style = 'font-size:16px;font-family:Arial'>The displayed output provides performance metrics that offer insights into the effectiveness of the trained ARIMA model. One such metric is the R-Squared value, which measures how well the model fits the data. In this instance, the R-Squared value is noted as 0.94, indicating a strong fit between the model and the data.
<br>
<br>
The R-Squared value ranges from 0 to 1, with higher values indicating a better fit. A value of 0 suggests that the model does not explain any of the variability in the data, while a value of 1 indicates that the model perfectly captures the observed data's variability.
<br>
<br>
In the context of the ARIMA model, an R-Squared value of 0.94 suggests that the model accounts for a significant proportion of the variability present in the data. This implies that the model's predictions closely align with the actual data points and that the model's learned patterns and parameters effectively capture the underlying dynamics of the time series.</p>

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>7.3 Create table PLOT_ESTIMATE for plotting</b></p>
</p>
<p style = 'font-size:16px;font-family:Arial'>Here, we'll create a table which will be used to plot the actual and estimated time series.</p>

In [None]:
CREATE TABLE PLOT_ESTIMATE (DatasetID VARCHAR(10), ROW_I BIGINT, FIT_MAGNITUDE FLOAT);

In [None]:
INSERT INTO PLOT_ESTIMATE SELECT 'FlyHigh', ROW_I, ACTUAL_VALUE FROM AR_RESIDUALS WHERE ROW_I>1; 
INSERT INTO PLOT_ESTIMATE SELECT 'ESTIMATED', ROW_I, CALC_VALUE FROM AR_RESIDUALS WHERE ROW_I>1; 

In [None]:
SELECT TOP 5 * FROM PLOT_ESTIMATE ORDER BY ROW_I;

<p style = 'font-size:16px;font-family:Arial'>The <b>TD_PLOT</b> function will return an image in the cell of the results showing the Actual and Estimated values by the fitted ARIMA model.</p>
<i>* Please <b> right click on the cell under the IMAGE column </b> from the output and choose view image to see the plot generated. </i>

In [None]:
EXECUTE FUNCTION
TD_Plot
(
    SERIES_SPEC(
        TABLE_NAME(PLOT_ESTIMATE),
        ROW_AXIS(SEQUENCE(ROW_I)),
        SERIES_ID(DataSetID),
        ID_SEQUENCE('[{"DatasetID":"FlyHigh"},{"DatasetID":"ESTIMATED"}]'),
        PAYLOAD(
            FIELDS(FIT_MAGNITUDE),
            CONTENT(REAL)
        )
    ),
    FUNC_PARAMS
    (
        WIDTH(1920),
        HEIGHT(1080),
        TITLE('ARIMA ESTIMATE'),
        PLOTS[
            (
                TITLE ('ORIGINAL and ESTIMATED SERIES'),
                GRID(FORMAT('-')),
                TYPE('line'),
                SERIES[
                       (
                        ID(1),
                        FORMAT('r--')
                       ),
                       (
                        ID(2),
                        FORMAT('b-')
                       )
                     ],
                MARKER('o'),
                LEGEND('best'),
                XLABEL('X SeqNo'),
                YLABEL('Y Magnitude')
            )
        ]
    )
);

<p style = 'font-size:16px;font-family:Arial'>If you followed the instructions above, you should have seen a graph looks like follows:</p>
<img id="fig4" src="images/fig4.png" alt="ARIMA Estimate" width="400" />
<p style = 'font-size:16px;font-family:Arial'>The red line indicates the actual number of passengers who travelled, and the blue line indicates the estimated number of passengers who travelled. This graph shows how well the ARIMA model has learned on the training dataset.</p>

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>7.4 Validation step using TD_ARIMAVALIDATE</b></p>
<p style = 'font-size:16px;font-family:Arial'>The TD_ARIMAVALIDATE function provides data scientists with a metrics collection for model selection and the produced residuals, such that several model validation tests can be performed. The TD_ARIMAVALIDATE function performs in-sample forecasting for seasonal and non-seasonal auto-regressive (AR), moving-average (MA), and ARIMA models. It also supports the extended Box Jenkins seasonal ARIMA model formula.
</p>
<p style = 'font-size:16px;font-family:Arial'>We'll use the output of the previous estimation step to validate the model. The train-validate split here for the dataset is 80:20. Hence 20% of the data will be used to validate the estimated model.</p>

In [None]:
EXECUTE FUNCTION 
INTO VOLATILE ART(AR_VALIDATE)
TD_ARIMAVALIDATE(
    ART_SPEC(TABLE_NAME(ART_EST)),
    FUNC_PARAMS(
        FIT_METRICS(1),
        RESIDUALS(1)
    )
);

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>7.5 Extract residuals</b></p>
<p style = 'font-size:16px;font-family:Arial'>The TD_EXTRACT_RESULTS function retrieves auxiliary result sets stored in an ART. Here we extract the residuals from the ART table output of the previous validation step.
</p>

In [None]:
CREATE TABLE AR_VALIDATE_RESIDUALS AS (
    EXECUTE FUNCTION 
    TD_EXTRACT_RESULTS(
        ART_SPEC(
            TABLE_NAME(AR_VALIDATE),
            LAYER(ARTFITRESIDUALS)
        )
    )
) WITH DATA;

In [None]:
SELECT TOP 5 * FROM AR_VALIDATE_RESIDUALS ORDER BY ROW_I;

<p style = 'font-size:16px;font-family:Arial'>The provided output displays the actual value, calculated value, and residual of the ARIMA model during the validation phase. In this context, the actual value represents the number of passengers from the unseen or validation data. These values serve as the ground truth against which the model's performance will be evaluated.
<br>
<br>
The calculated value represents the predicted value generated by the ARIMA model on the unseen validation data. These predicted values are obtained by applying the trained model's learned patterns, relationships, and parameters to the new data points. The calculated values provide an estimation of what the model predicts the number of passengers to be based on the unseen data.
<br>
<br>
The residual represents the difference between the actual value and the calculated value. It quantifies the prediction error of the ARIMA model for each data point in the validation set. 
<br>
<br>
In the following step, we again pull the metadata from the ART, which is the output of the validation phase.
</p>

In [None]:
SELECT * FROM (
    EXECUTE FUNCTION
    TD_EXTRACT_RESULTS(
        ART_SPEC(
            TABLE_NAME(AR_VALIDATE),
            LAYER(ARTFITMETADATA)
        )
    )
) AS T;

<p style = 'font-size:16px;font-family:Arial'>The displayed output presents performance metrics that allow us to assess the effectiveness of our model on the unseen dataset, specifically the validation dataset. These metrics offer valuable insights into how well our model performs in making predictions on previously unseen data.
<br>
<br>
One such metric is the R-Squared value, which is noted as 0.82 in the provided output. The R-Squared value is a widely used measure of how well the model fits the validation data. A higher R-Squared value indicates a stronger fit between the model's predictions and the actual values observed in the validation dataset.
<br>
<br>
The R-Squared value ranges from 0 to 1, where 0 indicates that the model does not explain any of the variability in the validation data, and 1 indicates a perfect fit where the model captures all the variability. In our case, an R-Squared value of 0.82 suggests that our model performs well, as it explains a significant portion of the variability present in the validation dataset.</p>

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>7.6 Create table PLOT_VALIDATE for plotting</b></p>
</p>
<p style = 'font-size:16px;font-family:Arial'>Here, we'll create a table which will be used to plot the actual and validated time series.</p>

In [None]:
CREATE TABLE PLOT_VALIDATE (DatasetID VARCHAR(10), ROW_I BIGINT, FIT_MAGNITUDE FLOAT);

In [None]:
INSERT INTO PLOT_VALIDATE SELECT 'FlyHigh', ROW_I, ACTUAL_VALUE FROM AR_VALIDATE_RESIDUALS WHERE ROW_I>0; 
INSERT INTO PLOT_VALIDATE SELECT 'PREDICTED', ROW_I, CALC_VALUE FROM AR_VALIDATE_RESIDUALS WHERE ROW_I>0;

<p style = 'font-size:16px;font-family:Arial'>The <b>TD_PLOT</b> function will return an image in the cell of the results showing the Actual and Predicted values by ARIMA model.</p>
<i>* Please <b> right click on the cell under the IMAGE column </b> from the output and choose view image to see the plot generated. </i>

In [None]:
EXECUTE FUNCTION
TD_Plot
(
    SERIES_SPEC(
        TABLE_NAME(PLOT_VALIDATE),
        ROW_AXIS(SEQUENCE(ROW_I)),
        SERIES_ID(DataSetID),
        ID_SEQUENCE('[{"DatasetID":"FlyHigh"},{"DatasetID":"PREDICTED"}]'),
        PAYLOAD(
            FIELDS(FIT_MAGNITUDE),
            CONTENT(REAL)
        )
    ),
    FUNC_PARAMS
    (
        WIDTH(1920),
        HEIGHT(1080),
        TITLE('ARIMA VALIDATE'),
        PLOTS[
            (
                TITLE ('ORIGINAL and PREDICTED SERIES'),
                GRID(FORMAT('-')),
                TYPE('line'),
                SERIES[
                       (
                        ID(1),
                        FORMAT('r--')
                       ),
                       (
                        ID(2),
                        FORMAT('b-')
                       )
                     ],
                MARKER('o'),
                LEGEND('best'),
                XLABEL('X SeqNo'),
                YLABEL('Y Magnitude')
            )
        ]
    )
);

<p style = 'font-size:16px;font-family:Arial'>If you followed the instructions above, you should have seen a graph that looks like follows:</p>
<img id="fig5" src="images/fig5.png" alt="ARIMA Validate" width="400" />
<p style = 'font-size:16px;font-family:Arial'>The red line indicates the actual number of passengers who travelled, and the blue line indicates the predicted number of passengers who travelled. This graph shows how well the ARIMA model predicts on the validation data.</p>

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>7.7 Forecast step using TD_ARIMAFORECAST</b></p>
<p style = 'font-size:16px;font-family:Arial'>The TD_ARIMAFORECAST function is used to forecast a user-defined number of periods based on models fitted from the TD_ARIMAESTIMATE function.</p>
<p style = 'font-size:16px;font-family:Arial'>Here in the next cell, we use the estimated and validated model to forecast the number of passengers for the subsequent six periods, i.e. next six months.</p>

In [None]:
EXECUTE FUNCTION INTO VOLATILE ART(ARMA_FORECAST)
TD_ARIMAFORECAST(
           ART_SPEC(TABLE_NAME(AR_VALIDATE)),
           FUNC_PARAMS(FORECAST_PERIODS(6)));

In [None]:
SELECT * FROM ARMA_FORECAST;

<p style = 'font-size:16px;font-family:Arial'>The above output shows us the forecasted value for the next six months. Observe that we also have forecasted values with 80% and 95% confidence.</p>

<hr>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>7.8 Create table PLOT_FORECAST for plotting</b></p>
<p style = 'font-size:16px;font-family:Arial'>Here, we'll create a table which will be used to plot the forecasted number of passengers in the next 6 months.</p>

In [None]:
CREATE TABLE PLOT_FORECAST (DatasetID VARCHAR(16), ROW_I BIGINT, FORECAST_MAGNITUDE FLOAT);

In [None]:
INSERT INTO PLOT_FORECAST   SELECT 'FORECASTED', ROW_I, FORECAST_VALUE FROM ARMA_FORECAST; 
INSERT INTO PLOT_FORECAST   SELECT 'UPPER_BOUND', ROW_I, HI_80 FROM ARMA_FORECAST ; 
INSERT INTO PLOT_FORECAST   SELECT 'LOWER_BOUND', ROW_I, LO_80 FROM ARMA_FORECAST ; 

In [None]:
SELECT * FROM PLOT_FORECAST ORDER BY ROW_I;

<p style = 'font-size:16px;font-family:Arial'>The <b>TD_PLOT</b> function will return an image in the cell of the results showing the Forecasted values by ARIMA model.</p>
<i>* Please <b> right click on the cell under the IMAGE column </b> from the output and choose view image to see the plot generated. </i>

In [None]:
EXECUTE FUNCTION
TD_Plot
(
    SERIES_SPEC(
        TABLE_NAME(PLOT_FORECAST),
        ROW_AXIS(SEQUENCE(ROW_I)),
        SERIES_ID(DataSetID),
        ID_SEQUENCE('[{"DatasetID":"FORECASTED"},{"DatasetID":"UPPER_BOUND"},{"DatasetID":"LOWER_BOUND"}]'),
        PAYLOAD(
            FIELDS(FORECAST_MAGNITUDE),
            CONTENT(REAL)
        )
    ),
    FUNC_PARAMS
    (
        WIDTH(1920),
        HEIGHT(1080),
        TITLE('ARIMA FORECAST'),
        PLOTS[
            (
                TITLE ('Forecast'),
                GRID(FORMAT('-')),
                TYPE('line'),
                SERIES[
                       (
                        ID(1),
                        FORMAT('r--')
                       ),
                       (
                        ID(2),
                        FORMAT('b-')
                       ),
                        (
                        ID(3),
                        FORMAT('b-')
                       )
                     ],
                MARKER('o'),
                LEGEND('best'),
                XLABEL('X SeqNo'),
                YLABEL('Y Magnitude')
            )
        ]
    )
);

<p style = 'font-size:16px;font-family:Arial'>If you followed the instructions above, you should have seen a graph that looks like follows:</p>
<img id="fig6" src="images/fig6.png" alt="ARIMA Forecast" width="400" />
<p style = 'font-size:16px;font-family:Arial'>The red line is the Forecasted number of passengers for the next six months, and the blue lines are the upper and lower confidence interval with an 80% confidence level.</p>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Conclusion:</b></p>
<p style = 'font-size:16px;font-family:Arial'>
After training and validating the ARIMA model on the air passengers dataset, we observe that the model's predictions closely align with the actual data. This indicates that the model has successfully learned the underlying patterns and relationships within the dataset.
<br>
<br>
Based on the close alignment of the model's predictions with the actual data and the favorable goodness of fit metrics, we can confidently conclude that our ARIMA model is well-trained and capable of making accurate forecasts for the air passengers dataset.</p>

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>8. Cleanup</b>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial'>Cleanup work tables to prevent errors next time. This section drops all the tables created during the demonstration.</p>

In [None]:
DROP TABLE diff1_air;

In [None]:
DROP TABLE diff2_air;

In [None]:
DROP TABLE ACFDemo;

In [None]:
DROP TABLE PACFDemo;

In [None]:
DROP TABLE ART_EST;

In [None]:
DROP TABLE AR_RESIDUALS;

In [None]:
DROP TABLE PLOT_ESTIMATE;

In [None]:
DROP TABLE AR_VALIDATE;

In [None]:
DROP TABLE AR_VALIDATE_RESIDUALS;

In [None]:
DROP TABLE PLOT_VALIDATE;

In [None]:
DROP TABLE ARMA_FORECAST;

In [None]:
DROP TABLE PLOT_FORECAST;

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
call remove_data('DEMO_AirPassengers');          -- Takes 5 seconds

<p style = 'font-size:16px;font-family:Arial;color:#E37C4D'><b>Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>UAF(Unbounded Array Framework) Documentation: <a href = 'https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-VantageTM-Unbounded-Array-Framework-Time-Series-Reference-17.20/Unbounded-Array-Framework'>https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-VantageTM-Unbounded-Array-Framework-Time-Series-Reference-17.20/Unbounded-Array-Framework</a></li>
</ul>

<footer style="padding:10px;background:#f9f9f9;border-bottom:3px solid #394851">©2023 Teradata. All Rights Reserved</footer>