<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       SelectionCriteria Function in Vantage
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial'><b>Introduction</b></p>
<p style = 'font-size:18px;font-family:Arial'><b>SelectionCriteria</b></p>
<p style = 'font-size:16px;font-family:Arial'>The SelectionCriteria() computes a series of model selection metrics to assist a data scientist in selecting the best model.</p>

<p style = 'font-size:16px;font-family:Arial'>Model selection is the process of choosing the best algorithm or model from a set of candidate models for a given dataset. It is done by comparing various model candidates on evaluation metrics calculated on an evaluation schema. Model selection is a critical step to determine the accuracy and effectiveness for predicting or classifying new data. The goal of model selection is to find a model that generalizes well on unseen data, rather than just fitting well to the training data.</p>

<p style = 'font-size:16px;font-family:Arial'>The model selection metrics are:</p>

<li style = 'font-size:16px;font-family:Arial'><code>Akaike Information Criteria (AIC):</code> Test how well your model fits the data set without over-fitting it. An AIC score is compared with the AIC score of a competing model. A model with a lower AIC score is expected to have balance between its ability to fit the data set and its ability to avoid over-fitting the data set.</li>
<li style = 'font-size:16px;font-family:Arial'><code>Schwarz Bayesian Information Criteria (SBIC):</code> Quantify and select the least complex probability model among options. This approach ignores the prior probability and instead compares the efficiencies of different models at predicting outcomes. That efficiency is measured by an index of each model’s parameters using a likelihood function, and then applying a penalizing function for models with more parameters.</li>
<li style = 'font-size:16px;font-family:Arial'><code>Hannan Quinn Information Criteria (HQIC):</code> Measure of the goodness-of-fit of a statistical model, and is often used as a criterion for model selection among a finite set of models. It is related to Akaike's information criterion. Like AIC, the HQIC has a penalty for the number of parameters in the model, but the HQIC penalty is larger than the AIC penalty.</li>
<li style = 'font-size:16px;font-family:Arial'><code>Maximum Likelihood Rule (MLR):</code> Determine values for the model parameters. The parameter values are such that they maximize the likelihood that the process described by the model produced the data that was observed.</li>
<li style = 'font-size:16px;font-family:Arial'><code>Mean Squared Error (MSE):</code> Measure the amount of error in statistical models. It assesses the average squared difference between the observed and predicted values. When a model has no error, the MSE equals zero. As model error increases, its MSE value increases. The MSE is also known as Mean Squared Deviation (MSD).</li>



<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>1. Initiate a connection to Vantage</b>

<p style = 'font-size:16px;font-family:Arial'>In the section, we import the required libraries and set environment variables and environment paths (if required).

In [None]:
from teradataml import (
    create_context,
    execute_sql,
    load_example_data,
    DataFrame, 
    in_schema,
    TDSeries,
    TDAnalyticResult,
    ArimaEstimate,
    ArimaValidate,
    SelectionCriteria,
    Figure,
    plot,
    db_drop_table,
    db_drop_view,
    remove_context
    )

# Modify the following to match the specific client environment settings
display.max_rows = 5

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>1.1 Connect to Vantage</b></p>
<p style = 'font-size:16px;font-family:Arial'>You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../../UseCases/startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=PP_SelectionCriteria.ipynb;' UPDATE FOR SESSION; ''')

<p style = 'font-size:16px;font-family:Arial'>Begin running steps with Shift + Enter keys. </p>

<hr style='height:1px;border:none;'>

<p style = 'font-size:18px;font-family:Arial'><b>1.2 Getting Data for This Demo</b></p>

<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
%run -i ../../UseCases/run_procedure.py "call get_data('DEMO_SalesForecasting_local');"       # Takes 70 seconds

<p style = 'font-size:16px;font-family:Arial'>Next is an optional step – if you want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../../UseCases/run_procedure.py "call space_report();"        # Takes 10 seconds

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>2. Preparing Dataset</b>
<li style = 'font-size:16px;font-family:Arial'>Weekly_Sales is our variable of interest.</li>
<li style = 'font-size:16px;font-family:Arial'>Type, Size, Temperature, isHoliday, Fuel_Price, MarkDown1, MarkDown2, MarkDown3, MarkDown4, MarkDown4 are exogenous variables.</li>
<p style = 'font-size:16px;font-family:Arial'>We prepare the dataset by creating a view by joining data from Weekly Sales, Stores and features. The view is created using SQL to reduce the number of steps to join and data preocessing which gets used in further steps..</p>

In [None]:
query2='''REPLACE VIEW Weekly_Sales_Details AS
SELECT
    w.Sales_date AS times,
    CAST('2012-02-03' AS DATE) AS cutoff_date,
    w.Dept,
    w.Store,
    CAST(w.Sales_Date AS TIMESTAMP) AS Sales_Date,
    ZEROIFNULL(Weekly_Sales) AS Weekly_Sales,
    ZEROIFNULL(Store_Size) AS Store_Size,
    Store_Type AS Store_Type,
    w.IsHoliday,
    ZEROIFNULL(Temperature) AS Temperature,
    ZEROIFNULL(MarkDown1) AS MarkDown1,
    ZEROIFNULL(MarkDown2) AS MarkDown2,
    ZEROIFNULL(MarkDown3) AS MarkDown3,
    ZEROIFNULL(MarkDown4) AS MarkDown4,
    ZEROIFNULL(MarkDown5) AS MarkDown5,
    ZEROIFNULL(CPI) AS CPI,
    ZEROIFNULL(Unemployment) AS Unemployment,
    ZEROIFNULL(Fuel_Price) AS Fuel_Price,
    CAST(TRIM(w.Dept) || TRIM(w.Store) AS INT) AS idcols
FROM
    Demo_SalesForecasting.Weekly_Sales w
LEFT JOIN
    Demo_SalesForecasting.Stores s ON w.Store = s.Store
LEFT JOIN
    Demo_SalesForecasting.Features f ON w.Store = f.store AND w.Sales_Date = f.Sales_Date
WHERE
    w.Store IN (20, 4);
'''

execute_sql(query2)
modeldf=DataFrame.from_query('select * from Weekly_Sales_Details;')

In [None]:
dfacheck = modeldf.groupby(["idcols"])
dfacheck=dfacheck.count().select(["idcols","count_Sales_Date"])

dfa4=modeldf.join(dfacheck, on = 'idcols', how = "left", lsuffix = 't1', rsuffix = 't2').drop(['idcols_t2'],axis=1)
dfa4=dfa4.assign(idcols = dfa4['idcols_t1'])
dfa4=dfa4.drop(['idcols_t1'],axis=1)

# filter out incomplete time series 

modeldf1 = dfa4[dfa4.count_Sales_Date == 143]
modeldf1.shape

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>3. SelectionCriteria</b>
<p style = 'font-size:16px;font-family:Arial'>The SelectionCriteria() function calculates metrics to help
    determine the users for an forecast modeling project.</p>


<p></p>
<p style = 'font-size:16px;font-family:Arial'>Detailed help can be found by passing function name to built-in help function. </p>

In [None]:
help(SelectionCriteria)

<p style = 'font-size:16px;font-family:Arial'>We need to first convert the data from dataframe into a TDSeries which will be passed to the ArimaEstimate function as input.</p>

In [None]:
from teradataml import TDSeries, Resample

data_series_df = TDSeries(data=modeldf1,
                              id="idcols",
                              row_index=("Sales_Date"),
                              row_index_style= "TIMECODE",
                              payload_field="Weekly_Sales",
                              payload_content="REAL")

uaf_out1 = Resample(data=data_series_df,
                        interpolate='LINEAR',
                        timecode_start_value="TIMESTAMP '2010-02-05 00:00:00'",
                        timecode_duration="WEEKS(1)")

df=uaf_out1.result
df1=df.select(['idcols','ROW_I', 'Weekly_Sales']).assign(Sales_Date=df.ROW_I)
df1

<p style = 'font-size:16px;font-family:Arial'>We will use the ArimaEstimate function.</p>

In [None]:
# Execute ArimaEstimate function.
data_series_df_1 = TDSeries(data=df1,
                              id="Sales_Date",
                              row_index=("idcols"),
                              row_index_style= "SEQUENCE",
                              payload_field="Weekly_Sales",
                              payload_content="REAL")

arima_est_out = ArimaEstimate(data1=data_series_df_1,
                            nonseasonal_model_order=[2,1,1],
                            constant=True,
                            algorithm="MLE",
                            coeff_stats=True,
                            fit_metrics=True,
                            residuals=True,
                            fit_percentage=70)

<p style = 'font-size:16px;font-family:Arial'>We will calculate the metrics on the series created on the output of ArimaEstimate() function.</p>

In [None]:
selectioncriteria_series = TDSeries(data=arima_est_out.fitresiduals,
                                    id="Sales_Date",
                                    row_index="ROW_I",
                                    row_index_style= "SEQUENCE",
                                    payload_field=["ACTUAL_VALUE",  "CALC_VALUE","RESIDUAL"],
                                    payload_content="MULTIVAR_REAL")
 
uaf_out=SelectionCriteria(data=selectioncriteria_series,
                          var_count=4,
                          constant=True,
                          use_likelihood=False)

# Print the result DataFrame.
uaf_out.result

<p style = 'font-size:16px;font-family:Arial'>The output of SelectionCriteria is a primary result set consisting of one row per series instance acted on by the function. </p>

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>4. Cleanup</b>

<p style = 'font-size:18px;font-family:Arial'><b>Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up intermediate tables.</p>

In [None]:
db_drop_view('Weekly_Sales_Details')

<p style = 'font-size:18px;font-family:Arial'><b>Databases and Tables</b></p>

<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../../UseCases/run_procedure.py "call remove_data('DEMO_SalesForecasting');" 
#Takes 45 seconds

In [None]:
remove_context()

<hr style="height:1px;border:none;">

<p style = 'font-size:16px;font-family:Arial'><b>Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Teradataml Python reference: <a href = 'https://docs.teradata.com/search/all?query=Python+Package+User+Guide&content-lang=en-US'>here</a></li>
    <li>SignifPeriodicities function reference: <a href = 'https://docs.teradata.com/search/all?query=SignifPeriodicities&content-lang=en-US'>here</a></li>
</ul>

<footer style="padding-bottom:35px; border-bottom:3px solid #91A0Ab">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2025. All Rights Reserved
        </div>
    </div>
</footer>