# Partitioned Time Series Modeling

This notebook can be used to train a time series forecasting model. 

It is especially useful for use cases in which multiple series need to be trained in parallel. For example, if a retailer needs to build a separate model for each individual store location, this code will train those models in parallel. This greatly improves run time, especially in cases involving a large number of partitions. 

❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ 

__Prerequisites before running this notebook:__ 

- The time series data in Snowflake must have at least the following columns: __date or timestamp__ column, __target__ column, and __partition__ column(s) if multi-series. 
- The date column name MUST be in [unquoted identifier](https://docs.snowflake.com/en/sql-reference/identifiers-syntax#label-unquoted-identifier) format, i.e. contains only __upper case letters, underscores, and decimal digits__. It is __recommended__ that all other column names also be in that format so that [double-quoted identifiers](https://docs.snowflake.com/en/sql-reference/identifiers-syntax#label-delimited-identifier) are not needed.
- The target column (and any exogenous feature columns if they exist) should have values in a numeric format like FLOAT, DOUBLE, or INT. 
- Any null values in the data should already be imputed. 

## Instructions


1. Go to the ____set_global_variables___ cell in the __SETUP__ section below. 
    - Change the values of the user constants to match the specifications of the use case.
    - Descriptions of each value are written in that cell.
2. Click ___Run all___ in the upper right corner of the notebook to run the entire notebook. 
    - The notebook will perform feature engineering and will train models. 
    - If ___SAVE_MODEL_VERSION_THIS_RUN=True___, then the models will be saved to the model registry for later inference. 
    - If ___VALIDATION_DAYS___ are specified, interactive cells near the end of the notebook can be used to evaluate model performance.
    
❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ ❄️ 

In [None]:
# Imports
import importlib.metadata
import json
import math
import pickle
import pkgutil
import random
from datetime import datetime
from typing import Optional

import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import streamlit as st
import xgboost as xgb
from snowflake.ml.model import custom_model
from snowflake.ml.model.model_signature import DataType, FeatureSpec, ModelSignature
from snowflake.ml.dataset import Dataset
from snowflake.ml.registry import registry
from snowflake.snowpark import Window
from snowflake.snowpark import functions as F
from snowflake.snowpark import types as T
from snowflake.ml.feature_store import (
    FeatureStore,
    FeatureView,
    Entity,
    CreationMode,
)

from forecast_model_builder.feature_engineering import (
    apply_functions_in_a_loop,
    expand_datetime,
    recent_rolling_avg,
    roll_up,
    verify_current_frequency,
    verify_valid_rollup_spec,
)
from forecast_model_builder.utils import (
    connect,
    version_featureview,
    version_data,
)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Establish session
session = connect(connection_name="default")
session_db = session.connection.database
session_schema = session.connection.schema
session_db_schema = f"{session_db}.{session_schema}"
print(f"Session db.schema: {session_db_schema}")

# Query tag
query_tag = '{"origin":"sf_sit", "name":"sit_forecasting", "version":{"major":1, "minor":0}, "attributes":{"component":"modeling"}}'
session.query_tag = query_tag

# Get the current datetime  (This will be saved in the model storage table)
run_dttm = datetime.now()
print(f"Current Datetime: {run_dttm}")

Session db.schema: ML_DEV_DB.ML_DEV_SCHEMA
Current Datetime: 2025-09-29 15:47:14.533436


-----
# SETUP
-----

In [3]:
# SET GLOBAL VARIABLES FOR THIS RUN

# Name the model (if model already exists, a new version will be created)
MODEL_NAME = "TEST_MODEL_1"

# Boolean that is True if we want to save the model version in the current run.
# Users may want to set it to false while they experiment with different specificiations, and then set to True when they develop the model they want to save.
SAVE_MODEL_VERSION_THIS_RUN = True

# --------------------------------
# Input Time Series Data
# --------------------------------
# Establish the Snowflake database, schema, and table containing the time series data
TS_DB = "FORECAST_MODEL_BUILDER"
TS_SCHEMA = "BASE"
TS_TABLE_NM = "DAILY_PARTITIONED_SAMPLE_DATA"

# --------------------------------
# Modeling setup
# --------------------------------
# Establish the Database and Schema that will be used to store the models
MODEL_DB = "FORECAST_MODEL_BUILDER"
MODEL_SCHEMA = "MODELING"

# --------------------------------
# Virtual Warehouse
# --------------------------------
# For modeling and inference, a larger warehouse may speed up execution time depending on the number of partitions.
# Scale up if there are a lot of partitions.
# NOTE: If set to None, then the session warehouse will be used.
MODELING_WH = "STANDARD_XL"

# --------------------------------
# Modeling
# --------------------------------
# From the time series data (TS_TABLE_NM), specify the name of column containing the datetime information
TIME_PERIOD_COLUMN = "ORDER_TIMESTAMP"

# NOTE: For the next 3 constants, if column names require a double-quoted identifier, include double quotes within the single quotes.
#       Examples: '"Target"', ['"store id"', '"product id"'], ['"Feature 1"'].

# Name of column containing the target variable (i.e. the value we are trying to predict)
TARGET_COLUMN = "TARGET"

# List of column names to use as partition columns. This is how you define each individual series to be modeled.
# If modeling a single series (i.e. no partitions) set this as an EMPTY LIST [].
PARTITION_COLUMNS = ["STORE_ID", "PRODUCT_ID"]

# List of column names in the time series table to use as EXOGENOUS FEATURES.
# Exogenous features are variables outside the main time series that can impact future values of the target variable.
#     Examples: weather features, promotions, holidays, economic indicators (like inflation), inventory on hand, etc.
# If there are no exogenous features in the data set, set this as an EMPTY LIST [].
# NOTE: This notebook will create several features (like YEAR, MONTH, DAY_OF_YEAR, etc). You do NOT need to list those.
#       Only list features that are already in the Snowflake table (TS_TABLE_NM).
EXOGENOUS_COLUMNS = ["FEATURE_1"]

# ALL_EXOG_COLS_HAVE_FUTURE_VALS is a boolean that is True if all exogenous features have future values present in the inference data.
#     For example, if you are predicting 56 days into the future (i.e. FORECAST_HORIZON=56),
#                  but you only know promotions for the next 4 weeks, you would set ALL_EXOG_COLS_HAVE_FUTURE_VALS = False.
# NOTE: There are two modeling patterns in this notebook:
#       1. Direct Multi-Step Forecasting - If ALL_EXOG_COLS_HAVE_FUTURE_VALS = False,
#                                           the code will create separate models for each lead/step (from step = 1 to step = FORECAST_HORIZON) within each partition.
#                                           In this pattern, inference is done using the most current date's information to predict each future step.
#       2. Global Modeling               - If ALL_EXOG_COLS_HAVE_FUTURE_VALS = True (or EXOGENOUS_COLUMNS is empty),
#                                           the code will train a single model within each partition.
#                                           In this pattern, inference is done using the information for each future step,
#                                               so the inference dataset will need a separate record for each future date to be predicted.
# This variable will determine which pattern is used.
ALL_EXOG_COLS_HAVE_FUTURE_VALS = True

# Specify if we should create lag features for the target variable (including avgs of previous periods). This will affect the lag_and_target_prep & recent_rolling_avg functions.
# NOTE: If we are using the Global Modeling pattern, we will not create recent rolling avg features.
CREATE_LAG_FEATURE = False

# Frequency of the data (choose from: "second", "minute", "hour", "day", "week", "month", "other")
# This is the frequency of the data as it currently exists in the Snowflake table (TS_TABLE_NM).
# If it is not a standard frequency, select "other"
CURRENT_FREQUENCY = "day"

# Frequency to roll up to (choose from: "second", "minute", "hour", "day", "week", "month", or None)
# If you do not wish to roll up to a higher level, set ROLLUP_FREQUENCY=None.
ROLLUP_FREQUENCY = None

# Specify how each column should agg on roll-up (choose from: "sum", "avg", "min", or "max")
# NOTE: If rollup_frequency is not None, then this can be an empty dictionary {}.
#       Otherwise, you must specify an aggregation for the TARGET column AND for each of the EXOGENOUS_COLUMNS.
ROLLUP_AGGREGATIONS = {
    TARGET_COLUMN: "sum",
    "FEATURE_1": "sum",
}

# Forecast Horizon. Number of time periods to forecast into the future (UNITS will be that of the ROLLUP_FREQUENCY if specified, otherwise CURRENT_FREQUENCY).
# NOTE: Keep this number as small as possible if doing Direct Multi-Step Forecasting (in which a separate model gets built for each future time period).
FORECAST_HORIZON = 7

# Specify how many days to set aside for validation.
# NOTE: If this is set to 0, then the model will be trained on all historic data.
VALIDATION_DAYS = 90

# XGBRegressor hyperparameter selections.
# NOTE: This notebook does not perform hyperparameter tuning, so you can set these parameters here if you know which values you would like to use.
XGB_PARAMS = {
    "learning_rate": 0.05,
    "subsample": 0.80,
    "colsample_bytree": 0.80,
    "random_state": 42,
}

# --------------------------------
# Inference
# --------------------------------
# When distributing the inference records, we can set the batch size here.
# If the number is too high, inference on a large number of records might use up all available memory.
INFERENCE_APPROX_BATCH_SIZE = 200

# --------------------------------
# Calculated Constants
# --------------------------------
# Establish the name of the table that will hold model binaries.
# This will be a Snowflake table in your project schema.
MODEL_BINARY_STORAGE_TBL_NM = f"MODEL_STORAGE_{MODEL_NAME}"

-----
# Establish objects needed for this run
-----

In [4]:
# DERIVED OBJECTS

# -----------------------------------------------------------------------
# Notebook Warehouse
# -----------------------------------------------------------------------
SESSION_WH = session.connection.warehouse
print(f"Session warehouse:          {SESSION_WH}")

# -----------------------------------------------------------------------
# Check Modeling Warehouse
# -----------------------------------------------------------------------
# Check that the user specified an available warehouse as MODELING_WH. If not, use the session warehouse.
available_warehouses = [
    row["NAME"]
    for row in session.sql("SHOW WAREHOUSES")
    .select(F.col('"name"').alias("NAME"))
    .collect()
]

if MODELING_WH in available_warehouses:
    print(f"Modeling warehouse:         {MODELING_WH} \n")
else:
    print(
        f"WARNING: User does not have access to MODELING_WH = '{MODELING_WH}'. Model training will use '{SESSION_WH}' instead. \n"
    )
    MODELING_WH = SESSION_WH


# -----------------------------------------------------------------------
# Fully qualified MODEL NAME
# -----------------------------------------------------------------------
qualified_model_name = f"{MODEL_DB}.{MODEL_SCHEMA}.{MODEL_NAME}"

# -----------------------------------------------------------------------
# Create dictionary of user settings to log with the model
# -----------------------------------------------------------------------
user_settings_dict = {
    "MODEL_NAME": MODEL_NAME,
    "SAVE_MODEL_VERSION_THIS_RUN": SAVE_MODEL_VERSION_THIS_RUN,
    "TS_DB": TS_DB,
    "TS_SCHEMA": TS_SCHEMA,
    "TS_TABLE_NM": TS_TABLE_NM,
    "MODEL_DB": MODEL_DB,
    "MODEL_SCHEMA": MODEL_SCHEMA,
    "MODEL_BINARY_STORAGE_TBL_NM": MODEL_BINARY_STORAGE_TBL_NM,
    "SESSION_WH": SESSION_WH,
    "MODELING_WH": MODELING_WH,
    "TIME_PERIOD_COLUMN": TIME_PERIOD_COLUMN,
    "TARGET_COLUMN": TARGET_COLUMN,
    "PARTITION_COLUMNS": PARTITION_COLUMNS,
    "EXOGENOUS_COLUMNS": EXOGENOUS_COLUMNS,
    "ALL_EXOG_COLS_HAVE_FUTURE_VALS": ALL_EXOG_COLS_HAVE_FUTURE_VALS,
    "CREATE_LAG_FEATURE": CREATE_LAG_FEATURE,
    "CURRENT_FREQUENCY": CURRENT_FREQUENCY,
    "ROLLUP_FREQUENCY": ROLLUP_FREQUENCY,
    "ROLLUP_AGGREGATIONS": ROLLUP_AGGREGATIONS,
    "FORECAST_HORIZON": FORECAST_HORIZON,
    "VALIDATION_DAYS": VALIDATION_DAYS,
    "XGB_PARAMS": XGB_PARAMS,
    "INFERENCE_APPROX_BATCH_SIZE": INFERENCE_APPROX_BATCH_SIZE,
}

# -----------------------------------------------------------------------
# BACKEND SETUP: Create Model Schema
# -----------------------------------------------------------------------
# Create a schema to hold our models if it does not already exist
schema_exists = (
    session.table(f"{MODEL_DB}.INFORMATION_SCHEMA.SCHEMATA")
    .filter(F.upper(F.col("SCHEMA_NAME")) == F.upper(F.lit(MODEL_SCHEMA)))
    .count()
)

if schema_exists == 0:
    try:
        session.sql(f"create schema if not exists {MODEL_DB}.{MODEL_SCHEMA}").collect()
    except Exception as e:
        if "insufficient privileges" in str(e).lower():
            raise PermissionError(f"""Schema {MODEL_SCHEMA} does not already exist in {MODEL_DB}, and user does not have sufficient privileges to CREATE SCHEMA. 
            Please specify an existing schema for MODEL_SCHEMA constant.""") from e
        else:
            raise RuntimeError(
                f"An error occurred while attempting to create schema {MODEL_DB}.{MODEL_SCHEMA}: {e}"
            ) from e

# Reset the schema to the original session schema. (If we created a new schema, the session schema was set to the new schema)
session.use_schema(session_db_schema)

# -----------------------------------------------------------------------
# Create a window spec
# -----------------------------------------------------------------------
window_spec = Window.partitionBy(PARTITION_COLUMNS).orderBy(TIME_PERIOD_COLUMN)

# -----------------------------------------------------------------------
# Create a variable for the frequency at which we will be modeling
# -----------------------------------------------------------------------
CURRENT_FREQUENCY = CURRENT_FREQUENCY.lower()

if ROLLUP_FREQUENCY is not None:
    ROLLUP_FREQUENCY = ROLLUP_FREQUENCY.lower()
    if ROLLUP_FREQUENCY.lower() == "none":
        ROLLUP_FREQUENCY = None

modeling_frequency = CURRENT_FREQUENCY if ROLLUP_FREQUENCY is None else ROLLUP_FREQUENCY
print(f"Modeling Frequency:         {modeling_frequency}")

# -----------------------------------------------------------------------
# Varible for modeling pattern
# -----------------------------------------------------------------------
# Either (1) train_separate_lead_models = False : all features have future values in the inference data, so we don't need a separate model for each lead
# or (2) train_separate_lead_models = True : data contains exogenous variables that the inference data won't have future values for, requiring direct multi-step (lead) modeling
train_separate_lead_models = (
    False
    if ALL_EXOG_COLS_HAVE_FUTURE_VALS is True or len(EXOGENOUS_COLUMNS) == 0
    else True
)
print(f"Train Separate Lead Models: {train_separate_lead_models}")

# -----------------------------------------------------------------------
# Establish model registry object
# -----------------------------------------------------------------------
reg = registry.Registry(
    session=session, database_name=MODEL_DB, schema_name=MODEL_SCHEMA
)

# -----------------------------------------------------------------------
# BACKEND SETUP: Create Backend Tables
# -----------------------------------------------------------------------
# Create Model Storage table if it does not already exist
# It will be created in the schema associated with the notebook (which is the schema that was created for this project).
session.sql(
    f"""
create table if not exists {MODEL_BINARY_STORAGE_TBL_NM} (
    GROUP_IDENTIFIER VARIANT,
    GROUP_IDENTIFIER_STRING VARCHAR,
    MODEL_NAME VARCHAR(100),
    MODEL_VERSION VARCHAR(100),
    ALGORITHM VARCHAR(100),
    MODEL_TRAINED_DTTM TIMESTAMP,
    MODEL_BINARY BINARY,
    METADATA VARIANT,
    ENVIRONMENT_SPECS VARIANT
    )
comment = '{query_tag}'
"""
).collect()


# -----------------------------------------------------------------------
# Does model already exist in the registry?
# -----------------------------------------------------------------------
try:
    number_of_versions = len(reg.get_model(qualified_model_name).show_versions())
    if number_of_versions > 0:
        print(
            f"Model {qualified_model_name} already exists. This notebook will build a new version."
        )
except Exception:
    print(f"This will be the first version of model {qualified_model_name}.")

Session warehouse:          ML_DEV_XS_WH

Modeling Frequency:         day
Train Separate Lead Models: False
Model FORECAST_MODEL_BUILDER.MODELING.TEST_MODEL_1 already exists. This notebook will build a new version.


In [5]:
# Create Snowpark DataFrame from table in Snowflake
sdf = session.table(f"{TS_DB}.{TS_SCHEMA}.{TS_TABLE_NM}")

# Only keep the columns specified in the config
sdf = sdf.select(
    TIME_PERIOD_COLUMN, TARGET_COLUMN, *PARTITION_COLUMNS, *EXOGENOUS_COLUMNS
)

In [6]:
# -----------------------------------------
# Preliminary checks
# -----------------------------------------
# Verify valid rollup specification.
# Raise an error if the user specifies a rollup frequency that is finer grain than the current frequency
# Raise an error if the user does not specify a rollup aggregation for the target and all exogenous columns
verify_valid_rollup_spec(
    CURRENT_FREQUENCY, ROLLUP_FREQUENCY, ROLLUP_AGGREGATIONS, EXOGENOUS_COLUMNS
)

# Roughly verify the current frequency (datetime difference between consecutive records) of the time series data
verify_current_frequency(sdf, TIME_PERIOD_COLUMN, window_spec, CURRENT_FREQUENCY)

Most common time between consecutive records (frequency): 1.0 day(s)
    The current frequency appears to be in DAY granularity.
    


-----
# Feature Engineering
-----

In [7]:
# First Convert Decimal data types to Floats (because DecimalType doesn't work in modeling algorithms)
sdf_converted = sdf.select(
    [
        (
            F.col(field.name).cast(T.FloatType()).alias(field.name)
            if isinstance(field.datatype, T.DecimalType)
            else F.col(field.name)
        )
        for field in sdf.schema
    ]
)

# ------------------------------------------------------------------------
# ROLL UP to specified frequency
# ------------------------------------------------------------------------
sdf_rollup = roll_up(
    sdf_converted,
    TIME_PERIOD_COLUMN,
    PARTITION_COLUMNS,
    TARGET_COLUMN,
    EXOGENOUS_COLUMNS,
    ROLLUP_FREQUENCY,
    ROLLUP_AGGREGATIONS,
)

# ------------------------------------------------------------------------
# Create time-derived features
# ------------------------------------------------------------------------
sdf_engineered = expand_datetime(sdf_rollup, TIME_PERIOD_COLUMN, modeling_frequency)

# ------------------------------------------------------------------------
# Create rolling average of most recent time periods
# ------------------------------------------------------------------------
# NOTE: We can only generate recent rolling average features if we are training separate lead models (direct multi-step forecasting).
if CREATE_LAG_FEATURE & train_separate_lead_models:
    sdf_engineered = recent_rolling_avg(
        sdf_engineered, [TARGET_COLUMN], window_spec, modeling_frequency
    )

# ------------------------------------------------------------------------
# Create LAG features (and possibly LEAD feature) of the TARGET variable
# ------------------------------------------------------------------------
final_sdf = apply_functions_in_a_loop(
    train_separate_lead_models=train_separate_lead_models,
    partition_column_list=PARTITION_COLUMNS,
    input_sdf=sdf_engineered,
    target_column=TARGET_COLUMN,
    time_step_frequency=modeling_frequency,
    forecast_horizon=FORECAST_HORIZON,
    w_spec=window_spec,
    create_lag_feature=CREATE_LAG_FEATURE,
)

# Inspect data
print(f"Total record count after rolling up:   {sdf_rollup.count()}")
print(f"Total record count of final data:      {final_sdf.count()}")
final_sdf.show(2)

Total record count after rolling up:   367500
Total record count of final data:      367500
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"ORDER_TIMESTAMP"    |"TARGET"       |"FEATURE_1"    |"YEAR"  |"MONTH_SIN"         |"MONTH_COS"          |"WEEK_OF_YEAR_SIN"  |"WEEK_OF_YEAR_COS"   |"DAY_OF_WEEK_SUN"  |"DAY_OF_WEEK_MON"  |"DAY_OF_WEEK_TUE"  |"DAY_OF_WEEK_WED"  |"DAY_OF_WEEK_THU"  |"DAY_OF_WEEK_FRI"  |"DAY_OF_WEEK_SAT"  |"DAY_OF_YEAR_SIN"   |"DAY_OF_YEAR_COS"    |"DAYS_SINCE_JAN2020"  |"MODEL_TARGET"  |"GROUP_IDENTIFIER"  |"GROUP_IDENTIFIER_STRING"  |
--------------------------------------------------------------

-----
# Feature Store
-----

In [8]:
fs = FeatureStore(
    session,
    database=TS_DB,
    name=TS_SCHEMA,
    default_warehouse=session.get_current_warehouse(),
    creation_mode=CreationMode.CREATE_IF_NOT_EXIST,
)

entity = Entity(
    name="TS_PARTITION_ENTITY",
    join_keys=["GROUP_IDENTIFIER_STRING"],
)

fs.register_entity(entity)

  return f(self, *args, **kargs)


Entity(name=TS_PARTITION_ENTITY, join_keys=['GROUP_IDENTIFIER_STRING'], owner=None, desc=)

In [9]:
fv = FeatureView(
    name="FORECAST_FEATURES",
    entities=[entity],
    feature_df=final_sdf,
    timestamp_col=TIME_PERIOD_COLUMN,
    refresh_freq="1 days",
    refresh_mode="INCREMENTAL",
)

version = version_featureview(fv)

fv_reg = fs.register_feature_view(fv, version=version)


  return self._get_feature_view_if_exists(feature_view.name, str(version))


-----
# TRAIN/TEST SPLIT
-----

In [10]:
# TRAIN/TEST SPLIT

sdf_fv = fs.read_feature_view(fv_reg).cache_result()

# TRAIN/VALIDATION SPLIT
if VALIDATION_DAYS == 0:
    sdf_train = sdf_fv
elif VALIDATION_DAYS > 0:
    # Get the last time period in the dataset
    last_time_period = sdf_fv.select(
        F.max(TIME_PERIOD_COLUMN).alias("MAX_DTTM")
    ).collect()[0]["MAX_DTTM"]
    # Remove the validation records from the training set
    sdf_train = sdf_fv.filter(
        F.date_trunc("day", TIME_PERIOD_COLUMN)
        < F.dateadd("day", F.lit(-VALIDATION_DAYS), F.lit(last_time_period))
    )
    sdf_test = sdf_fv.filter(
        F.date_trunc("day", TIME_PERIOD_COLUMN)
        >= F.dateadd("day", F.lit(-VALIDATION_DAYS), F.lit(last_time_period))
    )

# Inspect the data
training_dttm_boundaries = sdf_train.select(
    F.min(TIME_PERIOD_COLUMN).alias("MIN_DTTM"),
    F.max(TIME_PERIOD_COLUMN).alias("MAX_DTTM"),
).collect()[0]
print(f"Training set row count: {sdf_train.count()}")
print(f"First time period in training set: {training_dttm_boundaries['MIN_DTTM']}")
print(f"Last time period in training set:  {training_dttm_boundaries['MAX_DTTM']}")
if len(PARTITION_COLUMNS) > 0:
    print(
        f"Total Partition Count: {sdf_train.select(F.get(F.split('GROUP_IDENTIFIER_STRING',F.lit('_LEAD')),0)).distinct().count()}"
    )
else:
    print("No partitions specified.")
sdf_train.show(2)

Training set row count: 344750
First time period in training set: 2021-01-01 00:00:00
Last time period in training set:  2024-10-10 00:00:00
Total Partition Count: 250
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"ORDER_TIMESTAMP"    |"TARGET"      |"FEATURE_1"    |"YEAR"  |"MONTH_SIN"         |"MONTH_COS"          |"WEEK_OF_YEAR_SIN"  |"WEEK_OF_YEAR_COS"   |"DAY_OF_WEEK_SUN"  |"DAY_OF_WEEK_MON"  |"DAY_OF_WEEK_TUE"  |"DAY_OF_WEEK_WED"  |"DAY_OF_WEEK_THU"  |"DAY_OF_WEEK_FRI"  |"DAY_OF_WEEK_SAT"  |"DAY_OF_YEAR_SIN"   |"DAY_OF_YEAR_COS"    |"DAYS_SINCE_JAN2020"  |"MODEL_TARGET"  |"GROUP_IDENTIFIER"  |"GROUP_IDENTIFIER

In [11]:
dataset_name = "FORECAST_FEATURES"
dataset = Dataset.create(session, name=dataset_name+"_TRAIN", exist_ok=True)

ds_train_version = version_data(sdf_train)

if ds_train_version in dataset.list_versions():
    ds_train = dataset.select_version(ds_train_version)
else:
    ds_train = dataset.create_version(
        version=ds_train_version,
        input_dataframe=sdf_train,
        label_cols=["TARGET"],        
    )

if VALIDATION_DAYS > 0:
    dataset = Dataset.create(session, name=dataset_name+"_TEST", exist_ok=True)

    ds_test_version = version_data(sdf_test)
    
    if ds_test_version in dataset.list_versions():
        ds_test = dataset.select_version(ds_test_version)
    else:
        ds_test = dataset.create_version(
            version=ds_test_version,
            input_dataframe=sdf_test,
            label_cols=["TARGET"],        
        )



-----
# Model Training
-----

In [12]:
# --------------------------------------------------------
# Define and register a UDTF to perform model training
# --------------------------------------------------------

# # Get all of the column names except the partition columns and the column LEAD
training_udtf_input_col_nms = [
    colnm
    for colnm in sdf_train.columns
    if colnm not in ["GROUP_IDENTIFIER", "GROUP_IDENTIFIER_STRING"]
]


def train_model(df: pd.DataFrame) -> pd.DataFrame:
    """Trains a forecasting model and returns the model binary and metadata.

    Parameters
    ----------
    df : pandas.DataFrame
        The input DataFrame.

    Returns
    -------
    pandas.DataFrame
        A DataFrame containing the model binary and metadata.

    """
    # NOTE: In a vectorized UDTF we need to RENAME the columns to match the input dataset
    df.columns = training_udtf_input_col_nms

    # Set the index
    df = df.set_index(pd.to_datetime(df.pop(TIME_PERIOD_COLUMN)))

    # Create X and y dataframes.
    X = df.drop(columns=[TARGET_COLUMN, "MODEL_TARGET"])
    y = df["MODEL_TARGET"]

    # train a model
    model = xgb.XGBRegressor(**XGB_PARAMS)
    model.fit(X, y)
    # Save the model binary
    model_binary = pickle.dumps(model)
    # Obtain feature importances
    feature_importance_dict = dict(
        zip(X.columns, [float(val) for val in model.feature_importances_])
    )
    metadata = {
        "feature_importance": feature_importance_dict,
    }

    # Save the environment specs
    module_dict = {}
    for finder, module_name, is_pkg in pkgutil.iter_modules():
        try:
            distribution = importlib.metadata.distribution(module_name)
            version = distribution.version
            module_dict[module_name] = version
        except importlib.metadata.PackageNotFoundError:
            continue
    model_df = pd.DataFrame(
        [[model.__class__.__name__, model_binary, module_dict, metadata]],
        columns=["ALGORITHM", "MODEL_BINARY", "ENVIRONMENT_SPECS", "METADATA"],
    )

    return model_df


# Define UDTF class
class ModelTrainingUDTF:
    """Class which is registered as a UDTF to train forecasting models."""

    def end_partition(self, df):
        """End partition method which utilizes the train model function."""
        forecast_df = train_model(df)
        yield forecast_df


# Get the data types for the input dataframe
vect_udtf_input_dtypes = [
    T.PandasDataFrameType(
        [
            field.datatype
            for field in sdf_train.schema.fields
            if field.name not in ["GROUP_IDENTIFIER", "GROUP_IDENTIFIER_STRING"]
        ]
    )
]

# Register the class as a temporary UDTF
# Give the UDTF a unique name so that it doesn't conflict with anyone else running the same notebook
udtf_name = f"MODEL_TRAINER_{MODEL_NAME}_{datetime.now().strftime('%Y_%m_%d_%H_%M_%S')}__{random.randint(1, 999)}"
session.udtf.register(
    ModelTrainingUDTF,
    name=udtf_name,
    input_types=vect_udtf_input_dtypes,
    output_schema=T.PandasDataFrameType(
        [T.StringType(), T.BinaryType(), T.VariantType(), T.VariantType()],
        ["ALGORITHM", "MODEL_BINARY", "ENVIRONMENT_SPECS", "METADATA"],
    ),
    packages=[
        "snowflake-snowpark-python",
        "pandas",
        "numpy",
        "xgboost",
        "scikit-learn",
    ],
    replace=True,
    is_permanent=False,
    comment=query_tag,
)

print("Registration complete")

Registration complete


In [13]:
session.use_warehouse(MODELING_WH)

# Before model training, remove records where MODEL_TARGET is null
sdf_train = sdf_train.filter(F.col("MODEL_TARGET").isNotNull())

# Run the UDTF
udtf_models = sdf_train.select(
    "GROUP_IDENTIFIER",
    "GROUP_IDENTIFIER_STRING",
    F.call_table_function(udtf_name, *training_udtf_input_col_nms).over(
        partition_by=["GROUP_IDENTIFIER", "GROUP_IDENTIFIER_STRING"],
        order_by=TIME_PERIOD_COLUMN,
    ),
)

# Add additional columns to the output
if train_separate_lead_models:
    total_leads_modeled_this_run = FORECAST_HORIZON
elif not train_separate_lead_models:
    total_leads_modeled_this_run = None

udtf_models = udtf_models.select(
    "GROUP_IDENTIFIER",
    "GROUP_IDENTIFIER_STRING",
    F.lit(MODEL_NAME).alias("MODEL_NAME"),
    "ALGORITHM",
    F.lit(run_dttm).alias("MODEL_TRAINED_DTTM"),
    "MODEL_BINARY",
    F.object_insert(
        F.col("METADATA"),
        F.lit("total_leads_modeled_this_run"),
        F.lit(total_leads_modeled_this_run),
    )
    .astype(T.VariantType())
    .alias("METADATA"),
    "ENVIRONMENT_SPECS",
)

# Cache results for faster downstream usage of the udtf_models DataFrame
udtf_models = udtf_models.cache_result()

# Switch back to the original warehouse
session.use_warehouse(SESSION_WH)

print("Model training complete.")

Model training complete.


-----
# Model Registry
-----

In [14]:
# --------------------------------------------------------
# Define the Partitioned Custom Model
# --------------------------------------------------------

sample_input = sdf_train.limit(100).drop("GROUP_IDENTIFIER").join(
    udtf_models.select("GROUP_IDENTIFIER_STRING","MODEL_BINARY"),
    on = "GROUP_IDENTIFIER_STRING",
)

model_input_predictor_features = [
    colnm
    for colnm in sdf_train.columns
    if colnm
    not in [
        "GROUP_IDENTIFIER",
        "GROUP_IDENTIFIER_STRING",
        TIME_PERIOD_COLUMN,
        TARGET_COLUMN,
        "MODEL_TARGET",
    ]
]


class ForecastingModelPickleInput(custom_model.CustomModel):
    """Custom model class."""

    def __init__(self, context: Optional[custom_model.ModelContext] = None) -> None:
        """Initialize object."""
        super().__init__(context)
        self.partition_id = None
        self.model = None

    @custom_model.partitioned_api
    def predict(self, input_df: pd.DataFrame) -> pd.DataFrame:
        """Make predictions using unpickled model."""
        if self.partition_id != input_df["GROUP_IDENTIFIER_STRING"][0]:
            self.partition_id = input_df["GROUP_IDENTIFIER_STRING"][0]
            # Get the model binary from the first row of the input DataFrame where the column is not null
            self.model = pickle.loads(
                input_df.loc[
                    input_df["MODEL_BINARY"].first_valid_index(), "MODEL_BINARY"
                ]
            )

        model_output = self.model.predict(input_df[model_input_predictor_features])
        res = pd.DataFrame(model_output, columns=["_PRED_"])
        res["GROUP_IDENTIFIER_STRING_OUT_"] = input_df["GROUP_IDENTIFIER_STRING"]
        res[TIME_PERIOD_COLUMN+"_OUT_"] = input_df[TIME_PERIOD_COLUMN]
        return res


m = ForecastingModelPickleInput()

# --------------------------------------------------------
# Log Model to Model Registry
# --------------------------------------------------------


# Log the model to the model registry
options = {"function_type": "TABLE_FUNCTION", "relax_version": False}
metrics_to_log = {
    "direct_multi_step_forecasting": train_separate_lead_models,
    "frequency": modeling_frequency,
    "training_data_start": training_dttm_boundaries["MIN_DTTM"].strftime(
        "%Y-%m-%d %H:%M:%S"
    ),
    "training_data_end": training_dttm_boundaries["MAX_DTTM"].strftime(
        "%Y-%m-%d %H:%M:%S"
    ),
    "user_settings": user_settings_dict,
    "train_dataset": {"name":ds_train.fully_qualified_name, "version":ds_train.selected_version.name},
    "test_dataset": {"name":ds_test.fully_qualified_name, "version":ds_test.selected_version.name},
}
mv = reg.log_model(
    m,
    model_name=qualified_model_name,
    options=options,
    metrics=metrics_to_log,
    conda_dependencies=["pandas", "xgboost"],
    sample_input_data=sample_input,
    #signatures={"predict": signature},
    comment=query_tag,
)

# In addition to setting the query tag for the model version, we also set it for the model itself
reg.get_model(qualified_model_name).comment = query_tag

print(f"Model version name: {mv.version_name}")

# Confirm that the new model/version is in the registry
reg.show_models()

Logging model: creating model manifest...:  33%|███▎      | 2/6 [00:02<00:05,  1.34s/it]  

configuration generated by an older version of XGBoost, please export the model by calling
`Booster.save_model` from that version first, then load it back in current version. See:

    https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html

for more details about differences between saving model and serializing.



Model logged successfully.: 100%|██████████| 6/6 [01:42<00:00, 17.01s/it]                          
Model version name: PLASTIC_FALCON_3


Unnamed: 0,created_on,name,model_type,database_name,schema_name,comment,owner,default_version_name,versions,aliases
0,2025-09-29 08:55:04.723000-07:00,TEST_MODEL_1,USER_MODEL,FORECAST_MODEL_BUILDER,MODELING,"{""origin"":""sf_sit"", ""name"":""sit_forecasting"", ...",ML_DEV_ROLE,CALM_DODO_4,"[""CALM_DODO_4"",""CURLY_CRAB_1"",""GOOD_EEL_3"",""MA...","{""DEFAULT"":""CALM_DODO_4"",""FIRST"":""RED_BEAR_1"",..."


In [15]:
# --------------------------------------------------------
# Save the model version to the model storage table
#   and set it as the default version in the registry
# --------------------------------------------------------

if SAVE_MODEL_VERSION_THIS_RUN:
    # Append model binaries and metadata to the model binary storage table in Snowflake
    udtf_models_w_version = udtf_models.with_column(
        "MODEL_VERSION", F.lit(mv.version_name)
    ).select(session.table(f"{MODEL_BINARY_STORAGE_TBL_NM}").columns)

    udtf_models_w_version.write.save_as_table(
        f"{MODEL_BINARY_STORAGE_TBL_NM}", mode="append"
    )

    # Set default version of the model to this version name
    reg.get_model(qualified_model_name).default = mv.version_name

    print(
        f"Model version '{mv.version_name}' saved to model storage table and set as the default version in the registry."
    )
else:
    print(
        f"""Model version '{mv.version_name}' was NOT saved to the model storage table and will be deleted from the registry at the end of this notebook.
    If you wish to save this version, set SAVE_MODEL_VERSION_THIS_RUN = True."""
    )

# Look at the most recent 3 versions of the model
reg.get_model(qualified_model_name).show_versions().tail(3)

Model version 'PLASTIC_FALCON_3' saved to model storage table and set as the default version in the registry.


Unnamed: 0,created_on,name,aliases,comment,database_name,schema_name,model_name,is_default_version,functions,metadata,user_data,model_attributes,size,environment,runnable_in,inference_services
7,2025-09-29 12:18:11.322000-07:00,STRANGE_CAT_4,[],"{""origin"":""sf_sit"", ""name"":""sit_forecasting"", ...",FORECAST_MODEL_BUILDER,MODELING,TEST_MODEL_1,False,"[""PREDICT""]","{""metrics"": {""direct_multi_step_forecasting"": ...",{},"{""framework"":""custom"",""client"":""snowflake-ml-p...",12336,"{""default"":{""python_version"":""3.12"",""snowflake...","[""WAREHOUSE""]",[]
8,2025-09-29 12:29:44.133000-07:00,CALM_DODO_4,[],"{""origin"":""sf_sit"", ""name"":""sit_forecasting"", ...",FORECAST_MODEL_BUILDER,MODELING,TEST_MODEL_1,False,"[""PREDICT""]","{""metrics"": {""direct_multi_step_forecasting"": ...",{},"{""framework"":""custom"",""client"":""snowflake-ml-p...",10119,"{""default"":{""python_version"":""3.12"",""snowflake...","[""WAREHOUSE""]",[]
9,2025-09-29 12:49:46.972000-07:00,PLASTIC_FALCON_3,"[""DEFAULT"",""LAST""]","{""origin"":""sf_sit"", ""name"":""sit_forecasting"", ...",FORECAST_MODEL_BUILDER,MODELING,TEST_MODEL_1,True,"[""PREDICT""]","{""metrics"": {""direct_multi_step_forecasting"": ...",{},"{""framework"":""custom"",""client"":""snowflake-ml-p...",4262709,"{""default"":{""python_version"":""3.12"",""snowflake...","[""WAREHOUSE""]",[]


-----
# MODEL EVALUATION
-----

In [16]:
# ------------------------------------------------------------------------
# TEST SET DATAFRAME
# ------------------------------------------------------------------------

# Partition Count
inference_partition_count = (
    sdf_test.select("GROUP_IDENTIFIER_STRING").distinct().count()
)

# Inspect the data
dttm_boundaries = sdf_test.select(
    F.min(TIME_PERIOD_COLUMN).alias("MIN_DTTM"),
    F.max(TIME_PERIOD_COLUMN).alias("MAX_DTTM"),
).collect()[0]

if VALIDATION_DAYS == 0:
    print(f"Eval row count: {sdf_test.count()}")
else:
    print(f"Test set row count: {sdf_test.count()}")
print(f"First time period in test set: {dttm_boundaries['MIN_DTTM']}")
print(f"Last time period in test set:  {dttm_boundaries['MAX_DTTM']}")
if len(PARTITION_COLUMNS) > 0:
    print(f"Total Partition Count: {inference_partition_count}")
else:
    print("No partitions specified.")
sdf_test.show(2)

Test set row count: 22750
First time period in test set: 2024-10-11 00:00:00
Last time period in test set:  2025-01-09 00:00:00
Total Partition Count: 250
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"ORDER_TIMESTAMP"    |"TARGET"       |"FEATURE_1"     |"YEAR"  |"MONTH_SIN"         |"MONTH_COS"          |"WEEK_OF_YEAR_SIN"   |"WEEK_OF_YEAR_COS"   |"DAY_OF_WEEK_SUN"  |"DAY_OF_WEEK_MON"  |"DAY_OF_WEEK_TUE"  |"DAY_OF_WEEK_WED"  |"DAY_OF_WEEK_THU"  |"DAY_OF_WEEK_FRI"  |"DAY_OF_WEEK_SAT"  |"DAY_OF_YEAR_SIN"    |"DAY_OF_YEAR_COS"    |"DAYS_SINCE_JAN2020"  |"MODEL_TARGET"  |"GROUP_IDENTIFIER"  |"GROUP_IDENTIFIER_STRI

In [18]:
# ------------------------------------------------------------------------
# TEST SET INFERENCE
# ------------------------------------------------------------------------
# Establish inference DataFrame
inference_input_df = sdf_test.drop("GROUP_IDENTIFIER")

# Establish model binary DataFrame
if "udtf_models" in globals():
    model_bytes_table = udtf_models.select("GROUP_IDENTIFIER_STRING", "MODEL_BINARY")
    print("Evaluation is using model binaries created in this run. \n")
else:
    model_bytes_table = (
        session.table(f"{MODEL_BINARY_STORAGE_TBL_NM}")
        .filter(F.col("MODEL_NAME") == MODEL_NAME)
        .filter(
            F.col("MODEL_VERSION")
            == reg.get_model(qualified_model_name).default.version_name
        )
        .select(
            "GROUP_IDENTIFIER_STRING",
            "MODEL_BINARY",
        )
    )
    print(
        "Evaluation is using model binaries from the model registry default model version (previously-created). \n"
    )

# Join model binary object to inference input data
inference_input_df = inference_input_df.join(
    # model_bytes_table, on=["GROUP_IDENTIFIER_STRING", "PARTITION_ROW_NUMBER"], how="left"
    model_bytes_table,
    on=["GROUP_IDENTIFIER_STRING"],
    how="inner",
)

# Add a column called BATCH_GROUP,
#   which has the property that for each unique value there are roughly the number of records specified in batch_size.
# Use that to create a PARTITION_ID column that will be used to run inference in batches.
# We do this to avoid running out of memory when performing inference on a large number of records.
largest_partition_record_count = (
    sdf_test.group_by("GROUP_IDENTIFIER_STRING")
    .agg(F.count("*").alias("PARTITION_RECORD_COUNT"))
    .agg(F.max("PARTITION_RECORD_COUNT").alias("MAX_PARTITION_RECORD_COUNT"))
    .collect()[0]["MAX_PARTITION_RECORD_COUNT"]
)
batch_size = INFERENCE_APPROX_BATCH_SIZE
number_of_batches = math.ceil(largest_partition_record_count / batch_size)
inference_input_df = (
    inference_input_df.with_column(
        "BATCH_GROUP", F.abs(F.random(123)) % F.lit(number_of_batches)
    )
    .with_column(
        "PARTITION_ID",
        F.concat_ws(
            F.lit("__"), F.col("GROUP_IDENTIFIER_STRING"), F.col("BATCH_GROUP")
        ),
    )
    .drop("RANDOM_NUMBER", "BATCH_GROUP")
)

# Stats related to inference dataset
print(f"Inference data row count: {sdf_test.count()}")
print(f"Largest partition record count: {largest_partition_record_count}")
print(f"Number of partitions:   {inference_partition_count}")
print(f"Approx. Batch Size:     {batch_size}")
print(f"Approx. Number of batches for largest partition: {number_of_batches}")

# Perform inference from the model registry
session.use_warehouse(MODELING_WH)

inference_result = mv.run(
    inference_input_df,
    partition_column="PARTITION_ID",
).select(
    "_PRED_",
    F.col("GROUP_IDENTIFIER_STRING_OUT_").alias("GROUP_IDENTIFIER_STRING"),
    F.col(f"{TIME_PERIOD_COLUMN}_OUT_").alias(TIME_PERIOD_COLUMN),
)

# Bring in the ACTUALS as well as the GROUP_IDENTIFIER variant column into the results for easier filtering during evaluation
inference_result = inference_result.join(
    sdf_test.select(
        "GROUP_IDENTIFIER",
        "GROUP_IDENTIFIER_STRING",
        TIME_PERIOD_COLUMN,
        "MODEL_TARGET",
    ),
    on=["GROUP_IDENTIFIER_STRING", TIME_PERIOD_COLUMN],
    how="left",
)

# Add a column for the date on which we are running inference and a column for the future date for which we are forecasting
if train_separate_lead_models:
    inference_result = (
        inference_result.with_column_renamed(
            TIME_PERIOD_COLUMN, "INFERENCE_PERFORMED_ON_DTTM"
        )
        .with_column(
            "FUTURE_DTTM",
            F.dateadd(
                modeling_frequency,
                F.col("GROUP_IDENTIFIER").getItem("LEAD"),
                F.col("INFERENCE_PERFORMED_ON_DTTM"),
            ),
        )
        .select(
            "GROUP_IDENTIFIER",
            "GROUP_IDENTIFIER_STRING",
            "INFERENCE_PERFORMED_ON_DTTM",
            "FUTURE_DTTM",
            F.col("MODEL_TARGET").alias("ACTUAL"),
            F.col("_PRED_").alias("PREDICTED"),
        )
    )
else:
    inference_dttm = datetime.now()
    inference_result = (
        inference_result.with_column(
            "INFERENCE_PERFORMED_ON_DTTM", F.lit(inference_dttm)
        )
        .with_column_renamed(TIME_PERIOD_COLUMN, "FUTURE_DTTM")
        .select(
            "GROUP_IDENTIFIER",
            "GROUP_IDENTIFIER_STRING",
            "INFERENCE_PERFORMED_ON_DTTM",
            "FUTURE_DTTM",
            F.col("MODEL_TARGET").alias("ACTUAL"),
            F.col("_PRED_").alias("PREDICTED"),
        )
    )

# Cache result
inference_result = inference_result.cache_result()

# Switch back to the original warehouse
session.use_warehouse(SESSION_WH)

Evaluation is using model binaries created in this run. 

Inference data row count: 22750
Largest partition record count: 91
Number of partitions:   250
Approx. Batch Size:     200
Approx. Number of batches for largest partition: 1


In [19]:
# Filter to a single lead for evaluation
if train_separate_lead_models:
    inference_result = inference_result.filter(
        F.col("GROUP_IDENTIFIER_STRING").endswith("LEAD_1")
    )

# Write the validation scores to a Snowflake table in case the user just wants to re-run the model performance cells without re-running the model training cells.
pred_table_name = f"VALIDATION_PREDS_FROM_{MODEL_NAME}_{mv.version_name}"
inference_result.write.save_as_table(
    pred_table_name,
    mode="overwrite",
    comment='{"origin":"sf_sit", "name":"sit_forecasting", "version":{"major":1, "minor":0}, "attributes":{"component":"validation"}}',
)
print(
    f"Validation predictions saved to Snowflake table: {pred_table_name}"
)

Validation predictions saved to Snowflake table: VALIDATION_PREDS_FROM_TEST_MODEL_1_PLASTIC_FALCON_3


If users walk away from the notebook long enough for __the notebook session to end__ after it finishes running, they have the option to come back to the model evaluation section below at a later time. The __scored validation set__ was saved as a __Snowflake table__ after model training. So users can __re-activate__ this notebook and __re-run the evaluation cells__ below __without having to re-run inference on the validation set__. 

If the notebook has become inactive, and users wish to re-run the evalution cells below, they should follow these steps: 
1. Click __Start__ in the upper right corner of the notebook to activate a new session
2. At the top of this notebook, __run all of the cells above__ the ___md_model_training___ markdown cell
3. Run the ___test_feature_eng___ and the ___test_split___ cells above. (No need to re-run the inference cell.)
4. Click the 3 dots in the upper right corner of the cell below (___validation_scores_sdf___) and select __"Run all below"__ to re-run all the evaluation cells.


In [20]:
# validation_scores_sdf cell

# Create a DataFrame from the saved table
validation_scores = session.table(pred_table_name)

# Look at a couple rows of predictions
print(f"Number of partitions:  {inference_partition_count}")
print(f"Validation row count:  {validation_scores.count()}")
validation_scores.show(2)

Number of partitions:  250
Validation row count:  22750
--------------------------------------------------------------------------------------------------------------------------------------------
|"GROUP_IDENTIFIER"  |"GROUP_IDENTIFIER_STRING"  |"INFERENCE_PERFORMED_ON_DTTM"  |"FUTURE_DTTM"        |"ACTUAL"       |"PREDICTED"        |
--------------------------------------------------------------------------------------------------------------------------------------------
|{                   |STORE_ID_3_PRODUCT_ID_7    |2025-09-29 15:50:09.897759     |2024-10-21 00:00:00  |351.21763762   |350.6927185058594  |
|  "PRODUCT_ID": 7,  |                           |                               |                     |               |                   |
|  "STORE_ID": 3     |                           |                               |                     |               |                   |
|}                   |                           |                               |                

# Overall Performance

In [53]:
# Row-level metrics
row_actual_v_fcst = (
    validation_scores.with_column(
        "ABS_ERROR", F.abs(F.col("ACTUAL") - F.col("PREDICTED"))
    )
    .with_column(
        "APE",
        F.when(F.col("ACTUAL") == 0, F.lit(None)).otherwise(
            F.abs(F.col("ABS_ERROR") / F.col("ACTUAL"))
        ),
    )
    .with_column("SQ_ERROR", F.pow(F.col("ACTUAL") - F.col("PREDICTED"), 2))
)

# Metrics per partition
partition_metrics = row_actual_v_fcst.group_by("GROUP_IDENTIFIER_STRING").agg(
    F.avg("APE").alias("MAPE"),
    F.avg("ABS_ERROR").alias("MAE"),
    F.sqrt(F.avg("SQ_ERROR")).alias("RMSE"),
)

# Overall modeling process across all partitions
overall_avg_metrics = partition_metrics.agg(
    F.avg("MAPE").alias("OVERALL_MAPE"),
    F.avg("MAE").alias("OVERALL_MAE"),
    F.avg("RMSE").alias("OVERALL_RMSE"),
).with_column("AGGREGATION", F.lit("AVG"))
overall_median_metrics = partition_metrics.agg(
    F.median("MAPE").alias("OVERALL_MAPE"),
    F.median("MAE").alias("OVERALL_MAE"),
    F.median("RMSE").alias("OVERALL_RMSE"),
).with_column("AGGREGATION", F.lit("MEDIAN"))
overall_metrics = overall_avg_metrics.union(overall_median_metrics).select(
    "AGGREGATION", "OVERALL_MAPE", "OVERALL_MAE", "OVERALL_RMSE"
)

# Show the metrics
if inference_partition_count == 1:
    st.write(
        "There is only 1 partition, so these values are the metrics for that single model:"
    )
    st.dataframe(
        overall_median_metrics.select("OVERALL_MAPE", "OVERALL_MAE", "OVERALL_RMSE"),
        use_container_width=True,
    )
else:
    st.write("Avg and Median of each metric over all the partitions:")
    st.dataframe(overall_metrics, use_container_width=True)

2025-09-29 15:38:01.662 
  command:

    streamlit run /opt/anaconda3/envs/forecast/lib/python3.12/site-packages/ipykernel_launcher.py [ARGUMENTS]
2025-09-29 15:38:01.663 Please replace `use_container_width` with `width`.

`use_container_width` will be removed after 2025-12-31.

For `use_container_width=True`, use `width='stretch'`. For `use_container_width=False`, use `width='content'`.


# Partition Performance

In [54]:
if (len(PARTITION_COLUMNS) > 0) & (inference_partition_count > 1):
    # Metric Distribution plot with dynamic filtering
    metric = st.selectbox("Metric", ["MAPE", "MAE", "RMSE"])
    st.subheader(f"{metric} Distribution")
    distribution_df = partition_metrics.to_pandas()

    # Add a slider to filter outliers
    value_min, value_max = st.slider(
        f"Filter {metric} range in plot:",
        float(distribution_df[metric].min()),
        float(distribution_df[metric].max()),
        (float(distribution_df[metric].min()), float(distribution_df[metric].max())),
    )

    # Filter the DataFrame based on the slider values
    filtered_df = distribution_df[
        (distribution_df[metric] >= value_min) & (distribution_df[metric] <= value_max)
    ]

    fig = px.box(
        filtered_df,
        x=metric,  # Horizontal orientation
        points="all",  # Show individual data points as dots
        title=f"{metric} Distribution ({value_min:.2f} - {value_max:.2f})",
        labels={metric: metric, "GROUP_IDENTIFIER_STRING": "Partition"},
        hover_data=["GROUP_IDENTIFIER_STRING"],  # Add this for hover info
    )

    fig.update_layout(template="plotly_white", showlegend=True)
    st.plotly_chart(fig, use_container_width=True)

    # Layout with two columns
    col1, col2 = st.columns(2)

    # Column 1: Tables
    with col1:
        # Look at the best performing partitions
        st.subheader("BEST Performing Partitions")
        st.dataframe(partition_metrics.sort(F.abs(metric)))
    with col2:
        # Look at the worst performing partitions
        st.subheader("WORST Performing Partitions")
        st.dataframe(partition_metrics.sort(F.abs(metric).desc()))

2025-09-29 15:38:07.661 Session state does not function when running a script without `streamlit run`


In [55]:
# ------------------------------------------------------------------------------
# Visualize individual partition actual vs pred on a time series line chart
# ------------------------------------------------------------------------------
# Select a single partition to visualize
partitions = sorted(
    validation_scores.select("GROUP_IDENTIFIER_STRING").distinct().collect()
)
partition_choice = st.selectbox("Partition", partitions)
# Create a pandas dataframe
partition_choice_df = (
    validation_scores.filter(F.col("GROUP_IDENTIFIER_STRING") == partition_choice)
    .sort("FUTURE_DTTM")
    .to_pandas()
)
partition_choice_df["FUTURE_DTTM"] = pd.to_datetime(partition_choice_df["FUTURE_DTTM"])
tabs = st.tabs(
    [
        "Line Plot: Validation Actual & Predicted",
        "Scatter Plot: Validation Actual vs. Predicted",
        "Line Plot: Training Actuals",
    ]
)
# Validation Actuals & Predictions Line Plot
tabs[0].line_chart(partition_choice_df, x="FUTURE_DTTM", y=["ACTUAL", "PREDICTED"])

# Validation Actuals vs. Predictions Scatter Plot

# Plot: Prediction vs. Actual
fig_scatter = px.scatter(
    partition_choice_df,
    x="ACTUAL",
    y="PREDICTED",
    title="Predicted vs. Actual Visits",
    labels={"VISITS": "Actual Visits", "PREDICTION": "Predicted Visits"},
    opacity=0.6,
    trendline="ols",  # Add trendline (linear regression)
    hover_data=["PREDICTED", "ACTUAL", "FUTURE_DTTM"],  # Include date in hover
)

# Add expected trendline (y = x)
min_visits = min(partition_choice_df["ACTUAL"])
max_visits = max(partition_choice_df["ACTUAL"])

fig_scatter.add_trace(
    go.Scatter(
        x=[min_visits, max_visits],
        y=[min_visits, max_visits],
        mode="lines",
        line=dict(color="black", dash="dash"),
        name="Expected Trend (y = x)",  # Add to legend
        showlegend=True,
    )
)

# Render the plot in Streamlit
tabs[1].plotly_chart(fig_scatter, use_container_width=True)

# Look at the partition's actuals in the TRAINING set to assess the overall trens
partition_choice_df = (
    final_sdf.filter(F.col("GROUP_IDENTIFIER_STRING") == partition_choice)
    .sort(TIME_PERIOD_COLUMN)
    .to_pandas()
)
partition_choice_df[TIME_PERIOD_COLUMN] = pd.to_datetime(
    partition_choice_df[TIME_PERIOD_COLUMN]
)
# Plot the data
if (TARGET_COLUMN.startswith('"') and TARGET_COLUMN.endswith('"')) or (
    TARGET_COLUMN.startswith("'") and TARGET_COLUMN.endswith("'")
):
    y_name = TARGET_COLUMN[1:-1]
else:
    y_name = TARGET_COLUMN
tabs[2].line_chart(partition_choice_df, x=TIME_PERIOD_COLUMN, y=y_name)



DeltaGenerator()

# Feature Importance

In [56]:
# Establish model binary DataFrame
if "udtf_models" in globals():
    models_sdf = udtf_models
else:
    models_sdf = (
        session.table(f"{MODEL_BINARY_STORAGE_TBL_NM}")
        .filter(F.col("MODEL_NAME") == MODEL_NAME)
        .filter(
            F.col("MODEL_VERSION")
            == reg.get_model(qualified_model_name).default.version_name
        )
    )
    print(
        f"Feature Importances are for model version {reg.get_model(qualified_model_name).default.version_name} in table {MODEL_BINARY_STORAGE_TBL_NM}."
    )

model_df = models_sdf.select(
    "MODEL_NAME", "GROUP_IDENTIFIER_STRING", "METADATA"
).to_pandas()


def preprocess_model_data(df):
    """Preprocess model data by extracting feature importance from the METADATA column.

    This function performs the following steps:
    1. Extracts the "feature_importance" dictionary from the "METADATA" column.
    2. Converts the extracted feature importance data into a new DataFrame where each row
       represents a feature and its corresponding importance for a specific model.

    Args:
        df (pd.DataFrame): Input DataFrame containing model data with at least
                           the columns "MODEL_NAME", "GROUP_IDENTIFIER_STRING",
                           and "METADATA".

    Returns:
        tuple:
            - pd.DataFrame: The original DataFrame with an additional "FEATURE_IMPORTANCE" column.
            - pd.DataFrame: A new DataFrame containing the extracted features and their importance,
              with columns ["MODEL_NAME", "GROUP_IDENTIFIER_STRING", "FEATURE", "IMPORTANCE"].

    """
    # Extract feature importance from METADATA
    df["FEATURE_IMPORTANCE"] = df["METADATA"].apply(
        lambda x: (
            json.loads(x).get("feature_importance", {})
            if isinstance(x, str)
            else x.get("feature_importance", {})
        )
    )

    # Explode feature importance into rows
    feature_rows = []
    for _, row in df.iterrows():
        for feature, importance in row["FEATURE_IMPORTANCE"].items():
            feature_rows.append(
                {
                    "MODEL_NAME": row["MODEL_NAME"],
                    "GROUP_IDENTIFIER_STRING": row["GROUP_IDENTIFIER_STRING"],
                    "FEATURE": feature,
                    "IMPORTANCE": importance,
                }
            )

    feature_df = pd.DataFrame(feature_rows)
    return df, feature_df


def calculate_average_rank(feature_df):
    """Calculate the average rank and importance of features across different group partitions.

    This function:
    1. Computes the rank of each feature within its "GROUP_IDENTIFIER_STRING" based on
       feature importance in descending order.
    2. Aggregates the average rank and average importance for each feature across all groups.
    3. Returns the feature DataFrame with calculated ranks and a summarized DataFrame
       sorted by average rank.

    Args:
        feature_df (pd.DataFrame): Input DataFrame containing extracted feature importance
                                   with at least the columns ["GROUP_IDENTIFIER_STRING",
                                   "FEATURE", "IMPORTANCE"].

    Returns:
        tuple:
            - pd.DataFrame: The input DataFrame with an additional "RANK" column.
            - pd.DataFrame: A new DataFrame containing features and their average rank and
              importance, with columns ["FEATURE", "AVERAGE_RANK", "AVERAGE_IMPORTANCE"].

    """
    feature_df = feature_df.copy()
    feature_df.loc[:, "RANK"] = feature_df.groupby("GROUP_IDENTIFIER_STRING")[
        "IMPORTANCE"
    ].rank(ascending=False)

    avg_rank_df = (
        feature_df.groupby("FEATURE")
        .agg({"RANK": "mean", "IMPORTANCE": "mean"})
        .reset_index()
    )

    avg_rank_df.rename(
        columns={"RANK": "AVERAGE_RANK", "IMPORTANCE": "AVERAGE_IMPORTANCE"},
        inplace=True,
    )
    avg_rank_df = avg_rank_df.sort_values("AVERAGE_RANK", ascending=True)
    return feature_df, avg_rank_df


def plot_feature_importance(df, is_aggregated=True, top_n=20):
    """Create a horizontal bar plot to visualize feature importance.

    This function generates a feature importance plot based on whether the data
    is aggregated (showing average ranks across groups) or unaggregated (showing
    importance for a selected partition).

    Args:
        df (pd.DataFrame): DataFrame containing feature importance data.
                           Expected columns:
                           - If `is_aggregated=True`: ["FEATURE", "AVERAGE_RANK"]
                           - If `is_aggregated=False`: ["FEATURE", "IMPORTANCE"]
        is_aggregated (bool, optional): If True, plots average rank of features
                                        across groups. If False, plots raw importance
                                        for a single partition. Default is True.
        top_n (int, optional): Number of top features to display in the plot.
                               Default is 20.

    Returns:
        plotly.graph_objects.Figure: A bar plot visualizing the top feature importance.

    """
    if is_aggregated:
        df = df.sort_values("AVERAGE_RANK", ascending=True).head(top_n)
        x_col = "AVERAGE_RANK"
        title = "Top Feature Importance (Aggregated by Average Rank)"
        fig = px.bar(
            df,
            x=x_col,
            y="FEATURE",
            orientation="h",
            title=title,
            labels={"FEATURE": "Feature", x_col: "Average Rank"},
        )

        fig.update_layout(
            yaxis=dict(categoryorder="total descending"),
            xaxis_title="Average Rank",
            yaxis_title="Feature",
            margin=dict(l=50, r=50, t=50, b=50),
        )
    else:
        df = df.sort_values("IMPORTANCE", ascending=False).head(top_n)
        x_col = "IMPORTANCE"
        title = "Top Feature Importance for Selected Partition"

        fig = px.bar(
            df,
            x=x_col,
            y="FEATURE",
            orientation="h",
            title=title,
            labels={"FEATURE": "Feature", x_col: "Importance"},
        )

        fig.update_layout(
            yaxis=dict(categoryorder="total ascending"),
            xaxis_title="Importance",
            yaxis_title="Feature",
            margin=dict(l=50, r=50, t=50, b=50),
        )

    return fig


# Load and preprocess the data
model_df, feature_df = preprocess_model_data(model_df)

# Select Partition Model ID
partition_models = model_df["GROUP_IDENTIFIER_STRING"].unique()
selected_partition_model = st.selectbox(
    "Select Partition", [None] + sorted(partition_models)
)

# Filter data based on selections
filtered_feature_df = feature_df
if selected_partition_model:
    filtered_feature_df = filtered_feature_df[
        filtered_feature_df["GROUP_IDENTIFIER_STRING"] == selected_partition_model
    ]

# Select Top N Features
top_n = st.slider("Number of Top Features to Show", min_value=5, max_value=50, value=20)


# Display Feature Importance
st.subheader("Feature Importance")

if selected_partition_model:
    fig = plot_feature_importance(filtered_feature_df, is_aggregated=False, top_n=top_n)
else:
    filtered_feature_df, avg_rank_df = calculate_average_rank(filtered_feature_df)
    fig = plot_feature_importance(avg_rank_df, is_aggregated=True, top_n=top_n)

st.plotly_chart(fig, use_container_width=True)

# Expander for Underlying Data
with st.expander("Show Underlying Data"):
    if selected_partition_model:
        st.dataframe(filtered_feature_df.sort_values("IMPORTANCE", ascending=False))
    else:
        tabs = st.tabs(["Average Importance", "Individual Importance"])
        tabs[0].dataframe(avg_rank_df.sort_values("AVERAGE_RANK", ascending=True))
        tabs[1].dataframe(
            filtered_feature_df.sort_values("IMPORTANCE", ascending=False)
        )



# Outlier Analysis

In [57]:
# Row-level metrics
n_outliers = st.slider("Number of Outliers", 5, 1000, 20)
outlier_sdf = (
    validation_scores.with_column(
        "ABS_ERROR", F.abs(F.col("ACTUAL") - F.col("PREDICTED"))
    )
    .with_column(
        "APE",
        F.when(F.col("ACTUAL") == 0, F.lit(None)).otherwise(
            F.abs(F.col("ABS_ERROR") / F.col("ACTUAL"))
        ),
    )
    .with_column("SQ_ERROR", F.pow(F.col("ACTUAL") - F.col("PREDICTED"), 2))
    .order_by(F.col("APE").desc())
    .limit(n_outliers)
    .select(
        "GROUP_IDENTIFIER_STRING",
        "FUTURE_DTTM",
        "ACTUAL",
        "PREDICTED",
        "ABS_ERROR",
        "APE",
    )
)

st.dataframe(outlier_sdf, use_container_width=True)
cols = st.columns(2)
# Select a single partition to visualize
outlier_partition = cols[0].selectbox(
    "Outlier Partition", outlier_sdf.select("GROUP_IDENTIFIER_STRING").distinct()
)
outlier_partition_df = outlier_sdf.filter(
    F.col("GROUP_IDENTIFIER_STRING") == outlier_partition
)
outlier_date = cols[1].selectbox(
    "Outlier Date",
    outlier_partition_df.select(F.col("FUTURE_DTTM").cast("STRING")).distinct(),
)

# Validation Actuals & Predictions Line Plot
partition_df = (
    validation_scores.filter(
        F.col("GROUP_IDENTIFIER_STRING") == outlier_partition
    ).sort("FUTURE_DTTM")
).to_pandas()
selected_date = pd.to_datetime(outlier_date)

# Create time series plot
fig = px.line(
    partition_df,
    x="FUTURE_DTTM",
    y=["ACTUAL", "PREDICTED"],
    markers=True,
    title=f"Actual vs Predicted: {outlier_partition}",
)
fig.add_scatter(
    x=[selected_date],
    y=partition_df.loc[partition_df["FUTURE_DTTM"] == selected_date, "ACTUAL"],
    mode="markers",
    marker=dict(size=12, color="red", symbol="star"),
    name="Selected Outlier",
)

st.plotly_chart(fig, use_container_width=True)


# View Features
outlier_feature_sdf = final_sdf_test.filter(
    (F.col("GROUP_IDENTIFIER_STRING") == outlier_partition)
    & (F.col(TIME_PERIOD_COLUMN).cast("STRING") == outlier_date)
)
st.subheader("Date Features")
st.dataframe(
    outlier_feature_sdf.drop(
        "GROUP_IDENTIFIER", "GROUP_IDENTIFIER_STRING", "MODEL_TARGET"
    )
)

2025-09-29 15:39:06.074 Please replace `use_container_width` with `width`.

`use_container_width` will be removed after 2025-12-31.

For `use_container_width=True`, use `width='stretch'`. For `use_container_width=False`, use `width='content'`.


DeltaGenerator()

-----
# Clean up
-----

In [58]:
# If we don't want to keep the version we just built, we can remove it from the registry

# NOTE: Comment this code out if you do not want to delete the model version from the model registry
# If the user does not want to save the current version, delete this version of the model from the registry.
current_model = reg.get_model(qualified_model_name)

if not SAVE_MODEL_VERSION_THIS_RUN:
    deletion_message = ""
    try:
        current_model.version(mv.version_name)
    except Exception:
        deletion_message = f"WARNING: Model version '{mv.version_name}' does not exist in the registry."
        print(deletion_message)

    if len(deletion_message) == 0:
        try:
            if len(current_model.versions()) == 0:
                print(
                    f"WARNING: There are no versions for model '{MODEL_NAME}' in the registry."
                )
            elif (len(current_model.versions()) == 1) & (
                current_model.default.version_name == mv.version_name
            ):
                reg.delete_model(MODEL_NAME)
                print(
                    f" Model '{MODEL_NAME}' (which only had one version: '{mv.version_name}') was deleted from the registry."
                )
            else:
                current_model.delete_version(mv.version_name)
                print(
                    f"Model version '{mv.version_name}' was deleted from the registry."
                )
        except Exception:
            print(
                f"WARNING: Model version '{mv.version_name}' was not able to be deleted from the registry."
            )

    reg.show_models()

In [60]:
mv.lineage(direction='upstream')


The current snowflake-ml-python version out of date, package upgrade recommended (current=1.11.0, recommended>=1.12.0)



[FeatureView(_name=FORECAST_FEATURES, _entities=[Entity(name=TS_PARTITION_ENTITY, join_keys=['GROUP_IDENTIFIER_STRING'], owner=None, desc=)], _feature_df=<snowflake.snowpark.dataframe.DataFrame object at 0x323a12bd0>, _timestamp_col=ORDER_TIMESTAMP, _desc=, _infer_schema_df=<snowflake.snowpark.dataframe.DataFrame object at 0x323a1f170>, _query=SELECT 
     "ORDER_TIMESTAMP", 
     "TARGET", 
     "FEATURE_1", 
     "YEAR", 
     sin(((6.283185307179586 * "MONTH") / 12)) AS "MONTH_SIN", 
     cos(((6.283185307179586 * "MONTH") / 12)) AS "MONTH_COS", 
     sin(((6.283185307179586 * "WEEK_OF_YEAR") / 52)) AS "WEEK_OF_YEAR_SIN", 
     cos(((6.283185307179586 * "WEEK_OF_YEAR") / 52)) AS "WEEK_OF_YEAR_COS", 
      CAST (("DAY_OF_WEEK" = 0) AS INT) AS "DAY_OF_WEEK_SUN", 
      CAST (("DAY_OF_WEEK" = 1) AS INT) AS "DAY_OF_WEEK_MON", 
      CAST (("DAY_OF_WEEK" = 2) AS INT) AS "DAY_OF_WEEK_TUE", 
      CAST (("DAY_OF_WEEK" = 3) AS INT) AS "DAY_OF_WEEK_WED", 
      CAST (("DAY_OF_WEEK" = 4) AS I