# Vectorized UDTFs: Building Forecasting Models in Parallel

This notebook provides an example of building multiple forecasting models in parallel using Vectorized UDTFs. Unlike regular UDTFs, the vectorized variants allows you to operate on batches of rows in one pass, allowing for more efficient processing of the input data. 

For more details, refer to the documentation: [Vectorized UDTFs](https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-tabular-vectorized). 

In this notebook, we will be using the UDTF with the vectorized `end_partition` method. This is useful for when you want to: 

1. Process your data partition-by-partition instead of row-by-row. In our forecasting use-case this makes sense, we do not necessarily want to do and processing against each individual row for a particular sku/country/logical entity, we want to process it in one go. 
2. You want to return multiple rows or columns for each partition
3. Use libraries that operate on pandas DataFrame objects natively. 

## Imports: 

**Important**: Vectorized UDTF's require the Snowpark Library for Python version 1.14.0 or later!

In [1]:
import snowflake.snowpark
from snowflake.snowpark.session import Session
import snowflake.snowpark.types as T
import snowflake.snowpark.functions as F
from snowflake.snowpark.functions import col

from snowflake.snowpark.functions import udf
from snowflake.snowpark.types import IntegerType, FloatType, StringType,StructType, StructField

import json
import pandas as pd
from datetime import date, timedelta

import warnings
warnings.filterwarnings("ignore")

In [2]:
#check to make sure we have the right version of snowpark > 1.14.0
print(snowflake.snowpark.__version__)

1.16.0


In [4]:
#Connect to Snowflake:  
connection_parameters = json.load(open('/Users/hapatel/.config/creds.json'))
session = Session.builder.configs(connection_parameters).create()

## Environment Set Up: 

We will be using a dataset called `time_series_1k.csv` which has 1000 series for which we want to create forecasting models over. 

We will read this data in locally and create a snowflake table for us to work with. 

In [3]:
import pandas as pd 

In [4]:
df = pd.read_csv('time_series_1k.csv')
df.head()

Unnamed: 0,DATE,SERIES_ID,TRAFFIC
0,2018-01-01,1,119
1,2018-01-02,1,138
2,2018-01-03,1,134
3,2018-01-04,1,124
4,2018-01-05,1,103


In [5]:
#Confirm the number of values for each series: 
df['SERIES_ID'].value_counts()

SERIES_ID
1       2046
672     2046
659     2046
660     2046
661     2046
        ... 
339     2046
340     2046
341     2046
342     2046
1000    2046
Name: count, Length: 1000, dtype: int64

We have a total of 1000 unique series_id's, all with 2046 observations each. We will next write the table into Snowflake.

In [5]:
session.use_database('DEMO')
session.use_schema('PUBLIC')

In [None]:
session.write_pandas(df = df, table_name = 'TIME_SERIES_1K', auto_create_table = True)

In [6]:
#Read snowpark dataframe to confirm upload: 
sdf_raw = session.table('TIME_SERIES_1K')
sdf_raw.show()

----------------------------------------
|"DATE"      |"SERIES_ID"  |"TRAFFIC"  |
----------------------------------------
|2018-01-01  |1            |119        |
|2018-01-02  |1            |138        |
|2018-01-03  |1            |134        |
|2018-01-04  |1            |124        |
|2018-01-05  |1            |103        |
|2018-01-06  |1            |88         |
|2018-01-07  |1            |93         |
|2018-01-08  |1            |116        |
|2018-01-09  |1            |141        |
|2018-01-10  |1            |147        |
----------------------------------------



In [8]:
sdf_raw.schema

StructType([StructField('DATE', StringType(16777216), nullable=True), StructField('SERIES_ID', LongType(), nullable=True), StructField('TRAFFIC', LongType(), nullable=True)])

## Vectorized UDTF: 

In this section we will implement the body of our vectorized UDTF. We will be making use of the library Prophet for forecasting, and build an independent model for each partition. For API reference details, [see here](https://docs.snowflake.com/developer-guide/snowpark/reference/python/latest/snowpark/api/snowflake.snowpark.udtf.UDTFRegistration#snowflake.snowpark.udtf.UDTFRegistration)

### Defining the Input/Output Schema: 
The first step is to define the input/output data-types we expect as a result of executing the function. In this case, we will be returning 6 columns, and will be passing in the 2 columns from our input table (we will be partitioning over the series_id column). Pro Tip: Use the `.schema` function to get back the types of the input (see above)

In [7]:
#Define the input/output schema to be used: 
input_types = [T.StringType(), T.LongType()]

output_schema = T.StructType([
    T.StructField("TIMESTAMP", T.DateType()), #Date of the observation/forecast
    T.StructField("FORECAST", T.FloatType()), #Model output for the forecast
    T.StructField("TRAIN_START", T.DateType()), #Date at which the training time period started
    T.StructField("TRAIN_END", T.DateType()), #Date at which the training time period ended
    T.StructField("FORECAST_HORIZON", T.IntegerType()), #Length in days we are forecasting into the future for
    T.StructField("LIBRARY_VERSION", T.StringType()) #Collect metadata for the Prophet library version we are using
])

### Defining the Function body: 

Compared to regular UDTFs, we do not have to process each row individually. In our use-case, we will not implement the `process` method that is usually required, and only implement the `end_partition` method. The `end_partition` method will expect a pandas dataframe as an input, and can return either a pandas Dataframe object, or a list of pandas arrays/ pandas series. 

In [8]:
class forecast: 
    def end_partition(self, df: pd.DataFrame) -> pd.DataFrame: #Need to have type annotation to mark as a vectorized UDTF!
        """Reads in the data for the logical partition, and builds a forecasting model"""
        #Imports: 
        import prophet

        #Pre-process
        df['ds'] = pd.to_datetime(df['DATE'])
        df = df.groupby('ds').sum('TRAFFIC').reset_index()
        df = df.rename(columns = {'TRAFFIC':'y'})
        df = df[['ds', 'y']]
        df = df.sort_values(by=['ds']).reset_index(drop = True)

        #set training parameters: 
        train_length = 600
        forecast_horizon = 30
        train_end = max(df['ds'])
        train_start = train_end - pd.Timedelta(days = train_length)

        #get training data: 
        df = df.loc[(df['ds'] > train_start) & (df['ds'] <= train_end)]
        
        #train model and predict: 
        model = prophet.Prophet()
        model.fit(df)
        future = model.make_future_dataframe(periods = forecast_horizon)
        forecast = model.predict(future)

        #post process forecast results
        forecast = forecast[['ds','yhat']]
        forecast.columns = ['TIMESTAMP','FORECAST']
        forecast['TRAIN_START'] = train_start
        forecast['TRAIN_END'] = train_end
        forecast['FORECAST_HORIZON'] = forecast_horizon
        forecast['LIBRARY_VERSION'] = str(prophet.__version__)

        yield forecast

#Make sure to annotate as vectorized
forecast.end_partition._sf_vectorized_input = pd.DataFrame

### Register the UDTF with Snowflake

Having implemented the python handler method that will act on the batch of rows, we will now register this UDTF with Snowflake using the [UDTF Registration Method](https://docs.snowflake.com/developer-guide/snowpark/reference/python/latest/snowpark/api/snowflake.snowpark.udtf.UDTFRegistration.register):

In [9]:
forecast_udtf = session.udtf.register(
    handler = forecast, 
    output_schema = output_schema, 
    input_types = input_types, 
    name = 'VECTORIZED_UDTF_PROPHET', 
    stage_location = "@DEMO.PUBLIC.TMP_STAGE", 
    packages=['pandas==1.5.3','prophet', 'holidays==0.18', 'snowflake-snowpark-python','tqdm'], 
    replace = True,
    is_permanent= True, #We want to persist this
    input_names = ['DATE', 'TRAFFIC'] #pass in the column names that you want to make use of in the function body
)

Comments: 
1. Pass in the `input_names` argument in the registration call to name the input of the columns that will be used within the python handler. Default will be `ARG1, ARG2, ....`
2. Make sure to annotate the function body in the handler method with a type signature of accepting pandas dataframe as an input and output, or calling the function will throw an error. 

### Call the UDTF on our Data: 

In [9]:
#take sample of our data
input_df = sdf_raw.filter(F.col('SERIES_ID').isin([1,2,3,4,5]))
input_df.count()

10230

In [10]:
#Call the UDTF
forecast_sdf = input_df.select(F.col('SERIES_ID'),
                               forecast_udtf("DATE", "TRAFFIC").over(partition_by = ['SERIES_ID']),
                               )

In [11]:
forecast_sdf.show()

-------------------------------------------------------------------------------------------------------------------------
|"SERIES_ID"  |"TIMESTAMP"  |"FORECAST"          |"TRAIN_START"  |"TRAIN_END"  |"FORECAST_HORIZON"  |"LIBRARY_VERSION"  |
-------------------------------------------------------------------------------------------------------------------------
|4            |2021-12-17   |139.35180714549912  |2021-12-16     |2023-08-08   |30                  |1.1.3              |
|4            |2021-12-18   |127.18513288834738  |2021-12-16     |2023-08-08   |30                  |1.1.3              |
|4            |2021-12-19   |132.0650153488563   |2021-12-16     |2023-08-08   |30                  |1.1.3              |
|4            |2021-12-20   |148.9449512200764   |2021-12-16     |2023-08-08   |30                  |1.1.3              |
|4            |2021-12-21   |165.7318834669096   |2021-12-16     |2023-08-08   |30                  |1.1.3              |
|4            |2021-12-2

In [51]:
complete_predictions = sdf_raw.select(F.col('SERIES_ID'),
                               forecast_udtf("DATE", "TRAFFIC").over(partition_by = ['SERIES_ID']),
                               )

In [52]:
complete_predictions.write.save_as_table('PROPHET_VECTORIZED_UDTF', mode = "overwrite")