# udf_transform examples

This notebook demonstrates of how to use the **udf_transform** module.

The primary purpose of **udf_transform** is to be able to use the encoders/scalers created by the **preprocessing** module where the Snowpark DataFrame API can not be used. It could be that the transformation would be done using only SQL.

This notebook has two parts
1) Showing how to use the diffirent transform and inverse functions for UDFs
2) Showing how to use them in Python UDFs (scalar and tabular)

## Initial setup

In [None]:
from snowflake.snowpark import Session
import snowflake.snowpark.functions as F
import snowflake.snowpark.types as T

import cachetools

# Print the version of Snowpark we are using
from importlib.metadata import version
version('snowflake_snowpark_python')

In [None]:
import json

In [None]:
import preprocessing as pp
import udf_transform as ut

In [None]:
connection_parameters = {
    "account": "MY DEMO ACCOUNT",
    "user": "MY USER",
    "password": "MY PASSWORD",
    "warehouse": "MY COMPUTE WH",
    "database": "MY DATABASE",
    "schema": "MY SCHEMA"
}

In [None]:
session = Session.builder.configs(connection_parameters).create()
print("Current role: " + session.get_current_role() + ", Current schema: " + session.get_fully_qualified_current_schema() + ", Current WH: " + session.get_current_warehouse())

Start by creating a dataset that can be used for both scaling and encoding.

By caching the result into a new dataframe we avoid running teh generation SQL every time the data frame is used.

In [None]:
state = '["AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DC", "DE", "FL", "GA", "HI", "IA", "ID", "IL", "IN", "KS", "KY", "LA", "MA", "MD", "ME", "MI", "MN", "MO", "MS", "MT", "NC", "ND", "NE", "NH", "NJ", "NM", "NV", "NY", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VA", "VT", "WA", "WI", "WV", "WY"]'
area_code = '[408, 415, 510]'
intl_plan =  '["no", "yes"]'

df_gen_data = session.range(1000).with_columns(["STATES", "AREA_CODES", "INTL_PLANS"], 
                                         [F.parse_json(F.lit(state)), F.parse_json(F.lit(area_code)), F.parse_json(F.lit(intl_plan))])\
                            .select(F.col("ID").as_("CUST_ID"), F.as_varchar(F.get(F.col("STATES"), (F.call_builtin("zipf", F.lit(1), F.lit(51), F.random()) -1))).as_("STATE"),\
                                    F.get(F.col("AREA_CODES"), (F.call_builtin("zipf", F.lit(1), F.lit(3), F.random())) -1).as_("AREA_CODE"),\
                                    F.as_varchar(F.get(F.col("INTL_PLANS"), (F.call_builtin("zipf", F.lit(1), F.lit(2), F.random()))-1)).as_("INTL_PLAN"),\
                                    F.uniform(0, 100, F.random()).as_("CALLS"), F.uniform(0, 100, F.random()).as_("MINS"),F.uniform(0, 100, F.random()).as_("DATA"),\
                                    F.uniform(0.5, 10.9, F.random()).as_("DAY_CHARGE"),F.uniform(5.5, 15.1, F.random()).as_("INTL_CHARGE"))

df_test = df_gen_data.cache_result()


In [None]:
df_test.show()

## Introduction to the udf_transform functions
The purpose of the **udf_transform** module is to be able to use the encoders/scalers created by using the **preprocessing** module on Snowpark DataFrames where we do not can use Snowpark DataFrames, for example when the transformation is to be done with only SQL.

The **udf_transform** module has a transformer function for all scalers/encoders and in many canses also functions to inverse the scaling/encoding.

For each scaler and encoder in **preprocessing** there is a function in **udf_transform** to do the transformation based the fitted values.
### Scalers
**udf_transform** has the following functions for Scalers:
* udf_minmax_transform
* udf_minmax_inverse_transform
* udf_standard_transform
* udf_standard_inverse_transform
* udf_maxabs_transform
* udf_maxabs_inverse_transform
* udf_robust_transform
* udf_robust_inverse_transform
* udf_normalizer_transform
* udf_binarizer_transform

Input data can be a list or a numpy array.

Start by generating data to use with the scalers.

In [None]:
array_test = [[86, 10.2665], [34, 2.2345], [13, 8.1465], [66, 7.45]]
scaler_input_cols=["CALLS", "DAY_CHARGE"]
scaler_output_cols = ["calls_scaled", "day_charge_scaled"]

#### udf_minmax_transform


Start by fitting a Scaler using the **preprocessing** module, once fitted we can use the **get_udf_encoder** method to get a dictornary that can be used for transformation

In [None]:
mms = pp.MinMaxScaler(input_cols=scaler_input_cols, output_cols=scaler_output_cols)
mms.fit(df_test)
mms_udf = mms.get_udf_encoder()
mms_udf

Using the **udf_minmax_transform** will scale a list of list (one list for each row) using the fitted values in **mms_udf** and return a list of the same shape as the input list

In [None]:
mms_encoded_data = ut.udf_minmax_transform(array_test, mms_udf)
mms_encoded_data

**udf_minmax_inverse_transform** will inverse the scaled data back to original values.

In [None]:
ut.udf_minmax_inverse_transform(mms_encoded_data, mms_udf)

#### udf_standard_transform

For more example of how to use the **StandardScaler** see the **preprocessing_demo** notebook

In [None]:
sss = pp.StandardScaler(input_cols=scaler_input_cols, output_cols=scaler_output_cols)
sss.fit(df_test)
sss_udf = sss.get_udf_encoder()
sss_encoded_data = ut.udf_standard_transform(array_test, sss_udf)
sss_encoded_data

**udf_standard_inverse_transform** will inverse the scaled data back to original values.

In [None]:
ut.udf_standard_inverse_transform(sss_encoded_data, sss_udf)

#### udf_maxabs_transform

For more example of how to use the **MaxAbsScaler** see the **preprocessing_demo** notebook

In [None]:
mas = pp.MaxAbsScaler(input_cols=scaler_input_cols, output_cols=scaler_output_cols)
mas.fit(df_test)
mas_udf = mas.get_udf_encoder()
mas_encoded_data =  ut.udf_maxabs_transform(array_test, mas_udf)
mas_encoded_data

**udf_maxabs_inverse_transform** will inverse the scaled data back to original values.

In [None]:
ut.udf_maxabs_inverse_transform(mas_encoded_data, mas_udf)

#### udf_robust_transform

For more example of how to use the **RobustScaler** see the **preprocessing_demo** notebook

In [None]:
rs = pp.RobustScaler(input_cols=scaler_input_cols, output_cols=scaler_output_cols)
rs.fit(df_test)
rs_udf = rs.get_udf_encoder()
rs_encoded_data = ut.udf_robust_transform(array_test, rs_udf)
rs_encoded_data

**udf_robust_inverse_transform** will inverse the scaled data back to original values.

In [None]:
ut.udf_robust_inverse_transform(rs_encoded_data, rs_udf)

#### udf_normalizer_transform

For more example of how to use the **Normalizer** see the **preprocessing_demo** notebook

In [None]:
ns = pp.Normalizer(input_cols=scaler_input_cols, output_cols=scaler_output_cols)
ns.fit(df_test)
ns_udf = ns.get_udf_encoder()
ut.udf_normalizer_transform(array_test, ns_udf)

#### udf_binarizer_transform

For more example of how to use the **Binarizer** see the **preprocessing_demo** notebook

In [None]:
bs = pp.Binarizer(input_cols=scaler_input_cols, output_cols=scaler_output_cols)
bs.fit(df_test)
bs_udf = bs.get_udf_encoder()

ut.udf_binarizer_transform(array_test, bs_udf)

### Encoders
**udf_transform** has the following functions for Encoders:
* udf_ordinal_transform
* udf_onehot_transform



In [None]:
encoder_input_cols = ["STATE", "AREA_CODE", "INTL_PLAN"]
array_encoder_test = [['KS', 415, 'no'], ['OH', 415, 'no']]
array_encoder_unk = [['XX', 415, 'yes'], ['ZZ', 351, 'XY'], ['WI', 351, 'XY']]

#### udf_onehot_transform

For more example of how to use the **OneHotEncoder** see the **preprocessing_demo** notebook

In [None]:
ohe = pp.OneHotEncoder(input_cols=encoder_input_cols)
ohe.fit(df_test)
ohe_udf = ohe.get_udf_encoder()
ohe_encoded_data = ut.udf_onehot_transform(array_encoder_test, ohe_udf)
ohe_encoded_data

**udf_onehot_inverse_transform** will inverse the encoded values to the orginal ones

In [None]:
ut.udf_onehot_inverse_transform(ohe_encoded_data, ohe_udf)

The handling of unkown data is the same as with the **OneHotEncoder** **transform** method, it is igonerd by default

In [None]:
ohe_unk_encoded_data = ut.udf_onehot_transform(array_encoder_unk, ohe_udf)
ohe_unk_encoded_data

The udf_onehot_inverse_transform will return None for unkown values 

In [None]:
ut.udf_onehot_inverse_transform(ohe_unk_encoded_data, ohe_udf)

If we use **handle_unknown**='keep' then there will be one extra element for each input column for handling unkown values

In [None]:
ohe_keep_unk = pp.OneHotEncoder(input_cols=encoder_input_cols, handle_unknown='keep')
ohe_keep_unk.fit(df_test)
ohe_keep_unk_udf = ohe_keep_unk.get_udf_encoder()
ohe_keep_unk_encoded_data = ut.udf_onehot_transform(array_encoder_unk, ohe_keep_unk_udf)
ohe_keep_unk_encoded_data

Using **udf_onehot_inverse_transform** with **handle_unknown**='keep' will still return None for unkown values

In [None]:
ut.udf_onehot_inverse_transform(ohe_keep_unk_encoded_data, ohe_keep_unk_udf)

#### udf_ordinal_transform

For more example of how to use the **OrdinalEncoder** see the **preprocessing_demo** notebook

In [None]:
oe = pp.OrdinalEncoder(input_cols=encoder_input_cols)
oe.fit(df_test)
oe_udf = oe.get_udf_encoder()
oe_encoded_data = ut.udf_ordinal_transform(array_encoder_test, oe_udf)
oe_encoded_data

**udf_ordinal_inverse_transform** will inverse the encoded values to the orginal ones

In [None]:
ut.udf_ordinal_inverse_transform(oe_encoded_data, oe_udf)

The handling of unkown data is the same as with the **OrdinalEncoder** **transform** method, it is igonerd by default

In [None]:
oe_unk_encoded_data = ut.udf_ordinal_transform(array_encoder_unk, oe_udf)
oe_unk_encoded_data

Equaly for 

In [None]:
array_encoder_unk

In [None]:
ut.udf_ordinal_inverse_transform(oe_unk_encoded_data, oe_udf)

If **handle_unknown**="use_encoded_value" then **unknown_value** value will be used for unkown values

In [None]:
oe_unk = pp.OrdinalEncoder(input_cols=encoder_input_cols, handle_unknown="use_encoded_value", unknown_value=999)
oe_unk.fit(df_test)
oe_unk_udf = oe_unk.get_udf_encoder()
ut.udf_ordinal_transform(array_encoder_unk, oe_unk_udf)

#### udf_label_transform

The **udf_label_transform** function expects a list with one element for each row.

For more example of how to use the **LabelEncoder** see the **preprocessing_demo** notebook

In [None]:
y_data = [['yes'], ['yes'], ['no']]
le = pp.LabelEncoder(input_col="INTL_PLAN", output_col="INTL_PLAN_ENCODED")
le.fit(df_test)
le_udf = le.get_udf_encoder()

ut.udf_label_transform(y_data, le_udf)

## Using the udf transform functions with Python UDF
The **udf_transform** functions returns numpy arrays, meaning all UDFs using them need to add the numpy library as a import, and also convert the returned data to a Python list before returning it to Snowflake

When using a UDF transformer in a Python UDF there is different ways to deploy it.
* By emedding the encoder, returned by **get_udf_encoder** method, as a variable
* By providing the encoder, returned by **get_udf_encoder** method, as a parameter to the UDF function
* By storing the encoder, returned by **get_udf_encoder** method, as a file and load it in the UDF

We can also use a scalar or tabular UDF, depending on how we want the values back

In [None]:
session.sql('CREATE OR REPLACE STAGE udf_transform_stage').collect()

Starting with creating a scalar UDF function that uses the encoder as a embedded variable.

In [None]:
encoder = mms_udf
def minmax_transform(data: list):
    import udf_transform as ut
    # encoder variable needs to be set outside this function before deploying
    return ut.udf_minmax_transform(data, encoder).tolist()

In [None]:
udf_minmax = session.udf.register(minmax_transform, 
                                                 name="minmax_transform_udf",
                                                 is_permanent=True,
                                                 stage_location='@udf_transform_stage', 
                                                 imports=["udf_transform"],
                                                 packages=["numpy"],
                                                 input_types=[T.ArrayType()],
                                                 return_type=T.ArrayType(),
                                                 replace=True)

In [None]:
udf_test_df = session.create_dataframe(array_test, schema=scaler_input_cols)
udf_test_df.show()

In [None]:
udf_test_df.select(*scaler_input_cols, F.call_udf("minmax_transform_udf", F.array_construct(*scaler_input_cols))).show()

If we want the scaled values returned as columns we can use a Tabular UDF.

By checking the **input_features** we will get the number of parameters needed for our function

In [None]:
encoder['input_features']

In [None]:
class minmax_transform_udtf:
    def process(self, calls: int, day_charge:float):
        import udf_transform as ut
        data = [calls, day_charge]
        trans_vals = ut.udf_minmax_transform(data, encoder)
        yield tuple(trans_vals)
            

In [None]:
udtf_minmax = session.udtf.register(minmax_transform_udtf, 
                                                 name="minmax_transform_udtf",
                                                 is_permanent=True,
                                                 stage_location='@udf_transform_stage', 
                                                 imports=["udf_transform"],
                                                 packages=["numpy"],
                                                 output_schema=T.StructType([T.StructField("calls_scaled", T.FloatType()), T.StructField("day_charged_scaled", T.FloatType())]), 
                                                 input_types=[T.IntegerType(), T.FloatType()],
                                                 replace=True)


In [None]:
udf_test_df.join_table_function(udtf_minmax(F.col("CALLS"), F.col("DAY_CHARGE"))).show()

Passing the encoder as a parameter to the UDF function

In [None]:
def minmax_encoder_transform(data: list, udf_encoder: dict):
    import udf_transform as ut
    return ut.udf_minmax_transform(data, udf_encoder).tolist()

In [None]:
udf_encoder_minmax = session.udf.register(minmax_encoder_transform, 
                                                 name="minmax_encoder_transform_udf",
                                                 is_permanent=True,
                                                 stage_location='@udf_transform_stage', 
                                                 imports=["udf_transform"],
                                                 packages=["numpy"],
                                                 input_types=[T.ArrayType(), T.VariantType()],
                                                 return_type=T.ArrayType(),
                                                 replace=True)

Since the returned object from **get_udf_encoder** is a dictionary object we need to convert it to a JSON string first and then use the **parse_json** finction for passing it into the UDF

In [None]:
para_encoder = json.dumps(mms_udf)

In [None]:
udf_test_df.select(*scaler_input_cols, F.call_udf("minmax_encoder_transform_udf", F.array_construct(*scaler_input_cols), F.parse_json(F.lit(para_encoder)))).show()

When reading the encoder from a stage we need to first store it as a file

In [None]:
with open('./mms_encoder.json', 'w') as f:
    json.dump(mms_udf, f)

The function for the UDF then needs to use the **snowflake_import_directory** setting to get the storage location before reading the file.

In [None]:
@cachetools.cached(cache={})
def read_file(filename):
    import sys
    import os
    
    import_dir = sys._xoptions.get("snowflake_import_directory")
    if import_dir:
        encoder_file = import_dir +filename
        f = open(encoder_file)
        return json.load(f)
        
def minmax_file_transform(data: list):
    import udf_transform as ut

    udf_encoder = read_file('mms_encoder.json')
    return ut.udf_minmax_transform(data, udf_encoder).tolist()

In [None]:
udf_file_minmax = session.udf.register(minmax_file_transform, 
                                                 name="minmax_file_transform",
                                                 is_permanent=True,
                                                 stage_location='@udf_transform_stage', 
                                                 imports=["udf_transform", "mms_encoder.json"],
                                                 packages=["numpy", "cachetools"],
                                                 input_types=[T.ArrayType()],
                                                 return_type=T.ArrayType(),
                                                 replace=True)

In [None]:
udf_test_df.select(*scaler_input_cols, F.call_udf("minmax_file_transform", F.array_construct(*scaler_input_cols))).show()

Using encoders with Tabular UDFs

In [None]:
print(oe_udf['input_features'])
print(oe_udf['output_cols'])

In [None]:
class ordinal_encode_udtf:
    def process(self, state: str, area_code:str, intl_plan: str):
        import udf_transform as ut
        data = [state, area_code, intl_plan]
        trans_vals = ut.udf_ordinal_transform(data, oe_udf)
        yield tuple(trans_vals)

In [None]:
udtf_ordinal = session.udtf.register(ordinal_encode_udtf, 
                                                 name="ordinal_encode_udtf",
                                                 is_permanent=True,
                                                 stage_location='@udf_transform_stage', 
                                                 imports=["udf_transform"],
                                                 packages=["numpy"],
                                                 output_schema=T.StructType([T.StructField("state_ordinal", T.StringType()), T.StructField("area_code_ordinal", T.StringType()), T.StructField("intl_plan_ordinal", T.StringType())]), 
                                                 input_types=[T.StringType(), T.StringType(), T.StringType()],
                                                 replace=True)


In [None]:
udtf_test_df = session.create_dataframe(array_encoder_test, schema=encoder_input_cols)
udtf_test_df.show()

In [None]:
udtf_test_df.join_table_function(udtf_ordinal(F.col("STATE"), F.to_char(F.col("AREA_CODE")), F.col("INTL_PLAN"))).show()

The **OneHotEncoder** usually generates a lot of columns, it is based on the disticnt values found for each column during fit. So instead of typing each column by hand we can loop through **output_cols** field of the UDF encoder to generate the output schema.

In [None]:
output_cols = ohe_udf['output_cols']
fields = []
for col in output_cols:
    for col_nm in output_cols[col]:
        fields.append(T.StructField(col_nm, T.StringType()))
output_schema = T.StructType(fields)

In [None]:
output_schema

By checking the **input_features** we will get the number of parameters needed for our function

In [None]:
ohe_udf['input_features']

In [None]:
class onehot_encode_udtf:
    def process(self, state: str, area_code:str, intl_plan: str):
        import udf_transform as ut
        data = [state, area_code, intl_plan]
        trans_vals = ut.udf_onehot_transform(data, ohe_udf)
        yield tuple(trans_vals)

In [None]:
udtf_onehot = session.udtf.register(onehot_encode_udtf, 
                                                 name="onehot_encode_udtf",
                                                 is_permanent=True,
                                                 stage_location='@udf_transform_stage', 
                                                 imports=["udf_transform"],
                                                 packages=["numpy"],
                                                 output_schema=output_schema, 
                                                 input_types=[T.StringType(), T.StringType(), T.StringType()],
                                                 replace=True)

In [None]:
udtf_test_df.join_table_function(udtf_onehot(F.col("STATE"), F.to_char(F.col("AREA_CODE")), F.col("INTL_PLAN"))).to_pandas()