# perprocessing examples


This notebook demostrates how to use the **preprocessing** module of **sp4py_utilities**

The purpose of the **preprocessing** module is to provide similar preprocessing functionality using Snowpark DataFrames as the [sklearn.preprocessing](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) module. Having seperated **fit** and **transform** methods enables the possibility to use a fitted scaler/encoder in another pipleine by saving it as a file object. If you want to use a fitted scaler/encoder with Snowflake without using Snowpark DataFrames you can use the functions in the module **udf_transform** for that, see the **udf_transform_demo** notebook for details.

Currently the following scalers and encoders are implemented:
* MinMaxScaler: Transform each column by scaling each feature to a given range.
* StandardScaler: Standardize features by removing the mean and scaling to unit variance.
* MaxAbsScaler: Scale each column by its maximum absolute value.
* RobustScaler: Scale features using statistics that are robust to outliers.
* Normalizer: Normalize individually to unit norm.
* Binarizer: Binarize data (set feature values to 0 or 1) according to a threshold.
* OneHotEncoder: Encode categorical features as a one-hot.
* OrdinalEncoder: Encodes a string column of labels to a column of label indices. The indices are in [0, number of labels].
* LabelEncoder: A label indexer that maps a string column of labels to a column of label indices.

This notebook has the following sections
* Scalers - examples of how to use those
* Encoders - examples of how to use those
* Using the scalers/encoders in a Python Stored Procedure

In [None]:
# Snowpark
from snowflake.snowpark import Session
import snowflake.snowpark.functions as F

import joblib
import io

# Print the version of Snowpark we are using
from importlib.metadata import version
version('snowflake_snowpark_python')

In [None]:
# The preprosessing module
import preprocessing as pp

Connect to Snowflake

In [None]:
connection_parameters = {
    "account": "MY DEMO ACCOUNT",
    "user": "MY USER",
    "password": "MY PASSWORD",
    "warehouse": "MY COMPUTE WH",
    "database": "MY DATABASE",
    "schema": "MY SCHEMA"
}

session = Session.builder.configs(connection_parameters).create()
print("Current role: " + session.get_current_role() + ", Current schema: " + session.get_fully_qualified_current_schema() + ", Current WH: " + session.get_current_warehouse())

Start by creating a dataset that can be used for both scaling and encoding.

By caching the result into a new dataframe we avoid running teh generation SQL every time the data frame is used.

In [None]:
state = '["AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DC", "DE", "FL", "GA", "HI", "IA", "ID", "IL", "IN", "KS", "KY", "LA", "MA", "MD", "ME", "MI", "MN", "MO", "MS", "MT", "NC", "ND", "NE", "NH", "NJ", "NM", "NV", "NY", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VA", "VT", "WA", "WI", "WV", "WY"]'
area_code = '[408, 415, 510]'
intl_plan =  '["no", "yes"]'

df_gen_data = session.range(1000).with_columns(["STATES", "AREA_CODES", "INTL_PLANS"], 
                                         [F.parse_json(F.lit(state)), F.parse_json(F.lit(area_code)), F.parse_json(F.lit(intl_plan))])\
                            .select(F.col("ID").as_("CUST_ID"), F.as_varchar(F.get(F.col("STATES"), (F.call_builtin("zipf", F.lit(1), F.lit(51), F.random()) -1))).as_("STATE"),\
                                    F.get(F.col("AREA_CODES"), (F.call_builtin("zipf", F.lit(1), F.lit(3), F.random())) -1).as_("AREA_CODE"),\
                                    F.as_varchar(F.get(F.col("INTL_PLANS"), (F.call_builtin("zipf", F.lit(1), F.lit(2), F.random()))-1)).as_("INTL_PLAN"),\
                                    F.uniform(0, 100, F.random()).as_("CALLS"), F.uniform(0, 100, F.random()).as_("MINS"),F.uniform(0, 100, F.random()).as_("DATA"),\
                                    F.uniform(0.5, 10.9, F.random()).as_("DAY_CHARGE"),F.uniform(5.5, 15.1, F.random()).as_("INTL_CHARGE"))

df_test = df_gen_data.cache_result()


In [None]:
df_test.show()

## Scalers
Since we are going to test diffrent scalers we can set variables for our input columns, ie what columns to scale, and output columns, ie name of the scaled columns

If we do not provide input columns then all numeric columns in a Snowpark DataFrame will be used and if we do not provide output columns the scaled columns will replace the input columns.

In [None]:
scaler_input_cols=["CALLS", "DAY_CHARGE"]
scaler_output_cols = ["calls_scaled", "day_charge_scaled"]

### MinMaxScaler

The MinMaxScaler will transform each column by scaling each feature to a given range, default 0-1.

After fitting it with a DataFrame (need have the **input_cols** parameter)  we can for example see the what values where fitted by the **fitted_values_** attribute

In [None]:
mms = pp.MinMaxScaler(input_cols=scaler_input_cols, output_cols=scaler_output_cols)
mms.fit(df_test)
mms.fitted_values_

Scale a DataFrame, since we have set **input_cols** and **input_cols** the returning DataFrame will have new columns for the scaled values

In [None]:
mms_tr_df = mms.transform(df_test)
mms_tr_df.show()

We can reverse the scaling by using the **inverse_transform** method. The reversed values will be in the output columns.

In [None]:
mms.inverse_transform(mms_tr_df).show()

We can fit and transform in one go with **fit_transform**

In [None]:
mms.fit_transform(df_test).show()

If we want to save the fitted scaler so it can be used in another Python script etc we can do that with pickle or joblib.

In [None]:
joblib.dump(mms, 'my_min_max_scaler.joblib') 

We can the load it into a new variable and use it

In [None]:
loaded_mms = joblib.load('my_min_max_scaler.joblib')
loaded_mms.transform(df_test).show()

By default the feature range used is 0-1 but that can be changed with the **feature_range** parameter

In [None]:
pp.MinMaxScaler(input_cols=scaler_input_cols, output_cols=scaler_output_cols, feature_range=(1,2)).fit_transform(df_test).show()

Transform new data with the previous fitted scaler

In [None]:
df_new_data = session.create_dataframe([[56, 1.987], [32, 9.689]], schema=scaler_input_cols)
df_new_data.show()

In [None]:
mms.transform(df_new_data).show()

### StandardScaler

Standardize features by removing the mean and scaling to unit variance.

By default it center the data before scaling and scale the data to unit standard deviation.

How to save a fitted scaler to be used later see the MinMaxScaler examples above.

In [None]:
sss = pp.StandardScaler(input_cols=scaler_input_cols, output_cols=scaler_output_cols)
sss.fit(df_test)
sss.fitted_values_

In [None]:
sss_tr_df = sss.transform(df_test)
sss_tr_df.show()

We can reverse the scaling by using the **inverse_transform** method. The reversed values will be in the output columns.

In [None]:
sss.inverse_transform(sss_tr_df).show()

Setting **with_mean**=False will disable the centering of data before scaling. With **with_std**=True the data will be scaled to unit standard deviation.

In [None]:
pp.StandardScaler(input_cols=scaler_input_cols, output_cols=scaler_output_cols, with_mean=False, with_std=True).fit_transform(df_test).show()

Setting **with_std**=False will disable the scaling of the data to unit standard deviation. With **with_mean**=True the data will only be centered.

In [None]:
pp.StandardScaler(input_cols=scaler_input_cols, output_cols=scaler_output_cols, with_mean=True, with_std=False).fit_transform(df_test).show()

Setting both **with_mean** and **with_std** to False will return the same values as input.

In [None]:
pp.StandardScaler(input_cols=scaler_input_cols, output_cols=scaler_output_cols, with_mean=False, with_std=False).fit_transform(df_test).show()

### MaxAbsScaler

Scale each column by its maximum absolute value.

How to save a fitted scaler to be used later see the MinMaxScaler examples above.

In [None]:
mas = pp.MaxAbsScaler(input_cols=scaler_input_cols, output_cols=scaler_output_cols)
mas.fit(df_test)
mas.fitted_values_

In [None]:
mas_tr_df = mas.transform(df_test)
mas_tr_df.show()

We can reverse the scaling by using the **inverse_transform** method. The reversed values will be in the output columns.

In [None]:
mas.inverse_transform(mas_tr_df).show()

### RobustScaler
Scale columns using statistics that are robust to outliers.

This scaler scales by remove the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range) The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

By default it center the data before scaling and scale the data to interquartile range.

How to save a fitted scaler to be used later see the MinMaxScaler examples above.

In [None]:
rs = pp.RobustScaler(input_cols=scaler_input_cols, output_cols=scaler_output_cols)
rs.fit(df_test)
rs.fitted_values_

In [None]:
rs_tr_df = rs.transform(df_test)
rs_tr_df.show()

We can reverse the scaling by using the **inverse_transform** method. The reversed values will be in the output columns.

In [None]:
rs.inverse_transform(rs_tr_df).show()

Setting **with_centering**=False will disable centering of data before scaling.

In [None]:
pp.RobustScaler(input_cols=scaler_input_cols, output_cols=scaler_output_cols, with_centering=False).fit_transform(df_test).show()

Setting **with_scaling**=False will disable scaling of the data to interquartile range before scaling the data to interquartile range.

In [None]:
pp.RobustScaler(input_cols=scaler_input_cols, output_cols=scaler_output_cols, with_centering=True, with_scaling=False).fit_transform(df_test).show()

Setting both **with_centering** and **with_scaling** to False will return unchanged data

In [None]:
pp.RobustScaler(input_cols=scaler_input_cols, output_cols=scaler_output_cols, with_centering=False, with_scaling=False).fit_transform(df_test).show()

Using 10th and 90th quantiles

In [None]:
pp.RobustScaler(input_cols=scaler_input_cols, output_cols=scaler_output_cols,quantile_range=(10.0, 90.0)).fit_transform(df_test).show()

Setting **unit_variance**=True will scale data so that normally distributed features have a variance of 1

In [None]:
pp.RobustScaler(input_cols=scaler_input_cols, output_cols=scaler_output_cols,unit_variance=True).fit_transform(df_test).show()

### Normalizer

Normalize individually to unit norm, the Normalizer does not have a inverse transformation method since the transformation values are calculated row by row.

The norm to use to normalize each non zero data, l1, l2 or max, l2 is used default.  
The l1 norm is calculated as the sum of the absolute values of each column and row.  
The l2 norm is calculated as the square root of the sum of the squared column values and row.  
The max norm is calculated as the maximum value of the absolute values by column and row.

How to save a fitted scaler to be used later see the MinMaxScaler examples above.

In [None]:
ns = pp.Normalizer(input_cols=scaler_input_cols, output_cols=scaler_output_cols)
ns.fit(df_test)
ns.fitted_values_

In [None]:
ns_tr_df = ns.transform(df_test)
ns_tr_df.show()

l1 norm

In [None]:
pp.Normalizer(input_cols=scaler_input_cols, output_cols=scaler_output_cols, norm="l1").fit_transform(df_test).show()

max norm

In [None]:
pp.Normalizer(input_cols=scaler_input_cols, output_cols=scaler_output_cols, norm="max").fit_transform(df_test).show()

### Binarizer

Binarize data (set feature values to 0 or 1) according to a threshold, default 0.0.

The Binarizer does not have a inverse transform method.

How to save a fitted scaler to be used later see the MinMaxScaler examples above.

In [None]:
bs = pp.Binarizer(input_cols=scaler_input_cols, output_cols=scaler_output_cols)
bs.fit(df_test)

In [None]:
bs_tr_df = bs.transform(df_test)
bs_tr_df.show()

Threashold 9.5

In [None]:
pp.Binarizer(input_cols=scaler_input_cols, output_cols=scaler_output_cols, threshold=9.5).fit_transform(df_test).show()

## Encoders

Start by setting what columns to use for encoding, if none are provided all columns in a DataFrame will be used.

Output columns are created automatically if **categories**="auto" otherwise a category column mapping needs to be providedwith the **categories** parameter.

We are also generating a Snowpark DataFrame with unkown values to demo how that can be handled

In [None]:
encoder_input_cols = ["STATE", "AREA_CODE", "INTL_PLAN"]
df_unknown = session.create_dataframe([['XX', 415, 'yes'], ['ZZ', 351, 'XY']], schema=encoder_input_cols)

### OneHotEncoder
Encode categorical features as a one-hot, for each input column a new column for each category is created.

How to save a fitted encoder to be used later see the MinMaxScaler examples above.

In [None]:
ohe = pp.OneHotEncoder(input_cols=encoder_input_cols)
ohe.fit(df_test)

By default the input columns are dropped from the returning DataFrame during transform

In [None]:
ohe_tr_df = ohe.transform(df_test)
ohe_tr_df.show()

**inverse_transform** will return the original columns and drop the output columns from the returned DataFrame

In [None]:
ohe.inverse_transform(ohe_tr_df).show()

Setting **drop_input_cols**=False will keep input columns in the returned DataFrame

In [None]:
ohe_keep_input = pp.OneHotEncoder(input_cols=encoder_input_cols, drop_input_cols=False)
ohe_keep_input_tr_df = ohe_keep_input.fit_transform(df_test)
ohe_keep_input_tr_df.show()

**inverse_transform** will behave the same, even with **drop_input_cols**=False

In [None]:
ohe_keep_input.inverse_transform(ohe_keep_input_tr_df).show()

By default unkown values, ie values that was not present duing the fit, is ignored

In [None]:
ohe_ignore_unk = pp.OneHotEncoder(input_cols=encoder_input_cols, drop_input_cols=False)
ohe_ignore_unk.fit(df_test)
ohe_keep_ignore_tr_df = ohe_ignore_unk.transform(df_unknown)
ohe_keep_ignore_tr_df.show()

With **inverse_transform** unkown values will be NULL in the returning DataFrame

In [None]:
ohe_ignore_unk.inverse_transform(ohe_keep_ignore_tr_df).show()

Setting **handle_unknown**='keep' will create a unkown column for each feature that is set for 1 for all new values

In [None]:
ohe_keep_unk = pp.OneHotEncoder(input_cols=encoder_input_cols, handle_unknown='keep', drop_input_cols=False)
ohe_keep_unk.fit(df_test)
ohe_keep_unk_tr_df = ohe_keep_unk.transform(df_unknown)
ohe_keep_unk_tr_df.show()

**inverse_transform** with unkown and  handle_unknown='keep' will return NULL for the unkown values

In [None]:
ohe_keep_unk.inverse_transform(ohe_keep_unk_tr_df).show()

Column category mapping can be set manual by providing a dictonary to the **categories** parameter.

In [None]:
my_categories = {"AREA_CODE": ['408', '415', '510'], "INTL_PLAN": ['no', 'yes']}

pp.OneHotEncoder(input_cols=['AREA_CODE', 'INTL_PLAN'], categories=my_categories).fit_transform(df_test).show()

Output columns can be set by using the **output_cols** parameter, since the categories are always sorted in alphabetical order the columns needs to be in the same order.

In [None]:
my_output_cols = {"AREA_CODE": ['AC_1', 'AC_2', 'AC_3'], "INTL_PLAN": ['NO_PLAN', 'HAS_PLAN']}

pp.OneHotEncoder(input_cols=['AREA_CODE', 'INTL_PLAN'], output_cols=my_output_cols, drop_input_cols=False).fit_transform(df_test).show()

### OrdinalEncoder

Encodes a string column of labels to a column of label indices. The indices are in [0, number of labels].

By default, the labels are sorted alphabetically and numeric columns is cast to string.

How to save a fitted encoder to be used later see the MinMaxScaler examples above.

In [None]:
oe = pp.OrdinalEncoder(input_cols=encoder_input_cols)
oe.fit(df_test)

If not providing output_cols the input_cols witll be replace by the encoded values

In [None]:
oe_tr_df = oe.transform(df_test)
oe_tr_df.show()

By setting output_cols the transformed DataFrame will also keep the input columns.

In [None]:
pp.OrdinalEncoder(input_cols=encoder_input_cols, output_cols=["STATE_ENCODED", "AREA_CODE_ENCODED", "INTL_PLAN_ENCODED"]).fit_transform(df_test).show()

By default unkown values, ie values that was not present duing the fit, will get NULL in the encoded columns

In [None]:
oe_ignore_unk = pp.OrdinalEncoder(input_cols=encoder_input_cols, output_cols=["STATE_ENCODED", "AREA_CODE_ENCODED", "INTL_PLAN_ENCODED"])
oe_ignore_unk.fit(df_test)
oe_keep_ignore_tr_df = oe_ignore_unk.transform(df_unknown)
oe_keep_ignore_tr_df.show()

Inverse transform on a transformed DataFrame with unkown values will return NULL values for those

In [None]:
oe_ignore_unk.inverse_transform(oe_keep_ignore_tr_df).show()

Setting handle_unknown='use_encoded_value' will replace unkown values with the value of unknown_value

In [None]:
oe_handle_unk = pp.OrdinalEncoder(input_cols=encoder_input_cols, output_cols=["STATE_ENCODED", "AREA_CODE_ENCODED", "INTL_PLAN_ENCODED"], handle_unknown='use_encoded_value', unknown_value=999)
oe_handle_unk.fit(df_test)
oe_handle_ignore_tr_df = oe_handle_unk.transform(df_unknown)
oe_handle_ignore_tr_df.show()

Inverse transform on a transformed DataFrame with unkown values will return NULL values for those

In [None]:
oe_handle_unk.inverse_transform(oe_handle_ignore_tr_df).show()

### LabelEncoder

A label indexer that maps a string column of labels to a column of label indices. The indices are in [0, number of labels].

The LabelEncoder is to be used with the target column, for features **OrdinalEncoder** should be used.

How to save a fitted encoder to be used later see the MinMaxScaler examples above.

In [None]:
le = pp.LabelEncoder(input_col="INTL_PLAN", output_col="INTL_PLAN_ENCODED")
le.fit(df_test)
le_tr_df = le.transform(df_test)
le_tr_df.show()

**inverse_transform**

In [None]:
le.inverse_transform(le_tr_df).show()

## Using a scaler in a Python Stored Procedure

The following is an example of how a preprocessing scaler can be used in a Python Stored Procedure, the example is depened on that the testdata generation part has been done.

The stored Procedure will fit and transform a input tbale using the MinMaxScaler using the **input_cols** and then stored the transformed data in the **output_table**. It will stored the fitted scaler as a joblib object on the stage SP_STAGE.

Start by creating the satge where we store the fitted scaler object.

In [None]:
session.sql('CREATE OR REPLACE STAGE SP_STAGE').collect()

Create a helper function for ssaving the fitted scaler and then the primary function for the stored procedure

In [None]:
def save_file(session, model, path):
    input_stream = io.BytesIO()
    joblib.dump(model, input_stream)
    session._conn._cursor.upload_stream(input_stream, path)
    return "successfully created file: " + path

def min_max_scaler(session: Session, input_table: str, input_cols: list, output_table: str, output_cols: list) -> str:
    import preprocessing as pp

    df_input = session.table(input_table)
    
    mms = pp.MinMaxScaler(input_cols=input_cols, output_cols=output_cols)
    mms.fit(df_input)
    
    mms_tr_df = mms.transform(df_input)
    
    save_file(session, mms, "@SP_STAGE/min_max_scaler.joblib")
    mms_tr_df.write.mode("overwrite").save_as_table(output_table)
    
    return "SUCCESS"

Add the imports and deploy the temporary stored procedure function to Snowflake.

In [None]:
session.clear_imports()
session.clear_packages()
session.add_import("preprocessing")
session.add_packages('snowflake-snowpark-python', 'joblib', 'scipy', 'numpy')

min_max_scaler_sp = F.sproc(min_max_scaler, replace=True, is_permanent=False, session=session)

Store the test data as a table in Snowflake

In [None]:
df_test.write.mode("overwrite").save_as_table("scaler_input")

Call the stored procedure

In [None]:
min_max_scaler_sp("scaler_input", ["CALLS", "DAY_CHARGE"], "scaler_output", ["calls_scaled", "day_charge_scaled"])

Verify that the transformed data is in the **output_table**

In [None]:
session.table("scaler_output").show()

Verify that the fittedscaler is stored on the stage

In [None]:
session.sql("ls @SP_STAGE").show()

In [None]:
session.close()