# Custom Preprocessing in ML Pipelines using Snowflake Model Registry & Scikit-Learn: 

This notebooks walks through how you can register ml pipelines with Snowflake's Model Registry by using scikit-learn's custom transformers. 

First, we will start with the required imports to be able to connect to Snowflake, and create a model: 

In [1]:
#imports: 
import json
import pandas as pd
import numpy as np

#Snowflake imports:
from snowflake.snowpark import Session
from snowflake.snowpark import functions as F

#Model Registry: 
from snowflake.ml.registry import Registry

#ML Imports: 
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline


In [2]:
#Confirm version of model registry: 
from snowflake.ml import version
print(version.VERSION)

1.7.1


In [3]:
#Authenticate to Snowflake, here using a local json file with credentials:
conn_params = json.load(open('/Users/hapatel/.config/creds.json'))
session = Session.builder.configs(conn_params).create()

#Use the appropriate database context (I have created my own Database/Schema ahead of time, this may look different compared to yours)
session.sql('USE ROLE ML_ENGINEER').collect()
session.sql('USE WAREHOUSE TEST').collect()
session.sql('USE DATABASE DEMO').collect()
session.sql('USE SCHEMA CUSTOMER_EXAMPLES').collect()

[Row(status='Statement executed successfully.')]

## Feature Engineering & Modeling: 

We can now proceed to doing model-specific feature engineering and model building with the data being stored to a Snowflake Table. 

In a real-world example, this would be the starting point, where we would take the raw data in the Snowflake Table and perform additional model-specific feature engineering before building and deploying our model. 

In [4]:
taxi_sdf = session.table("nyc_yellow_trips")
taxi_df = taxi_sdf.to_pandas()

In [5]:
taxi_df.head()

Unnamed: 0,VENDORID,PASSENGER_COUNT,TRIP_DISTANCE,RATECODEID,STORE_AND_FWD_FLAG,PULOCATIONID,DOLOCATIONID,PAYMENT_TYPE,FARE_AMOUNT,EXTRA,MTA_TAX,TIP_AMOUNT,TOLLS_AMOUNT,IMPROVEMENT_SURCHARGE,TOTAL_AMOUNT,CONGESTION_SURCHARGE,AIRPORT_FEE,TPEP_PICKUP_DATETIME,TPEP_DROPOFF_DATETIME,TRIP_ID
0,1,1,3.2,1,N,48,262,1,14.0,0.5,0.5,3.06,0.0,0.3,18.36,,,2016-01-01 00:12:22,2016-01-01 00:29:14,0
1,1,2,1.0,1,N,162,48,2,9.5,0.5,0.5,0.0,0.0,0.3,10.8,,,2016-01-01 00:41:31,2016-01-01 00:55:10,1
2,1,1,0.9,1,N,246,90,2,6.0,0.5,0.5,0.0,0.0,0.3,7.3,,,2016-01-01 00:53:37,2016-01-01 00:59:57,2
3,1,1,0.8,1,N,170,162,2,5.0,0.5,0.5,0.0,0.0,0.3,6.3,,,2016-01-01 00:13:28,2016-01-01 00:18:07,3
4,1,1,1.8,1,N,161,140,2,11.0,0.5,0.5,0.0,0.0,0.3,12.3,,,2016-01-01 00:33:04,2016-01-01 00:47:14,4


In [6]:
taxi_df.dtypes

VENDORID                           int8
PASSENGER_COUNT                    int8
TRIP_DISTANCE                   float64
RATECODEID                         int8
STORE_AND_FWD_FLAG               object
PULOCATIONID                      int16
DOLOCATIONID                      int16
PAYMENT_TYPE                       int8
FARE_AMOUNT                     float64
EXTRA                           float64
MTA_TAX                         float64
TIP_AMOUNT                      float64
TOLLS_AMOUNT                    float64
IMPROVEMENT_SURCHARGE           float64
TOTAL_AMOUNT                    float64
CONGESTION_SURCHARGE            float64
AIRPORT_FEE                     float64
TPEP_PICKUP_DATETIME     datetime64[ns]
TPEP_DROPOFF_DATETIME    datetime64[ns]
TRIP_ID                           int32
dtype: object

For this dataset, we would like to predict the `FARE_AMOUNT` which according to the data dictionary is the total time-and-distance fare calculated by the meter. 

Based on the other variables provided, like `TPEP_PICKUP_DATETIME` and `TPEP_DROPOFF_DATETIME` which represent pickup and dropoff times repsectively, we may engineer additional features like trip duration which may be indicative of the fare amount. We could write a function to do this like below: 



In [7]:
def calculate_trip_duration(X):
    """Input X is the dataframe object"""
    return (X['TPEP_DROPOFF_DATETIME'] - X['TPEP_PICKUP_DATETIME']).dt.seconds.values.reshape(-1, 1) / 60

This function could be applied to our input dataframe to provide us the new feature: 

In [8]:
calculate_trip_duration(taxi_df)

array([[16.86666667],
       [13.65      ],
       [ 6.33333333],
       ...,
       [10.53333333],
       [28.88333333],
       [12.31666667]])

In our dataset, we also have categorical variables like `STORE_AND_FWD_FLAG` that have `Y` and `N` as values that we want to map to 1, 0 respectively, as well as `PAYMENT_TYPE` which we want to one hot encode. Again these transformations can be done using functions: 

In [9]:
def map_store_and_fwd_flag(X):
    return X['STORE_AND_FWD_FLAG'].map({'Y': 1, 'N': 0}).values.reshape(-1, 1)

In [10]:
map_store_and_fwd_flag(taxi_df)

array([[0],
       [0],
       [0],
       ...,
       [0],
       [0],
       [0]])

In [11]:
#one hot encoder: 
payment_encoder = OneHotEncoder(sparse_output = False, handle_unknown= 'ignore')

In [12]:
payment_encoder.fit(taxi_df[['PAYMENT_TYPE']])

In [13]:
payment_encoder.transform(taxi_df[['PAYMENT_TYPE']])

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       ...,
       [1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.]])

### Using Custom Transformers to Perform Model-Specific Feature Engineering Steps: 

We performed some simple feature engineering steps on top of categorical and date variables through the functions we had defined above. While we can apply them in sequence in a script, the one disadvantage is that when it comes to model deployment, we have to make sure we apply the same transformation steps to ensure the data is fed in the same manner during prediction time as it was during training time. 

To handle this in an idiomatic way, we can make use of Scikit-Learn's custom transformers. By extending the `BaseEstimator` and the `TransformerMixin` classes, we can package the model-specific transformation logic into resuable components, and register the pipline object with the model registry. 

In [14]:
#Import necessary classes: 
from sklearn.base import BaseEstimator, TransformerMixin

In [15]:
#Define our custom preprocessing class: 
class PreProcessingPipeline(BaseEstimator, TransformerMixin):
    """Custom class that handles implements custom traansformations prior to modeling the data"""

    def __init__(self, cols_ohe): 
        """Initializes the state of the object, cols_ohe is a list object that contains the columns that we want to 
        one hot encode"""

        self.cols_ohe = cols_ohe

        #Initialize a One Hot Encoder Object that can be referenced later:
        self.ohe = OneHotEncoder(sparse_output = False, handle_unknown = 'ignore')

    def fit(self, X, y = None): 
        """Fit method fits learned parameters over the training data"""

        #Create copy of the data to prevent side effects: 
        X_ = X[self.cols_ohe].copy()
        #Handle for null values with a catch-all 'missing' category
        X_.fillna('missing', inplace = True)
        #Fit the encoder to the input pandas dataframe object
        self.ohe.fit(X_)
        return self

    def transform(self, X, y = None): 
        """This will have all the transformation functions required to transform the input dataset to what is compatiable
        with the model
        
        The logic in the current preprocessing function should be included here, the output will be a dataframe object that will
        be used in the model.fit() call
        """

        X_ = X.copy()
        #Calculate the trip duration
        X_['TRIP_DURATION'] = (X_['TPEP_DROPOFF_DATETIME'] - X_['TPEP_PICKUP_DATETIME']).dt.seconds.values/60
        #Calculate the numerical flag: 
        X_['STORE_AND_FWD_FLAG'] = X_['STORE_AND_FWD_FLAG'].map({'Y': 1, 'N': 0})

        X_ohe = self.ohe.transform(X_[self.cols_ohe]) 
        X_ohe_df = pd.DataFrame(X_ohe, columns = self.ohe.get_feature_names_out(), index = X_.index)

        X_ = pd.concat([X_, X_ohe_df], axis = 1)
        X_.drop(self.cols_ohe, axis = 1, inplace = True)

        #Drop any uncessary columns: 
        X_.drop(['TPEP_PICKUP_DATETIME' ,'TPEP_DROPOFF_DATETIME'], axis = 1, inplace = True)
        
        ## X here is the dataframe object that will be used to fit the model with the final set of columns
        
        return X_

The above custom class definition handles all of the custom transformations that are required for our model. We can now take this and package this into a [sklearn pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) object to train our model: 

In [16]:
#Features we will use from our input dataframe
FEATURE_COLUMNS = ['TPEP_PICKUP_DATETIME', 'PAYMENT_TYPE', 'TPEP_DROPOFF_DATETIME', 'STORE_AND_FWD_FLAG', 'PASSENGER_COUNT', 
                  'TRIP_DISTANCE']

#Target variable: 
TARGET_LABEL = ['FARE_AMOUNT']

X = taxi_df[FEATURE_COLUMNS]
y = taxi_df[TARGET_LABEL]

# Splitting the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating the RandomForestRegressor
rf = RandomForestRegressor(n_estimators = 100,
                           max_depth = 5, 
                           random_state=42,
                           n_jobs= -1)

In [17]:
#Instatiate our preprocesser class: 
preprocesser = PreProcessingPipeline(cols_ohe = ['PAYMENT_TYPE'])

#Create the pipeline object: 
rf_pipeline = Pipeline(steps = [
    ('preprocessing_pipeline', preprocesser),
    ('rf_model', rf)]
)

In [18]:
#Fit the model to the data: 
rf_pipeline.fit(X_train, y_train)

  return fit_method(estimator, *args, **kwargs)


In [19]:
#Evaluate the model performance: 
y_preds = rf_pipeline.predict(X_test)
mse = mean_squared_error(y_test, y_preds)
r2 = r2_score(y_test, y_preds)

print(f"Mean Squared Error: {mse}")
print(f"R^2 Score: {r2}")

Mean Squared Error: 8.025179964296344
R^2 Score: 0.8935025133622962


### Model Deployment: 

 The `rf_pipeline` obect is one that now encapsulates all the intermediate model-specific engineering steps that were scripted individually before. The advantage of this approach is that we can now take this and deploy it using Snowflake's Model Registry, and get predictions against new data without having to write separate scripts to transform the intermediate data. 

In [20]:
#Instantiate Model Registry: 
reg = Registry(
    session = session, 
    database_name = session.get_current_database(), 
    schema_name = session.get_current_schema()
)

In [21]:
#Register Model with the Registry: 
rf_wh_model = reg.log_model(
    model = rf_pipeline, #reference the serializable model object that we had just created
    model_name = "TRIP_FARE_MODEL", #name to identify the model within the registry
    version_name = "V1", #Version to identify iterations of the model we have created - think of it as experiments you have run
    sample_input_data = X_train.head(1000), #pass in sample input data for inferring type signatures
    comment = "Model iteration with the custom transformers"
)

  self.manifest.save(


### Predictions: 

Now that we have logged our model, we can now make predictions, using the raw records from the Snowflake table. Below, we will take a sample of 1000 rows from our original `nyc_yellow_trips` to create the predictions. 

In a real world situation we would use a set of fresh data, the below example is representative of what the code would look like: 

In [22]:
#Sample 1000 rows from our input snowflake table: 
taxi_sdf_sample = session.table("nyc_yellow_trips").sample(n = 1000)

In [None]:
#Run model predictions against a sample of our original snowflake table: 
rf_wh_model.run(taxi_sdf_sample, function_name = 'predict').with_column('PREDICTIONS', F.col('"output_feature_0"'))\
           .select("TRIP_ID", "PREDICTIONS").show()