- Required snowflake-ml-python version **1.5.5** or higher
- Last updated on: 7/22/2024

# End-to-End Snowflake ML workflow

This notebook demonstrates an end-to-end ML experiment cycle including feature creation, training data generation, model training and inference. The workflow touches on key Snowflake ML features including [Snowflake Feature Store](https://docs.snowflake.com/en/developer-guide/snowpark-ml/feature-store/overview), [Dataset](https://docs.snowflake.com/en/developer-guide/snowpark-ml/dataset), ML Lineage, [Snowpark ML Modeling](https://docs.snowflake.com/en/developer-guide/snowpark-ml/modeling) and [Snowflake Model Registry](https://docs.snowflake.com/en/developer-guide/snowpark-ml/model-registry/overview). 

**Table of contents**
- [Set up test environment](#setup-test-env)
  - [Connect to Snowflake](#connect-to-snowflake)
  - [Select your example](#select-your-example)
- [Create features with Feature Store](#create-features-with-feature-store)
  - [Initialize Feature Store](#initialize-feature-store)
  - [Register entities and feature views](#register-new-entities-and-feature-views)
- [Generate Training Data](#gen-training-data)
- [Train model with Snowpark ML](#train-with-snowpark-ml)
- [Log models in Model Registry](#log-models-in-model-registry)
- [Query lineage](#query-lineage)
- [Predict with model](#predict-with-model)
  - [Predict with local model](#predict-with-local-model)
  - [Predict with Model Registry](#predict-with-model-registry)
- [Clean up notebook](#cleanup)

<a id='setup-test-env'></a>
## Set up test environment

<a id='connect-to-snowflake'></a>
### Connect to Snowflake

Let's start with setting up our test environment. We will create a session and a schema. The schema `FS_DEMO_SCHEMA` will be used as the Feature Store. It will be cleaned up at the end of the demo. You need to fill the `connection_parameters` with your Snowflake connection information. Follow this **[guide](https://docs.snowflake.com/en/developer-guide/snowpark/python/creating-session)** for more details about how to connect to Snowflake.


In [1]:
from snowflake.snowpark import Session

connection_parameters = {
    "account": "<your snowflake account>",
    "user": "<your snowflake user>",
    "password": "<your snowflake password>",
    "role": "<your snowflake role>",
    "warehouse": "<your snowflake warehouse>",
    "database": "<your snowflake database>",
    "schema": "<your snowflake schema>",
}

session = Session.builder.configs(connection_parameters).create()

assert session.get_current_database() != None, "Session must have a database for the demo."
assert session.get_current_warehouse() != None, "Session must have a warehouse for the demo."

In [2]:
# The schema where Feature Store will initialize on and test dataset stores.
FS_DEMO_SCHEMA = "SNOWFLAKE_FEATURE_STORE_NOTEBOOK_DEMO"
# the schema model lives.
MODEL_DEMO_SCHEMA = "SNOWFLAKE_FEATURE_STORE_NOTEBOOK_DEMO_MODEL"

# Make sure your role has CREATE SCHEMA privileges or USAGE privileges on the schema if it already exists.
session.sql(f"CREATE OR REPLACE SCHEMA {FS_DEMO_SCHEMA}").collect()
session.sql(f"CREATE OR REPLACE SCHEMA {MODEL_DEMO_SCHEMA}").collect()

[Row(status='Schema SNOWFLAKE_FEATURE_STORE_NOTEBOOK_DEMO_MODEL successfully created.')]

<a id='select-your-example'></a>
### Select your example

We have prepared some examples that you can find in our [open source repo](https://github.com/snowflakedb/snowflake-ml-python/tree/main/snowflake/ml/feature_store/examples). Each example contains the source dataset, feature view and entity definitions which will be used in this demo. `ExampleHelper` (included in snowflake-ml-python) will setup everything with simple APIs and you don't have to worry about the details.

In [3]:
from snowflake.ml.feature_store.examples.example_helper import ExampleHelper

example_helper = ExampleHelper(session, session.get_current_database(), FS_DEMO_SCHEMA)
print(f"All examples: {example_helper.list_examples()}")

All examples: ['new_york_taxi_features', 'citibike_trip_features', 'wine_quality_features']


`load_example()` will load the source data into Snowflake tables. In the example below, we are using the “wine_quality_features” example. You can replace this with any example listed above. Execution of the cell below may take some time depending on the size of the dataset.

In [4]:
# replace the value with the example you want to run
source_tables = example_helper.load_example('new_york_taxi_features')
print(source_tables)

['"REGTEST_DB".SNOWFLAKE_FEATURE_STORE_NOTEBOOK_DEMO.nyc_yellow_trips']


In [5]:
# display as Pandas DataFrame
session.table(source_tables[0]).limit(5).to_pandas()

Unnamed: 0,VENDORID,PASSENGER_COUNT,TRIP_DISTANCE,RATECODEID,STORE_AND_FWD_FLAG,PULOCATIONID,DOLOCATIONID,PAYMENT_TYPE,FARE_AMOUNT,EXTRA,MTA_TAX,TIP_AMOUNT,TOLLS_AMOUNT,IMPROVEMENT_SURCHARGE,TOTAL_AMOUNT,CONGESTION_SURCHARGE,AIRPORT_FEE,TPEP_PICKUP_DATETIME,TPEP_DROPOFF_DATETIME
0,1,1,3.2,1,N,48,262,1,14.0,0.5,0.5,3.06,0.0,0.3,18.36,,,2016-01-01 00:12:22,2016-01-01 00:29:14
1,1,2,1.0,1,N,162,48,2,9.5,0.5,0.5,0.0,0.0,0.3,10.8,,,2016-01-01 00:41:31,2016-01-01 00:55:10
2,1,1,0.9,1,N,246,90,2,6.0,0.5,0.5,0.0,0.0,0.3,7.3,,,2016-01-01 00:53:37,2016-01-01 00:59:57
3,1,1,0.8,1,N,170,162,2,5.0,0.5,0.5,0.0,0.0,0.3,6.3,,,2016-01-01 00:13:28,2016-01-01 00:18:07
4,1,1,1.8,1,N,161,140,2,11.0,0.5,0.5,0.0,0.0,0.3,12.3,,,2016-01-01 00:33:04,2016-01-01 00:47:14


<a id='create-features-with-feature-store'></a>
## Create features with Feature Store

<a id='initialize-feature-store'></a>
### Initialize Feature Store

Let's first create a feature store client. With `CREATE_IF_NOT_EXIST` mode, it will try to create a new Feature Store schema and all necessary feature store metadata if it doesn't exist already. It is required for the first time to set up a Feature Store. Afterwards, you can use `FAIL_IF_NOT_EXIST` mode to connect to an existing Feature Store. 

Note that the database being used must already exist. Feature Store will **NOT** try to create the database even in `CREATE_IF_NOT_EXIST` mode.

In [6]:
from snowflake.ml.feature_store import (
    FeatureStore,
    FeatureView,
    Entity,
    CreationMode
)

fs = FeatureStore(
    session=session, 
    database=session.get_current_database(), 
    name=FS_DEMO_SCHEMA, 
    default_warehouse=session.get_current_warehouse(),
    creation_mode=CreationMode.CREATE_IF_NOT_EXIST,
)

<a id='register-new-entities-and-feature-views'></a>
### Register entities and feature views

Next we register new entities and feature views in Feature Store. Entities will be the join keys used to generate training data. Feature Views contains all the features you need for your model training and inference. We have entities and feature views for this example defined in our [open source repo](https://github.com/snowflakedb/snowflake-ml-python/tree/main/snowflake/ml/feature_store/examples). We will load the definitions with `load_entities()` and `load_draft_feature_views()` for simplicity. 

In [7]:
all_entities = []
for e in example_helper.load_entities():
    entity = fs.register_entity(e)
    all_entities.append(entity)
fs.list_entities().show()

-----------------------------------------------------------------------
|"NAME"        |"JOIN_KEYS"       |"DESC"                |"OWNER"     |
-----------------------------------------------------------------------
|TRIP_DROPOFF  |["DOLOCATIONID"]  |Trip dropoff entity.  |REGTEST_RL  |
|TRIP_PICKUP   |["PULOCATIONID"]  |Trip pickup entity.   |REGTEST_RL  |
-----------------------------------------------------------------------



In [8]:
all_feature_views = []
for fv in example_helper.load_draft_feature_views():
    rf = fs.register_feature_view(
        feature_view=fv,
        version='1.0'
    )
    all_feature_views.append(rf)

fs.list_feature_views().select('name', 'version', 'desc', 'refresh_freq').show()

----------------------------------------------------------------------------------------------------
|"NAME"          |"VERSION"  |"DESC"                                              |"REFRESH_FREQ"  |
----------------------------------------------------------------------------------------------------
|F_TRIP_DROPOFF  |1.0        |Managed feature view trip dropoff refreshed eve...  |12 hours        |
|F_TRIP_PICKUP   |1.0        |Managed feature view trip pickup refreshed ever...  |1 day           |
----------------------------------------------------------------------------------------------------



<a id='gen-training-data'></a>
## Generate Training Data

After our feature pipelines are fully setup, we can use them to generate [Snowflake Dataset](https://docs.snowflake.com/en/developer-guide/snowpark-ml/dataset) and later do model training. Generating training data is easy since materialized FeatureViews already carry most of the metadata like join keys, timestamp for point-in-time lookup, etc. We just need to provide the spine data (it's called spine because it is the list of entity IDs that we are essentially enriching by joining features with it).

`generate_dataset()` returns a Snowflake Dataset object, which is best for distributed training with deep learning frameworks like TensorFlow or Pytorch which requires fine-grained file-level access. It creates a new Dataset object (which is versioned and immutable) in Snowflake which materializes the data in Parquet files. If you train models with classic ML libraries like Snowpark ML or scikit-learn, you can use `generate_training_set()` which returns a classic Snowflake table. The Cell below demonstrates `generate_dataset()`.

Retrieve some metadata columns that are essential when generating training data.

In [9]:
label_cols = example_helper.get_label_cols()
timestamp_col = example_helper.get_training_data_timestamp_col()
excluded_cols = example_helper.get_excluded_cols()
join_keys = [key for entity in all_entities for key in entity.join_keys]
print(f'timestamp col: {timestamp_col}')
print(f'excluded cols: {excluded_cols}')
print(f'label cols: {label_cols}')
print(f'join keys: {join_keys}')

timestamp col: TPEP_PICKUP_DATETIME
excluded cols: []
label cols: ['FARE_AMOUNT']
join keys: ['PULOCATIONID', 'DOLOCATIONID']


Create a spine dataframe that's sampled from source table.

In [10]:
sample_count = 512
source_df = session.sql(f"""
    select {','.join(label_cols)}, 
            {','.join(join_keys)} 
            {',' + timestamp_col if timestamp_col is not None else ''} 
    from {source_tables[0]}""")
spine_df = source_df.sample(n=sample_count)
# preview spine dataframe
spine_df.to_pandas()

Unnamed: 0,FARE_AMOUNT,PULOCATIONID,DOLOCATIONID,TPEP_PICKUP_DATETIME
0,8.5,161,68,2016-01-08 10:47:49
1,6.0,234,114,2016-01-09 17:14:42
2,5.0,87,231,2016-01-08 13:26:55
3,7.5,170,79,2016-01-09 10:45:00
4,28.0,97,143,2016-01-07 22:11:59
...,...,...,...,...
507,13.0,231,48,2016-01-04 18:00:32
508,52.0,132,244,2016-01-06 06:16:31
509,12.5,226,162,2016-01-05 08:29:48
510,7.0,79,107,2016-01-06 18:02:51


Generate dataset object from spine dataframe and feature views.

In [11]:
my_dataset = fs.generate_dataset(
    name="my_cool_training_dataset",
    spine_df=spine_df, 
    features=all_feature_views,
    version="4.0",
    spine_timestamp_col=timestamp_col,
    spine_label_cols=label_cols,
    exclude_columns=excluded_cols,
    desc="This is the dataset joined spine dataframe with feature views",
)

Convert dataset to a snowpark dataframe and examine all the features in it.

In [12]:
training_data_df = my_dataset.read.to_snowpark_dataframe()
assert training_data_df.count() == sample_count
# drop rows that have any nulls in value. 
training_data_df = training_data_df.dropna(how='any')
training_data_df.to_pandas()

Unnamed: 0,FARE_AMOUNT,PULOCATIONID,DOLOCATIONID,TPEP_PICKUP_DATETIME,TRIP_COUNT_1H,TRIP_COUNT_5H,MEAN_FARE_2H,MEAN_FARE_5H
0,15.0,125,52,2016-01-09 01:51:54,42,164,11.920779,1.0
1,5.0,125,231,2016-01-20 18:24:09,317,1532,11.256917,1.0
2,7.0,125,88,2016-01-22 00:49:05,65,317,14.458763,1.0
3,32.5,138,48,2016-01-05 21:56:43,591,2974,28.401669,1.0
4,41.0,138,45,2016-01-06 23:03:19,50,225,28.584057,1.0
...,...,...,...,...,...,...,...,...
507,6.5,162,230,2016-01-19 08:05:47,885,1946,9.193698,1.0
508,7.5,162,234,2016-01-20 00:37:15,165,1995,11.385578,1.0
509,10.0,162,107,2016-01-21 19:45:20,692,2307,11.164576,1.0
510,10.0,162,113,2016-01-22 18:56:25,460,1692,10.909946,1.0


<a id='train-with-snowpark-ml'></a>
## Train model with Snowpark ML

Now let's train a simple random forest model, and evaluate the prediction accuracy. When you call fit() on a DataFrame that is created from a  Dataset, the linkage between the trained model and dataset is automatically wired up. Later, you can easily retrieve the training dataset from this model, or you can query the lineage about the dataset and model. This is work-in-progress and will be available  soon in an upcoming release.

In [13]:
from snowflake.ml.modeling.ensemble import RandomForestRegressor
from snowflake.ml.modeling import metrics as snowml_metrics
from snowflake.snowpark.functions import abs as sp_abs, mean, col

def train_model_using_snowpark_ml(training_data_df):
    train, test = training_data_df.random_split([0.8, 0.2], seed=42)
    feature_columns = list(set(training_data_df.columns) - set(label_cols) - set(join_keys) - set([timestamp_col]))
    print(f"feature cols: {feature_columns}")
    
    rf = RandomForestRegressor(
        input_cols=feature_columns, label_cols=label_cols, 
        max_depth=3, n_estimators=20, random_state=42
    )

    rf.fit(train)
    predictions = rf.predict(test)

    output_label_names = ['OUTPUT_' + col for col in label_cols]
    mse = snowml_metrics.mean_squared_error(
        df=predictions, 
        y_true_col_names=label_cols, 
        y_pred_col_names=output_label_names
    )

    accuracy = 100 - snowml_metrics.mean_absolute_percentage_error(
        df=predictions,
        y_true_col_names=label_cols,
        y_pred_col_names=output_label_names
    )

    print(f"MSE: {mse}, Accuracy: {accuracy}")
    return rf

random_forest_model = train_model_using_snowpark_ml(training_data_df) 

feature cols: ['MEAN_FARE_2H', 'MEAN_FARE_5H', 'TRIP_COUNT_5H', 'TRIP_COUNT_1H']
MSE: 101.03885521242017, Accuracy: 99.50167389473202


<a id="log-models-in-model-registry"></a>
## Log model in Model Registry

After the model is trained, we can save the model into Model Registry so we can manage the model, its metadata including metrics, versions, and use it later for inference. Also, ML lineage is built automatically between the model, dataset and feature views.

In [14]:
from snowflake.ml.registry import Registry

registry = Registry(
    session=session, 
    database_name=session.get_current_database(), 
    schema_name=MODEL_DEMO_SCHEMA,
)

Log model into Model Registry.

In [15]:
model_name = "MY_RANDOM_FOREST_REGRESSOR_MODEL"

registry.log_model(
    model_name=model_name,
    version_name="v1",
    model=random_forest_model,
    comment="My model trained with feature views, dataset",
)

  return next(self.gen)


ModelVersion(
  name='MY_RANDOM_FOREST_REGRESSOR_MODEL',
  version='V1',
)

<a id="query-lineage"></a>
## Query lineage
We can now query the lineage from an object. You can call `lineage()` on any object and it returns a set of objects that it has dependency with.

In [16]:
model = registry.get_model(model_name).version("v1")
model.lineage(direction="upstream")



[Dataset(
   name='REGTEST_DB.SNOWFLAKE_FEATURE_STORE_NOTEBOOK_DEMO.MY_COOL_TRAINING_DATASET',
   version='4.0',
 )]

In [17]:
my_dataset.lineage(direction="downstream")

[ModelVersion(
   name='MY_RANDOM_FOREST_REGRESSOR_MODEL',
   version='V1',
 )]

There's a bug causing below cell not return Dataset as downstream lineage object of feature view. We are working on fixing it.

In [18]:
for fv in all_feature_views:
    fv.lineage(direction='downstream')

<a id='predict-with-model'></a>
## Predict with model

Finally we are almost ready for prediction! For this, we can look up the latest feature values from Feature Store for the specific data records that we are running prediction on. One of the key benefits of using the Feature Store is that it provides a way to automatically serve up the right feature values during prediction with point-in-time correct feature values. `load_feature_views_from_dataset()` gets the same feature views used in training, then `retrieve_feature_values()` lookups the latest feature values.

In [19]:
test_df = source_df.sample(n=3)

# load back feature views from dataset
fvs = fs.load_feature_views_from_dataset(my_dataset)
enriched_df = fs.retrieve_feature_values(
    test_df, 
    features=fvs,
    exclude_columns=join_keys,
    spine_timestamp_col=timestamp_col
)
enriched_df = enriched_df.drop(join_keys)
enriched_df.show()

--------------------------------------------------------------------------------------------------------------------
|"FARE_AMOUNT"  |"TPEP_PICKUP_DATETIME"  |"TRIP_COUNT_1H"  |"TRIP_COUNT_5H"  |"MEAN_FARE_2H"      |"MEAN_FARE_5H"  |
--------------------------------------------------------------------------------------------------------------------
|33.0           |2016-01-10 20:11:11     |27               |112              |11.381987577639752  |1.000000        |
|15.5           |2016-01-28 03:18:50     |35               |907              |12.427487352445194  |1.000000        |
|6.5            |2016-01-07 16:49:59     |495              |3035             |10.373705179282869  |1.000000        |
--------------------------------------------------------------------------------------------------------------------



<a id='predict-with-local-model'></a>
### [Optional 1] predict with local model
Now we can predict with a local model and the feature values retrieved from feature store. 

In [20]:
pred = random_forest_model.predict(enriched_df.to_pandas())
print(pred)

   FARE_AMOUNT TPEP_PICKUP_DATETIME  TRIP_COUNT_1H  TRIP_COUNT_5H  \
0         10.0  2016-01-28 08:57:11           1559           4636   
1          6.5  2016-01-29 07:02:50            137            239   
2          7.0  2016-01-09 10:06:25            225            521   

   MEAN_FARE_2H  MEAN_FARE_5H  OUTPUT_FARE_AMOUNT  
0      9.863181           1.0            8.751825  
1     23.330000           1.0           21.714971  
2      9.970138           1.0           10.083706  


<a id='predict-with-model-registry'></a>
### [Option 2] Predict with Model Registry

We can also retrieve the model from model registry and run  predictions on the model using latest feature values.

In [21]:
# model is retrieved from Model Registry in earlier step.
restored_prediction = model.run(
    enriched_df.to_pandas(), function_name="predict")

print(restored_prediction)

   MEAN_FARE_2H  MEAN_FARE_5H  TRIP_COUNT_5H  TRIP_COUNT_1H  \
0     12.713693           1.0           4544            739   
1     13.123772           1.0           3463            245   
2      9.186856           1.0           4495            878   

   OUTPUT_FARE_AMOUNT  
0           10.083857  
1           10.548088  
2            8.751825  


<a id='cleanup'></a>
## Clean up notebook

This cell will drop the schemas have been created at beginning of this notebook, and also drop all objects live in the schemas including source data tables, feature views, datasets, and models.

In [22]:
session.sql(f"DROP SCHEMA IF EXISTS {FS_DEMO_SCHEMA}").collect()
session.sql(f"DROP SCHEMA IF EXISTS {MODEL_DEMO_SCHEMA}").collect()

[Row(status='SNOWFLAKE_FEATURE_STORE_NOTEBOOK_DEMO_MODEL successfully dropped.')]