- Last updated on: 8/6/2024
- Required snowflake-ml-python version: >=**1.6.1**

# Feature Store API Overview

This notebook provides an overview of Feature Store APIs. It demonstrates how to manage Feature Store, Feature Views, Feature Entities and how to retrieve features and generate training datasets etc. The goal is to provide a quick walkthrough of the most common APIs. For a full list of APIs, please refer to [API Reference page](https://docs.snowflake.com/en/developer-guide/snowpark-ml/reference/latest/feature_store).


Note: there may be a delay in the availability of the newest snowflake-ml-python package in the Snowflake Conda channel. To install the latest snowflake-ml-python package which includes all of necessary components used in this notebook, please follow the install instructions [here](https://docs.snowflake.com/LIMITEDACCESS/snowpark-ml-library-update).

**Table of contents**:
- [Set up connection and test dataset](#setup-test-environment)
- [Manage features in Feature Store](#manage-features-in-feature-store)
  - [Initialize a Feature Store](#initialize-a-feature-store)
  - [Create entities](#create-entities)
  - [Create feature views](#create-feature-views)
  - [Add feature view versions](#add-feature-view-versions)
  - [Update feature views](#update-feature-views)
  - [Operate feature views](#operate-feature-views)
  - [Retrieve values from a feature view](#read-values-from-a-feature-view)
  - [Generate training data](#generate-training-data)
  - [Delete feature views](#delete-feature-views)
  - [Delete entities](#delete-entities)
  - [Cleanup Feature Store](#cleanup-feature-store)
- [Clean up notebook](#cleanup-notebook)

<a id='setup-test-environment'></a>
## Set up connection and test dataset

Let's start with setting up out test environment. We will create a session and a schema. The schema `FS_DEMO_SCHEMA` will be used as the Feature Store. It will be cleaned up at the end of the demo. You need to fill the `connection_parameters` with your Snowflake connection information. Follow this **[guide](https://docs.snowflake.com/en/developer-guide/snowpark/python/creating-session)** for more details about how to connect to Snowflake.

In [1]:
from snowflake.snowpark import Session, context, exceptions

try:
    # Retrieve active session if in Snowpark Notebook
    session = context.get_active_session()
except exceptions.SnowparkSessionException:
    # ACTION REQUIRED: Need to manually configure Snowflake connection if using Jupyter
    connection_parameters = {
        "account": "<your snowflake account>",
        "user": "<your snowflake user>",
        "password": "<your snowflake password>",
        "role": "<your snowflake role>",
        "warehouse": "<your snowflake warehouse>",
        "database": "<your snowflake database>",
        "schema": "<your snowflake schema>",
    }
    session = Session.builder.configs(connection_parameters).create()

assert session.get_current_database() != None, "Session must have a database for the demo."
assert session.get_current_warehouse() != None, "Session must have a warehouse for the demo."

In [2]:
# The schema where Feature Store will be initialized and test datasets stored.
FS_DEMO_SCHEMA = "SNOWFLAKE_FEATURE_STORE_NOTEBOOK_DEMO"

# Make sure your role has CREATE SCHEMA privileges or USAGE privileges on the schema if it already exists.
session.sql(f"CREATE OR REPLACE SCHEMA {FS_DEMO_SCHEMA}").collect()

[Row(status='Schema SNOWFLAKE_FEATURE_STORE_NOTEBOOK_DEMO successfully created.')]

We have prepared some examples which you can find in our [open source repo](https://github.com/snowflakedb/snowflake-ml-python/tree/main/snowflake/ml/feature_store/examples). Each example contains the source dataset, feature view and entity definitions which will be used in this demo. `ExampleHelper` (included in snowflake-ml-python) will setup everything with simple APIs and you don't have to worry about the details.

In [3]:
from snowflake.ml.feature_store.examples.example_helper import ExampleHelper

example_helper = ExampleHelper(session, session.get_current_database(), FS_DEMO_SCHEMA)
example_helper.list_examples().to_pandas()

Unnamed: 0,NAME,DESC,LABEL_COLS
0,new_york_taxi_features,Features using taxi trip data trying to predic...,TOTAL_AMOUNT
1,airline_features,Features using synthetic airline data to predi...,DEPARTING_DELAY
2,wine_quality_features,Features using wine quality data trying to pre...,quality
3,citibike_trip_features,Features using citibike trip data trying to pr...,tripduration


We can quickly look at the newly generated source tables.

In [4]:
# replace the value with the example you want to run
source_tables = example_helper.load_example('citibike_trip_features')
# display as Pandas DataFrame
for table in source_tables:
    print(f"{table}:")
    df = session.table(table).limit(5).to_pandas()
    display(df)

"REGTEST_DB".SNOWFLAKE_FEATURE_STORE_NOTEBOOK_DEMO.citibike_trips:


Unnamed: 0,TRIP_ID,TRIPDURATION,STARTTIME,STOPTIME,START_STATION_ID,START_STATION_NAME,START_STATION_LATITUDE,START_STATION_LONGITUDE,END_STATION_ID,END_STATION_NAME,END_STATION_LATITUDE,END_STATION_LONGITUDE,BIKEID,MEMBERSHIP_TYPE,USERTYPE,BIRTH_YEAR,GENDER
0,1,327,2013-12-05 13:09:50,2013-12-05 13:15:17,523,W 38 St & 8 Ave,40.754666,-73.991382,505,6 Ave & W 33 St,40.749013,-73.988484,15852,,Subscriber,1980,1
1,2,478,2013-12-05 13:09:52,2013-12-05 13:17:50,473,Rivington St & Chrystie St,40.721101,-73.991925,161,LaGuardia Pl & W 3 St,40.72917,-73.998102,17952,,Subscriber,1983,2
2,3,288,2013-12-05 13:09:54,2013-12-05 13:14:42,167,E 39 St & 3 Ave,40.748901,-73.976049,524,W 43 St & 6 Ave,40.755273,-73.983169,19033,,Subscriber,1988,1
3,4,1163,2013-12-05 13:10:00,2013-12-05 13:29:23,229,Great Jones St,40.727434,-73.99379,347,W Houston St & Hudson St,40.728739,-74.007488,17488,,Subscriber,1988,1
4,5,247,2013-12-05 13:10:04,2013-12-05 13:14:11,505,6 Ave & W 33 St,40.749013,-73.988484,466,W 25 St & 6 Ave,40.743954,-73.991449,15838,,Subscriber,1965,1


<a id='manage-features-in-feature-store'></a>
## Manage features in Feature Store

Now we're ready to create a  Feature Store. The sections below showcase how to create a Feature Store, entities, feature views and how to work with them.

<a id='initialize-a-feature-store'></a>
### Initialize a Feature Store

Firstly, we create a new (or connect to an existing) Feature Store.

In [5]:
from snowflake.ml.feature_store import (
    FeatureStore,
    FeatureView,
    Entity,
    CreationMode,
    FeatureViewStatus,
)

fs = FeatureStore(
    session=session, 
    database=session.get_current_database(), 
    name=FS_DEMO_SCHEMA, 
    default_warehouse=session.get_current_warehouse(),
    creation_mode=CreationMode.CREATE_IF_NOT_EXIST,
)

<a id='create-entities'></a>
### Create entities

Before we can create  feature views, we need to create entities. The cell below registers the entities that are pre-defined for this example, and loaded by `helper.load_entities()`.

In [6]:
for e in example_helper.load_entities():
    fs.register_entity(e)
all_entities_df = fs.list_entities()
all_entities_df.show()

--------------------------------------------------------------------------------
|"NAME"          |"JOIN_KEYS"         |"DESC"                     |"OWNER"     |
--------------------------------------------------------------------------------
|END_STATION_ID  |["END_STATION_ID"]  |The id of an end station.  |REGTEST_RL  |
|TRIP_ID         |["TRIP_ID"]         |The id of a trip.          |REGTEST_RL  |
--------------------------------------------------------------------------------



You can get registered entities by name from Feature Store.

In [7]:
# if you are running with other examples besides citibike_trip_features, replace with other entity name.
entity_name = 'end_station_id'
my_entity = fs.get_entity(entity_name)

<a id='create-feature-views'></a>
### Create feature views

Next, we can register feature views. Feature views also are pre-defined in our repository. You can find the definitions [here](https://github.com/snowflakedb/snowflake-ml-python/tree/main/snowflake/ml/feature_store/examples).

In [8]:
for fv in example_helper.load_draft_feature_views():
    fs.register_feature_view(
        feature_view=fv,
        version='1.0'
    )

all_fvs_df = fs.list_feature_views().select('name', 'version', 'desc', 'refresh_freq')
all_fvs_df.show()

----------------------------------------------------------------------------------
|"NAME"     |"VERSION"  |"DESC"                                 |"REFRESH_FREQ"  |
----------------------------------------------------------------------------------
|F_STATION  |1.0        |Station features refreshed every day.  |1 day           |
|F_TRIP     |1.0        |Static trip features                   |NULL            |
----------------------------------------------------------------------------------



Note that you can specify feature view versions and attach descriptive comments in the “DESC” field to make search and discovery of features easier. 

<a id='add-feature-view-versions'></a>
### Add feature view versions

We can also add new versions in a feature view by using the same name as an existing feature view but a different version.

In [9]:
for fv in example_helper.load_draft_feature_views():
    fv.desc = f'{fv.name}/2.0 with new desc.'
    fs.register_feature_view(
        feature_view=fv,
        version='2.0'
    )

all_fvs_df = fs.list_feature_views().select('name', 'version', 'desc', 'refresh_freq')
all_fvs_df.show()

  fv.desc = f'{fv.name}/2.0 with new desc.'


----------------------------------------------------------------------------------
|"NAME"     |"VERSION"  |"DESC"                                 |"REFRESH_FREQ"  |
----------------------------------------------------------------------------------
|F_STATION  |1.0        |Station features refreshed every day.  |1 day           |
|F_STATION  |2.0        |F_STATION/2.0 with new desc.           |1 day           |
|F_TRIP     |1.0        |Static trip features                   |NULL            |
|F_TRIP     |2.0        |F_TRIP/2.0 with new desc.              |NULL            |
----------------------------------------------------------------------------------



<a id='update-feature-views'></a>
### Update feature views

After a feature view is registered, it is materialized to Snowflake backend. You can still update some metadata for a registered feature view with `update_feature_view`. Below cell updates the `desc` of a managed feature view. You can check our [API reference](https://docs.snowflake.com/en/developer-guide/snowpark-ml/reference/latest/api/feature_store/snowflake.ml.feature_store.FeatureStore) page to find the full list of metadata that can be updated.

In [10]:
# if you are running other examples besides citibike_trip_features, replace with other feature view name.
target_feature_view = 'f_station'
updated_fv = fs.update_feature_view(
    name=target_feature_view,
    version='1.0',
    desc=f'Updated desc for {target_feature_view}.', 
)

assert updated_fv.desc == f'Updated desc for {target_feature_view}.'
fs.list_feature_views(feature_view_name=target_feature_view) \
    .select('name', 'version', 'desc', 'refresh_freq', 'scheduling_state').show()

----------------------------------------------------------------------------------------------
|"NAME"     |"VERSION"  |"DESC"                        |"REFRESH_FREQ"  |"SCHEDULING_STATE"  |
----------------------------------------------------------------------------------------------
|F_STATION  |1.0        |Updated desc for f_station.   |1 day           |ACTIVE              |
|F_STATION  |2.0        |F_STATION/2.0 with new desc.  |1 day           |ACTIVE              |
----------------------------------------------------------------------------------------------



<a id='operate-feature-views'></a>
### Operate feature views

For **managed feature views**, you can suspend, resume, or manually refresh the backend pipelines. A managed feature view is an automated feature pipeline that computes the features on a given schedule. You create a managed feature view by setting the `refresh_freq`. In contrast, a **static feature view** is created when `refresh_freq` is set to None.

In [11]:
registered_fv = fs.get_feature_view(target_feature_view, '1.0')
suspended_fv = fs.suspend_feature_view(registered_fv)
assert suspended_fv.status == FeatureViewStatus.SUSPENDED
fs.list_feature_views().select('name', 'version', 'desc', 'refresh_freq', 'scheduling_state').show()

----------------------------------------------------------------------------------------------
|"NAME"     |"VERSION"  |"DESC"                        |"REFRESH_FREQ"  |"SCHEDULING_STATE"  |
----------------------------------------------------------------------------------------------
|F_STATION  |1.0        |Updated desc for f_station.   |1 day           |SUSPENDED           |
|F_STATION  |2.0        |F_STATION/2.0 with new desc.  |1 day           |ACTIVE              |
|F_TRIP     |1.0        |Static trip features          |NULL            |NULL                |
|F_TRIP     |2.0        |F_TRIP/2.0 with new desc.     |NULL            |NULL                |
----------------------------------------------------------------------------------------------



In [12]:
resumed_fv = fs.resume_feature_view(suspended_fv)
assert resumed_fv.status == FeatureViewStatus.ACTIVE
fs.list_feature_views().select('name', 'version', 'desc', 'refresh_freq', 'scheduling_state').show()

----------------------------------------------------------------------------------------------
|"NAME"     |"VERSION"  |"DESC"                        |"REFRESH_FREQ"  |"SCHEDULING_STATE"  |
----------------------------------------------------------------------------------------------
|F_STATION  |1.0        |Updated desc for f_station.   |1 day           |ACTIVE              |
|F_STATION  |2.0        |F_STATION/2.0 with new desc.  |1 day           |ACTIVE              |
|F_TRIP     |1.0        |Static trip features          |NULL            |NULL                |
|F_TRIP     |2.0        |F_TRIP/2.0 with new desc.     |NULL            |NULL                |
----------------------------------------------------------------------------------------------



In [13]:
history_df_before = fs.get_refresh_history(resumed_fv).order_by('REFRESH_START_TIME')
history_df_before.show()

----------------------------------------------------------------------------------------------------------------------
|"NAME"         |"STATE"    |"REFRESH_START_TIME"              |"REFRESH_END_TIME"                |"REFRESH_ACTION"  |
----------------------------------------------------------------------------------------------------------------------
|F_STATION$1.0  |SUCCEEDED  |2024-08-06 09:41:17.171000-07:00  |2024-08-06 09:41:17.547000-07:00  |INCREMENTAL       |
|F_STATION$1.0  |SUCCEEDED  |2024-08-06 09:42:56.835000-07:00  |2024-08-06 09:42:57.612000-07:00  |INCREMENTAL       |
|F_STATION$1.0  |SUCCEEDED  |2024-08-06 09:43:34.390000-07:00  |2024-08-06 09:43:34.884000-07:00  |INCREMENTAL       |
|F_STATION$1.0  |SUCCEEDED  |2024-08-06 09:44:10.294000-07:00  |2024-08-06 09:44:10.860000-07:00  |INCREMENTAL       |
----------------------------------------------------------------------------------------------------------------------



The cell below manually refreshes a feature view. It triggers the feature computation on the latest source data. You can check the refresh history with `get_refresh_history()` and you will see updated results from previous `get_refresh_history()`.

In [14]:
fs.refresh_feature_view(resumed_fv)
history_df_after = fs.get_refresh_history(resumed_fv).order_by('REFRESH_START_TIME')
history_df_after.show()

----------------------------------------------------------------------------------------------------------------------
|"NAME"         |"STATE"    |"REFRESH_START_TIME"              |"REFRESH_END_TIME"                |"REFRESH_ACTION"  |
----------------------------------------------------------------------------------------------------------------------
|F_STATION$1.0  |SUCCEEDED  |2024-08-06 09:41:17.171000-07:00  |2024-08-06 09:41:17.547000-07:00  |INCREMENTAL       |
|F_STATION$1.0  |SUCCEEDED  |2024-08-06 09:42:56.835000-07:00  |2024-08-06 09:42:57.612000-07:00  |INCREMENTAL       |
|F_STATION$1.0  |SUCCEEDED  |2024-08-06 09:43:34.390000-07:00  |2024-08-06 09:43:34.884000-07:00  |INCREMENTAL       |
|F_STATION$1.0  |SUCCEEDED  |2024-08-06 09:44:10.294000-07:00  |2024-08-06 09:44:10.860000-07:00  |INCREMENTAL       |
|F_STATION$1.0  |SUCCEEDED  |2024-08-06 09:44:48.016000-07:00  |2024-08-06 09:44:48.449000-07:00  |INCREMENTAL       |
------------------------------------------------

<a id='read-values-from-a-feature-view'></a>
### Retrieve values from a feature view 

You can read the feature value of a registered feature view with `read_feature_view()`.

In [15]:
feature_value_df = fs.read_feature_view(resumed_fv)
feature_value_df.show()

------------------------------------------------------------------------
|"END_STATION_ID"  |"F_COUNT"  |"F_AVG_LATITUDE"  |"F_AVG_LONGTITUDE"  |
------------------------------------------------------------------------
|505               |483        |40.74901271       |-73.98848395        |
|161               |429        |40.72917025       |-73.99810231        |
|347               |440        |40.72873888       |-74.00748842        |
|466               |425        |40.74395411       |-73.99144871        |
|459               |456        |40.746745         |-74.007756          |
|247               |241        |40.73535398       |-74.00483090999998  |
|127               |481        |40.73172428       |-74.00674436        |
|2000              |121        |40.70255088       |-73.98940236        |
|514               |272        |40.76087502       |-74.00277668        |
|195               |219        |40.70905623       |-74.01043382        |
---------------------------------------------------

<a id='generate-training-data'></a>
### Generate training data

We can generate training data easily from Feature Store and output it either as a [Dataset object](https://docs.snowflake.com/en/developer-guide/snowpark-ml/dataset), or as Snowpark DataFrame.
The cell below creates a spine dataframe by randomly sampling some entity keys from source table. generate_dataset() then creates a Dataset object by populating the spine_df with respective feature values from selected feature views. 

In [16]:
entity_key_names = ','.join(my_entity.join_keys)
spine_df = session.sql(f"select {entity_key_names} from {source_tables[0]}").sample(n=1000)

Use generate_dataset() to output a Dataset object.

In [17]:
training_fv = fs.get_feature_view(target_feature_view, '1.0')

my_dataset = fs.generate_dataset(
    name='my_cool_dataset',
    version='first',
    spine_df=spine_df,
    features=[training_fv],
    desc='This is my dataset joined with feature views',
)

Convert dataset to Pandas DataFrame and look at the first 10 rows.

In [18]:
my_dataset.read.to_pandas().head(10)

Unnamed: 0,END_STATION_ID,F_COUNT,F_AVG_LATITUDE,F_AVG_LONGTITUDE
0,253,391,40.735439,-73.994537
1,368,486,40.730385,-74.002151
2,345,451,40.736492,-73.997047
3,477,759,40.756405,-73.990028
4,157,147,40.690891,-73.996124
5,521,956,40.75045,-73.994812
6,375,497,40.726795,-73.996948
7,350,258,40.715595,-73.98703
8,337,112,40.7038,-74.008385
9,497,911,40.737049,-73.990089


Dataset object materializes data in Parquet files on internal stages. Alternatively, you can use  `generate_training_set()` to output training data as a DataFrame.

In [19]:
training_data_df = fs.generate_training_set(
    spine_df=spine_df,
    features=[training_fv]
)

training_data_df.show()

------------------------------------------------------------------------
|"END_STATION_ID"  |"F_COUNT"  |"F_AVG_LATITUDE"  |"F_AVG_LONGTITUDE"  |
------------------------------------------------------------------------
|195               |219        |40.70905623       |-74.01043382        |
|398               |69         |40.69165183       |-73.9999786         |
|329               |361        |40.72043411       |-74.01020609        |
|498               |368        |40.74854862       |-73.98808416        |
|319               |252        |40.71336124       |-74.00937622        |
|369               |265        |40.73224119       |-74.00026394        |
|459               |456        |40.746745         |-74.007756          |
|311               |228        |40.7172274        |-73.98802084        |
|480               |242        |40.76669671       |-73.99061728        |
|127               |481        |40.73172428       |-74.00674436        |
---------------------------------------------------

<a id='delete-feature-views'></a>
### Delete feature views

Feature views can be deleted via `delete_feature_view()`.

Warning: Deleting a feature view may break downstream dependencies for other feature views or models that depend on the feature view being deleted.

In [20]:
for row in fs.list_feature_views().collect():
    fv = fs.get_feature_view(row['NAME'], row['VERSION'])
    fs.delete_feature_view(fv)

all_fvs_df = fs.list_feature_views().select('name', 'version') 
assert all_fvs_df.count() == 0, "0 feature views left after deletion."
all_fvs_df.show()

----------------------
|"NAME"  |"VERSION"  |
----------------------
|        |           |
----------------------



<a id='delete-entities'></a>
### Delete entities

You can delete entity with `delete_entity()`. Note it will check whether there are feature views registered on this entity before it gets deleted, otherwise the deletion will fail.

In [21]:
for row in fs.list_entities().collect():
    fs.delete_entity(row['NAME'])

all_entities_df = fs.list_entities()
assert all_entities_df.count() == 0, "0 entities after deletion."
all_entities_df.show()

-------------------------------------------
|"NAME"  |"JOIN_KEYS"  |"DESC"  |"OWNER"  |
-------------------------------------------
|        |             |        |         |
-------------------------------------------



<a id='cleanup-feature-store'></a>
### Cleanup Feature Store (experimental) 

Currently we provide an experimental API to delete all entities and feature views in a Feature Store for easy cleanup. If "dryrun" is set to True (the default) then `fs._clear()` only prints the objects that will be deleted. If "dryrun" is set to False, it performs the deletion.

In [22]:
fs._clear(dryrun=False)

assert fs.list_feature_views().count() == 0, "0 feature views left after deletion."
assert fs.list_entities().count() == 0, "0 entities left after deletion."

  return f(self, *args, **kargs)


<a id='cleanup-notebook'></a>
## Clean up notebook

In [23]:
session.sql(f"DROP SCHEMA IF EXISTS {FS_DEMO_SCHEMA}").collect()

[Row(status='SNOWFLAKE_FEATURE_STORE_NOTEBOOK_DEMO successfully dropped.')]