# Tutorial: Develop a feature set and register with managed feature store

Azure ML managed feature store lets you discover, create and operationalize features. Features are the connective tissue in ML lifecycle, starting from prototyping where you experiment with various features to operationalization where models are deployed and feature data is looked up during inference. For information on basics concept of feature store, see [feature store concepts](fs-concepts).

In this tutorial series you will experience how features seamlessly integrates all the phases of ML lifecycle:
Part 1 (this tutorial): Create

This tutorial is the first part of a three part series. In this tutorial you will:
- Create a new minimal feature store resource
- Develop and test featureset locally with feature transformation capability
- Register a feature-store entity with the feature store
- Register the featureset that you developed with the feature store
- Generate sample training data dataframe using the features you created
- Enable offline materialization on the feature sets, and backfill the feature data


#### Important

This feature is currently in public preview. This preview version is provided without a service-level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/).

## Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

* An Azure Machine Learning workspace. If you don't have one, use the steps in the [Quickstart: Create workspace resources](https://learn.microsoft.com/en-us/azure/machine-learning/quickstart-create-resources?view=azureml-api-2) article to create one.
* To perform the steps in this article, your user account must be assigned the owner or contributor role to a resource group where the feature store will be created

## Setup 

#### Prepare the notebook environment for development
Note: This tutorial uses AzureML spark notebook for development. (placeholder: link to ADB document once ready)

1. Clone the examples repository to your local machine: To run the tutorial, first clone the [examples repository (azureml-examples)](https://github.com/azure/azureml-examples)

```bash
git clone --depth 1 https://github.com/Azure/azureml-examples
```

Alternatively you can download a zip file from the [examples repository (azureml-examples)](https://github.com/azure/azureml-examples): click on the `code` dropdown and click `Download ZIP`. Then unzip the contents into a folder in your local machine.

2. Upload the feature store samples directory to project workspace: Open Azure ML studio UI of your Azure ML workspace -> click on "Notebooks" in left nav -> right click on your user name in the directory listing -> click "upload folder" -> select the feature store samples folder from the cloned directory path: `azureml-examples/sdk/python/featurestore-sample`

3. You can either create a new notebook and paste the instructions in this document step by step and execute OR open the existing notebook titled `1.Develop a feature set and register with managed feature store.ipynb`. You can execute step by step. Keep this document open and refer to it for detailed explanation of the steps. The notebooks are available in the folder: `featurestore_sample/notebooks`. Select either `sdk_only` folder or the `sdk_and_cli` folder. The latter has CLI commands mixed with python sdk useful in ci/cd scenarios.

4. In the "Compute" dropdown in the top nav, select "Serverless Spark Compute". It may take 1-2 minutes for this activity to complete. Wait for a status bar in the top to display `configure session`

5. Click on "configure session" -> click on "Python packages" -> click on "upload conda file" -> select the file `azureml-examples/sdk/python/featurestore-sample/project/env/conda.yml` from your local machine; Also increase the session time out (idle time) if you want to reduce serverless spark cluster startup time.

__Important:__ Except for this step, you need to run all the other steps every time you have a new spark session/session time out


#### Start spark session

In [None]:
# run this cell to start the spark session (any code block will start the session ). This can take around 10 mins.
print("start spark session")

#### Setup root directory for the samples

In [None]:
import os

# Please update <your_user_alias> below (or any custom directory you uploaded the samples to).
# You can find the name from the directory structure in the left nav
root_dir = "./Users/<your_user_alias>/featurestore_sample"

if os.path.isdir(root_dir):
    print("The folder exists.")
else:
    print("The folder does not exist. Please create or fix the path")

## Note
Feature store Vs Project workspace: You will use a featurestore to reuse features across projects. You will use a project workspace(the current workspace) to train and inference models, by leveraging features from feature stores. Many project workspaces can share and reuse a same feature store.

## Note
In this tutorial you will be using two SDK's:

1. Feature store CRUD sdk:  You will use the same SDK, MLClient (package name `azure-ai-ml`), that you use with Azure ML workpace. This will be used for feature store CRUD operations (Create, Update and Delete) for featurestore, featureset and featurestore-entity. This is because feature store is implemented as a type of workspace. 
2. Feature store core sdk: This sdk (`azureml-featurestore`) is meant to be used for feature set development and consumption (you will learn more about these operations later):
- List/Get registered feature set
- Generate/resolve feature retrieval spec
- Execute featureset definition to generate Spark dataframe
- Generate training using a point-in-time join

For this tutorial so you do not need to install any of these explicitly, since the instructions already cover them (conda yaml in the above step include these)

## Step 1: Create a minimal feature store

#### Step 1a: Set feature store parameters
Set name, location and other values for the feature store

In [None]:
# We use the subscription, resource group, region of this active project workspace.
# You can optionally replace them to create the resources in a different subsciprtion/resourceGroup, or use existing resources
import os

featurestore_name = "my-featurestore"
featurestore_location = "eastus"
featurestore_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
featurestore_resource_group_name = os.environ["AZUREML_ARM_RESOURCEGROUP"]
version = "<VERSION>"

#### Step 1b: Create the feature store

In [None]:
from azure.ai.ml import MLClient
from azure.ai.ml.entities import (
    FeatureStore,
    FeatureStoreEntity,
    FeatureSet,
)
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

ml_client = MLClient(
    AzureMLOnBehalfOfCredential(),
    subscription_id=featurestore_subscription_id,
    resource_group_name=featurestore_resource_group_name,
)


fs = FeatureStore(name=featurestore_name, location=featurestore_location)
# wait for featurestore creation
fs_poller = ml_client.feature_stores.begin_create(fs, update_dependent_resources=True)
print(fs_poller.result())

#### Step 1c: Initialize AzureML feature store core SDK client
As explained above, this is used to develop and consume features

In [None]:
# feature store client
from azureml.featurestore import FeatureStoreClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

featurestore = FeatureStoreClient(
    credential=AzureMLOnBehalfOfCredential(),
    subscription_id=featurestore_subscription_id,
    resource_group_name=featurestore_resource_group_name,
    name=featurestore_name,
)

## Step 2: Prototype and develop a transaction rolling aggregation featureset in this notebook

#### Step 2a: Explore the transactions source data

#### Note
The sample data used in this notebook is hosted in a public accessible blob container. It can only be read in Spark via `wasbs` driver. When you create feature sets using your own source data, please host them in adls gen2 account and use `abfss` driver in the data path.  

In [None]:
# remove the "." in the roor directory path as we need to generate absolute path to read from spark
transactions_source_data_path = "wasbs://data@azuremlexampledata.blob.core.windows.net/feature-store-prp/datasources/transactions-source/*.parquet"
transactions_src_df = spark.read.parquet(transactions_source_data_path)

display(transactions_src_df.head(5))
# Note: display(training_df.head(5)) displays the timestamp column in a different format. You can can call transactions_src_df.show() to see correctly formatted value

#### Step 2b: Develop a transactions featureset locally

Featureset specification is a self-contained definition of feature set that can be developed and tested locally.

Lets create the following rolling window aggregate features:
- transactions 3-day count
- transactions amount 3-day sum
- transactions amount 3-day avg
- transactions 7-day count
- transactions amount 7-day sum
- transactions amount 7-day avg

__Action__:
- Inspect the feature transformation code file: `featurestore/featuresets/transactions/spec/transformation_code/transaction_transform.py`. You will see how is the rolling aggregation defined for the features. This is a spark transformer.

To understand the feature set and transformations in more detail, see [feature store concepts](fs-concepts-url-todo) and [transformation concepts](fs-transformation-concepts-todo).

In [None]:
from azureml.featurestore import create_feature_set_spec, FeatureSetSpec
from azureml.featurestore.contracts import (
    DateTimeOffset,
    FeatureSource,
    TransformationCode,
    Column,
    ColumnType,
    SourceType,
    TimestampColumn,
)


transactions_featureset_code_path = (
    root_dir + "/featurestore/featuresets/transactions/transformation_code"
)

transactions_featureset_spec = create_feature_set_spec(
    source=FeatureSource(
        type=SourceType.parquet,
        path="wasbs://data@azuremlexampledata.blob.core.windows.net/feature-store-prp/datasources/transactions-source/*.parquet",
        timestamp_column=TimestampColumn(name="timestamp"),
        source_delay=DateTimeOffset(days=0, hours=0, minutes=20),
    ),
    transformation_code=TransformationCode(
        path=transactions_featureset_code_path,
        transformer_class="transaction_transform.TransactionFeatureTransformer",
    ),
    index_columns=[Column(name="accountID", type=ColumnType.string)],
    source_lookback=DateTimeOffset(days=7, hours=0, minutes=0),
    temporal_join_lookback=DateTimeOffset(days=1, hours=0, minutes=0),
    infer_schema=True,
)
# Generate a spark dataframe from the feature set specification
transactions_fset_df = transactions_featureset_spec.to_spark_dataframe()
# display few records
display(transactions_fset_df.head(5))

#### Step 2c:  Export as feature set spec
Inorder to register the feature set spec with the feature store, it needs to be saved in a specific format. 
Action: Please inspect the generated `transactions` FeaturesetSpec: Open this file from the file tree to see the spec: `featurestore/featuresets/accounts/spec/FeaturesetSpec.yaml`

Spec contains these important elements:

1. `source`: reference to a storage. In this case a parquet file in a blob storage.
1. `features`: list of features and their datatypes. If you provide transformation code (see Day 2 section), the code has to return a dataframe that maps to the features and datatypes.
1. `index_columns`: the join keys required to access values from the feature set

Learn more about it in the [top level feature store entities document](fs-concepts-todo) and the [feature set spec yaml reference](reference-yaml-featureset-spec.md).

The additional benefit of persisting it is that it can be source controlled.

In [None]:
import os

# create a new folder to dump the feature set spec
transactions_featureset_spec_folder = (
    root_dir + "/featurestore/featuresets/transactions/spec"
)

# check if the folder exists, create one if not
if not os.path.exists(transactions_featureset_spec_folder):
    os.makedirs(transactions_featureset_spec_folder)

transactions_featureset_spec.dump(transactions_featureset_spec_folder, overwrite=True)

## Step 3: Register a feature-store entity
Entity helps enforce best practice that same join key definitions are used across featuresets which uses the same logical entities. Examples of entities are account entity, customer entity etc. Entities are typically created once and reused across feature-sets. For information on basics concept of feature store, see [feature store concepts](fs-concepts).

#### Step 3a: Initialize the Feature Store CRUD client

As explained in the beginning of this tutorial, MLClient is used for CRUD of assets in feature store. The below code looks up the feature store we created in an earlier step. We cannot reuse the same ml_client used above here because the former is scoped at resource group level, which is a prerequisite for creation of feature store. The below one is scoped at feature store level.
 

In [None]:
# mlclient on feature store
fs_client = MLClient(
    AzureMLOnBehalfOfCredential(),
    featurestore_subscription_id,
    featurestore_resource_group_name,
    featurestore_name,
)

#### Step 3b: Register `account` entity with the feature store
Create account entity that has join key `accountID` of `string` type. 

In [None]:
from azure.ai.ml.entities import DataColumn, DataColumnType

account_entity_config = FeatureStoreEntity(
    name="account",
    version=version,
    index_columns=[DataColumn(name="accountID", type=DataColumnType.STRING)],
    stage="Development",
    description="This entity represents user account index key accountID.",
    tags={"data_typ": "nonPII"},
)

poller = fs_client.feature_store_entities.begin_create_or_update(account_entity_config)
print(poller.result())

## Step 4: Register the transaction featureset with the featurestore
You register a feature set asset with the feature store so that you can share and reuse with others. You also get managed capabilities like versioning and materialization (we will learn in this tutorial series).

The feature set asset has reference to the feature set spec that you created earlier and additional properties like version and materialization settings.

In [None]:
from azure.ai.ml.entities import FeatureSetSpecification

transaction_fset_config = FeatureSet(
    name="transactions",
    version=version,
    description="7-day and 3-day rolling aggregation of transactions featureset",
    entities=[f"azureml:account:{version}"],
    stage="Development",
    specification=FeatureSetSpecification(path=transactions_featureset_spec_folder),
    tags={"data_type": "nonPII"},
)

poller = fs_client.feature_sets.begin_create_or_update(transaction_fset_config)
print(poller.result())

#### Explore the FeatureStore UI
* Goto the [Azure ML global landing page](https://ml.azure.com/home?flight=FeatureStores).
* Click on `Feature stores` in the left nav
* You will see the list of feature stores that you have access to. Click on the feature store that you created above.

You can see the feature set and entity that you created.

Note: Creating and updating feature store assets are possible only through SDK and CLI. You can use the UI to search/browse the feature store.

## Step 5: Generate a training data dataframe using the registered features

#### Step 5a: Load observation data

We start by exploring the observation data. Observation data is typically the core data used in training and inference data. This is then joined with feature data to create the full training data. Observation data is the data captured during the time of the event: in this case it has core transaction data including transaction id, account id, transaction amount. In this case, since it is for training, it also has the target variable appended (is_fraud).

To learn more core concepts including observation data, refer to the docs

In [None]:
observation_data_path = "wasbs://data@azuremlexampledata.blob.core.windows.net/feature-store-prp/observation_data/train/*.parquet"
observation_data_df = spark.read.parquet(observation_data_path)
obs_data_timestamp_column = "timestamp"

display(observation_data_df)
# Note: the timestamp column is displayed in a different format. Optionally, you can can call training_df.show() to see correctly formatted value

#### Step 5c: Get the registered featureset and list its features

In [None]:
# look up the featureset by providing name and version
transactions_featureset = featurestore.feature_sets.get("transactions", version)
# list its features
transactions_featureset.features

In [None]:
# print sample values
display(transactions_featureset.to_spark_dataframe().head(5))

#### Step 5d: Select features and generate training data
In this step we will select features that we would like to be part of training data and use the feature store sdk to generate the training data.

In [None]:
from azureml.featurestore import get_offline_features

# you can select features in pythonic way
features = [
    transactions_featureset.get_feature("transaction_amount_7d_sum"),
    transactions_featureset.get_feature("transaction_amount_7d_avg"),
]

# you can also specify features in string form: featurestore:featureset:version:feature
more_features = [
    f"transactions:{version}:transaction_3d_count",
    f"transactions:{version}:transaction_amount_3d_avg",
]

more_features = featurestore.resolve_feature_uri(more_features)
features.extend(more_features)

# generate training dataframe by using feature data and observation data
training_df = get_offline_features(
    features=features,
    observation_data=observation_data_df,
    timestamp_column=obs_data_timestamp_column,
)

# Ignore the message that says feature set is not materialized (materialization is optional). We will enable materialization in the next part of the tutorial.
display(training_df)
# Note: the timestamp column is displayed in a different format. Optionally, you can can call training_df.show() to see correctly formatted value

You can see how the features are appended to the training data using a point-in-time join.

We have reached the end of the tutorial. Now you have your training data using features from feature store. You can either save it to storage for later use, or run model training on it directly.

## Step 6: Enable offline materialization on transactions featureset
Once materialization is enabled on a featureset, you can perform backfill (this tutorial) or schedule recurrent materialization jobs(next part of the tutorial)

In [None]:
from azure.ai.ml.entities import (
    MaterializationSettings,
    MaterializationComputeResource,
)

transactions_fset_config = fs_client._featuresets.get(
    name="transactions", version=version
)

transactions_fset_config.materialization_settings = MaterializationSettings(
    offline_enabled=True,
    resource=MaterializationComputeResource(instance_type="standard_e8s_v3"),
    spark_configuration={
        "spark.driver.cores": 4,
        "spark.driver.memory": "36g",
        "spark.executor.cores": 4,
        "spark.executor.memory": "36g",
        "spark.executor.instances": 2,
    },
    schedule=None,
)

fs_poller = fs_client.feature_sets.begin_create_or_update(transactions_fset_config)
print(fs_poller.result())

Optionally, you can save the the above feature set asset as yaml

In [None]:
## uncomment to run
transactions_fset_config.dump(
    root_dir
    + "/featurestore/featuresets/transactions/featureset_asset_offline_enabled.yaml"
)

## Step 7: Backfill data for transactions featureset
As explained in the beginning of this tutorial, materialization is the process of computing the feature values for a given feature window and storing this in an materialization store. Materializing the features will increase its reliability and availability. All feature queries will now use the values from the materialization store. In this step you perform a one-time backfill for a feature window of __three months__.

#### Note
How to determine the window of backfill data needed? It has to match with the window of your training data. For e.g. if you want to train with two years of data, then you will want to be able to retrieve features for the same window, so you will backfill for a two year window.

In [None]:
from datetime import datetime

st = datetime(2023, 1, 1, 0, 0, 0, 0)
ed = datetime(2023, 4, 1, 0, 0, 0, 0)

poller = fs_client.feature_sets.begin_backfill(
    name="transactions",
    version=version,
    feature_window_start_time=st,
    feature_window_end_time=ed,
)
print(poller.result().job_id)

In [None]:
# get the job URL, and stream the job logs
fs_client.jobs.stream(poller.result().job_id)

Lets print sample data from the featureset. You can notice from the output information that the data was retrieved from the materilization store. `get_offline_features()` method that is used to retrieve training/inference data will also use the materialization store by default.

In [None]:
# look up the featureset by providing name and version
transactions_featureset = featurestore.feature_sets.get("transactions", version)
display(transactions_featureset.to_spark_dataframe().head(5))

## Cleanup

Tutorial of "Enable recurrent materialization and run batch inference" has instructions deleting the resources

## Next steps
* Experiment and train models using features