# Tutorial : Network isolation with feature store (preview)

#### Important

This feature is currently in Public Preview. This preview version is provided without a service-level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/).

Azure Machine Learning managed feature store lets you discover, create and operationalize features. Features are the connective tissue in machine learning lifecycle, starting from prototyping where you experiment with various features to operationalization where models are deployed and feature data is looked up during inference. For information on basic concept of feature store, see [feature store concepts](https://learn.microsoft.com/azure/machine-learning/concept-what-is-managed-feature-store).

In this tutorial you will learn how to configure secure ingress through private endpoint and secure egress through managed virtual network.

This tutorial contains the necessary information to setup network isolation for managed feature store. In this tutorial you will:
- Setup necessary resources required for network isolation of managed feature store.
- Create a new feature store resource.
- Setup your feature store to support network isolation scenarios.
- Update your project workspace (current workspace) to support network isolation scenarios.


## Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

* Project workspace: **You must have an Azure Machine Learning workspace enabled with managed virtual network** for **serverless Spark jobs**. To configure your project workspace:
  1. Create a `YAML` file `network.yml` with the following content:

      ```yaml
      managed_network:
        isolation_mode: allow_internet_outbound
      ```
  2. Execute the following commands to update the workspace and provision managed virtual network for serverless Spark jobs:
      ```cli
      az ml workspace update --file network.yml --resource-group my_resource_group --name my_workspace_name
      az ml workspace provision-network --resource-group my_resource_group --name my_workspace_name --include-spark
      ```
  You can refer to the document [Configure for serverless spark job](https://learn.microsoft.com/azure/machine-learning/how-to-managed-network?view=azureml-api-2&tabs=azure-cli#configure-for-serverless-spark-jobs) for more details.

  __Important__:  When following the above instructions, set the `isolation_mode` as `allow_internet_outbound`. This is the only supported network isolation mode. As you will see in this notebook, you can connect to sources, materialization store and observation data securely through private endpoints.

* To perform the steps in this article, your user account must be assigned the `Owner` or `Contributor` role to a resource group where the feature store will be created. And you need the `User Access Administrator` role.

## Setup 

### Prepare the notebook environment for development
Note: This tutorial uses Azure Machine Learning notebook with **Serverless Spark Compute**.

1. Clone the examples repository to your local machine: To run the tutorial, first clone the [examples repository - (azureml-examples)](https://github.com/azure/azureml-examples) with this command:

   `git clone --depth 1 https://github.com/Azure/azureml-examples`

   You can also download a zip file from the [examples repository (azureml-examples)](https://github.com/azure/azureml-examples). At this page, first select the `code` dropdown, and then select `Download ZIP`. Then, unzip the contents into a folder on your local device.

2. Upload the feature store samples directory to project workspace.
      * Open Azure Machine Learning studio UI of your Azure Machine Learning workspace.
      * Select **Notebooks** in left navigation panel.
      * Select your user name in the directory listing.
      * Select ellipses (**...**) and then select **Upload folder**.
      * Select the feature store samples folder from the cloned directory path: `azureml-examples/sdk/python/featurestore-sample`.

3. Running the tutorial:
* Option 1: Create a new notebook, and execute the instructions in this document step by step. 
* Option 2: Open the existing notebook `featurestore_sample/notebooks/sdk_and_cli/network_isolation/Network Isolation for Feature store.ipynb`. You may keep this document open and refer to it for additional explanation and documentation links.

  1. Select **Serverless Spark Compute** in the top navigation **Compute** dropdown. This operation might take one to two minutes. Wait for a status bar in the top to display **Configure session**.
  2. Select **Configure session** in the top status bar.
  3. Select **Python packages**.
  4. Select **Upload conda file**.
  5. Select file `azureml-examples/sdk/python/featurestore-sample/project/env/conda.yml` located on your local device.
  6. (Optional) Increase the session time-out (idle time in minutes) to reduce the serverless spark cluster startup time.

#### Start Spark session
Execute the following code cell to start the Spark session. It wil take approximately 10 minutes to install all dependencies and start the Spark session.

In [None]:
# Run this cell to start the spark session (any code block will start the session ). This can take around 10 mins.
print("start spark session")

#### Setup root directory for the samples

In [None]:
import os

# Please update your alias below (or any custom directory you have uploaded the samples to).
# You can find the name from the directory structure in the left navigation.
root_dir = "./Users/<your_user_alias>/featurestore_sample"

if os.path.isdir(root_dir):
    print("The folder exists.")
else:
    print("The folder does not exist. Please create or fix the path")

#### Setup CLI
Set up the Azure Machine Learning CLI in the following three steps:
1. Install Azure Machine Learning CLI extention

In [None]:
# install azure ml cli extension
!az extension add --name ml

2. Authenticate

In [None]:
# authenticate
!az login

3. Set the default subscription

In [None]:
# Set default subscription
import os

subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]

!az account set -s $subscription_id

## Note
__Feature store vs Project workspace__: You will use a feature store workspace to reuse features across projects. You will use a project workspace (the current workspace) to train and inference models, by leveraging features from the feature stores. Many project workspaces can share and reuse the same feature store.

In this tutorial you will be using CLI and feature store core SDK:
Uses CLI for CRUD operations (create, read, update, and delete) and Python SDK for feature set development and testing only. This is useful in CI/CD or GitOps scenarios where CLI/YAML is preferred.

1. CLI:  You will use CLI for CRUD operations (create, read, update, and delete) on feature store, feature set and feature store entity.
2. Feature store core SDK: This SDK (`azureml-featurestore`) is meant to be used for feature set development and consumption (you will learn more about these operations later):
- List/Get registered feature set
- Generate/resolve feature retrieval spec
- Execute feature set definition to generate Spark dataframe
- Generate training using a point-in-time join

For this tutorial so you do not need to install any of these explicitly, since the instructions already cover them (conda YAML in the above step include these).

### Step 1. Setup necessary resources

You can create a new Azure Data Lake Storage (ADLS) Gen2 storage account and containers, or reuse existing one to be used for the feature store.

We would need three separate storage containers for each of 
- Source data
- Offline store
- Observation data

For the purpose of this demo, we are going to create three containers in the same Azure Data Lake Storage Gen2 storage account.

For the real-life scenarios, each of these containers can be in different storage accounts or the same account depending on your organizational need. 

##### Step 1a. Create an ADLS Gen2 storage account that will be used to store source, offline store and observation data.
Provide the name of an Azure Data Lake Storage Gen2 storage account in the following code sample. Other than that, you can execute the following code cell with the provided default settings. Optionally, you can override the default settings. 

In [None]:
## Default Setting
# We use the subscription, resource group, region of this active project workspace,
# We hard-coded default resource names for creating new resources

## Overwrite
# You can replace them if you want to create the resources in a different subsciprtion/resourceGroup, or use existing resources
# At the minimum, provide an ADLS Gen2 storage account name for `storage_account_name`

storage_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
storage_resource_group_name = os.environ["AZUREML_ARM_RESOURCEGROUP"]
storage_account_name = "<STORAGE_ACCOUNT_NAME>"

storage_location = "eastus"
storage_file_system_name_offline_store = "offline-store"
storage_file_system_name_source_data = "source-data"
storage_file_system_name_observation_data = "observation-data"

Execute the following code cell to create the ADLS Gen2 storage account defined in the above code cell.

In [None]:
# Create new storage account
!az storage account create --name $storage_account_name --enable-hierarchical-namespace true --resource-group $storage_resource_group_name --location $storage_location --subscription $storage_subscription_id

Execute the following code cell to create a new storage container for offline store.

In [None]:
# Create a new storage container for offline store
!az storage fs create --name $storage_file_system_name_offline_store --account-name $storage_account_name --subscription $storage_subscription_id

Execute the following code cell to create a new storage container for source data.

In [None]:
# Create a new storage container for source data
!az storage fs create --name $storage_file_system_name_source_data --account-name $storage_account_name --subscription $storage_subscription_id

Execute the following code cell to create a new storage container for observation data.

In [None]:
# Create a new storage container for observation data
!az storage fs create --name $storage_file_system_name_observation_data --account-name $storage_account_name --subscription $storage_subscription_id

#### Step 1b. Copy the sample data required for this tutorial series to the newly created storage containers.

To write data to the storage containers, please ensure that **Contributor** and **Storage Blob Data Contributor** roles are assigned to the user identity on the created ADLS Gen2 storage account in the Azure portal [following these steps](https://learn.microsoft.com/azure/role-based-access-control/role-assignments-portal). 

__Important__: Once you have ensured that the **Contributor** and **Storage Blob Data Contributor** roles are assigned to the user identity, wait for a few minutes after role assignment to let permissions propagate before proceeding with the next steps. To learn more about access control, see [role-based access control (RBAC) for Azure storage accounts](https://learn.microsoft.com/azure/storage/blobs/data-lake-storage-access-control-model#role-based-access-control-azure-rbac)

The following code cells copy sample source data for transactions feature set used in this tutorial from a public storage account to the newly created storage account.

In [None]:
# Copy sample source data for transactions feature set used in this tutorial series from the public storage account to the newly created storage account
transactions_source_data_path = "wasbs://data@azuremlexampledata.blob.core.windows.net/feature-store-prp/datasources/transactions-source/*.parquet"
transactions_src_df = spark.read.parquet(transactions_source_data_path)

transactions_src_df.write.parquet(
    f"abfss://{storage_file_system_name_source_data}@{storage_account_name}.dfs.core.windows.net/transactions-source/"
)

Copy sample source data for account feature set used in this tutorial from a public storage account to the newly created storage account.

In [None]:
# Copy sample source data for account feature set used in this tutorial series from the public storage account to the newly created storage account
accounts_data_path = "wasbs://data@azuremlexampledata.blob.core.windows.net/feature-store-prp/datasources/accounts-precalculated/*.parquet"
accounts_data_df = spark.read.parquet(accounts_data_path)

accounts_data_df.write.parquet(
    f"abfss://{storage_file_system_name_source_data}@{storage_account_name}.dfs.core.windows.net/accounts-precalculated/"
)

Copy sample observation data used for training from a public storage account to the newly created storage account.

In [None]:
# Copy sample observation data used for training from the public storage account to the newly created storage account
observation_data_train_path = "wasbs://data@azuremlexampledata.blob.core.windows.net/feature-store-prp/observation_data/train/*.parquet"
observation_data_train_df = spark.read.parquet(observation_data_train_path)

observation_data_train_df.write.parquet(
    f"abfss://{storage_file_system_name_observation_data}@{storage_account_name}.dfs.core.windows.net/train/"
)

Copy sample observation data used for batch inference from a public storage account to the newly created storage account.

In [None]:
# Copy sample observation data used for batch inference from a public storage account to the newly created storage account
observation_data_inference_path = "wasbs://data@azuremlexampledata.blob.core.windows.net/feature-store-prp/observation_data/batch_inference/*.parquet"
observation_data_inference_df = spark.read.parquet(observation_data_inference_path)

observation_data_inference_df.write.parquet(
    f"abfss://{storage_file_system_name_observation_data}@{storage_account_name}.dfs.core.windows.net/batch_inference/"
)

#### Step 1c. Disable the public network access on the newly created storage account.

Execute the following code cell to disable public network access for the above created ADLS Gen2 storage account

In [None]:
# Disable the public network access for the above created ADLS Gen2 storage account
!az storage account update --name $storage_account_name --resource-group $storage_resource_group_name --subscription $storage_subscription_id --public-network-access disabled

Set ARM IDs for the offline store, source data, and observation data containers.

In [None]:
# set the container arm id
offline_store_gen2_container_arm_id = "/subscriptions/{sub_id}/resourceGroups/{rg}/providers/Microsoft.Storage/storageAccounts/{account}/blobServices/default/containers/{container}".format(
    sub_id=storage_subscription_id,
    rg=storage_resource_group_name,
    account=storage_account_name,
    container=storage_file_system_name_offline_store,
)

print(offline_store_gen2_container_arm_id)

source_data_gen2_container_arm_id = "/subscriptions/{sub_id}/resourceGroups/{rg}/providers/Microsoft.Storage/storageAccounts/{account}/blobServices/default/containers/{container}".format(
    sub_id=storage_subscription_id,
    rg=storage_resource_group_name,
    account=storage_account_name,
    container=storage_file_system_name_source_data,
)

print(source_data_gen2_container_arm_id)

observation_data_gen2_container_arm_id = "/subscriptions/{sub_id}/resourceGroups/{rg}/providers/Microsoft.Storage/storageAccounts/{account}/blobServices/default/containers/{container}".format(
    sub_id=storage_subscription_id,
    rg=storage_resource_group_name,
    account=storage_account_name,
    container=storage_file_system_name_observation_data,
)

print(observation_data_gen2_container_arm_id)

## Step 2. Create a feature store with materialization enabled

#### Step 2a. Set feature store parameters
Set name, location and other values for the feature store.

In [None]:
# We use the subscription, resource group, region of this active project workspace.
# Optionally, you can replace them to create the resources in a different subsciprtion/resourceGroup, or use existing resources
import os

# At the minimum, define a name for the feature store
featurestore_name = "<FEATURESTORE_NAME>"
# It is recommended to create featurestore in the same location as the storage
featurestore_location = storage_location
featurestore_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
featurestore_resource_group_name = os.environ["AZUREML_ARM_RESOURCEGROUP"]

feature_store_arm_id = "/subscriptions/{sub_id}/resourceGroups/{rg}/providers/Microsoft.MachineLearningServices/workspaces/{ws_name}".format(
    sub_id=featurestore_subscription_id,
    rg=featurestore_resource_group_name,
    ws_name=featurestore_name,
)

Following code cell generates a YAML specification file for a feature store with materialization enabled.

In [None]:
# The below code creates a feature store with enabled materialization
import yaml

config = {
    "$schema": "http://azureml/sdk-2-0/FeatureStore.json",
    "name": featurestore_name,
    "location": featurestore_location,
    "compute_runtime": {"spark_runtime_version": "3.2"},
    "offline_store": {
        "type": "azure_data_lake_gen2",
        "target": offline_store_gen2_container_arm_id,
    },
}

feature_store_yaml = root_dir + "/featurestore/featurestore_with_offline_setting.yaml"

with open(feature_store_yaml, "w") as outfile:
    yaml.dump(config, outfile, default_flow_style=False)

#### Step 2b. Create the feature store

Execute the following code cell to create a feature store with materialization enabled by using the YAML specification file generated in the previous step.

In [None]:
!az ml feature-store create --file $feature_store_yaml --subscription $featurestore_subscription_id --resource-group $featurestore_resource_group_name

#### Step 2c. Initialize Azure Machine Learning feature store core SDK client
As explained above, this SDK client is used to develop and consume features.

In [None]:
# feature store client
from azureml.featurestore import FeatureStoreClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

featurestore = FeatureStoreClient(
    credential=AzureMLOnBehalfOfCredential(),
    subscription_id=featurestore_subscription_id,
    resource_group_name=featurestore_resource_group_name,
    name=featurestore_name,
)

#### Step 2d. Grant user identity access to the feature store

Follow these instructions to [get you Azure Active Directory (AAD) Object ID for your user identity](https://learn.microsoft.com/partner-center/find-ids-and-domain-names#find-the-user-object-id). Then use your AAD Object ID in the following command to assign **AzureML Data Scientist** role to your user identity on the created feature store. 

In [None]:
your_aad_objectid = "<YOUR_AAD_OBJECT_ID>"

!az role assignment create --role "AzureML Data Scientist" --assignee-object-id $your_aad_objectid --assignee-principal-type User --scope $feature_store_arm_id

#### Step 2e. Get default storage account and key vault for the feature store and disable the public network access to the corresponding resources.

The following code cell gets the feature store object for the next steps.

In [None]:
fs = featurestore.feature_stores.get()

This code cell get names of default storage account and key vault for the feature store.

In [None]:
# Copy the properties storage_account and key_vault from the response returned in feature store show command respectively
default_fs_storage_account_name = fs.storage_account.rsplit("/", 1)[-1]
default_key_vault_name = fs.key_vault.rsplit("/", 1)[-1]

Execute the following code cell to disable public network access to the default storage account for the feature store.

In [None]:
# Disable the public network access for the above created default ADLS Gen2 storage account for the feature store
!az storage account update --name $default_fs_storage_account_name --resource-group $featurestore_resource_group_name --subscription $featurestore_subscription_id --public-network-access disabled

The following cell prints name of the default key vault for the feature store.

In [None]:
print(default_key_vault_name)

#### Disable the public network access for the default key vault for the feature store

- Open the default keyvault from the previous cell in the Azure portal.
- Go to the **Networking** tab.
- Select **Disable public access** and click on **Apply** on the bottom left of the page.

#### Step 3. Enable managed virtual network for the feature store

#### Step 3a. Update the feature store with the necessary outbound rules.

The following code cell creates a YAML specification file for outbound rules that are defined for the feature store.

In [None]:
# The below code creates a configuration for managed virtual network for the feature store
import yaml

config = {
    "public_network_access": "disabled",
    "managed_network": {
        "isolation_mode": "allow_internet_outbound",
        "outbound_rules": [
            # You need to add multiple rules here if you have separate storage account for source, observation data and offline store.
            {
                "name": "sourcerulefs",
                "destination": {
                    "spark_enabled": "true",
                    "subresource_target": "dfs",
                    "service_resource_id": f"/subscriptions/{storage_subscription_id}/resourcegroups/{storage_resource_group_name}/providers/Microsoft.Storage/storageAccounts/{storage_account_name}",
                },
                "type": "private_endpoint",
            },
            # This rule is added currently because serverless Spark doesn't automatically create a private endpoint to default key vault.
            {
                "name": "defaultkeyvault",
                "destination": {
                    "spark_enabled": "true",
                    "subresource_target": "vault",
                    "service_resource_id": f"/subscriptions/{featurestore_subscription_id}/resourcegroups/{featurestore_resource_group_name}/providers/Microsoft.Keyvault/vaults/{default_key_vault_name}",
                },
                "type": "private_endpoint",
            },
        ],
    },
}

feature_store_managed_vnet_yaml = (
    root_dir + "/featurestore/feature_store_managed_vnet_config.yaml"
)

with open(feature_store_managed_vnet_yaml, "w") as outfile:
    yaml.dump(config, outfile, default_flow_style=False)

Execute the following code cell to update the feature store using the generated YAML specification file with the outbound rules.

In [None]:
!az ml feature-store update --file $feature_store_managed_vnet_yaml --name $featurestore_name --resource-group $featurestore_resource_group_name

#### Step 3b. Execute provision network commands to actually create private endpoints for the above mentioned rules.

Provision network command will create private endpoints from the managed virtual network, where the materialization job will execute, to the source, offline store, observation data, default storage account and the default key vault for the feature store. This command can take approximately 20 minutes to execute.

In [None]:
#### Provision network to create necessary private endpoints (it may take approximately 20 minutes)
!az ml feature-store provision-network --name $featurestore_name --resource-group $featurestore_resource_group_name --include-spark

Execute the following code cell to confirm that private endpoints defined by the outbound rules have been created.

In [None]:
### Check that managed virtual network is correctly enabled
### After provisioning the network, all the outbound rules should become active
### For this tutorial, you will see 6 outbound rules
!az ml feature-store show --name $featurestore_name --resource-group $featurestore_resource_group_name

### Step 4. Update managed virtual network for the project workspace (current workspace)


Next, we update the managed virtual network for the project workspace. First, we get the subscription ID, resource group, and workspace name for the project workspace.

In [None]:
# lookup the subscription id, resource group and workspace name of the current workspace
project_ws_sub_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
project_ws_rg = os.environ["AZUREML_ARM_RESOURCEGROUP"]
project_ws_name = os.environ["AZUREML_ARM_WORKSPACE_NAME"]

#### Step 4a. Update project workspace with necessary outbound rules.

Project workspace needs access to the following resources :
- Source data
- Offline store
- Observation data
- Featurestore
- Default storage account of featurestore

Execute the following code cell to update the project workspace using the generated YAML specification file with required outbound rules.

In [None]:
# The below code creates a configuration for managed virtual network for the project workspace
import yaml

config = {
    "managed_network": {
        "isolation_mode": "allow_internet_outbound",
        "outbound_rules": [
            # Incase you have separate storage accounts for source, observation data and offline store, you need to add multiple rules here. No action needed otherwise.
            {
                "name": "projectsourcerule",
                "destination": {
                    "spark_enabled": "true",
                    "subresource_target": "dfs",
                    "service_resource_id": f"/subscriptions/{storage_subscription_id}/resourcegroups/{storage_resource_group_name}/providers/Microsoft.Storage/storageAccounts/{storage_account_name}",
                },
                "type": "private_endpoint",
            },
            # Rule to create private endpoint to default storage of feature store
            {
                "name": "defaultfsstoragerule",
                "destination": {
                    "spark_enabled": "true",
                    "subresource_target": "blob",
                    "service_resource_id": f"/subscriptions/{featurestore_subscription_id}/resourcegroups/{featurestore_resource_group_name}/providers/Microsoft.Storage/storageAccounts/{default_fs_storage_account_name}",
                },
                "type": "private_endpoint",
            },
            # Rule to create private endpoint to default key vault of feature store
            {
                "name": "defaultfskeyvaultrule",
                "destination": {
                    "spark_enabled": "true",
                    "subresource_target": "vault",
                    "service_resource_id": f"/subscriptions/{featurestore_subscription_id}/resourcegroups/{featurestore_resource_group_name}/providers/Microsoft.Keyvault/vaults/{default_key_vault_name}",
                },
                "type": "private_endpoint",
            },
            # Rule to create private endpoint to feature store
            {
                "name": "featurestorerule",
                "destination": {
                    "spark_enabled": "true",
                    "subresource_target": "amlworkspace",
                    "service_resource_id": f"/subscriptions/{featurestore_subscription_id}/resourcegroups/{featurestore_resource_group_name}/providers/Microsoft.MachineLearningServices/workspaces/{featurestore_name}",
                },
                "type": "private_endpoint",
            },
        ],
    }
}

project_ws_managed_vnet_yaml = (
    root_dir + "/featurestore/project_ws_managed_vnet_config.yaml"
)

with open(project_ws_managed_vnet_yaml, "w") as outfile:
    yaml.dump(config, outfile, default_flow_style=False)

Execute the following code cell to update the project workspace using the generated YAML specification file with the outbound rules.

In [None]:
#### Update project workspace to create private endpoints for the defined outbound rules (it may take approximately 15 minutes)
!az ml workspace update --file $project_ws_managed_vnet_yaml --name $project_ws_name --resource-group $project_ws_rg

Execute the following code cell to confirm that private endpoints defined by the outbound rules have been created.

In [None]:
!az ml workspace show --name $project_ws_name --resource-group $project_ws_rg

You can also verify the outbound rules from the Azure portal by navigating to **Networking** from left navigation panel for the project workspace and then opening **Workspace managed outbound access** tab. 

![OUTBOUND_RULES](./images/Project_Workspace_Outbound_Rules.png)

## Step 5 Prototype and develop a transactions rolling aggregation feature set in this notebook

#### Step 5a: Explore the transactions source data

#### Note
The sample data used in this notebook is hosted in a public accessible blob container. It can only be read in Spark via `wasbs` driver. When you create feature sets using your own source data, please host them in an Adls Gen2 account and use `abfss` driver in the data path.  

In [None]:
# remove the "." in the root directory path as we need to generate absolute path to read from Spark
transactions_source_data_path = f"abfss://{storage_file_system_name_source_data}@{storage_account_name}.dfs.core.windows.net/transactions-source/*.parquet"
transactions_src_df = spark.read.parquet(transactions_source_data_path)

display(transactions_src_df.head(5))
# Note: display(training_df.head(5)) displays the timestamp column in a different format. You can can call transactions_src_df.show() to see correctly formatted value

#### Step 5b: Develop a transactions feature set locally

Featureset specification is a self-contained definition of feature set that can be developed and tested locally.

Lets create the following rolling window aggregate features:
- transactions 3-day count
- transactions amount 3-day sum
- transactions amount 3-day avg
- transactions 7-day count
- transactions amount 7-day sum
- transactions amount 7-day avg

__Action__:
- Inspect the feature transformation code file: `featurestore/featuresets/transactions/spec/transformation_code/transaction_transform.py`. You will see how is the rolling aggregation defined for the features. This is a spark transformer.

To understand the feature set and transformations in more detail, see [feature store concepts](https://learn.microsoft.com/azure/machine-learning/concept-what-is-managed-feature-store) and [transformation concepts](https://learn.microsoft.com/azure/machine-learning/feature-set-specification-transformation-concepts).

In [None]:
from azureml.featurestore import create_feature_set_spec, FeatureSetSpec
from azureml.featurestore.contracts import (
    DateTimeOffset,
    FeatureSource,
    TransformationCode,
    Column,
    ColumnType,
    SourceType,
    TimestampColumn,
)


transactions_featureset_code_path = (
    root_dir + "/featurestore/featuresets/transactions/transformation_code"
)

transactions_featureset_spec = create_feature_set_spec(
    source=FeatureSource(
        type=SourceType.parquet,
        path=f"abfss://{storage_file_system_name_source_data}@{storage_account_name}.dfs.core.windows.net/transactions-source/*.parquet",
        timestamp_column=TimestampColumn(name="timestamp"),
        source_delay=DateTimeOffset(days=0, hours=0, minutes=20),
    ),
    transformation_code=TransformationCode(
        path=transactions_featureset_code_path,
        transformer_class="transaction_transform.TransactionFeatureTransformer",
    ),
    index_columns=[Column(name="accountID", type=ColumnType.string)],
    source_lookback=DateTimeOffset(days=7, hours=0, minutes=0),
    temporal_join_lookback=DateTimeOffset(days=1, hours=0, minutes=0),
    infer_schema=True,
)
# Generate a spark dataframe from the feature set specification
transactions_fset_df = transactions_featureset_spec.to_spark_dataframe()
# display few records
display(transactions_fset_df.head(5))

#### Step 5c:  Export as feature set specification
Inorder to register the feature set specification with the feature store, it needs to be saved in a specific format. 
Action: Please inspect the generated `transactions` `FeaturesetSpec`: Open this file from the file tree to see the specification file: `featurestore/featuresets/accounts/spec/FeaturesetSpec.yaml`

Specification file contains these important elements:

1. `source`: reference to a storage. In this case a parquet file in a blob storage.
2. `features`: list of features and their data types. If you provide transformation code, the code has to return a data frame that maps to the features and data types.
3. `index_columns`: the join keys required to access values from the feature set.

Learn more about it in the [top level feature store entities document](https://learn.microsoft.com/azure/machine-learning/concept-top-level-entities-in-managed-feature-store) and the [feature set specification yaml reference](https://learn.microsoft.com/azure/machine-learning/reference-yaml-featureset-spec).

The additional benefit of persisting it is that it can be source controlled.

In [None]:
import os

# create a new folder to dump the feature set specification
transactions_featureset_spec_folder = (
    root_dir + "/featurestore/featuresets/transactions/spec"
)

# check if the folder exists, create one if not
if not os.path.exists(transactions_featureset_spec_folder):
    os.makedirs(transactions_featureset_spec_folder)

transactions_featureset_spec.dump(transactions_featureset_spec_folder, overwrite=True)

## Step 6: Register a feature store entity
Entity helps enforce best practice that same join key definitions are used across feature sets which uses the same logical entities. Examples of entities are account entity, customer entity etc. Entities are typically created once and reused across feature sets. For information on basics concept of feature store, see [feature store concepts](https://learn.microsoft.com/azure/machine-learning/concept-what-is-managed-feature-store).

In [None]:
account_entity_path = root_dir + "/featurestore/entities/account.yaml"
!az ml feature-store-entity create --file $account_entity_path --resource-group $featurestore_resource_group_name --workspace-name $featurestore_name

## Step 7: Register the transactions feature set with the feature store and submit a materialization job
You register a feature set asset with the feature store so that you can share and reuse with others. You also get managed capabilities like versioning and materialization (we will learn in this tutorial series).

The feature set asset has reference to the feature set specification that you created earlier and additional properties like version and materialization settings.

#### Step 7a: Create a feature set

The following code cell creates a feature set by using a predefined YAML specification file.

In [None]:
transactions_featureset_path = (
    root_dir
    + "/featurestore/featuresets/transactions/featureset_asset_offline_enabled.yaml"
)
!az ml feature-set create --file $transactions_featureset_path --resource-group $featurestore_resource_group_name --workspace-name $featurestore_name

Execute the following code cell to preview the newly created feature set.

In [None]:
# Preview the newly created feature set

!az ml feature-set show --resource-group $featurestore_resource_group_name --workspace-name $featurestore_name -n transactions -v 1

#### Step 7b: Submit a backfill materialization job

The following code cell defines start and end time for feature materialization window and submit a backfill materialization job.

In [None]:
feature_window_start_time = "2023-02-01T00:00.000Z"
feature_window_end_time = "2023-03-01T00:00.000Z"

!az ml feature-set backfill --name transactions --version 1 --by-data-status "['None']" --workspace-name $featurestore_name --resource-group $featurestore_resource_group_name --feature-window-start-time $feature_window_start_time --feature-window-end-time $feature_window_end_time

Execute the following code cell to check the status of backfill materialization job by providing `<JOB_ID_FROM_PREVIOUS_COMMAND>`.

In [None]:
### Check the job status

!az ml job show --name <JOB_ID_FROM_PREVIOUS_COMMAND> -g $featurestore_resource_group_name -w $featurestore_name

Next, execute the following code cell to list all the materialization jobs for the current feature set. 

In [None]:
### List all the materialization jobs for the current feature set

!az ml feature-set list-materialization-operation --name transactions --version 1 -g $featurestore_resource_group_name -w $featurestore_name

## Step 8: Enable online store materialization

#### Step 8a: Attach online store to feature store

In the following code cell, define the name of the Azure Cache for Redis that you want to create or reuse. Optionally, you can override other default settings.

In [None]:
redis_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
redis_resource_group_name = os.environ["AZUREML_ARM_RESOURCEGROUP"]
redis_name = "my-redis"
redis_location = storage_location

You can select the Redis cache tier (basic, standard, or premium). You should choose a SKU family that is available for the selected cache tier. See this documentation page to learn more about [how selecting different tiers may affect cache performance](https://learn.microsoft.com/azure/azure-cache-for-redis/cache-best-practices-performance).  See this link learn more about [pricing for different SKU tiers and families of Azure Cache for Redis](https://azure.microsoft.com/en-us/pricing/details/cache/).

Execute the following code cell to create an Azure Cache for Redis with premium tier, SKU family `P` and cache capacity 2. It may take approximately 5-10 minutes to provision the Redis instance.

In [None]:
# Create new redis cache
from azure.mgmt.redis import RedisManagementClient
from azure.mgmt.redis.models import RedisCreateParameters, Sku, SkuFamily, SkuName

management_client = RedisManagementClient(
    AzureMLOnBehalfOfCredential(), redis_subscription_id
)

# It usually takes about 5 - 10 min to finish the provision of the Redis instance.
# If the following begin_create() call still hangs for longer than that,
# please check the status of the Redis instance on the Azure portal and cancel the cell if the provision has completed.
# This sample uses a PREMIUM tier Redis SKU from family P, which may cost more than a STANDARD tier SKU from family C.
# Please choose the SKU tier and family according to your performance and pricing requirements.

redis_arm_id = (
    management_client.redis.begin_create(
        resource_group_name=redis_resource_group_name,
        name=redis_name,
        parameters=RedisCreateParameters(
            location=redis_location,
            sku=Sku(name=SkuName.PREMIUM, family=SkuFamily.P, capacity=2),
            public_network_access="Disabled",  # can only disable PNA to redis cache during creation
        ),
    )
    .result()
    .id
)
print(redis_arm_id)

#### Step 8b: Update feature store with online store
Create YAML file for attaching the Azure Cache for Redis to the feature store as the online materialization store, and creating necessary outbound rules.

In [None]:
# The following code cell creates a YAML specification file for outbound rules that are defined for the feature store.
## rule 1: PE to online store (redis cache): this is optional if online store is not used

import yaml

config = {
    "public_network_access": "disabled",
    "managed_network": {
        "isolation_mode": "allow_internet_outbound",
        "outbound_rules": [
            {
                "name": "sourceruleredis",
                "destination": {
                    "spark_enabled": "true",
                    "subresource_target": "redisCache",
                    "service_resource_id": f"/subscriptions/{storage_subscription_id}/resourcegroups/{storage_resource_group_name}/providers/Microsoft.Cache/Redis/{redis_name}",
                },
                "type": "private_endpoint",
            },
        ],
    },
    "online_store": {"target": f"{redis_arm_id}", "type": "redis"},
}

feature_store_managed_vnet_yaml = (
    root_dir + "/featurestore/feature_store_managed_vnet_config.yaml"
)

with open(feature_store_managed_vnet_yaml, "w") as outfile:
    yaml.dump(config, outfile, default_flow_style=False)

Next, execute the following code cell to update the feature store using the generated YAML specification file with the outbound rules for the online store.

In [None]:
!az ml feature-store update --file $feature_store_managed_vnet_yaml --name $featurestore_name --resource-group $featurestore_resource_group_name

#### Step 8c: Update project workspace outboud rules
Project workspace needs access to the online store. Execute the following code cell to update the project workspace using the generated YAML specification file with required outbound rules.

In [None]:
import yaml

config = {
    "managed_network": {
        "isolation_mode": "allow_internet_outbound",
        "outbound_rules": [
            {
                "name": "onlineruleredis",
                "destination": {
                    "spark_enabled": "true",
                    "subresource_target": "redisCache",
                    "service_resource_id": f"/subscriptions/{storage_subscription_id}/resourcegroups/{storage_resource_group_name}/providers/Microsoft.Cache/Redis/{redis_name}",
                },
                "type": "private_endpoint",
            },
        ],
    }
}

project_ws_managed_vnet_yaml = (
    root_dir + "/featurestore/project_ws_managed_vnet_config.yaml"
)

with open(project_ws_managed_vnet_yaml, "w") as outfile:
    yaml.dump(config, outfile, default_flow_style=False)

Next, execute the following code cell to update the project workspace using the generated YAML specification file with the outbound rules for the online store.

In [None]:
#### Update project workspace to create private endpoints for the defined outbound rules (it may take approximately 15 minutes)
!az ml workspace update --file $project_ws_managed_vnet_yaml --name $project_ws_name --resource-group $project_ws_rg

#### Step 8d:  Materialize transactions feature set to online store
The following code cell enables online materialization for the `transactions` feature set.

In [None]:
# Update featureset to enable online materialization
transactions_featureset_path = (
    root_dir
    + "/featurestore/featuresets/transactions/featureset_asset_online_enabled.yaml"
)
!az ml feature-set update --file $transactions_featureset_path --resource-group $featurestore_resource_group_name --workspace-name $featurestore_name

The following code cell defines start and end time for feature materialization window and submits a backfill materialization job.

In [None]:
feature_window_start_time = "2024-01-24T00:00.000Z"
feature_window_end_time = "2024-01-25T00:00.000Z"

!az ml feature-set backfill --name transactions --version 1 --by-data-status "['None']" --feature-window-start-time $feature_window_start_time --feature-window-end-time $feature_window_end_time --feature-store-name $featurestore_name --resource-group $featurestore_resource_group_name

## Step 9: Generate a training data data frame using the registered features

#### Step 9a: Load observation data

We start by exploring the observation data. Observation data is typically the core data used in training and inference data. This is then joined with feature data to create the full training data. Observation data is the data captured during the time of the event: in this case it has core transaction data including transaction ID, account ID, transaction amount. In this case, since it is for training, it also has the target variable appended (`is_fraud`).

To learn more core concepts including observation data, see [feature store concepts](https://learn.microsoft.com/azure/machine-learning/feature-retrieval-concepts)

In [None]:
observation_data_path = f"abfss://{storage_file_system_name_observation_data}@{storage_account_name}.dfs.core.windows.net/train/*.parquet"
observation_data_df = spark.read.parquet(observation_data_path)
obs_data_timestamp_column = "timestamp"

display(observation_data_df)
# Note: the timestamp column is displayed in a different format. Optionally, you can can call training_df.show() to see correctly formatted value

#### Step 9b: Get the registered feature set and list its features

Next, we get a feature set by providing its name and version, and then we list features in this feature set. Also, we print some sample feature values.  

In [None]:
# look up the featureset by providing name and version
transactions_featureset = featurestore.feature_sets.get("transactions", "1")
# list its features
transactions_featureset.features

In [None]:
# print sample values
display(transactions_featureset.to_spark_dataframe().head(5))

#### Step 9c: Select features and generate training data
In this step we will select features that we would like to be part of training data and use the feature store SDK to generate the training data.

In [None]:
from azureml.featurestore import get_offline_features

# you can select features in pythonic way
features = [
    transactions_featureset.get_feature("transaction_amount_7d_sum"),
    transactions_featureset.get_feature("transaction_amount_7d_avg"),
]

# you can also specify features in string form: featurestore:featureset:version:feature
more_features = [
    "transactions:1:transaction_3d_count",
    "transactions:1:transaction_amount_3d_avg",
]

more_features = featurestore.resolve_feature_uri(more_features)
features.extend(more_features)

# generate training dataframe by using feature data and observation data
training_df = get_offline_features(
    features=features,
    observation_data=observation_data_df,
    timestamp_column=obs_data_timestamp_column,
)

# Ignore the message that says feature set is not materialized (materialization is optional). We will enable materialization in the next part of the tutorial.
display(training_df)
# Note: the timestamp column is displayed in a different format. Optionally, you can can call training_df.show() to see correctly formatted value

You can see how the features are appended to the training data using a point-in-time join.

We have reached the end of the tutorial. Now you have your training data using features from feature store. You can either save it to storage for later use, or run model training on it directly.

## Step 10: Optional next Steps

Now, you have successfully created a secure feature store and submitted a successful materialization run. You can go through the tutorial notebook series to get an understanding of the feature store. This notebook contained mixture of steps from first and second tutorial in the series. Please replace the necessary public storage containers used in the other notebooks with the ones created in this notebook for network isolation.
- [Tutorial 2: Experiment and train models using features](https://learn.microsoft.com/en-us/azure/machine-learning/tutorial-experiment-train-models-using-features)
- [Tutorial 3: Enable recurrent materialization and run batch inference](https://learn.microsoft.com/en-us/azure/machine-learning/tutorial-enable-recurrent-materialization-run-batch-inference)