# Tutorial #2: Enable materialization and backfill feature data

In this tutorial series you will experience how features seamlessly integrates all the phases of ML lifecycle: Prototyping features, training and operationalizing. 

In the part 1 of the tutorial you learnt how to create a feature set and use it to generate training data. When you query the featureset, the transformations will be applied on the source on-the-fly to compute the features before returning the values. This is fine for prototyping. However when you run training and inference in production environment, it is recommended that you materialize the features for higher reliability and availability. Materialization is the process of computing the feature values for a given feature window and storing this in an materialization store. All feature queries will now use the values from the materialization store.

In this tutorial (part 2 of the series) you will:
- Enable offline store on the feature store by creating and attaching an ADLS gen2 container and a user assigned managed identity
- Enable offline materialization on the feature sets, and backfill the feature data

#### Important

This feature is currently in public preview. This preview version is provided without a service-level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/).

# Prerequsite
1. Please ensure you have executed part 1 of the tutorial
1. An Azure Resource group, in which you (or the service principal you use) need to have `User Access Administrator` role and `Contributor` role.

# Setup
Summary of setup steps you will execute:
- In your project workspace, create Azure ML compute to run training pipeline
- In your feature store workspace, create a offline materialization store: create a Azure gen2 storage account and a container in it and attach to feature store. Optionally you can use existing storage container.
- Create and assign user-assigned managed identity to feature store. Optionally you can use existing one. This will be used by the system managed materialization jobs i.e. recurrent job that will be used in part 3 of the tutorial
- Grant required RBAC permissions to the user-assigned managed identity
- Grant required RBAC to your AAD identity. Users (like you) need to have read access to (a) sources (b) materialization store

#### Configure Azure ML spark notebook

1. In the "Compute" dropdown in the top nav, select "Serverless Spark Compute". 
1. Click on "configure session" in top status bar -> click on "Python packages" -> click on "upload conda file" -> select the file azureml-examples/sdk/python/featurestore-sample/project/env/conda.yml from your local machine; Also increase the session time out (idle time) if you want to avoid running the prerequisites frequently




In [None]:
print("started spark session")

#### Setup root directory for the samples

In [None]:
import os

# please update the dir to ./Users/{your-alias} (or any custom directory you uploaded the samples to).
# You can find the name from the directory structure inm the left nav
root_dir = "./Users/<your_user_alias>/featurestore_sample"

if os.path.isdir(root_dir):
    print("The folder exists.")
else:
    print("The folder does not exist. Please create or fix the path")

#### Initialize the project workspace CRUD client
This is the current workspace where you will be running the tutorial notebook from

In [None]:
### Initialize the MLClient of this project workspace
import os
from azure.ai.ml import MLClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

project_ws_sub_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
project_ws_rg = os.environ["AZUREML_ARM_RESOURCEGROUP"]
project_ws_name = os.environ["AZUREML_ARM_WORKSPACE_NAME"]

# connect to the project workspace
ws_client = MLClient(
    AzureMLOnBehalfOfCredential(), project_ws_sub_id, project_ws_rg, project_ws_name
)

#### Initialize the feature store CRUD client
Ensure you update the `featurestore_name` to reflect what you created in part 1 of this tutorial

In [None]:
from azure.ai.ml import MLClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

# feature store
featurestore_name = "my-featurestore"  # use the same name from part #1 of the tutorial
featurestore_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
featurestore_resource_group_name = os.environ["AZUREML_ARM_RESOURCEGROUP"]

# feature store ml client
fs_client = MLClient(
    AzureMLOnBehalfOfCredential(),
    featurestore_subscription_id,
    featurestore_resource_group_name,
    featurestore_name,
)

#### Initialize the feature store core sdk client

In [None]:
# feature store client
from azureml.featurestore import FeatureStoreClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

featurestore = FeatureStoreClient(
    credential=AzureMLOnBehalfOfCredential(),
    subscription_id=featurestore_subscription_id,
    resource_group_name=featurestore_resource_group_name,
    name=featurestore_name,
)

#### Setup offline materialization store
You can create a new gen2 storage account and a container, or reuse existing one to be used as the offline materilization store for the feature store

##### Setup utility functions
Note: The below code  sets up utility functions to create storage and user assigned identity. These utility functions use standard azure SDKs. These are provided to keep the tutorial concise. However do not use this for production purposes as it might not implement best practices.

In [None]:
import sys

sys.path.insert(0, root_dir + "/featurestore/setup")
from setup_storage_uai import (
    create_gen2_storage_container,
    create_user_assigned_managed_identity,
    grant_rbac_permissions,
    grant_user_aad_storage_data_reader_role,
)

##### Set values for the adls gen 2 storage that will be used as materialization store
You can optionally override the default settings

In [None]:
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

## Default Setting
# We use the subscription, resource group, region of this active project workspace,
# We hard-coded resource names for creating new resources

## Overwrite
# You can replace them if you want to create the resources in a different subsciprtion/resourceGroup, or use existing resources

ws_location = ws_client.workspaces.get(ws_client.workspace_name).location

# storage
storage_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
storage_resource_group_name = os.environ["AZUREML_ARM_RESOURCEGROUP"]
storage_account_name = "<FEATURE_STORAGE_ACCOUNT_NAME>"
storage_location = ws_location
storage_file_system_name = "offlinestore"

##### Storage container (option 1): create new storage container

In [None]:
gen2_container_arm_id = create_gen2_storage_container(
    AzureMLOnBehalfOfCredential(),
    storage_subscription_id=storage_subscription_id,
    storage_resource_group_name=storage_resource_group_name,
    storage_account_name=storage_account_name,
    storage_location=storage_location,
    storage_file_system_name=storage_file_system_name,
)

print(gen2_container_arm_id)

##### Storage container (option 2): If you have an existing storage container that you want to reuse

In [None]:
gen2_container_arm_id = "/subscriptions/{sub_id}/resourceGroups/{rg}/providers/Microsoft.Storage/storageAccounts/{account}/blobServices/default/containers/{container}".format(
    sub_id=storage_subscription_id,
    rg=storage_resource_group_name,
    account=storage_account_name,
    container=storage_file_system_name,
)

print(gen2_container_arm_id)

### Setup user assigned managed identity (UAI)
This will be used by the system managed materialization jobs i.e. recurrent job that will be used in part 3 of the tutorial

##### Set values for UAI

In [None]:
# User assigned managed identity values. Optionally you may change the values.
uai_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
uai_resource_group_name = os.environ["AZUREML_ARM_RESOURCEGROUP"]
uai_name = "fstoreuai"
uai_location = ws_location

##### User-assigned managed identity (option 1): create new one

In [None]:
uai_principal_id, uai_client_id, uai_arm_id = create_user_assigned_managed_identity(
    AzureMLOnBehalfOfCredential(),
    uai_subscription_id=uai_subscription_id,
    uai_resource_group_name=uai_resource_group_name,
    uai_name=uai_name,
    uai_location=uai_location,
)

print("uai_principal_id:" + uai_principal_id)
print("uai_client_id:" + uai_client_id)
print("uai_arm_id:" + uai_arm_id)

##### User-assigned managed identity (option 2): If you have an existing one that you want to reuse

In [None]:
from azure.mgmt.msi import ManagedServiceIdentityClient

msi_client = ManagedServiceIdentityClient(
    AzureMLOnBehalfOfCredential(), uai_subscription_id
)

managed_identity = msi_client.user_assigned_identities.get(
    uai_resource_group_name, uai_name
)

uai_principal_id = managed_identity.principal_id
uai_client_id = managed_identity.client_id
uai_arm_id = managed_identity.id

print("uai_principal_id:" + uai_principal_id)
print("uai_client_id:" + uai_client_id)
print("uai_arm_id:" + uai_arm_id)

##### Grant RBAC permission to the user assigned managed identity (UAI)

This UAI will be assigned to the feature store shortly. It requires the following permissions:

|Scope|	Action/Role|
|--|--|
|Feature store	|AzureML Data Scientist role|
|Storage account of feature store offline store	|Blob storage data contributor role|
|Storage accounts of source data	|Blob storage data reader role|

The below code utility function will assign the first two roles to the UAI. In this example "Storage accounts of source data" is not applicable since we are reading the sample data from a public access blob storage. If you have your own data sources then you want to assign the required roles to the UAI. To learn more about access control, see access control document in the docs

In [None]:
# This utility function is created for ease of use in the docs tutorials. It uses standard azure API's. You can optionally inspect it `featurestore/setup/setup_storage_uai.py`
grant_rbac_permissions(
    AzureMLOnBehalfOfCredential(),
    uai_principal_id,
    storage_subscription_id=storage_subscription_id,
    storage_resource_group_name=storage_resource_group_name,
    storage_account_name=storage_account_name,
    featurestore_subscription_id=featurestore_subscription_id,
    featurestore_resource_group_name=featurestore_resource_group_name,
    featurestore_name=featurestore_name,
)

#### Grant your user account "Blob data reader" role on the offline store
If feature data is materialized, then you need this role to read feature data from offline materialization store.

Get your AAD object id from Azure portal following this instruction: https://learn.microsoft.com/en-us/partner-center/find-ids-and-domain-names#find-the-user-object-id

To learn more about access control, see access control document in the docs.

In [None]:
# This utility function is created for ease of use in the docs tutorials. It uses standard azure API's. You can optionally inspect it `featurestore/setup/setup_storage_uai.py`
your_aad_objectid = "<USER_AAD_OBJECTID>"

grant_user_aad_storage_data_reader_role(
    AzureMLOnBehalfOfCredential(),
    your_aad_objectid,
    storage_subscription_id,
    storage_resource_group_name,
    storage_account_name,
)

## Step 1: Enable offline store on the feature store by attaching offline materialization store and UAI

In [None]:
from azure.ai.ml.entities import (
    ManagedIdentityConfiguration,
    FeatureStore,
    MaterializationStore,
)

offline_store = MaterializationStore(
    type="azure_data_lake_gen2",
    target=gen2_container_arm_id,
)

materialization_identity1 = ManagedIdentityConfiguration(
    client_id=uai_client_id, principal_id=uai_principal_id, resource_id=uai_arm_id
)

fs = FeatureStore(
    name=featurestore_name,
    offline_store=offline_store,
    materialization_identity=materialization_identity1,
)

fs_poller = fs_client.feature_stores.begin_update(fs, update_dependent_resources=True)

print(fs_poller.result())

## Step 2: Enable offline materialization on transactions featureset
Once materialization is enabled on a featureset, you can perform backfill (this tutorial) or schedule recurrent materialization jobs(next part of the tutorial)

In [None]:
from azure.ai.ml.entities import (
    MaterializationSettings,
    MaterializationComputeResource,
)

transactions_fset_config = fs_client._featuresets.get(name="transactions", version="1")

transactions_fset_config.materialization_settings = MaterializationSettings(
    offline_enabled=True,
    resource=MaterializationComputeResource(instance_type="standard_e8s_v3"),
    spark_configuration={
        "spark.driver.cores": 4,
        "spark.driver.memory": "36g",
        "spark.executor.cores": 4,
        "spark.executor.memory": "36g",
        "spark.executor.instances": 2,
    },
    schedule=None,
)

fs_poller = fs_client.feature_sets.begin_create_or_update(transactions_fset_config)
print(fs_poller.result())

Optionally, you can save the the above feature set asset as yaml

In [None]:
## uncomment to run
# transactions_fset_config.dump(root_dir + "/featurestore/featuresets/transactions/featureset_asset_offline_enabled.yaml")

## Step 3: Backfill data for transactions featureset
As explained in the beginning of this tutorial, materialization is the process of computing the feature values for a given feature window and storing this in an materialization store. Materializing the features will increase its reliability and availability. All feature queries will now use the values from the materialization store. In this step you perform a one-time backfill for a feature window of __three months__.

#### Note
How to determine the window of backfill data needed? It has to match with the window of your training data. For e.g. if you want to train with two years of data, then you will want to be able to retrieve features for the same window, so you will backfill for a two year window.

In [None]:
from datetime import datetime

st = datetime(2023, 1, 1, 0, 0, 0, 0)
ed = datetime(2023, 4, 1, 0, 0, 0, 0)

poller = fs_client.feature_sets.begin_backfill(
    name="transactions",
    version="1",
    feature_window_start_time=st,
    feature_window_end_time=ed,
)
print(poller.result().job_id)

In [None]:
# get the job URL, and stream the job logs
fs_client.jobs.stream(poller.result().job_id)

Lets print sample data from the featureset. You can notice from the output information that the data was retrieved from the materilization store. `get_offline_features()` method that is used to retrieve training/inference data will also use the materialization store by default .

In [None]:
# look up the featureset by providing name and version
transactions_featureset = featurestore.feature_sets.get("transactions", "1")
display(transactions_featureset.to_spark_dataframe().head(5))

## Cleanup
Part 4 of the tutorial has instructions for deleting the resources

## Next steps
* Part 3 of tutorial: Experiment and train models using features