# Tutorial #2: Enable materialization and backfill feature data

In this tutorial series you will experience how features seamlessly integrates all the phases of ML lifecycle: Prototyping features, training and operationalizing. 

In the part 1 of the tutorial you learnt how to create a feature set and use it to generate training data. When you query the featureset, the transformations will be applied on the source on-the-fly to compute the features before returning the values. This is fine for prototyping. However when you run training and inference in production environment, it is recommended that you materialize the features for higher reliability and availability. Materialization is the process of computing the feature values for a given feature window and storing this in an materialization store. All feature queries will now use the values from the materialization store.

In this tutorial (part 2 of the series) you will:
- Enable offline store on the feature store by creating and attaching an ADLS gen2 container and a user assigned managed identity
- Enable offline materialization on the feature sets, and backfill the feature data

#### Important

This feature is currently in public preview. This preview version is provided without a service-level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/).

# Prerequsite
1. Please ensure you have executed part 1 of the tutorial
1. An Azure Resource group, in which you (or the service principal you use) need to have `User Access Administrator` role and `Contributor` role.

# Setup
Summary of setup steps you will execute:
- In your project workspace, create Azure ML compute to run training pipeline
- In your feature store workspace, create a offline materialization store: create a Azure gen2 storage account and a container in it and attach to feature store. Optionally you can use existing storage container.
- Create and assign user-assigned managed identity to feature store. Optionally you can use existing one. This will be used by the system managed materialization jobs i.e. recurrent job that will be used in part 3 of the tutorial
- Grant required RBAC permissions to the user-assigned managed identity
- Grant required RBAC to your AAD identity. Users (like you) need to have read access to (a) sources (b) materialization store

#### Configure Azure ML spark notebook

1. Running the tutorial: You can either create a new notebook, and execute the instructions in this document step by step or open the existing notebook named `2. Enable materialization and backfill feature data.ipynb`, and run it. The notebooks are available in `featurestore_sample/notebooks` directory. You can select from `sdk_only` or `sdk_and_cli`. You may keep this document open and refer to it for additional explanation and documentation links.
1. In the "Compute" dropdown in the top nav, select "Serverless Spark Compute". 
1. Click on "configure session" in top status bar -> click on "Python packages" -> click on "upload conda file" -> select the file azureml-examples/sdk/python/featurestore-sample/project/env/conda.yml from your local machine; Also increase the session time out (idle time) if you want to avoid running the prerequisites frequently


In [None]:
print("started spark session")

#### Setup root directory for the samples

In [None]:
import os

# please update the dir to ./Users/{your-alias} (or any custom directory you uploaded the samples to).
# You can find the name from the directory structure inm the left nav
root_dir = "./Users/<your user alias>/featurestore_sample"

if os.path.isdir(root_dir):
    print("The folder exists.")
else:
    print("The folder does not exist. Please create or fix the path")

#### (new for sdk/cki track) Setup CLI

1. Install azure ml cli extention
1. Authenticate
1. Set the default subscription

In [None]:
# Install azure ml cli extention
!az extension add --name ml

In [None]:
# Authenticate
!az login

In [None]:
# Set the default subscription
import os

subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]

!az account set -s $subscription_id

#### (new for sdk/cki track) Initialize the project workspace properties
This is the current workspace where you will be running the tutorial notebook from.

In [None]:
# lookup the subscription id, resource group and workspace name of the current workspace
project_ws_sub_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
project_ws_rg = os.environ["AZUREML_ARM_RESOURCEGROUP"]
project_ws_name = os.environ["AZUREML_ARM_WORKSPACE_NAME"]

#### (new for sdk/cki track) Initialize the feature store properties
Ensure you update the `featurestore_name` and `featurestore_location` to reflect what you created in part 1 of this tutorial

In [None]:
# use the same name from part #1 of the tutorial
featurestore_name = "my-featurestore"
# use the same location from part #1 of the tutorial
featurestore_location = "eastus"
# use the subscription of the project workspace by default
featurestore_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
# use the resource group of the project workspace by default
featurestore_resource_group_name = os.environ["AZUREML_ARM_RESOURCEGROUP"]

feature_store_arm_id = "/subscriptions/{sub_id}/resourceGroups/{rg}/providers/Microsoft.MachineLearningServices/workspaces/{ws_name}".format(
    sub_id=featurestore_subscription_id,
    rg=featurestore_resource_group_name,
    ws_name=featurestore_name,
)

print(feature_store_arm_id)

#### Initialize the feature store core sdk client

In [None]:
# feature store client
from azureml.featurestore import FeatureStoreClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

featurestore = FeatureStoreClient(
    credential=AzureMLOnBehalfOfCredential(),
    subscription_id=featurestore_subscription_id,
    resource_group_name=featurestore_resource_group_name,
    name=featurestore_name,
)

#### Setup offline materialization store
You can create a new gen2 storage account and a container, or reuse existing one to be used as the offline materilization store for the feature store

##### Note to docs team: 
The SDK only track has: `Setup utility functions` and the note below it ("This code ..."). This is not applicable in the CLI + SDK track, you can remove it in this track

##### Set values for the adls gen 2 storage that will be used as materialization store
You can optionally override the default settings

In [None]:
## Default Setting
# We use the subscription, resource group, region of this active project workspace,
# We hard-coded default resource names for creating new resources

## Overwrite
# You can replace them if you want to create the resources in a different subsciprtion/resourceGroup, or use existing resources

storage_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
storage_resource_group_name = os.environ["AZUREML_ARM_RESOURCEGROUP"]
storage_account_name = "fstorestorage"
# feature store location is used by default. You can change it.
storage_location = featurestore_location
storage_file_system_name = "offlinestore"

##### Storage container (option 1): create new storage and container

In [None]:
# create new storage account
!az storage account create --name $storage_account_name --enable-hierarchical-namespace true --resource-group $storage_resource_group_name --location $storage_location --subscription $storage_subscription_id

In [None]:
# create new storage container
!az storage fs create --name $storage_file_system_name --account-name $storage_account_name --subscription $storage_subscription_id --auth-mode login

In [None]:
# set the container arm id
gen2_container_arm_id = "/subscriptions/{sub_id}/resourceGroups/{rg}/providers/Microsoft.Storage/storageAccounts/{account}/blobServices/default/containers/{container}".format(
    sub_id=storage_subscription_id,
    rg=storage_resource_group_name,
    account=storage_account_name,
    container=storage_file_system_name,
)

print(gen2_container_arm_id)

##### Storage container (option 2): If you have an existing storage container that you want to reuse

In [None]:
# set the container arm id
gen2_container_arm_id = "/subscriptions/{sub_id}/resourceGroups/{rg}/providers/Microsoft.Storage/storageAccounts/{account}/blobServices/default/containers/{container}".format(
    sub_id=storage_subscription_id,
    rg=storage_resource_group_name,
    account=storage_account_name,
    container=storage_file_system_name,
)

print(gen2_container_arm_id)

### Setup user assigned managed identity (UAI)
This will be used by the system managed materialization jobs i.e. recurrent job that will be used in part 3 of the tutorial

##### Set values for UAI

In [None]:
# User assigned managed identity values. Optionally you may change the values.
uai_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
uai_resource_group_name = os.environ["AZUREML_ARM_RESOURCEGROUP"]
uai_name = "fstoreuai"
# feature store location is used by default. You can change it.
uai_location = featurestore_location

##### User-assigned managed identity (option 1): create new one

In [None]:
!az identity create --subscription $uai_subscription_id --resource-group $uai_resource_group_name --location $uai_location --name $uai_name

##### User-assigned managed identity (option 2): If you have an existing one that you want to reuse
Run `az identity show` to get the UAI information

In [None]:
!az identity show --resource-group $uai_resource_group_name --subscription $uai_subscription_id --name $uai_name

##### Retrieve UAI properties

In [None]:
from azure.mgmt.msi import ManagedServiceIdentityClient
from azure.mgmt.msi.models import Identity

msi_client = ManagedServiceIdentityClient(
    AzureMLOnBehalfOfCredential(), uai_subscription_id
)
managed_identity = msi_client.user_assigned_identities.get(
    resource_name=uai_name, resource_group_name=uai_resource_group_name
)

uai_principal_id = managed_identity.principal_id
uai_client_id = managed_identity.client_id
uai_arm_id = managed_identity.id

##### Grant RBAC permission to the user assigned managed identity (UAI)

This UAI will be assigned to the feature store shortly. It requires the following permissions:

|Scope|	Action/Role|
|--|--|
|Feature store	|AzureML Data Scientist role|
|Storage account of feature store offline store	|Blob storage data contributor role|
|Storage accounts of source data	|Blob storage data reader role|

The below cli commands will assign the first two roles to the UAI. In this example "Storage accounts of source data" is not applicable since we are reading the sample data from a public access blob storage. If you have your own data sources then you want to assign the required roles to the UAI. To learn more about access control, see access control document in the docs

In [None]:
!az role assignment create --role "AzureML Data Scientist" --assignee-object-id $uai_principal_id --assignee-principal-type ServicePrincipal --scope $feature_store_arm_id

In [None]:
!az role assignment create --role "Storage Blob Data Contributor" --assignee-object-id $uai_principal_id --assignee-principal-type ServicePrincipal --scope $gen2_container_arm_id

#### Grant your user account "Blob data reader" role on the offline store
If feature data is materialized, then you need this role to read feature data from offline materialization store.

Get your AAD object id from Azure portal following this instruction: https://learn.microsoft.com/en-us/partner-center/find-ids-and-domain-names#find-the-user-object-id

To learn more about access control, see access control document in the docs.

In [None]:
# This utility function is created for ease of use in the docs tutorials. It uses standard azure API's. You can optionally inspect it `featurestore/setup/setup_storage_uai.py`
your_aad_objectid = "<your_aad_objectId>"

!az role assignment create --role "Storage Blob Data Reader" --assignee-object-id $your_aad_objectid --assignee-principal-type User --scope $gen2_container_arm_id

## Step 1: Enable offline store on the feature store by attaching offline materialization store and UAI

__(todo) (new for sdk+cli track)__ Action: inspect the file `xxxx`. The below command will update the feature store by attaching the offline store and UAI.

In [None]:
# The below code creeates a feature stor
import yaml

config = {
    "$schema": "http://azureml/sdk-2-0/FeatureStore.json",
    "name": featurestore_name,
    "location": featurestore_location,
    "compute_runtime": {"spark_runtime_version": "3.2"},
    "offline_store": {"type": "azure_data_lake_gen2", "target": gen2_container_arm_id},
    "materialization_identity": {"client_id": uai_client_id, "resource_id": uai_arm_id},
}

feature_store_yaml = root_dir + "/featurestore/featurestore_with_offline_setting.yaml"

with open(feature_store_yaml, "w") as outfile:
    yaml.dump(config, outfile, default_flow_style=False)

In [None]:
!az ml feature-store update --file $feature_store_yaml --resource-group $featurestore_resource_group_name --name $featurestore_name

## Step 2: Enable offline materialization on transactions featureset
Once materialization is enabled on a featureset, you can perform backfill (this tutorial) or schedule recurrent materialization jobs (next part of the tutorial)

__(todo) (new for sdk+cli track)__ Action: inspect the file `xxxx`. The below command will update the transaction feature set to enable offline materilization

In [None]:
transaction_asset_mat_yaml = (
    root_dir
    + "/featurestore/featuresets/transactions/featureset_asset_offline_enabled.yaml"
)

!az ml feature-set update --file $transaction_asset_mat_yaml --resource-group $featurestore_resource_group_name --workspace-name $featurestore_name

## Step 3: Backfill data for transactions featureset
As explained in the beginning of this tutorial, materialization is the process of computing the feature values for a given feature window and storing this in an materialization store. Materializing the features will increase its reliability and availability. All feature queries will now use the values from the materialization store. In this step you perform a one-time backfill for a feature window of __three months__.

#### Note
How to determine the window of backfill data needed? It has to match with the window of your training data. For e.g. if you want to train with two years of data, then you will want to be able to retrieve features for the same window, so you will backfill for a two year window.

In [None]:
feature_window_start_time = "2023-01-01T00:00.000Z"
feature_window_end_time = "2023-04-01T00:00.000Z"

!az ml feature-set backfill --name transactions --version 1 --workspace-name $featurestore_name --resource-group $featurestore_resource_group_name --feature-window-start-time $feature_window_start_time --feature-window-end-time $feature_window_end_time

Lets print sample data from the featureset. You can notice from the output information that the data was retrieved from the materilization store. `get_offline_features()` method that is used to retrieve training/inference data will also use the materialization store by default .

In [None]:
# look up the featureset by providing name and version
transactions_featureset = featurestore.feature_sets.get("transactions", "1")
display(transactions_featureset.to_spark_dataframe().head(5))

## Cleanup
Part 4 of the tutorial has instructions for deleting the resources

## Next steps
* Part 3 of tutorial: Experiment and train models using features