# Setup Environment

Follow this notebook to setup the environment for running the project. This notebook should be run in any local or Azure service which allows you to,
- run jupyter notebooks
- authenticate with credential of the target tenant

## Step 1. Create Azure Resources

In [None]:
# Prequisites:
subscription_id = ""
tenant_id = ""

In [None]:
# Decide Prefix for the name of the environment. Try to keep it short and UNIQUE.
#   the prefix is used to identify/name the resources
#   i.e., resource group will be named <prefix>rg
#   i.e., storage account will be named <prefix>sa
#   i.e., purview will be named <prefix>pv
# etc.
prefix = ""

# Decide the location of the resources.
location = "westeurope"

In [None]:
resource_group_name = f"{prefix}rg" # name of the resource group
featurestore_name = f"{prefix}fs" # name of feature store
storage_account_name = f"{prefix}sa" # name of the storage account
purview_name = f"{prefix}pv" # The purview name. !It must be globally unique!
sp_name=f"{prefix}sp" # name of the service principal

In [None]:
# install necessary packages. skip those you have already installed.
!pip install azure-cli
!pip install azure-identity
!pip install azure-mgmt-purview
!pip install azureml-featurestore
!pip install azure-mgmt-resource
!pip install azure-mgmt-storage

In [None]:
# obtain credential to create the resources
from azure.identity import InteractiveBrowserCredential
default_credential = InteractiveBrowserCredential(tenant_id=tenant_id)

### Create a Resource Group

In [None]:
from azure.mgmt.resource import ResourceManagementClient

def check_or_create_resource_group(subscription_id, resource_group_name, location):

    # Initialize the ResourceManagementClient
    resource_client = ResourceManagementClient(default_credential, subscription_id)

    # Check if the resource group already exists
    try:
        resource_group = resource_client.resource_groups.get(resource_group_name)
        print(f"Resource group '{resource_group_name}' already exists.")
    except:
        # If it doesn't exist, create a new one
        print(f"Resource group '{resource_group_name}' does not exist. Creating...")
        resource_group_params = {'location': location}
        resource_group = resource_client.resource_groups.create_or_update(
            resource_group_name, resource_group_params
        )
        print(f"Resource group '{resource_group_name}' created.")

In [None]:
# create the resource group
# ! this action may open your browser to login to azure portal. Follow the instruction to login.
check_or_create_resource_group(subscription_id, resource_group_name, location)

### Create a Purview Account

In [None]:
from azure.mgmt.purview import PurviewManagementClient
from azure.mgmt.purview.models import *
import time

purview_client = PurviewManagementClient(default_credential, subscription_id)

# create a purview account
# notice: if you meet error 2005 which specifies quota limit, you can try to use a different location.
identity = Identity(type= "SystemAssigned")
sku = AccountSku(name= 'Standard', capacity= 4)
purview_resource = Account(identity=identity,sku=sku,location=location)

       
try:
	pa = (purview_client.accounts.begin_create_or_update(resource_group_name, purview_name, purview_resource)).result()
	print("location:", pa.location, " Microsoft Purview Account Name: ", purview_name, " Id: " , pa.id ," tags: " , pa.tags) 
except Exception as e:
	print(f"Error in submitting job to create account: {e}")
 
while (getattr(pa,'provisioning_state')) != "Succeeded" :
    pa = (purview_client.accounts.get(resource_group_name, purview_name))  
    print(getattr(pa,'provisioning_state'))
    if getattr(pa,'provisioning_state') == "Failed" :
        print("Error in creating Microsoft Purview account")
        break
    time.sleep(30)

### Create a Azure ML Managed Feature Store

In [None]:
from azure.ai.ml import MLClient
from azure.ai.ml.entities import FeatureStore

fs_client = MLClient(
    default_credential,
    subscription_id,
    resource_group_name,
    featurestore_name,
)

fs = FeatureStore(name=featurestore_name, location=location)
# wait for featurestore creation
fs_poller = fs_client.feature_stores.begin_create(fs, update_dependent_resources=True)
print(fs_poller.result())

### Create a Service Principal

In [None]:
sp_name=f"{prefix}sp"

In [None]:
# create the service principal
sp_creation_output = !az ad sp create-for-rbac --name $sp_name

!**Notice**: Make a memo of the following cell output. The `password` here is the `client_secret` of the Service Principal. You will need it when setting up the data pipeline parameter in Fabric workspace.

In [None]:
# analyze the output to get the service principal information
import json
import re

sp_creation_output_str = ''.join(sp_creation_output)

match = re.search(r'\{.*\}', sp_creation_output_str)

if match:
    sp_dict = json.loads(match.group())
    print(sp_dict)

In [None]:
# app_id/client_id of the service principal
app_id = sp_dict['appId']

### Assign Roles

Allow the Service Principal to access feature store. It should be assigned to the role `AzureML Data Scientist` so that it can act to registrate/retrieve feature sets to the store.

In [None]:
featurestore_arm_id = f"/subscriptions/{subscription_id}/resourceGroups/{resource_group_name}/providers/Microsoft.MachineLearningServices/workspaces/{featurestore_name}"

In [None]:
!az role assignment create \
    --assignee $app_id  \
    --role "AzureML Data Scientist" \
    --scope $featurestore_arm_id

Allow the Service Principal to access Purview. It should be assigned to the role `Data Curator` so that it can act to register/scan the data assets.

In [None]:
purview_arm_id = f"/subscriptions/{subscription_id}/resourceGroups/{resource_group_name}/providers/Microsoft.Purview/accounts/{purview_name}"

In [None]:
!az role assignment create \
    --assignee $app_id  \
    --role "Contributor" \
    --scope $purview_arm_id

You will need to assign the Service Principal to the role `Data Curator` in the Purview root collection manually. Check more details in [README.md](./README.md) and search for "Data Lineage Setup".

## Step 2. Create Fabric Workspace

Follow steps in the [README.md](./README.md) (search for "Microsoft Fabric Setup Steps") to create a new Fabric workspace.

After workspace created, follow steps in the [README.md](./README.md) (search for "Fabric Environment Setup") to create the Fabric Environment.

You will need to edit the Spark properties. Find the properties list yaml template in this repo `./src/environment/sparkProperties.yaml`. Replace the value with your created resources of Azure Managed Feature Store and Service Principle. Run the following cell to get most of the information. However, for `<fabric-tenant-name>`, `<fabric-workspace>`, `<fabric-lakehouse>` you need to provide the information manually.

In [None]:
env_props = f"""
runtime_version: '1.1'
spark_conf:
  - spark.fsd.client_id: {sp_dict['appId']}
  - spark.fsd.tenant_id: {sp_dict['tenant']}
  - spark.fsd.subscription_id: {subscription_id}
  - spark.fsd.rg_name: {resource_group_name}
  - spark.fsd.name: {featurestore_name}
  - spark.fsd.fabric.tenant: <fabric-tenant-name> # Fetch from Fabric base URL, like https://<fabric-tenant-name>.powerbi.com/
  - spark.fsd.purview.account: {purview_name}
"""

print(env_props)

Save and Publish the Environment. To apply the environment, set the newly created environment as 'default' in the Fabric workspace settings page. This will take several minutes to complete.

## Step 3. Set up the Data Pipeline

Follow steps in the [README.md](./README.md) (search for "Data Pipeline Setup") to setup the pipeline.