# Azure Machine Learning Serverless Spark with Managed Virtual Network
This Notebook provides sample codes for running a Spark job using [Azure Machine Learning serverless Spark compute with a managed virtual network](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-managed-network#configure-for-serverless-spark-jobs). In this sample notebook you will:
- Create an Azure Machine Learning Workspace with Public Network Access _disabled_.
- Configure outbound rules for the Azure Machine Learning workspace that allow storage account data access.
- Provision managed network for the workspace.
- View created outbound rules.
- Submit a Spark job using serverless Spark compute.

### Getting MLClient instance

In [None]:
# import required libraries
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# Enter the details of your subscription
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"

# get an instance of MLClient
ml_client = MLClient(DefaultAzureCredential(), subscription_id, resource_group)

### Create a Workspace
Define a managed VNet with isolation mode `IsolationMode.ALLOW_INTERNET_OUTBOUND` and a user-defined outbound rule for Azure Blob storage account. In this example, Public Network Access to the workspace is _disabled_. The code in cell creates the workspace, but the managed VNet and Private Endpoints corresponding to the outbound rules are provisioned in the later step.

> [!NOTE]
> If you want to allow only approved outbound traffic to enable data exfiltration protection (DEP), use `IsolationMode.ALLOW_ONLY_APPROVED_OUTBOUND`.

If the Azure Blob storage account needs to have Public Network Access _disabled_, then access should be disabled before adding the outbound rule and provisioning the managed VNet for the workspace. 

In [None]:
# Creating a workspace with unique name
from azure.ai.ml.entities import Workspace, ManagedNetwork, PrivateEndpointDestination
from azure.ai.ml.constants._workspace import IsolationMode

# Enter workspace name and the region where the
# workspace will be created
ws_name = "<AML_WORKSPACE_NAME>"
region = "<AZURE_REGION_NAME>"
# Enter Azure Blob storage account name for the outbound rule
blob_storage_account = "<STORAGE_ACCOUNT_NAME>"

ws_mvnet = Workspace(
    name=ws_name,
    location=region,
    hbi_workspace=False,
    public_network_access="Disabled",  # Comment this out to enable Public Network Access
    tags=dict(purpose="demo"),
)

ws_mvnet.managed_network = ManagedNetwork(
    isolation_mode=IsolationMode.ALLOW_INTERNET_OUTBOUND
)

rule_name = "<OUTBOUND_RULE_NAME>"
service_resource_id = f"/subscriptions/{subscription_id}/resourceGroups/{resource_group}/providers/Microsoft.Storage/storageAccounts/{blob_storage_account}"
subresource_target = "blob"
spark_enabled = True

ws_mvnet.managed_network.outbound_rules = [
    PrivateEndpointDestination(
        name=rule_name,
        service_resource_id=service_resource_id,
        subresource_target=subresource_target,
        spark_enabled=spark_enabled,
    )
]

print(f"Initiating creation request for workspace with name: {ws_name}")
ws_mvnet = ml_client.workspaces.begin_create(ws_mvnet).result()
print("workspace created!")

### Provision the Managed VNet
Provision the managed VNet for the workspace that we created in the previous step. The method `begin_provision_network`:
- Provisions a managed VNet for the workspace.
- Creates Private Endpoints defined by outbound rules.
- Creates system-defined Private Endpoints.
- Enables Spark support based on the passed parameter value.


In [None]:
# Provisioning managed VNet with Spark support
include_spark = True
provision_network_result = ml_client.workspaces.begin_provision_network(
    workspace_name=ws_name, include_spark=include_spark
).result()

### Get workspace details
Showing workspace details.

In [None]:
print(f"Getting workspace with name: {ws_name}")
ws_mvnet = ml_client.workspaces.get(ws_name)
print(ws_mvnet)

### List outbound rules for the workspace
Listing outbound rules for the workspace. This list shows Private Endpoints created for defined outbound rules, and system-defined Private Endpoints.

In [None]:
# List outbound rules for a workspace
rule_list = ml_client._workspace_outbound_rules.list(ws_name)
print([r._to_dict() for r in rule_list])

### Display details of an outbound rule
Displaying details of an outbound rule.

In [None]:
# Get details of an outbound rule by name
rule = ml_client._workspace_outbound_rules.get(ws_name, rule_name)
print(rule)
print(rule._to_dict())

### Submit a Spark job
In the subsequent parts:
- A directory named `src` is created to keep Python scripts used in a standalone Spark job.
- A Python script to wrangle Titanic data is written to the `src` directory.
- A standalone job is submitted using the data stored on the Azure Blob storage account and serverless Spark compute with managed VNet.

### Ensure code path exists
Create a directory named `src` in the current directory, if it does not exist.

In [None]:
import os

os.makedirs("src", exist_ok=True)

#### Write a script file
This script file uses Azure Blob storage access key stored in an Azure Key Vault for credential-based data access. Identity-based data access is not supported for Spark jobs using data in Azure Blob storage accounts.

If you want to use a SAS token instead of `access_key`, you can use the following to get a `sas_token` and set it in the configuration:
```python
sas_token = token_library.getSecret("<KEY_VAULT_NAME>", "<SAS_TOKEN_SECRET_NAME>")
sc._jsc.hadoopConfiguration().set(
    "fs.azure.sas.<BLOB_CONTAINER_NAME>.<STORAGE_ACCOUNT_NAME>.blob.core.windows.net",
    sas_token,
)
```

In [None]:
%%writefile src/titanic.py
import argparse
from operator import add
import pyspark.pandas as pd
from pyspark.ml.feature import Imputer
from pyspark.sql import SparkSession

parser = argparse.ArgumentParser()
parser.add_argument("--titanic_data")
parser.add_argument("--wrangled_data")

args = parser.parse_args()
print("Input path: " + args.wrangled_data)
print("Output path: " + args.titanic_data)

sc = SparkSession.builder.getOrCreate()
token_library = sc._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
access_key = token_library.getSecret("<KEY_VAULT_NAME>", "<ACCESS_KEY_SECRET_NAME>")
sc._jsc.hadoopConfiguration().set(
    "fs.azure.account.key.<STORAGE_ACCOUNT_NAME>.blob.core.windows.net", access_key
)

df = pd.read_csv(args.titanic_data, index_col="PassengerId")
imputer = Imputer(inputCols=["Age"], outputCol="Age").setStrategy(
    "mean"
)  # Replace missing values in Age column with the mean value
df.fillna(
    value={"Cabin": "None"}, inplace=True
)  # Fill Cabin column with value "None" if missing
df.dropna(inplace=True)  # Drop the rows which still have any missing value
df.to_csv(args.wrangled_data, index_col="PassengerId")


#### Get MLClient for the workspace with managed VNet
Getting instance of `MLClient` for the workspace with managed VNet provisioned.

In [None]:
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace_name=ws_name
)

#### Submit the job
Submit a standalone Spark job to wrangle Titanic data stored in an Azure Blob storage account. To learn more about submitting a standalone Spark job, [see this documentation page](https://learn.microsoft.com/azure/machine-learning/how-to-submit-spark-jobs?view=azureml-api-2&tabs=sdk#submit-a-standalone-spark-job).

In [None]:
from azure.ai.ml import spark, Input, Output

# Enter the Azure Blob storage account name and container name.
# The file `titanic.csv` should be placed inside folder `data`
# created in the Azure Blob storage container.
blob_storage_account = "<STORAGE_ACCOUNT_NAME>"
container_name = "<BLOB_CONTAINER_NAME>"

spark_job = spark(
    display_name="Job from serverless Spark with VNet using Azure Blob storage",
    code="./src",
    entry={"file": "titanic.py"},
    driver_cores=1,
    driver_memory="2g",
    executor_cores=2,
    executor_memory="2g",
    executor_instances=2,
    resources={
        "instance_type": "Standard_E8S_V3",
        "runtime_version": "3.2.0",
    },
    inputs={
        "titanic_data": Input(
            type="uri_file",
            path=f"wasbs://{container_name}@{blob_storage_account}.blob.core.windows.net/data/titanic.csv",
            mode="direct",
        ),
    },
    outputs={
        "wrangled_data": Output(
            type="uri_folder",
            path=f"wasbs://{container_name}@{blob_storage_account}.blob.core.windows.net/data/wrangled",
            mode="direct",
        ),
    },
    args="--titanic_data ${{inputs.titanic_data}} --wrangled_data ${{outputs.wrangled_data}}",
)

returned_spark_job = ml_client.jobs.create_or_update(spark_job)
print(returned_spark_job.id)

### Add another outbound rule
Adding another outbound rule and creating a Private Endpoint to an Azure Data Lake Storage (ADLS) Gen2 account. A separate call to provision Private Endpoint is _not_ required.

If the Azure Data Lake Storage (ADLS) Gen2 account needs to have Public Network Access _disabled_, then access should be disabled before adding the outbound rule and updating the workspace to create the Private Endpoint. The [private endpoints for the workspace and the storage accounts should be in the same VNet](https://learn.microsoft.com/azure/machine-learning/how-to-secure-workspace-vnet#limitations).

Note that `subresource_target` value `dfs` is used here for the Azure Data Lake Storage (ADLS) Gen2 account.

In [None]:
from azure.ai.ml.entities import PrivateEndpointDestination

# This will add a new outbound rule to existing rules
rule_name = "<OUTBOUND_RULE_NAME_GEN2>"  # This name should be unique
adls_storage_account = "<GEN2_STORAGE_ACCOUNT_NAME>"
service_resource_id = f"/subscriptions/{subscription_id}/resourceGroups/{resource_group}/providers/Microsoft.Storage/storageAccounts/{adls_storage_account}"
subresource_target = "dfs"
spark_enabled = True

ws = ml_client.workspaces.get()

ws.managed_network.outbound_rules = [
    PrivateEndpointDestination(
        name=rule_name,
        service_resource_id=service_resource_id,
        subresource_target=subresource_target,
        spark_enabled=spark_enabled,
    )
]

ml_client.workspaces.begin_update(ws).result()

### List outbound rules for the workspace
Listing outbound rules for the workspace. This list shows Private Endpoints created for defined outbound rules, and system-defined Private Endpoints.

In [None]:
# List outbound rules for a workspace
rule_list = ml_client._workspace_outbound_rules.list(ws_name)
print([r._to_dict() for r in rule_list])

### Write Python script for a job with ADLS Gen2
This script file assumes data stored in an Azure Data Lake Storage (ADLS) Gen2 accessed using identity-based data access. The user identity should have **Contributor** and **Storage Blob Data Contributor** roles assigned to ensure data access.

In [None]:
%%writefile src/titanic-adlsg2.py
import argparse
from operator import add
import pyspark.pandas as pd
from pyspark.ml.feature import Imputer
from pyspark.sql import SparkSession

parser = argparse.ArgumentParser()
parser.add_argument("--titanic_data")
parser.add_argument("--wrangled_data")

args = parser.parse_args()
print("Input path: " + args.wrangled_data)
print("Output path: " + args.titanic_data)

df = pd.read_csv(args.titanic_data, index_col="PassengerId")
imputer = Imputer(inputCols=["Age"], outputCol="Age").setStrategy(
    "mean"
)  # Replace missing values in Age column with the mean value
df.fillna(
    value={"Cabin": "None"}, inplace=True
)  # Fill Cabin column with value "None" if missing
df.dropna(inplace=True)  # Drop the rows which still have any missing value
df.to_csv(args.wrangled_data, index_col="PassengerId")

### Submit the job
Submit a standalone Spark job to wrangle Titanic data stored in an Azure Data Lake Storage (ADLS) Gen2 account. To learn more about submitting a standalone Spark job, [see this documentation page](https://learn.microsoft.com/azure/machine-learning/how-to-submit-spark-jobs?view=azureml-api-2&tabs=sdk#submit-a-standalone-spark-job).

In [None]:
from azure.ai.ml import spark, Input, Output
from azure.ai.ml.entities import UserIdentityConfiguration

# Enter the Azure Data Lake Storage (ADLS) Gen2 account name and container name.
# The file `titanic.csv` should be placed inside folder `data`
# created in the Azure Data Lake Storage (ADLS) Gen2 container.
adls_storage_account = "<GEN2_STORAGE_ACCOUNT_NAME>"
container_name = "<ADLS_CONTAINER_NAME>"

spark_job = spark(
    display_name="Job from serverless Spark with VNet using data in ADLS Gen2",
    code="./src",
    entry={"file": "titanic-adlsg2.py"},
    driver_cores=1,
    driver_memory="2g",
    executor_cores=2,
    executor_memory="2g",
    executor_instances=2,
    resources={
        "instance_type": "Standard_E8S_V3",
        "runtime_version": "3.2.0",
    },
    inputs={
        "titanic_data": Input(
            type="uri_file",
            path=f"abfss://{container_name}@{adls_storage_account}.dfs.core.windows.net/data/titanic.csv",
            mode="direct",
        ),
    },
    outputs={
        "wrangled_data": Output(
            type="uri_folder",
            path=f"abfss://{container_name}@{adls_storage_account}.dfs.core.windows.net/data/wrangled",
            mode="direct",
        ),
    },
    args="--titanic_data ${{inputs.titanic_data}} --wrangled_data ${{outputs.wrangled_data}}",
    identity=UserIdentityConfiguration(),
)

returned_spark_job = ml_client.jobs.create_or_update(spark_job)
print(returned_spark_job.id)

### Delete an outbound rule

In [None]:
# Delete an outbound rule from a workspace
ml_client._workspace_outbound_rules.begin_remove(ws_name, rule_name).result()

# List outbound rules for the workspace
rule_list = ml_client._workspace_outbound_rules.list(ws_name)
print([r._to_dict() for r in rule_list])

### Clean-up
Deleting the workspace also deletes corresponding managed VNet and Private Endpoints.

In [None]:
ml_client.workspaces.begin_delete(
    name=ws_name, permanently_delete=True, delete_dependent_resources=True
)