# Running a Spark Job on AzureML


## Create a client connection to the AzureML workspace

In [1]:
from azure.ai.ml import MLClient, spark, Input, Output
from azure.identity import DefaultAzureCredential
from azure.ai.ml.entities import UserIdentityConfiguration

azureml_client = MLClient.from_config(
    credential=DefaultAzureCredential()
)

Found the config file in: /config.json


## Define the Job

The following cell defines the job. It is an object of [Spark Class](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.entities.spark?view=azure-python) that contains the required information to run a job:

- The cluster size
- The script to run
- The parameters for the script

In the example below, we are using the `pyspark-script-job.py` script which is parametrized. As you can see, the parameters are the following:

- `platform`: which can be `azureml` or `sagemaker`. This script is designed to be able to run on both platforms.
- `input_object_store_base_url`: Here you will use a base URL of the `s3://<BUCKETNAME>/` form for Sagemaker, or `wasbs://<CONTAINER-NAME>@<STORAGE-ACCOUNT>.blob.core.windows.net/` or `azureml://datastores/workspaceblobstore/paths/` for AzureML. **Don't forget the trailing slash /.**
- `input_path`: The path to read from
- `output_object_store_base_url`: Here you will use a base URL of the `s3://<BUCKETNAME>/` form for Sagemaker, or `wasbs://<CONTAINER-NAME>@<STORAGE-ACCOUNT>.blob.core.windows.net/` or `azureml://datastores/workspaceblobstore/paths/` for AzureML. **Don't forget the trailing slash /.**
- `output_path`: The path to write to
- `subreddits`: a comma separated string of subreddit names

This job receives the object store location for the raw data, in this case a single month. Then the job filters the original data and writes the filtered data out. This is designed to be used for either submissions or comments, not both.

For more information about the parameters used in the job definition, [read the documentation](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-submit-spark-jobs?view=azureml-api-2&tabs=sdk#submit-a-standalone-spark-job).



In [2]:
spark_job = spark(
    name="reddit-filter-submissions",
    code="./",
    entry={"file": "pyspark-script-job.py"},
    driver_cores=1,
    driver_memory="5g",
    executor_cores=4,
    executor_memory="10g",
    executor_instances=6,
    resources={
        "instance_type": "Standard_E4S_V3",
        "runtime_version": "3.2.0",
    },
    inputs={
        "platform": "azureml",
        "input_object_store_base_url": "wasbs://bigdata@marckvnonprodblob.blob.core.windows.net/",
        "input_path": "reddit-parquet/submissions/year=*/month=*/",
        "output_object_store_base_url": "azureml://datastores/workspaceblobstore/paths/",
        "output_path": "filtered-submissions",
        "subreddits": "movies,television,anime,MovieSuggestions,televisionsuggestions,Animesuggest"
    },
    identity=UserIdentityConfiguration(),
    args="--platform ${{inputs.platform}} --input_object_store_base_url ${{inputs.input_object_store_base_url}} --input_path ${{inputs.input_path}} --output_object_store_base_url ${{inputs.output_object_store_base_url}} --output_path ${{inputs.output_path}} --subreddits ${{inputs.subreddits}}"
)

## Script.py File

In [None]:
%%writefile pyspark-script-job.py

import argparse
import os
import logging
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

logging.basicConfig(level=logging.INFO)

# Parse Inputs
parser = argparse.ArgumentParser()
parser.add_argument("--platform")
parser.add_argument("--input_object_store_base_url")
parser.add_argument("--input_path")
parser.add_argument("--output_object_store_base_url")
parser.add_argument("--output_path")
parser.add_argument("--subreddits")
args = parser.parse_args()

logging.info(args.platform)
logging.info(args.input_object_store_base_url)
logging.info(args.input_path)
logging.info(args.output_object_store_base_url)
logging.info(args.output_path)
logging.info(args.subreddits)

input_complete_path = f"{args.input_object_store_base_url}{args.input_path}"
output_complete_path = f"{args.output_object_store_base_url}{args.output_path}"

logging.info(input_complete_path)
logging.info(output_complete_path)

spark = SparkSession.builder.appName("PySparkApp").getOrCreate()
logging.info(f"spark version = {spark.version}")

if args.platform == "azureml":
    # Add the Blob SAS credentials
    blob_account_name = "marckvnonprodblob"
    blob_container_name = "bigdata"
    blob_sas_token = "sv=2021-10-04&st=2023-10-04T20%3A02%3A01Z&se=2023-12-31T21%3A02%3A00Z&sr=c&sp=racwdxltf&sig=l%2BbUjYGp1p2IDeyanWtXpDjssBCdW%2B4CJlO4SfPnCEk%3D"
    spark.conf.set(
        f"fs.azure.sas.{blob_container_name}.{blob_account_name}.blob.core.windows.net",
        blob_sas_token,
    )


# Read data from object store
logging.info(f"going to read {input_complete_path}")
df_in = spark.read.parquet(input_complete_path)
df_in_ct = df_in.count()
logging.info(f"finished reading files...")

# filter the dataframe to only keep the subreddits of interest
subreddits = [s.strip() for s in args.subreddits.split(",")]
filtered = df_in.where(col("subreddit").isin(subreddits))
filtered_ct = filtered.count()

# save the filtered dataframes so that these files can now be used for future analysis
logging.info(f"going to write {output_complete_path}")

logging.info(f"Read in {df_in_ct} records, wrote out {filtered_ct} records.")
filtered.write.mode("overwrite").parquet(output_complete_path, compression="zstd")

spark.stop()


## Submit the job

In [3]:
job_object = azureml_client.jobs.create_or_update(spark_job)

Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
[32mUploading code (0.52 MBs): 100%|█

## Get the Job Studio URL

In [4]:
studio_url = job_object.studio_url
print(studio_url)

https://ml.azure.com/runs/reddit-filter-submissions?wsid=/subscriptions/eccec620-0403-4f08-9535-0f93c5a0e1dc/resourcegroups/project-rg/workspaces/group-34-aml&tid=fd571193-38cb-437b-bb55-60f28d67b643
