# Orange juice sales prediction example \[Parallel job\] \[SDK example\]
## Key notes for this example
- How to use **parallel job** for **many model training** scenario.
- How to use parallel job **run_function** task with predefined **entry_script**.
- How to pre-cook data into **mltable with partition setting**.
- How to use **mltable** with **tabular data** as the **input of parallel job**.
- How to use **partition_keys** in parallel job to consume data with partitions. 
- How to use **error_threshold** with **empty returns** to ignore checking failed items in mini-batch.
- How to use other parallel job settings:
  - mini_batch_error_threshold
  - environment_variables

To get the same example with CLI + Yaml experience, please refer to: [link](../../../../../cli/jobs/parallel/1a_oj_sales_prediction/README.md)

# 1. Connect to Azure Machine Learning Workspace
## 1.1 Import the required libraries

In [None]:
# import required libraries
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient, Input, Output, load_component
from azure.ai.ml.dsl import pipeline
from azure.ai.ml.entities import Environment, ResourceConfiguration
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.ai.ml.parallel import parallel_run_function, RunFunction

## 1.2 Configure credential
`DefaultAzureCredential` should be capable of handling most Azure SDK authentication scenarios. 

Reference for more available credentials if it does not work for you: [configure credential example](../../configuration.ipynb), [azure-identity reference doc](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python).

In [None]:
try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

## 1.3 Get a handle to the workspace

We use config file to connect to a workspace. The Azure ML workspace should be configured with computer cluster. [Check this notebook for configure a workspace](../../configuration.ipynb)

In [None]:
# Get a handle to workspace
ml_client = MLClient.from_config(credential=credential)

# Retrieve an already attached Azure Machine Learning Compute.
cpu_compute_target = "cpu-cluster"
print(ml_client.compute.get(cpu_compute_target))

# 2. Define components and jobs in pipeline

## 2.1 Load existing command component

In [None]:
# load existing command component to partition the single csv data to mltable.
partition_data = load_component(source="./src/partition_data/partition_data.yml")

## 2.2 Declare parallel job by `parallel_run_function`


In [None]:
# Declare parallel job with run_function task
many_model_training_with_partition_keys = parallel_run_function(
    name="train_many_models_with_partition_keys",
    display_name="Train Many Models With Partition Keys",
    description="parallel job to train many models with partition_keys on mltable input",
    tags={
        "azureml_parallel_example": "oj_many-model_sdk",
    },
    inputs=dict(
        data_source=Input(
            type=AssetTypes.MLTABLE,
            description="Input mltable with predefined partition format.",
            mode=InputOutputModes.DIRECT,  # [Important] To use 'partition_keys', input MLTable is required to use 'direct' mode.
        ),
        drop_cols=Input(
            type="string",
            description="Columns need to be dropped before training. Split by comma.",
        ),
        target_col=Input(
            type="string",
            description="The column name for label of the input data.",
        ),
        date_col=Input(
            type="string",
            description="The column name for datatime. This will be used for generating time-series lagging data.",
        ),
        lagging_orders=Input(
            type="string",
            description="List of int which indicate how to generate lagging data for time-series input. Split by comma.",
        ),
    ),
    outputs=dict(
        model_folder=Output(
            type=AssetTypes.URI_FOLDER,
            mode=InputOutputModes.RW_MOUNT,
        ),
    ),
    input_data="${{inputs.data_source}}",  # Define which input data will be splitted into mini-batches
    partition_keys=[
        "Store",
        "Brand",
    ],  # Use 'partition_keys' as the data division method. This method requires MLTable input with partition setting pre-defined in MLTable artifact.
    instance_count=2,  # Use 2 nodes from compute cluster to run this parallel job.
    max_concurrency_per_instance=1,  # Create 2 worker processors in each compute node to execute mini-batches.
    error_threshold=-1,  # Monitor the failures of item processed by the gap between mini-batch input count and returns. 'Many model training' scenario doesn't fit this setting and '-1' means ignore counting failure items by mini-batch returns.
    mini_batch_error_threshold=5,  # Monitor the failed mini-batch by exception, time out, or null return. When failed mini-batch count is higher than this setting, the parallel job will be marked as 'failed'.
    retry_settings=dict(
        max_retries=2,  # Define how many retries when mini-batch execution is failed by exception, time out, or null return.
        timeout=60,  # Define the timeout in second for each mini-batch execution.
    ),
    logging_level="DEBUG",
    environment_variables={
        "AZUREML_PARALLEL_EXAMPLE": "1a_sdk",
    },
    task=RunFunction(
        code="./src/parallel_train/",
        entry_script="parallel_train.py",
        environment=Environment(
            image="mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04",
            conda_file="./src/parallel_train/conda.yml",
        ),
        program_arguments="--drop_cols ${{inputs.drop_cols}} "  # Passthrough input parameters into parallel_train script.
        "--target_col ${{inputs.target_col}} "
        "--date_col ${{inputs.date_col}} "
        "--lagging_orders ${{inputs.lagging_orders}} "
        "--model_folder ${{outputs.model_folder}} ",
    ),
)

# 3. Build pipeline

In [None]:
# Declare the overall input of the job.
input_oj_data = Input(
    path="./oj_sales_data/oj_sales_data.csv",
    type=AssetTypes.URI_FILE,
    mode=InputOutputModes.RO_MOUNT,
)

# Declare pipeline structure.
@pipeline(
    display_name="parallel job for oj many model training",
)
def partition_job_in_pipeline(
    pipeline_input_data,
):
    # Declare 1st data partition command job.
    partition_job = partition_data(
        data_source=pipeline_input_data,
        partition_keys="Store,Brand",
    )

    # Declare 2nd parallel model training job.
    parallel_train = many_model_training_with_partition_keys(
        data_source=partition_job.outputs.tabular_output_data,
        drop_cols="Revenue,Advert,Store,Brand",
        target_col="Quantity",
        date_col="WeekStarting",
        lagging_orders="1,2,3,4,5,6",
    )

    # User could override parallel job run-level property when invoke that parallel job/component in pipeline.
    parallel_train.resources.instance_count = 3
    parallel_train.max_concurrency_per_instance = 2
    parallel_train.mini_batch_error_threshold = 10


# Create pipeline instance
my_job = partition_job_in_pipeline(
    pipeline_input_data=input_oj_data,
)

# Set pipeline level compute
my_job.settings.default_compute = "cpu-cluster"

In [None]:
print(my_job)

# 4. Submit pipeline job

In [None]:
pipeline_job = ml_client.jobs.create_or_update(
    my_job,
    experiment_name="hello-world-parallel-job",
)
pipeline_job

In [None]:
# wait until the job completes
ml_client.jobs.stream(pipeline_job.name)