# Build a simple ML pipeline with parallel component

**Requirements** - In order to benefit from this tutorial, you will need:
- A basic understanding of Machine Learning
- An Azure account with an active subscription - [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)
- An Azure ML workspace with computer cluster - [Configure workspace](../../configuration.ipynb)
- A python environment
- Installed Azure Machine Learning Python SDK v2 - [install instructions](../../../README.md) - check the getting started section

**Learning Objectives** - By the end of this tutorial, you should be able to:
- Connect to your AML workspace from the Python SDK
- Create `Pipeline` with parallel nodes
- Process file/tabular data using parallel node

**Motivations** - In this example, we will explains how to create a parallel node and use it in a pipeline. Parallel node auto splits one main data input into several mini batches, creates a parallel task for each mini_batch, distributes all parallel tasks across a compute cluster and execute in parallel. It monitors task execution progress, auto retries a task if data/code/process failure and stores the outputs in user configured location.


# 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

## 1.1 Import the required libraries

In [None]:
# import required libraries
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient, Input, Output, load_component
from azure.ai.ml.dsl import pipeline
from azure.ai.ml.entities import ParallelTask, Environment
from azure.ai.ml.constants import AssetTypes, InputOutputModes

# parallel function is currently a private preview feature
from azure.ai.ml.entities._builders.parallel_func import parallel

## 1.2 Configure credential
We are using `DefaultAzureCredential` to get access to workspace.

`DefaultAzureCredential` should be capable of handling most Azure SDK authentication scenarios. 

Reference for more available credentials if it does not work for you: [configure credential example](../../configuration.ipynb), [azure-identity reference doc](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python).

In [None]:
try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token('https://management.azure.com/.default')
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

## 1.3 Get a handle to the workspace

We use config file to connect to a workspace. The Azure ML workspace should be configured with computer cluster. [Check this notebook for configure a workspace](../../configuration.ipynb)

In [None]:
# Get a handle to workspace
ml_client = MLClient.from_config(credential=credential)

# Retrieve an already attached Azure Machine Learning Compute.
cpu_compute_target = 'cpu-cluster'
print(ml_client.compute.get(cpu_compute_target))
gpu_compute_target = 'gpu-cluster'
print(ml_client.compute.get(gpu_compute_target))

# 2. Define components

Use `parallel` to create parallel node. Use `load_component` to load command components defined using YAML. 

In [None]:
# load component
prepare_data = load_component(path='./src/prepare_data.yml')

# parallel task to process file data
file_batch_inference = parallel(
  name = 'file_batch_score',
  display_name = 'Batch Score with File Dataset',
  description = 'parallel component for batch score',
  inputs = dict(job_data_path=Input(type=AssetTypes.MLTABLE, description='The data to be split and scored in parallel')),
  outputs = dict(job_output_path=Output(type=AssetTypes.MLTABLE)),
  input_data = '${{inputs.job_data_path}}',
  instance_count = 2,
  mini_batch_size = '1',
  mini_batch_error_threshold = 1,
  max_concurrency_per_instance = 1,
  task = ParallelTask(
    type = 'function',
    code = './src',
    entry_script = 'file_batch_inference.py',
    args = '--job_output_path ${{outputs.job_output_path}}',
    environment = 'azureml:AzureML-sklearn-0.24-ubuntu18.04-py37-cpu:1'
  ),
)

# parallel task to process tabular data
tabular_batch_inference = parallel(
  name = 'batch_score_with_tabular_input',
  display_name = 'Batch Score with Tabular Dataset',
  description = 'parallel component for batch score',
  inputs = dict(
    job_data_path=Input(type=AssetTypes.MLTABLE, description='The data to be split and scored in parallel'),
    score_model=Input(type=AssetTypes.URI_FOLDER, description='The model for batch score.')
  ),
  outputs = dict(job_output_path=Output(type=AssetTypes.MLTABLE)),
  input_data = '${{inputs.job_data_path}}',
  max_concurrency_per_instance = 2,
  mini_batch_size = '100',
  mini_batch_error_threshold = 5,
  logging_level = 'DEBUG',
  retry_settings = dict(max_retries=2, timeout=60),
  task = ParallelTask(
    type = 'function',
    code = './src',
    entry_script = 'tabular_batch_inference.py',
    environment = Environment(
      image= 'mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04',
      conda_file='./src/environment_parallel.yml'),
    args = '--model ${{inputs.score_model}}',
    append_row_to = '${{outputs.job_output_path}}',
  ),
)

# 3. Build pipeline

We define a pipeline containing 3 nodes:
- `prepare_file_tabular_data` will load the file and tabular data input and trained model for batch inference. 
- `batch_inference_with_file_data` is a parallel component which will process a large number of files.
- `batch_inference_with_tabular_data` will batch score the model using tabular input data.

In [None]:
@pipeline(default_compute='cpu-cluster')
def parallel_in_pipeline(pipeline_job_data_path, pipeline_score_model):

    prepare_file_tabular_data = prepare_data(input_data=pipeline_job_data_path)
    # output of file & tabular data should be type MLTable
    prepare_file_tabular_data.outputs.file_output_data.type = AssetTypes.MLTABLE
    prepare_file_tabular_data.outputs.tabular_output_data.type = AssetTypes.MLTABLE

    batch_inference_with_file_data = file_batch_inference(
        job_data_path=prepare_file_tabular_data.outputs.file_output_data)
    # use eval_mount mode to handle file data
    batch_inference_with_file_data.inputs.job_data_path.mode = InputOutputModes.EVAL_MOUNT
    batch_inference_with_file_data.outputs.job_output_path.type = AssetTypes.MLTABLE

    batch_inference_with_tabular_data = tabular_batch_inference(
        job_data_path=prepare_file_tabular_data.outputs.tabular_output_data,
        score_model=pipeline_score_model
    )
    # use direct mode to handle tabular data
    batch_inference_with_tabular_data.inputs.job_data_path.mode = InputOutputModes.DIRECT

    return {
        'pipeline_job_out_file': batch_inference_with_file_data.outputs.job_output_path,
        'pipeline_job_out_tabular': batch_inference_with_tabular_data.outputs.job_output_path,
    }

pipeline_job_data_path=Input(
    path='./dataset/', type=AssetTypes.MLTABLE, mode=InputOutputModes.RO_MOUNT
)
pipeline_score_model=Input(
    path='./model/', type=AssetTypes.URI_FOLDER, mode=InputOutputModes.DOWNLOAD
)
# create a pipeline
pipeline_job = parallel_in_pipeline(pipeline_job_data_path=pipeline_job_data_path, pipeline_score_model=pipeline_score_model)
pipeline_job.outputs.pipeline_job_out_tabular.type = AssetTypes.URI_FILE

In [None]:
print(pipeline_job)

# 4. Submit pipeline job

In [None]:
# parallel in pipeline is an preview feature
os.environ["AZURE_ML_CLI_PRIVATE_FEATURES_ENABLED"] = "True"

pipeline_job = ml_client.jobs.create_or_update(
    pipeline_job, experiment_name='pipeline_samples'
)
pipeline_job

In [None]:
# wait until the job completes
ml_client.jobs.stream(pipeline_job.name)

# Next Steps
You can see further examples of running a pipeline job [here](../)