# Build a simple ML pipeline with parallel component

**Requirements** - In order to benefit from this tutorial, you will need:
- A basic understanding of Machine Learning
- An Azure account with an active subscription - [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)
- An Azure ML workspace with computer cluster - [Configure workspace](../../configuration.ipynb)
- A python environment
- Installed Azure Machine Learning Python SDK v2 - [install instructions](../../../README.md) - check the getting started section

**Learning Objectives** - By the end of this tutorial, you should be able to:
- Connect to your AML workspace from the Python SDK
- Create `Pipeline` with components

**Motivations** - This notebook covers the scenario that user define parallel components using yaml then use these components to build pipeline.


# 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

## 1.1 Import the required libraries

In [None]:
# import required libraries
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential

from azure.ml import MLClient, dsl
from azure.ml.entities import load_component
from azure.ml.entities import Component as ComponentEntity
from azure.ml.constants import AssetTypes, InputOutputModes

## 1.2 Configure credential
We are using `DefaultAzureCredential` to get access to workspace.

`DefaultAzureCredential` should be capable of handling most Azure SDK authentication scenarios. 

Reference for more available credentials if it does not work for you: [configure credential example](../../configuration.ipynb), [azure-identity reference doc](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python).

In [None]:
try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

## 1.3 Get a handle to the workspace

We use config file to connect to a workspace. The Azure ML workspace should be configured with computer cluster. [Check this notebook for configure a workspace](../../configuration.ipynb)

In [None]:
# Get a handle to workspace
ml_client = MLClient.from_config(credential=credential)

# Retrieve an already attached Azure Machine Learning Compute.
cpu_compute_target = "cpu-cluster"
print(ml_client.compute.get(cpu_compute_target))
gpu_compute_target = "gpu-cluster"
print(ml_client.compute.get(gpu_compute_target))

## 1.4 Prepare Job Input
By defining `Input`, you create a reference to the data source location. The data remains in its existing location, so no extra storage cost is incurred.

In [None]:
from azure.ml import Input

pipeline_job_data_path=Input(
    type=AssetTypes.MLTABLE, path="./dataset/", mode=InputOutputModes.RW_MOUNT
),
pipeline_score_model=Input(
    path="./model/", type=AssetTypes.URI_FOLDER, mode=InputOutputModes.DOWNLOAD
),

In [None]:
pipeline_job_data_path

# 2. Define parallel component via YAML
Below is a basic example to define parallel component using YAML.



In [None]:

get_data = ComponentEntity.load(
    path="./src/get_data.yml"
)

file_batch_inference1 = ComponentEntity.load(path="./src/file_batch_inference.yml")
file_batch_inference2 = ComponentEntity.load(path="./src/file_batch_inference.yml", params_override=[{"name": "file_batch_score_duplicate"}]
)
tabular_batch_inference = ComponentEntity.load(
    path="./src/tabular_input_e2e.yml"
)

# 3. Build pipeline

We define a pipeline containing 4 nodes:
- `get_data` will load the file and tabular data input and trained model for batch inference. 
- `file_batch_inference1` and `file_batch_inference2` are dummy parallel components which will process a large number of files.
- `tabular_batch_inference` will batch score the model using tabular input data.

In [None]:
@dsl.pipeline(default_compute="cpu-cluster")
def parallel_in_pipeline(pipeline_job_data_path, pipeline_score_model):
    get_data_node = get_data(input_data=pipeline_job_data_path)
    get_data_node.outputs.file_output_data.type = AssetTypes.MLTABLE
    get_data_node.outputs.tabular_output_data.type = AssetTypes.MLTABLE

    file_batch_inference_node1 = file_batch_inference1(job_data_path=get_data_node.outputs.file_output_data)
    file_batch_inference_node1.inputs.job_data_path.mode = InputOutputModes.EVAL_MOUNT
    file_batch_inference_node1.outputs.job_output_path.type = AssetTypes.MLTABLE

    file_batch_inference_node2 = file_batch_inference2(job_data_path=file_batch_inference_node1.outputs.job_output_path)
    file_batch_inference_node2.inputs.job_data_path.mode = InputOutputModes.EVAL_MOUNT

    tabular_batch_inference_node = tabular_batch_inference(
        job_data_path=get_data_node.outputs.tabular_output_data,
        score_model=pipeline_score_model
    )
    tabular_batch_inference_node.inputs.job_data_path.mode = InputOutputModes.DIRECT

    return {
        "pipeline_job_out_file": file_batch_inference_node2.outputs.job_output_path,
        "pipeline_job_out_tabular": tabular_batch_inference_node.outputs.job_out_path,
    }

# create a pipeline
pipeline = parallel_in_pipeline(pipeline_job_data_path=pipeline_job_data_path, pipeline_score_model=pipeline_score_model)

In [None]:
print(pipeline)

# 4. Submit pipeline job

In [None]:
pipeline_job = ml_client.jobs.create_or_update(
    pipeline, experiment_name="pipeline_samples"
)
pipeline_job

In [None]:
# wait until the job completes
ml_client.jobs.stream(pipeline_job.name)

# Next Steps
You can see further examples of running a pipeline job [here](../)