Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Build an ML Pipeline

In this notebook, you learn how to create a machine learning training pipeline by using Azure Machine Learning components.

1. Prepare and create components into the workspace.
2. Use the component and pipeline SDK to create a pipeline the registered components.

## Prerequisites
* Install azure-ai-ml sdk following the [instructions here](../../README.md).
* Initialize credential & create compute clusters following [instructions here](../../configuration.ipynb);

# 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

## 1.1 Import the required libraries

In [1]:
# Import required libraries
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import load_component, Input, Output
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import MLClient
from azure.ai.ml.constants import AssetTypes, InputOutputModes, InputOutputModes

import os

# enable internal components in v2
os.environ["AZURE_ML_INTERNAL_COMPONENTS_ENABLED"] = "True"

## 1.2 Configure credential

We are using `DefaultAzureCredential` to get access to workspace. 
`DefaultAzureCredential` should be capable of handling most Azure SDK authentication scenarios. 

Reference for more available credentials if it does not work for you: [configure credential example](../../configuration.ipynb), [azure-identity reference doc](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python).

In [2]:
try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

DefaultAzureCredential failed to retrieve a token from the included credentials.
Attempted credentials:
	EnvironmentCredential: EnvironmentCredential authentication unavailable. Environment variables are not fully configured.
Visit https://aka.ms/azsdk/python/identity/environmentcredential/troubleshoot to troubleshoot.this issue.
	ManagedIdentityCredential: ManagedIdentityCredential authentication unavailable, no response from the IMDS endpoint.
	AzureDeveloperCliCredential: Azure Developer CLI could not be found. Please visit https://aka.ms/azure-dev for installation instructions and then,once installed, authenticate to your Azure account using 'azd login'.
	SharedTokenCacheCredential: Azure Active Directory error '(invalid_grant) AADSTS700082: The refresh token has expired due to inactivity. The token was issued on 2022-11-21T03:07:07.6802296Z and was inactive for 90.00:00:00.
Trace ID: 0cda5ffb-5767-46ca-bdc1-3580bfd21a00
Correlation ID: 48dce46f-d821-4d36-97b9-808a60a87adf
Timestam

## 1.3 Get a handle to the workspace

We use config file to connect to a workspace. The Azure ML workspace should be configured with computer cluster. [Check this notebook for configure a workspace](../../configuration.ipynb)

In [4]:
# Get a handle to workspace
ml_client = MLClient.from_config(credential=credential)

# Retrieve an already attached Azure Machine Learning Compute.
cluster_name = "cpu-cluster"
print(ml_client.compute.get(cluster_name))

Found the config file in: D:\programs\azureml-examples\sdk\.azureml\config.json


enable_node_public_ip: true
id: /subscriptions/96aede12-2f73-41cb-b983-6d11a904839b/resourceGroups/hod-eastus2/providers/Microsoft.MachineLearningServices/workspaces/sdk_vnext_cli/computes/cpu-cluster
idle_time_before_scale_down: 120
location: eastus2
max_instances: 4
min_instances: 0
name: cpu-cluster
provisioning_state: Succeeded
size: STANDARD_DS2_V2
ssh_public_access_enabled: true
tier: dedicated
type: amlcompute



In [4]:
from azure.ai.ml import Input
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.ai.ml.dsl import pipeline

input_asset = Input(
    type="uri_file",
    path="azureml://datastores/workspaceblobstore/paths/miguTestMLTable/relative_path_in_content.txt",
)

In [5]:
input_asset.datastore

In [6]:
input_asset.path

'azureml://datastores/workspaceblobstore/paths/miguTestMLTable/relative_path_in_content.txt'

# 2. Define and create components into workspace
## 2.1 Load components from YAML

Anonymous Component:
* Component SDK allows user to load and validate component as anounymous component first: `load_component()`.

Created Component:
* Create the components to be used in the pipeline in workspace using the Azure CLI: `az ml component create`.
* Load components with the component SDK: `ml_client.components.get()`

In [7]:
output_mltable_func = load_component("./output-mltable/output_mltable.yml")
pass_through_func = load_component("./pass-through-uri-file/pass_through.yml")
read_ml_table_func = load_component("./read-mltable/read_mltable.yml")
read_ml_table_download_func = load_component("./read-mltable-download/read_mltable_download.yml")



In [8]:
ml_client.components.create_or_update(output_mltable_func)
ml_client.components.create_or_update(pass_through_func)
ml_client.components.create_or_update(read_ml_table_func)

[32mUploading read-mltable (0.0 MBs): 100%|##########################################################################################################################################################################################################| 1487/1487 [00:00<00:00, 1513.21it/s][0m
[39m



CommandComponent({'auto_increment_version': False, 'source': 'REMOTE.WORKSPACE.COMPONENT', 'is_anonymous': False, 'name': 'read_mltable', 'description': 'Read MLTable', 'tags': {'author': 'azureml-sdk-team'}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/b8c23406-f9b5-4ccb-8a65-a8cb5dcd6a5a/resourceGroups/rge2etests/providers/Microsoft.MachineLearningServices/workspaces/wse2etests/components/read_mltable/versions/3', 'Resource__source_path': None, 'base_path': WindowsPath('.'), 'creation_context': <azure.ai.ml._restclient.v2022_10_01.models._models_py3.SystemData object at 0x000002BF45E154F0>, 'serialize': <msrest.serialization.Serializer object at 0x000002BF45E15C10>, 'command': 'python read_mltable.py ${{inputs.input_mltable}}', 'code': '/subscriptions/b8c23406-f9b5-4ccb-8a65-a8cb5dcd6a5a/resourceGroups/rge2etests/providers/Microsoft.MachineLearningServices/workspaces/wse2etests/codes/a63b2c27-1418-45c8-88c9-cb1c8c9bdf5b/versions/1', 'environment_variables': None, 'e

### 2.2 Intellisense and docstring support

The loaded component_func has dynamic generated signature.

# 3. Sample pipeline job
## 3.1 Build pipeline
You can build pipeline through SDK experience, or drag-n-drop way through [Azure Machine Learning designer (preview)](https://docs.microsoft.com/en-us/azure/machine-learning/concept-designer)

With the component SDK, you will benefit from:
* Simple syntax to provide consistent experience with drag-n-drop.
* Creating a pipeline with unpublished component for debugging/testing purpose.

In [9]:
# define a pipeline
@pipeline()
def training_pipeline_mltable(input_data):
    output_mltable = output_mltable_func(
        input_containing_relative_path=input_data
    )
    # read_mltable_download = read_ml_table_download_func(input_mltable=output_mltable.outputs.output_mltable)
    read_mltable = read_ml_table_func(input_mltable=output_mltable.outputs.output_mltable)
    # pass_through = pass_through_func(
    #    input_file = output_mltable.outputs.output_mltable
    # )

In [10]:
# create a pipeline
pipeline_job = training_pipeline_mltable(
    input_asset
)
pipeline_job.settings.default_compute = cluster_name
pipeline_job.settings.default_datastore = "workspaceblobstore"
# pipeline_job.settings._dataset_access_mode = "DatasetInDpv2"

# Validating the pipeline
ml_client.jobs.validate(pipeline_job)

Method validate: This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


{
  "result": "Succeeded"
}

## 3.2 Run pipeline on remote compute

In [11]:
# Specify the workspace for workspace independent component when submitting the pipeline.
created_pipeline_job = ml_client.jobs.create_or_update(
    pipeline_job, experiment_name="migu_test_mltable"
)


# show detail information of run
created_pipeline_job

Experiment,Name,Type,Status,Details Page
migu_test_mltable,honest_guava_k31m6t94yl,pipeline,Preparing,Link to Azure Machine Learning studio


In [None]:
# Wait until the job completes
ml_client.jobs.stream(created_pipeline_job.name)

RunId: honest_guava_k31m6t94yl
Web View: https://ml.azure.com/runs/honest_guava_k31m6t94yl?wsid=/subscriptions/b8c23406-f9b5-4ccb-8a65-a8cb5dcd6a5a/resourcegroups/rge2etests/workspaces/wse2etests

Streaming logs/azureml/executionlogs.txt

[2023-02-22 02:50:34Z] Completing processing run id f2422fcb-d001-47e3-94ad-02fe9005fc1f.
[2023-02-22 02:50:36Z] Submitting 1 runs, first five are: 82931b88:f6f0ff80-817b-4919-b572-49f84d749efa


## Next steps

In this notebook, you built a basic simple pipeline using the AzureML component SDK. Check the following examples for more advanced topics:
* [Create a pipeline with sub-pipeline](create-pipeline-with-subpipeline.ipynb)

In [5]:

import mltable
# help(mltable.load)

files = {"folder": "./asdfsrc"}
tbl = mltable.from_paths(paths=[files])
mltable.from

tbl.save("./hod")

In [7]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

my_path = "./src"

my_data = Data(
    path=my_path,
    type=AssetTypes.URI_FOLDER,
    name="mltable_test_files",
    # version="1",
)

registered_data = ml_client.data.create_or_update(my_data)

In [13]:
registered_data.path

'azureml://subscriptions/96aede12-2f73-41cb-b983-6d11a904839b/resourcegroups/hod-eastus2/workspaces/sdk_vnext_cli/datastores/workspaceblobstore/paths/LocalUpload/cc355250825d6284521e9ae14f3db123/src/'

In [14]:
registered_data.id

'/subscriptions/96aede12-2f73-41cb-b983-6d11a904839b/resourceGroups/hod-eastus2/providers/Microsoft.MachineLearningServices/workspaces/sdk_vnext_cli/data/mltable_test_files/versions/2'

In [19]:
got_data = ml_client.data.get(name="mltable_test_files", version="2")

In [23]:
got_data.base_path

'd:\\programs\\azureml-examples\\sdk\\python\\jobs\\pipelines\\hod_test_mltable\\test_create_and_link'