# Pipeline job using data transfer to copy data

**Requirements** - In order to benefit from this tutorial, you will need:
- A basic understanding of Machine Learning
- An Azure account with an active subscription - [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)
- An Azure ML workspace with computer cluster - [Configure workspace](../../configuration.ipynb)
- A python environment
- Installed Azure Machine Learning Python SDK v2 - [install instructions](../../../README.md) - check the getting started section

**Learning Objectives** - By the end of this tutorial, you should be able to:
- Connect to your AML workspace from the Python SDK
- Create `datatranster` component using load component function.
- Create `PipelineJob` with `datatranster` component.

**Motivations** - This notebook covers the scenario that user define pipeline component using dsl decorator.

# 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

## 1.1 Import the required libraries

In [1]:
# Import required libraries
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential

from azure.ai.ml import MLClient, command, Input, Output
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import load_component

In [2]:
from azure.ai.ml._version import VERSION

print(VERSION)

1.7.2


## 1.2 Configure credential

We are using `DefaultAzureCredential` to get access to workspace. 
`DefaultAzureCredential` should be capable of handling most Azure SDK authentication scenarios. 

Reference for more available credentials if it does not work for you: [configure credential example](../../configuration.ipynb), [azure-identity reference doc](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python).

In [3]:
try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

DefaultAzureCredential failed to retrieve a token from the included credentials.
Attempted credentials:
	EnvironmentCredential: EnvironmentCredential authentication unavailable. Environment variables are not fully configured.
Visit https://aka.ms/azsdk/python/identity/environmentcredential/troubleshoot to troubleshoot.this issue.
	ManagedIdentityCredential: ManagedIdentityCredential authentication unavailable, no response from the IMDS endpoint.
	SharedTokenCacheCredential: Shared token cache unavailable
	VisualStudioCodeCredential: Azure Active Directory error '(invalid_grant) AADSTS700082: The refresh token has expired due to inactivity. The token was issued on 2022-01-12T00:05:41.8429245Z and was inactive for 90.00:00:00.
Trace ID: f5a721ce-874a-4bcc-b035-842ca5ff0800
Correlation ID: 9d8a09fe-44d9-4fe5-8847-657016773a65
Timestamp: 2023-05-31 09:26:45Z'
Content: {"error":"invalid_grant","error_description":"AADSTS700082: The refresh token has expired due to inactivity. The token was 

## 1.3 Get a handle to the workspace

We use config file to connect to a workspace. The Azure ML workspace should be configured with computer cluster. [Check this notebook for configure a workspace](../../configuration.ipynb)

In [4]:
# Get a handle to workspace
ml_client = MLClient.from_config(credential=credential)

# Retrieve an already attached Azure Machine Learning Compute.
cluster_name = "cpu-cluster"
print(ml_client.compute.get(cluster_name))

Found the config file in: D:\azureml-examples\config.json


enable_node_public_ip: true
id: /subscriptions/d128f140-94e6-4175-87a7-954b9d27db16/resourceGroups/lochen/providers/Microsoft.MachineLearningServices/workspaces/lochen-eastus/computes/cpu-cluster
idle_time_before_scale_down: 120
location: eastus
max_instances: 1
min_instances: 0
name: cpu-cluster
provisioning_state: Succeeded
size: STANDARD_DS3_V2
ssh_public_access_enabled: false
tier: dedicated
type: amlcompute



# 2. Define and create components into workspace
## 2.1 Load data transfer components from YAML

In [5]:
components_dir = "."
data_merge = load_component(source=f"{components_dir}/data_merge.yml")

Class DataTransferCopyComponent: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


## 2.2 (Optional) Create data transfer component to workspace

In [6]:
registered_data_pipeline_component = ml_client.components.create_or_update(
    data_merge
)

## 2.3 Define command component via command function
Use `command` function to create a `Command` object which can be used in `@pipeline` function.

In [13]:
environment = "AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1"
# Create a dummy command component to generate output
hello_component = command(
    name="hello_component",
    display_name="hello component",
    description="generate dummy folder as output",
    tags=dict(),
    command="echo 'hello' > ${{outputs.hello_output}}/hello.txt",
    environment=environment,
    outputs=dict(
        hello_output=Output(type="uri_folder")
    ),
)

world_component = command(
    name="world_component",
    display_name="world component",
    description="generate dummy folder as output",
    tags=dict(),
    command="echo 'world' > ${{outputs.world_output}}/world.txt",
    environment=environment,
    outputs=dict(
        world_output=Output(type="uri_folder")
    ),
)

# 3. Build pipeline job to merge data
## 3.1 Build pipeline

In [25]:
# Construct pipeline
@pipeline
def pipeline_copy_data():
    # define steps in pipeline job
    hello_step=hello_component()
    world_step=world_component()
    data_merge_step = data_merge(folder1=hello_step.outputs.hello_output, folder2=world_step.outputs.world_output)
    data_merge_step.outputs.output_folder.type = "uri_folder"
    # Return: pipeline outputs
    data_merge_step.compute = "serverless"
    return {
        "output_folder": data_merge_step.outputs.output_folder,
    }


pipeline_job = pipeline_copy_data()
# set pipeline level compute
pipeline_job.settings.default_compute = "cpu-cluster"
pipeline_job.outputs.output_folder.type = "uri_folder"

In [26]:
# Inspect built pipeline
print(pipeline_job)

display_name: pipeline_copy_data
type: pipeline
outputs:
  output_folder:
    type: uri_folder
jobs:
  hello_step:
    type: command
    outputs:
      hello_output:
        type: uri_folder
    environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1
    component:
      name: hello_component
      display_name: hello component
      description: generate dummy folder as output
      type: command
      outputs:
        hello_output:
          type: uri_folder
      command: echo 'hello' > ${{outputs.hello_output}}/hello.txt
      environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1
      is_deterministic: true
      id: /subscriptions/d128f140-94e6-4175-87a7-954b9d27db16/resourceGroups/lochen/providers/Microsoft.MachineLearningServices/workspaces/lochen-eastus/components/azureml_anonymous/versions/d3fdc987-f543-4223-96ad-96253a6e8074
  world_step:
    type: command
    outputs:
      world_output:
        type: uri_folder
    environment: azureml:AzureML-sklearn-1.0

## 3.2 Submit pipeline job

In [24]:
# Submit pipeline job to workspace
pipeline_job = ml_client.jobs.create_or_update(
    pipeline_job, experiment_name="data_transfer_in_pipeline"
)
pipeline_job

Experiment,Name,Type,Status,Details Page
data_transfer_in_pipeline,sad_pocket_1ykf265tbx,pipeline,Preparing,Link to Azure Machine Learning studio


In [None]:
# Wait until the job completes
ml_client.jobs.stream(pipeline_job.name)

# Next Steps
You can see further examples of running a pipeline job [here](../README.md)