# Build AML Pipeline with azureml modules

In this tutorial you will learn how to work with Azure ML module:

1. Setup enrivonment - install module CLI and module/pipeline SDK
2. Register a few sample modules into your aml workspace using CLI
3. Use module/pipeline SDK to create a pipeline with modules registered in step 2

## Prerequisite
* Install Azure CLI, please follow [the Azure CLI installation instructions](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) to install.
* Install docker desktop from [here](https://www.docker.com/products/docker-desktop) 

## Setup environment
* Install Azure CLI AML extension which includes the _module_ command group
* Install Azure ML SDK including the APIs to work with _module_ and _pipeline_

In [None]:
CLI_SDK_VERSION=19657780

In [None]:
!az extension remove -n azure-cli-ml 

# Install local version of azure-cli-ml (which includes `az ml module` commands)
!az extension add --source https://azuremlsdktestpypi.azureedge.net/CLI-SDK-Runners-Validation/$CLI_SDK_VERSION/azure_cli_ml-0.1.0.$CLI_SDK_VERSION-py3-none-any.whl --pip-extra-index-urls https://azuremlsdktestpypi.azureedge.net/CLI-SDK-Runners-Validation/$CLI_SDK_VERSION --yes 

In [None]:
# Verify the availability of `az ml module` commands
#!az ml pipeline -h
!az ml module -h

In [None]:
# Install azureml-sdk with Pipeline, Module
# Important! After install succeed, need to restart kernel

%config IPCompleter.greedy=True 
!pip install azureml-pipeline-wrapper[notebooks]==0.1.0.$CLI_SDK_VERSION --extra-index-url https://azuremlsdktestpypi.azureedge.net/CLI-SDK-Runners-Validation/$CLI_SDK_VERSION --user --upgrade 

## Register azureml module

You can manage AML module through [azure-cli-ml](https://aka.ms/moduledoc) or [ml.azure.com](https://ml.azure.com/). <br>

Module could be registered from:
- local path
- public Github url
- Azure DevOps build artifacts

Azureml module support multiple module type:
- Basic python module
- Mpi module
- Parallel run module
- Hdi module (pending on backend support)

In [None]:
# you need to configure your ws information here

subscription_id = '74eccef0-4b8d-4f83-b5f9-fa100d155b22' #'4aaa645c-5ae2-4ae9-a17a-84b9023bc56a'#'74eccef0-4b8d-4f83-b5f9-fa100d155b22'
workspace_name = 'lisal-amlservice' #'itp-pilot'#'kubeflow_ws_2' #'lisal-amlservice'
resource_group = 'lisal-dev' #'itp-pilot-ResGrp'#'kubeflow-demo' #'lisal-dev'

# Specify available aml compute in workspace
pipeline_compute = 'always-on-ds2v2' #'k80-16-a'#'kubeflow-aks' #'always-on-ds2v2'

In [None]:
# Configure your aml workspace 

!az login 
!az account set -s $subscription_id 
!az ml folder attach -w $workspace_name -g $resource_group 

# Configure global .amlignore, it's designed for register module from local development environment
# !az configure --defaults module_amlignore_file=./.amlignore

In [None]:
# Register azureml modules from github url

!az ml module register --spec-file=https://github.com/lisagreenview/hello-aml-modules/blob/master/train-score-eval/mpi_train.yaml --set-as-default-version
!az ml module register --spec-file=https://github.com/lisagreenview/hello-aml-modules/blob/master/train-score-eval/score.yaml --set-as-default-version
!az ml module register --spec-file=https://github.com/lisagreenview/hello-aml-modules/blob/master/train-score-eval/eval.yaml --set-as-default-version
!az ml module register --spec-file=https://github.com/lisagreenview/hello-aml-modules/blob/master/train-score-eval/compare2.yaml --set-as-default-version

In [None]:
# list available custom module in aml workspace
!az ml module list -o table 

## Create pipeline
You can build pipeline through SDK experience, or drag-n-drop way through [Designer](https://ml.azure.com/visualinterface?wsid=/subscriptions/74eccef0-4b8d-4f83-b5f9-fa100d155b22/resourcegroups/kubeflow-demo/workspaces/kubeflow_ws_1&flight=cm,nml,newGraphDetail,newGraphAuthoring,all&tid=72f988bf-86f1-41af-91ab-2d7cd011db47) in workspace portal

The new SDK:
* Symplified the syntax to provide consistent experience with drag-n-drop
* Support intellisense and docstring, free you to work with dict all the time
* Support creating a pipeline with unpublished module

In [None]:
from azureml.core import Workspace, Run, Dataset
from azureml.pipeline.wrapper import Pipeline, Module, dsl

ws = Workspace.get(name=workspace_name, subscription_id=subscription_id, resource_group=resource_group)

# get modules
# load module from ws registered modules
train_module_func = Module.load(ws, namespace='microsoft.com/aml/samples', name='MPI Train')
score_module_func = Module.load(ws, namespace='microsoft.com/aml/samples', name='Score')
eval_module_func = Module.load(ws, namespace='microsoft.com/aml/samples', name='Evaluate')
compare_module_func = Module.load(ws, namespace='microsoft.com/aml/samples', name='Compare 2 Models')

"""
# load module from local unregistered module
train_module_func = Module.from_yaml(ws, yaml_file='./train-score-eval/mpi_train.yaml')
score_module_func = Module.from_yaml(ws, yaml_file='./train-score-eval/score.yaml')
eval_module_func = Module.from_yaml(ws, yaml_file='./train-score-eval/eval.yaml')
compare_module_func = Module.from_yaml(ws, yaml_file='./train-score-eval/compare2.yaml')
"""

# get dataset
training_data_name = 'training_data'
test_data_name = 'test_data'

if training_data_name not in ws.datasets:
    print('Registering a training dataset for sample pipeline ...')
    train_data = Dataset.File.from_files(path=['https://dprepdata.blob.core.windows.net/demo/Titanic.csv'])
    train_data.register(workspace = ws, 
                              name = training_data_name, 
                              description = 'Training data (just for illustrative purpose)')
    print('Registerd')
else:
    train_data = ws.datasets[training_data_name]
    print('Training dataset found in workspace')

if test_data_name not in ws.datasets:
    print('Registering a test dataset for sample pipeline ...')
    test_data = Dataset.File.from_files(path=['https://dprepdata.blob.core.windows.net/demo/Titanic.csv'])
    test_data.register(workspace = ws, 
                          name = test_data_name, 
                          description = 'Test data (just for illustrative purpose)')
    print('Registered')
else:
    test_data = ws.datasets[test_data_name]    
    print('Test dataset found in workspace')


### dsl pipeline 
* 'Pipeline parameter' is exposed as pipeline function input parameter
* Pipeline output is the return of pipeline function

### module function
* module input can be set through set_inputs() or module initialization function
* module parameter can be set through set_parameter() or module initialization function
* module runsetting including compute, datastore, data mode and other runtime parameter are set through runsettings.configure()


In [None]:
# define a sub pipeline
@dsl.pipeline(name = 'A sub pipeline including train-score-eval', 
              description = 'train model and evaluate model perf')
def training_pipeline(input_data, test_data, learning_rate):
    train = train_module_func()

    train.set_inputs(training_data=input_data).set_parameters(learning_rate=learning_rate, max_epochs=5)
    train.runsettings.configure(process_count_per_node = 1, node_count = 1)

    score = score_module_func(
        model_input=train.outputs.model_output, 
        test_data=test_data)

    eval = eval_module_func(scoring_result=score.outputs.score_output)
    
    return {'eval_output': eval.outputs.eval_output, 'model_output': train.outputs.model_output}

In [None]:
# define pipeline with sub pipeline
@dsl.pipeline(name = 'A dummy pipeline that trains multiple models and output the best one', 
              description = 'select best model trained with different learning rate',
              default_compute_target = pipeline_compute)
def dummy_automl_pipeline(input_data, test_data):
    train_and_evalute_model1 = training_pipeline(input_data, test_data, 0.01)
    train_and_evalute_model2 = training_pipeline(input_data, test_data, 0.02)
    
    compare = compare_module_func(
        model1=train_and_evalute_model1.outputs.model_output, 
        eval_result1=train_and_evalute_model1.outputs.eval_output,
        model2=train_and_evalute_model2.outputs.model_output,
        eval_result2=train_and_evalute_model2.outputs.eval_output
    )

    return {'best_model': compare.outputs.best_model, 'best_result': compare.outputs.best_result}

# create a pipeline
pipeline = dummy_automl_pipeline(input_data = train_data, test_data = test_data)

In [None]:
# validate pipeline and visualize the graph
pipeline.validate()

In [None]:
# save as a draft, then you can continue to modify the pipeline in AML Studio Designer page
pipeline.save(experiment_name = 'pipeline-with-azureml-module')

In [None]:
# pipeline parameter can be override when submit pipeline
run = pipeline.submit(experiment_name='pipeline-with-azureml-module', tags={'mode':'module-SDK','SDK-version':f'{CLI_SDK_VERSION}'}, pipeline_parameters={'input_data':train_data,'test_data':test_data})
run.wait_for_completion()

In [None]:
pipeline.run(experiment_name='pipeline-local_run', show_output = True, show_graph = True)

### Load an unregistered module, and test if locally
* Support load module from local or github 
* Support use the module without register it to aml ws
* Use module.run() to test the module locally

In [None]:
# load unregistered module from github
copy_file_func = Module.from_yaml(ws, yaml_file='https://github.com/lisagreenview/hello-aml-modules/blob/master/parallel_copy_file/copy_files.yaml')
help(copy_file_func)

In [None]:
# create a module
# you need to prepare local test data to initialize the module
copy_file = copy_file_func(input_folder='./dummy_metrics')

copy_file.run(use_docker=True, track_run_history=True)

### Create a new pipeline with unregistered module

new pipeline = dummy_automl_pipeline + copy_file_func

In [None]:
@dsl.pipeline(name='pipeline-with-azureml-module',default_compute_target = pipeline_compute)
def add_copy_file():
    compare_pipeline = dummy_automl_pipeline(input_data=train_data, test_data=test_data)
    copy_file_node = copy_file_func(input_folder=compare_pipeline.outputs.best_result)
    copy_file_node.runsettings.configure(node_count=1)

    return {**copy_file_node.outputs}

new_pipeline = add_copy_file()

### Run pipeline locally

* pipeline.run() support BasicPythonModule, ParallelRunModule and MpiModule 
* Local run and module node log will be uploaded and recorded in aml ws, and the running status will also be synced back to aml ws
* For mpi module, local run only support image with openmpi, intelmpi not support

In [None]:
os.chdir('D:\work\code\hello-aml-modules')
run = new_pipeline.run(experiment_name = 'pipeline-local_run', show_output = True, show_graph = True)
run