# Train a scikit-learn SVM on the Iris dataset-two step pipeline

**Requirements** - In order to benefit from this tutorial, you will need:
- A basic understanding of Machine Learning
- An Azure account with an active subscription. [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)
- An Azure ML workspace with computer cluster - [Configure workspace](../../../configuration.ipynb) 

- A python environment
- Installed Azure Machine Learning Python SDK v2 - [install instructions](../../../../README.md) - check the getting started section

**Learning Objectives** - By the end of this tutorial, you should be able to:
- Connect to your AML workspace from the Python SDK
- Create two step pipeline leverage [single step job: iris-scikit-learn](../../../single-step/scikit-learn/iris/iris-scikit-learn.ipynb)

**Motivations** - This notebook explains how to build two step pipeline job, 

# 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

## 1.1. Import the required libraries

In [None]:
# import required libraries
from azure.ml import command, Input, Output
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ml import MLClient, dsl, Input
from azure.ml.entities import load_component

## 1.2 Configure credential

We are using `DefaultAzureCredential` to get access to workspace. 
`DefaultAzureCredential` should be capable of handling most Azure SDK authentication scenarios. 

Reference for more available credentials if it does not work for you: [configure credential example](../../configuration.ipynb), [azure-identity reference doc](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python).

In [None]:
try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

## 1.3 Get a handle to the workspace

We use config file to connect to a workspace. The Azure ML workspace should be configured with computer cluster. [Check this notebook for configure a workspace](../../configuration.ipynb)

In [None]:
# Get a handle to workspace
ml_client = MLClient.from_config(credential=credential)

# Retrieve an already attached Azure Machine Learning Compute.
cluster_name = "cpu-cluster"
print(ml_client.compute.get(cluster_name))

# 2. Build two step pipeline
In this section we built two step pipeline, first we will use sklearn to training model, then use the trained model and predict result on test data.

## 2.1 Build train step
## 2.1.1 Configure the Command
The `command` allows user to configure the following key aspects.
- `code` - This is the path where the code to run the command is located
- `command` - This is the command that needs to be run
- `inputs` - This is the dictionary of inputs using name value pairs to the command. The key is a name for the input within the context of the job and the value is the input value. Inputs can be referenced in the `command` using the `${{inputs.<input_name>}}` expression. To use files or folders as inputs, we can use the `Input` class. The `Input` class supports three parameters:
    - `type` - The type of input. This can be a `uri_file` or `uri_folder`. The default is `uri_folder`.         
    - `path` - The path to the file or folder. These can be local or remote files or folders. For remote files - http/https, wasb are supported. 
        - Azure ML `data`/`dataset` or `datastore` are of type `uri_folder`. To use `data`/`dataset` as input, you can use registered dataset in the workspace using the format '<data_name>:<version>'. For e.g Input(type='uri_folder', path='my_dataset:1')
    - `mode` - 	Mode of how the data should be delivered to the compute target. Allowed values are `ro_mount`, `rw_mount` and `download`. Default is `ro_mount`
- `environment` - This is the environment needed for the command to run. Curated or custom environments from the workspace can be used. Or a custom environment can be created and used as well. Check out the [environment](../../../../assets/environment/environment.ipynb) notebook for more examples.
- `compute` - The compute on which the command will run. In this example we are using a compute called `cpu-cluster` present in the workspace. You can replace it any other compute in the workspace. You can run it on the local machine by using `local` for the compute. This will run the command on the local machine and all the run details and output of the job will be uploaded to the Azure ML workspace.
- `distribution` - Distribution configuration for distributed training scenarios. Azure Machine Learning supports PyTorch, TensorFlow, and MPI-based distributed training. The allowed values are `PyTorch`, `TensorFlow` or `Mpi`.
- `display_name` - The display name of the Job
- `description` - The description of the experiment

In [None]:
# create the command
command_func = command(
    code="./train-src",  # local path where the code is stored
    command="python train.py --iris-csv ${{inputs.data_csv}} --C ${{inputs.C}} --kernel ${{inputs.kernel}} --coef0 ${{inputs.coef0}} --model_path ${{outputs.model_path}} --test_data ${{outputs.test_data}}",
    inputs={
        "data_csv": Input(
            type="uri_file",
        ),
        "C": 0.8,
        "kernel": "rbf",
        "coef0": 0.1,
    },
    outputs={
        "model_path": Output(
            type="uri_folder",
        ),
        "test_data":  Output(
            type="uri_folder",
        ),
    },    
    environment="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest",
    compute="cpu-cluster",    
)

## 2.2 Build predict step using command
This step take mode and test as input and return the predict result.


In [None]:
from azure.ml import command, Input, Output
# define the command
predict = command(
    code="./predict-src",
    command="python predict.py --model ${{inputs.model}} --test_data ${{inputs.test_data}} --predict_result ${{outputs.predict_result}}",
    environment="AzureML-lightgbm-3.2-ubuntu18.04-py37-cpu@latest",
    inputs={
        "model": Input(
            type="uri_folder",
        ),
        "test_data": Input(
            type="uri_folder",
        ),
    },
    outputs={
        "predict_result": Output(
            type="uri_folder",
        ),
    },
    compute="cpu-cluster",
)

# 3. Build pipeline

We define a pipeline containing 2 nodes:
- `train_step` will train model using diabetes data and return mode, and test data. 
- `predict_step` will load trained model and predict test data

In [None]:
# define a pipeline containing 2 nodes: train node, predict node
@dsl.pipeline(
    description="two step pipeline",
    default_compute='cpu-cluster',
)
def two_step_pipeline():
    # we will reuse the command_job created before. we call it as a function so that we can apply inputs
    train_step = command_func(
        data_csv=Input(type='uri_file', path='https://azuremlexamples.blob.core.windows.net/datasets/iris.csv'),
    )
    predict_step = predict(
        model=train_step.outputs.model_path,
        test_data=train_step.outputs.test_data,
        )

# create a pipeline
pipeline = two_step_pipeline()

# 4. Submit pipeline job

In [None]:
pipeline_job = ml_client.jobs.create_or_update(
    pipeline, experiment_name="sklearn_iris_pipeline_samples"
)
pipeline_job