# AutoML in pipeline

**Requirements** - In order to benefit from this tutorial, you will need:
- A basic understanding of Machine Learning
- An Azure account with an active subscription - [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)
- An Azure ML workspace with computer cluster - [Configure workspace](../../../configuration.ipynb)
- A python environment
- Installed Azure Machine Learning Python SDK v2 - [install instructions](../../../../README.md) - check the getting started section

**Learning Objectives** - By the end of this tutorial, you should be able to:
- Create a pipeline with Regression AutoML task.

**Motivations** - This notebook explains how to use Regression AutoML task inside pipeline.

# 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

## 1.1 Import the required libraries

In [None]:
# import required libraries
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential

from azure.ai.ml import MLClient, Input, command, Output
from azure.ai.ml.dsl import pipeline
from azure.ai.ml.automl import classification, regression
from azure.ai.ml.entities._job.automl.tabular import TabularFeaturizationSettings
from azure.ai.ml.entities import Environment

## 1.2 Configure credential

We are using `DefaultAzureCredential` to get access to workspace. 
`DefaultAzureCredential` should be capable of handling most Azure SDK authentication scenarios. 

Reference for more available credentials if it does not work for you: [configure credential example](../../../configuration.ipynb), [azure-identity reference doc](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python).

In [None]:
try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

## 1.3 Get a handle to the workspace

We use config file to connect to a workspace. The Azure ML workspace should be configured with computer cluster. [Check this notebook for configure a workspace](../../../configuration.ipynb)

In [None]:
# Get a handle to workspace
ml_client = MLClient.from_config(credential=credential)

# Retrieve an already attached Azure Machine Learning Compute.
cluster_name = "cpu-cluster"
print(ml_client.compute.get(cluster_name))

# 3. Basic pipeline job with regression task

## 3.1 Build pipeline

In [None]:
env_docker_conda = Environment(
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
    conda_file="./environment/preprocessing_env.yaml",
    name="pipeline-custom-environment",
    description="Environment created from a Docker image plus Conda environment.",
)
ml_client.environments.create_or_update(env_docker_conda)

In [None]:
# Define pipeline
@pipeline(
    description="AutoML Regression Pipeline",
)
def automl_regression(
    regression_train_data, regression_validation_data, regression_test_data
):
    # define command function for preprocessing the model
    preprocessing_command_func = command(
        inputs=dict(
            train_data=Input(type="mltable"),
            validation_data=Input(type="mltable"),
            test_data=Input(type="mltable"),
        ),
        outputs=dict(
            preprocessed_train_data=Output(type="mltable"),
            preprocessed_validation_data=Output(type="mltable"),
            preprocessed_test_data=Output(type="mltable"),
        ),
        code="./preprocess.py",
        command="python preprocess.py "
        + "--train_data ${{inputs.train_data}} "
        + "--validation_data ${{inputs.validation_data}} "
        + "--test_data ${{inputs.test_data}} "
        + "--preprocessed_train_data ${{outputs.preprocessed_train_data}} "
        + "--preprocessed_validation_data ${{outputs.preprocessed_validation_data}} "
        + "--preprocessed_test_data ${{outputs.preprocessed_test_data}}",
        environment="pipeline-custom-environment@latest",
    )
    preprocess_node = preprocessing_command_func(
        train_data=regression_train_data,
        validation_data=regression_validation_data,
        test_data=regression_test_data,
    )

    # define the AutoML regression task with AutoML function
    regression_node = regression(
        primary_metric="r2_score",
        target_column_name="SalePrice",
        training_data=preprocess_node.outputs.preprocessed_train_data,
        validation_data=preprocess_node.outputs.preprocessed_validation_data,
        test_data=preprocess_node.outputs.preprocessed_test_data,
        featurization=TabularFeaturizationSettings(mode="off"),
        # currently need to specify outputs "mlflow_model" explicitly to reference it in following nodes
        outputs={"best_model": Output(type="mlflow_model")},
    )
    # set limits & training
    regression_node.set_limits(max_trials=1, max_concurrent_trials=1)
    regression_node.set_training(
        enable_stack_ensemble=False, enable_vote_ensemble=False
    )

    # define command function for registering the model
    command_func = command(
        inputs=dict(
            model_input_path=Input(type="mlflow_model"),
            model_base_name="regression_example_model",
        ),
        code="./register.py",
        command="python register.py "
        + "--model_input_path ${{inputs.model_input_path}} "
        + "--model_base_name ${{inputs.model_base_name}}",
        environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1",
    )
    register_model = command_func(model_input_path=regression_node.outputs.best_model)


pipeline_regression = automl_regression(
    regression_train_data=Input(path="./training-mltable-folder/", type="mltable"),
    regression_validation_data=Input(
        path="./validation-mltable-folder/", type="mltable"
    ),
    regression_test_data=Input(path="./test-mltable-folder/", type="mltable"),
)

# set pipeline level compute
pipeline_regression.settings.default_compute = "cpu-cluster"

# 3.2 Submit pipeline job

In [None]:
# submit the pipeline job
pipeline_job = ml_client.jobs.create_or_update(
    pipeline_regression, experiment_name="pipeline_samples"
)
pipeline_job

In [None]:
# Wait until the job completes
ml_client.jobs.stream(pipeline_job.name)

# Next Steps
You can see further examples of running a pipeline job [here](../)