# Build RAI pipeline

**Requirements** - In order to benefit from this tutorial, you will need:
- A basic understanding of Machine Learning
- An Azure account with an active subscription. [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)
- An Azure ML workspace. [Check this notebook for creating a workspace](/sdk/resources/workspace/workspace.ipynb) 
- A Compute Cluster. [Check this notebook to create a compute cluster](/sdk/resources/compute/compute.ipynb)
- A python environment
- Installed Azure Machine Learning Python SDK v2 - [install instructions](/sdk/README.md#getting-started)

**Learning Objectives** - By the end of this tutorial, you should be able to:
- Connect to your AML workspace from the Python SDK
- Create `Pipeline` using component defined by yaml

**Motivations** - This notebook demos how to build complex RAI sample pipeline.  

# 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

## 1.1. Import the required libraries

In [None]:
#import required libraries
import time

from azure.ml import MLClient, dsl
from azure.ml.entities import JobInput, load_component

## 1.2. Configure credential

We are using `DefaultAzureCredential` to get access to workspace. When an access token is needed, it requests one using multiple identities(`EnvironmentCredential, ManagedIdentityCredential, SharedTokenCacheCredential, VisualStudioCodeCredential, AzureCliCredential, AzurePowerShellCredential`) in turn, stopping when one provides a token.
Reference [here](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential?view=azure-python) for more information.

`DefaultAzureCredential` should be capable of handling most Azure SDK authentication scenarios. 
Reference [here](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python) for all available credentials if it does not work for you.  

In [None]:
from azure.identity import DefaultAzureCredential

try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token('https://management.azure.com/.default')
except Exception as ex:
    # If exception happens when retrieve token, try exclude the failed credential like this then try again:
    # Exclude VSCode credential:
    # credential = DefaultAzureCredential(exclude_visual_studio_code_credential=True)
    raise Exception("Failed to retrieve a token from the included credentials due to the following exception, try to add `exclude_xxx_credential=True` to `DefaultAzureCredential` and try again.") from ex

## 1.3. Configure workspace details and get a handle to the workspace

To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the `MLClient` from `azure.ml` to get a handle to the required Azure Machine Learning workspace. 

In [None]:
try:
    ml_client = MLClient.from_config(credential=credential)
except Exception as ex:
    # NOTE: Update following workspace information if not correctly configure before
    client_config = {
        "subscription_id": "<SUBSCRIPTION_ID>",
        "resource_group": "<RESOURCE_GROUP>",
        "workspace_name": "<WORKSPACE_NAME>"
    }

    if client_config["subscription_id"].startswith('<'):
        print("please update your <SUBSCRIPTION_ID> <RESOURCE_GROUP> <WORKSPACE_NAME> in notebook cell")
        raise ex
    else:  # write and reload from config file
        import json, os
        config_path = "../../.azureml/config.json"
        os.makedirs(os.path.dirname(config_path), exist_ok=True)
        with open(config_path, "w") as fo:
            fo.write(json.dumps(client_config))
        ml_client = MLClient.from_config(credential=credential, path=config_path)
print(ml_client)

## 1.4. Retrieve or create an Azure Machine Learning compute target

In [None]:
# Retrieve an already attached Azure Machine Learning Compute.
cluster_name = "cpu-cluster"
try:
    ml_client.compute.get(name=cluster_name)
except Exception:
    print('Creating a new compute target...')
    from azure.ml.entities import AmlCompute
    compute = AmlCompute(
        name=cluster_name,
        size="Standard_D2_v2",
        max_instances=2
    )
    ml_client.compute.begin_create_or_update(compute)

# 2. Define command component via YAML
Below is example to define command component using YAML.

In [None]:

register_model = load_component(yaml_file="./rai_components/component_register_model.yaml")
train_logistic_regression_for_rai = load_component(yaml_file="./rai_components/component_train_logreg.yaml")
fetch_registered_model = load_component(yaml_file="./rai_components/component_fetch_registered_model.yaml")
rai_insights_constructor = load_component(yaml_file="./rai_components/component_rai_insights.yaml")
rai_insights_causal = load_component(yaml_file="./rai_components/component_causal.yaml")

# 3. Sample pipeline job

## 3.1 Build pipeline

In [None]:

cluster_name = "cpu-cluster"

def submit_and_wait(
        client: MLClient, pipeline_job, expected_state: str = "Completed"
):
    created_job = client.jobs.create_or_update(pipeline_job)
    terminal_states = ["Completed", "Failed", "Canceled", "NotResponding"]
    assert created_job is not None
    assert expected_state in terminal_states

    while created_job.status not in terminal_states:
        time.sleep(30)
        created_job = client.jobs.get(created_job.name)
        print("Latest status : {0}".format(created_job.status))
    if created_job.status != expected_state:
        print(f"Debug_Hod: {str(created_job)}")
    assert created_job.status == expected_state
    return created_job



version_string = "1"

model_name_suffix = int(time.time())
model_name = "common_fetch_model_adult"
adult_train_pq = JobInput(path=f"adult_train_pq_file:{version_string}", mode="download")

@dsl.pipeline(
    compute=cluster_name,
    description="Register Common Model for Adult",
    experiment_name="pipeline_samples",
)
def register_common_model_for_adult(target_column_name, training_data):
    trained_model = train_logistic_regression_for_rai(
        target_column_name=target_column_name, training_data=training_data
    )

    _ = register_model(
        model_input_path=trained_model.outputs.model_output,
        model_base_name=model_name,
        model_name_suffix=model_name_suffix,
    )

    return {}

training_pipeline = register_common_model_for_adult("income", adult_train_pq)

training_pipeline_job = submit_and_wait(ml_client, training_pipeline)
assert training_pipeline_job is not None

expected_model_id = f"{model_name}_{model_name_suffix}:1"

# ==============================================================

@dsl.pipeline(
    compute=cluster_name,
    description="Causal component with all arguments",
    experiment_name=f"pipeline_samples",
    default_compute=cluster_name,
)
def rai_causal_classification(
        target_column_name,
        train_data,
        test_data,
):
    fetch_model_job = fetch_registered_model(
        model_id=expected_model_id
    )

    construct_job = rai_insights_constructor(
        title="Run built from DSL",
        task_type="classification",
        model_info_path=fetch_model_job.outputs.model_info_output_path,
        train_dataset=train_data,
        test_dataset=test_data,
        target_column_name=target_column_name,
        categorical_column_names='["Race", "Sex", "Workclass", "Marital Status", "Country", "Occupation"]',
        maximum_rows_for_test_dataset=5000,  # Should be default
        classes="[]",  # Should be default
    )

    causal_job = rai_insights_causal(
        rai_insights_dashboard=construct_job.outputs.rai_insights_dashboard,
        treatment_features='["Age", "Sex"]',
        heterogeneity_features='["Marital Status"]',
        nuisance_model="automl",
        heterogeneity_model="forest",
        alpha=0.06,
        upper_bound_on_cat_expansion=51,
        treatment_cost="[0.1, 0.2]",
        min_tree_leaf_samples=3,
        max_tree_depth=3,
        skip_cat_limit_checks=True,
        categories="auto",
        n_jobs=2,
        verbose=0,
        random_state=10,
    )

    return {}

adult_train_pq = JobInput(path=f"adult_train_pq_file:{version_string}", mode="download")
adult_test_pq = JobInput(path=f"adult_test_pq_file:{version_string}", mode="download")
rai_pipeline = rai_causal_classification(
    target_column_name="income",
    train_data=adult_train_pq,
    test_data=adult_test_pq,
)

rai_pipeline_job = submit_and_wait(ml_client, rai_pipeline)
assert rai_pipeline_job is not None

# 3.2 Submit pipeline job

In [None]:
# submit job to workspace
# pipeline_job = ml_client.jobs.create_or_update(pipeline, experiment_name="pipeline_samples")
# print(f'Job link: {pipeline_job.services["Studio"].endpoint}')
# pipeline_job

In [None]:
# Wait until the job completes
# ml_client.jobs.stream(pipeline_job.name)

# Next Steps
You can see further examples of running a pipeline job [here](/sdk/jobs/pipelines/)