# NYC taxi data regression

**Requirements** - In order to benefit from this tutorial, you will need:
- A basic understanding of Machine Learning
- An Azure account with an active subscription - [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)
- An Azure ML workspace with computer cluster - [Configure workspace](../../configuration.ipynb)
- A python environment
- Installed Azure Machine Learning Python SDK v2 - [install instructions](../../../README.md) - check the getting started section

**Learning Objectives** - By the end of this tutorial, you should be able to:
- Connect to your AML workspace from the Python SDK
- Define different `CommandComponent` using YAML
- Create `Pipeline` load these components from YAML

**Motivations** - This notebook explains how to load component via SDK then use these components to build pipeline. We use NYC dataset, build pipeline with five steps, prep data, transform data, train model, predict results and evaluate model performance.

# 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

## 1.1 Import the required libraries

In [1]:
!pip install azure-identity 
!pip install azure-ai-ml

Collecting azure-identity
  Downloading azure_identity-1.19.0-py3-none-any.whl.metadata (80 kB)
Collecting azure-core>=1.31.0 (from azure-identity)
  Downloading azure_core-1.32.0-py3-none-any.whl.metadata (39 kB)
Collecting cryptography>=2.5 (from azure-identity)
  Downloading cryptography-44.0.0-cp37-abi3-manylinux_2_28_x86_64.whl.metadata (5.7 kB)
Collecting msal>=1.30.0 (from azure-identity)
  Downloading msal-1.31.1-py3-none-any.whl.metadata (11 kB)
Collecting msal-extensions>=1.2.0 (from azure-identity)
  Downloading msal_extensions-1.2.0-py3-none-any.whl.metadata (7.6 kB)
Collecting PyJWT<3,>=1.0.0 (from PyJWT[crypto]<3,>=1.0.0->msal>=1.30.0->azure-identity)
  Downloading PyJWT-2.9.0-py3-none-any.whl.metadata (3.0 kB)
Collecting portalocker<3,>=1.4 (from msal-extensions>=1.2.0->azure-identity)
  Downloading portalocker-2.10.1-py3-none-any.whl.metadata (8.5 kB)
Downloading azure_identity-1.19.0-py3-none-any.whl (187 kB)
Downloading azure_core-1.32.0-py3-none-any.whl (198 kB)
Down

In [2]:
# import required libraries
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential

from azure.ai.ml import MLClient, Input
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import load_component

## 1.2 Configure credential

We are using `DefaultAzureCredential` to get access to workspace. 
`DefaultAzureCredential` should be capable of handling most Azure SDK authentication scenarios. 

Reference for more available credentials if it does not work for you: [configure credential example](../../configuration.ipynb), [azure-identity reference doc](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python).

In [5]:
try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

## 1.3 Get a handle to the workspace

We use config file to connect to a workspace. The Azure ML workspace should be configured with computer cluster. [Check this notebook for configure a workspace](../../configuration.ipynb)

In [6]:
# Enter details of your AML workspace
subscription_id = "096b3461-7e4d-4cc6-a17c-6f3d723c6277"
resource_group = "testrgajuaza01"
workspace = "Testworkspace"

# Get a handle to workspace
#ml_client = MLClient.from_config(credential=credential)
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# # Retrieve an already attached Azure Machine Learning Compute.
# cluster_name = "Redi"
# print(ml_client.compute.get(cluster_name))

# 2. Build pipeline

In [7]:
parent_dir = ""

# 1. Load components
prepare_data = load_component(source=parent_dir + "./prep.yml")
transform_data = load_component(source=parent_dir + "./transform.yml")
train_model = load_component(source=parent_dir + "./train.yml")
predict_result = load_component(source=parent_dir + "./predict.yml")
score_data = load_component(source=parent_dir + "./score.yml")

# 2. Construct pipeline
@pipeline()
def nyc_taxi_data_regression(pipeline_job_input):
    """NYC taxi data regression example."""
    prepare_sample_data = prepare_data(raw_data=pipeline_job_input)
    transform_sample_data = transform_data(
        clean_data=prepare_sample_data.outputs.prep_data
    )
    train_with_sample_data = train_model(
        training_data=transform_sample_data.outputs.transformed_data
    )
    predict_with_sample_data = predict_result(
        model_input=train_with_sample_data.outputs.model_output,
        test_data=train_with_sample_data.outputs.test_data,
    )
    score_with_sample_data = score_data(
        predictions=predict_with_sample_data.outputs.predictions,
        model=train_with_sample_data.outputs.model_output,
    )
    return {
        "pipeline_job_prepped_data": prepare_sample_data.outputs.prep_data,
        "pipeline_job_transformed_data": transform_sample_data.outputs.transformed_data,
        "pipeline_job_trained_model": train_with_sample_data.outputs.model_output,
        "pipeline_job_test_data": train_with_sample_data.outputs.test_data,
        "pipeline_job_predictions": predict_with_sample_data.outputs.predictions,
        "pipeline_job_score_report": score_with_sample_data.outputs.score_report,
    }


pipeline_job = nyc_taxi_data_regression(
    Input(type="uri_folder", path=parent_dir + "./data/")
)
# demo how to change pipeline output settings
pipeline_job.outputs.pipeline_job_prepped_data.mode = "rw_mount"

# set pipeline level compute
pipeline_job.settings.default_compute = "Redi"
# set pipeline level datastore
pipeline_job.settings.default_datastore = "workspaceblobstore"

## 3. Submit pipeline job

In [8]:
# submit job to workspace
pipeline_job = ml_client.jobs.create_or_update(
    pipeline_job, experiment_name="pipeline_samples"
)
pipeline_job

Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


HttpResponseError: (AuthorizationFailed) The client 'georet@dtu.dk' with object id '75ae6087-e4ff-4199-a47d-0ad68ddb47fd' does not have authorization to perform action 'Microsoft.MachineLearningServices/workspaces/codes/versions/write' over scope '/subscriptions/096b3461-7e4d-4cc6-a17c-6f3d723c6277/resourceGroups/testrgajuaza01/providers/Microsoft.MachineLearningServices/workspaces/Testworkspace/codes/353f49a1-5c7b-4966-a3af-760f4d072c0d/versions/1' or the scope is invalid. If access was recently granted, please refresh your credentials.
Code: AuthorizationFailed
Message: The client 'georet@dtu.dk' with object id '75ae6087-e4ff-4199-a47d-0ad68ddb47fd' does not have authorization to perform action 'Microsoft.MachineLearningServices/workspaces/codes/versions/write' over scope '/subscriptions/096b3461-7e4d-4cc6-a17c-6f3d723c6277/resourceGroups/testrgajuaza01/providers/Microsoft.MachineLearningServices/workspaces/Testworkspace/codes/353f49a1-5c7b-4966-a3af-760f4d072c0d/versions/1' or the scope is invalid. If access was recently granted, please refresh your credentials.

In [None]:
# Wait until the job completes
ml_client.jobs.stream(pipeline_job.name)

# Next Steps
You can see further examples of running a pipeline job [here](../)