## Set up the pipeline resources

The Azure Machine Learning framework can be used from CLI, Python SDK, or studio interface. In this example, you use the Azure Machine Learning Python SDK v2 to create a pipeline. 

Before creating the pipeline, you need the following resources:

* The data asset for training
* The software environment to run the pipeline
* A compute resource to where the job runs

## Create handle to workspace

Before we dive in the code, you need a way to reference your workspace. You'll create `ml_client` for a handle to the workspace.  You'll then use `ml_client` to manage resources and jobs.

In the next cell, enter your Subscription ID, Resource Group name and Workspace name. To find these values:

1. In the upper right Azure Machine Learning studio toolbar, select your workspace name.
1. Copy the value for workspace, resource group and subscription ID into the code.
1. You'll need to copy one value, close the area and paste, then come back for the next one.


In [1]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# authenticate
credential = DefaultAzureCredential()
# # Get a handle to the workspace
ml_client = MLClient( credential=credential,
                      subscription_id="<subcription_id>",
                      resource_group_name="<resource_group_name>",
                      workspace_name="<workspace_name>",
                    )

# FETCH DATA
credit_data = ml_client.data.get("credit_fraud_detection", version='1')

## Create a compute resource to run your pipeline

You can **skip this step** if you want to use **serverless compute (preview)** to run the training job. Through serverless compute, Azure Machine Learning takes care of creating, scaling, deleting, patching and managing compute, along with providing managed network isolation, reducing the burden on you. 

Each step of an Azure Machine Learning pipeline can use a different compute resource for running the specific job of that step. It can be single or multi-node machines with Linux or Windows OS, or a specific compute fabric like Spark.

In this section, you provision a Linux  [compute cluster](https://docs.microsoft.com/azure/machine-learning/how-to-create-attach-compute-cluster?tabs=python). See the [full list on VM sizes and prices](https://azure.microsoft.com/en-ca/pricing/details/machine-learning/) .

For this tutorial, you only need a basic cluster so use a Standard_DS3_v2 model with 2 vCPU cores, 7-GB RAM and create an Azure Machine Learning Compute.
> [!TIP]
> If you already have a compute cluster, replace "cpu-cluster" in the next code block with the name of your cluster.  This will keep you from creating another one.


In [2]:
from azure.ai.ml.entities import AmlCompute

# Name assigned to the compute cluster
cpu_compute_target = "ML-Pipeline-Cluster"

try:
    # let's see if the compute target already exists
    cpu_cluster = ml_client.compute.get(cpu_compute_target)
    print(f"You already have a cluster named {cpu_compute_target}, we'll reuse it as is.")

except Exception:
    print("Creating a new cpu compute target...")

    # Let's create the Azure Machine Learning compute object with the intended parameters
    # if you run into an out of quota error, change the size to a comparable VM that is available.
    # Learn more on https://azure.microsoft.com/en-us/pricing/details/machine-learning/.
    cpu_cluster = AmlCompute(
        name=cpu_compute_target,
        # Azure Machine Learning Compute is the on-demand VM service
        type="amlcompute",
        # VM Family
        size="STANDARD_DS3_V2",
        # Minimum running nodes when there is no job running
        min_instances=0,
        # Nodes in cluster
        max_instances=4,
        # How many seconds will the node running after the job termination
        idle_time_before_scale_down=180,
        # Dedicated or LowPriority. The latter is cheaper but there is a chance of job termination
        tier="Dedicated",
    )
    print(f"AMLCompute with name {cpu_cluster.name} will be created, with compute size {cpu_cluster.size}")
    # Now, we pass the object to MLClient's create_or_update method
    cpu_cluster = ml_client.compute.begin_create_or_update(cpu_cluster)

You already have a cluster named ML-Pipeline-Cluster, we'll reuse it as is.


## Create a job environment for pipeline steps

So far, you've created a development environment on the compute instance, your development machine. You also need an environment to use for each step of the pipeline. Each step can have its own environment, or you can use some common environments for multiple steps.

In this example, you create a conda environment for your jobs, using a conda yaml file.
First, create a directory to store the file in.

In [3]:
from azure.ai.ml.entities import Environment

custom_env_name  = "aml-scikit-learn"
ver = "0.1.1"
# dependencies_dir = './dependencies'
# pipeline_job_env = Environment( name=custom_env_name,
#                                 description="Custom environment for Credit Card Defaults pipeline",
#                                 tags={"scikit-learn": "0.24.2"},
#                                 conda_file=os.path.join(dependencies_dir, "conda.yaml"),
#                                 image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
#                                 version=ver,
#                               )
# pipeline_job_env = ml_client.environments.create_or_update(pipeline_job_env)

# GET ENVIRONMENT
pipeline_job_env = ml_client.environments.get(name=custom_env_name, version=ver)


print(f"Environment with name {pipeline_job_env.name} is registered to workspace, the environment version is {pipeline_job_env.version}")

Environment with name aml-scikit-learn is registered to workspace, the environment version is 0.1.1


## Build the training pipeline

Now that you have all assets required to run your pipeline, it's time to build the pipeline itself.

Azure Machine Learning pipelines are reusable ML workflows that usually consist of several components. The typical life of a component is:

- Write the yaml specification of the component, or create it programmatically using `ComponentMethod`.
- Optionally, register the component with a name and version in your workspace, to make it reusable and shareable.
- Load that component from the pipeline code.
- Implement the pipeline using the component's inputs, outputs and parameters.
- Submit the pipeline.

There are two ways to create a component, programmatic and yaml definition. The next two sections walk you through creating a component using programmatic definition

> [!NOTE]
> In this tutorial for simplicity we are using the same compute for all components. However, you can set different computes for each component, for example by adding a line like `train_step.compute = "cpu-cluster"`. To view an example of building a pipeline with different computes for each component, see the [Basic pipeline job section in the cifar-10 pipeline tutorial](https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/pipelines/2b_train_cifar_10_with_pytorch/train_cifar_10_with_pytorch.ipynb).

### Create component: data preparation (using programmatic definition)

Let's start by creating the first component. This component handles the preprocessing of the data. The preprocessing task is performed in the *data_preparation.py* Python file.

First create a source folder for the data_prep component:

In [47]:
from azure.ai.ml import command
from azure.ai.ml import Input, Output

scripts_dir = "./src"
data_prep_component = command( name="Data prep CreditFraud Detection",
                               display_name ="Data preparation for training",
                               description  ="reads input data & preprocesses it",
                               inputs= { "data": Input(type="uri_folder") },
                               outputs=dict( processed_data=Output(type="uri_folder", mode="rw_mount")),
                               # The source folder of the component
                               code=scripts_dir,
                               command="""python data_preparation.py \
                                        --data ${{inputs.data}} \
                                        --processed_data ${{outputs.processed_data}} \
                                        """,
                               environment=f"{pipeline_job_env.name}:{pipeline_job_env.version}",
                            )

train_component = command( name="Training  Model",
                            display_name ="Training Model",
                            # description  ="reads input data & preprocesses it",
                            inputs= { "processed_data": Input(type="uri_folder"),
                                      "test_train_ratio": Input(type='number'),
                                      "registered_model_name":Input(type='string'),
                                    },
                            outputs=dict(model=Output(type="uri_folder", mode="rw_mount")),
                            # The source folder of the component
                            code=scripts_dir,
                            command="""python train.py \
                                    --input_data ${{inputs.processed_data}} \
                                    --registered_model_name ${{inputs.registered_model_name}} \
                                    --model ${{outputs.model}} \
                                    """,
                            environment=f"{pipeline_job_env.name}:{pipeline_job_env.version}",
                            )

### Create Pipeline using Components
To code the pipeline, you use a specific `@dsl.pipeline` decorator that identifies the Azure Machine Learning pipelines. In the decorator, we can specify the pipeline description and default resources like compute and storage. Like a Python function, pipelines can have inputs. You can then create multiple instances of a single pipeline with different inputs.

Here, we used *input data*, *split ratio* and *registered model name* as input variables. We then call the components and connect them via their inputs/outputs identifiers. The outputs of each step can be accessed via the `.outputs` property.


In [48]:
# the dsl decorator tells the sdk that we are defining an Azure Machine Learning pipeline
from azure.ai.ml import dsl, Input, Output


@dsl.pipeline(compute=cpu_compute_target, description="E2E data_prep-train pipeline")
def credit_fraud_detection_pipeline(input_data, test_train_ratio, registered_model_name,):
                             # using data_prep_function like a python call with its own inputs
                             data_prep_job = data_prep_component(data=input_data,)

                             # using train_func like a python call with its own inputs
                             train_job = train_component( processed_data  = data_prep_job.outputs.processed_data,     # note: using outputs from previous step
                                                          test_train_ratio=test_train_ratio,
                                                          registered_model_name=registered_model_name,
                                                        )

                             # a pipeline returns a dictionary of outputs
                             # keys will code for the pipeline output identifier
                             # return  { "processed_data": data_prep_job.outputs.processed_data }

#### Initiate Pipeline

In [49]:
registered_model_name = "FraudDetectionModel"

# Let's instantiate the pipeline with the parameters of our choice
pipeline = credit_fraud_detection_pipeline(input_data=Input(type="uri_file", path=credit_data.path),
                                    test_train_ratio=0.25,
                                    registered_model_name=registered_model_name,
                                    )

## Submit the job 

It's now time to submit the job to run in Azure Machine Learning. This time you use `create_or_update`  on `ml_client.jobs`.

Here you also pass an experiment name. An experiment is a container for all the iterations one does on a certain project. All the jobs submitted under the same experiment name would be listed next to each other in Azure Machine Learning studio.

Once completed, the pipeline registers a model in your workspace as a result of training.

In [50]:
# submit the pipeline job
pipeline_job = ml_client.jobs.create_or_update(pipeline,experiment_name="e2e_registered_components",)
ml_client.jobs.stream(pipeline_job.name)

RunId: keen_arm_rd3wjr89rm
Web View: https://ml.azure.com/runs/keen_arm_rd3wjr89rm?wsid=/subscriptions/4c3b2838-71d0-44f4-9f40-4539213bfcf4/resourcegroups/rg-dev-allerganconnect/workspaces/Clustering_Analysis

Streaming logs/azureml/executionlogs.txt

[2023-07-10 09:32:14Z] Completing processing run id d56fefbb-fe18-4e29-9141-33fa4db6b163.
[2023-07-10 09:32:15Z] Submitting 1 runs, first five are: 1538f8dd:7edb7d7d-85c9-469d-bfe5-2f33c6a104c9
[2023-07-10 09:40:53Z] Completing processing run id 7edb7d7d-85c9-469d-bfe5-2f33c6a104c9.

Execution Summary
RunId: keen_arm_rd3wjr89rm
Web View: https://ml.azure.com/runs/keen_arm_rd3wjr89rm?wsid=/subscriptions/4c3b2838-71d0-44f4-9f40-4539213bfcf4/resourcegroups/rg-dev-allerganconnect/workspaces/Clustering_Analysis



In [16]:
import os
print(os.getcwd())
# dir  = os.getcwd()
# dir1 = os.chdir('../dir')
print('Dir:',dir)
# print('Dir1:',dir1)

/mnt/batch/tasks/shared/LS_root/mounts/clusters/computeinstance-m/code/Users/musthaq.mohammed/AzureML/CreditFraudDetection
Dir: /mnt/batch/tasks/shared/LS_root/mounts/clusters/computeinstance-m/code/Users/musthaq.mohammed/AzureML/CreditFraudDetection
