## Using Azure Machine Learning Studio with Python

We will go through some relevant components of AzureML, and how to manage them from the python sdk. The goal is to be able to utilize Azureml for flexible model training and management and build useful and flexible pipeline components.

We will discuss:
- SDK
- compute
- environments
- models
- jobs
- pipeline componenents
- pipelines

### Dependencies

- Azure CLI
```python
#libraries for azure machine learning
azureml-core
mlflow
azure-ai-ml

#libraries for azure identiy and access management
azure-identity
```

### Connect from the SDK

All interactions with AzureML through the SDK go through the `MLClient` object, which authenticates either with a `DefaultAzureCredential` or `AzureCLICredential` obtained from your logged in session of azure cli, or by using an `InteractiveBroserCredential`. 

To create the `MLCLient`, download the config.json file from azureml and store it in the root of your project: 

<img src="./images/configjson.png" alt="Config json download location in Azure ML Studio"/>



In [1]:
#create the MLClient object
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()
ml_client = MLClient.from_config(credential=credential)

print("MLClient created successfully.")

Found the config file in: /home/daniel/repos/aml_demo/config.json


MLClient created successfully.


## Compute
We need to provision compute to do our work for us. We can create:
- compute instances which are aimed at running notebooks and development
- compute clusters which are for our heavy workloads

We can also run jobs and commands using `serverless` compute these days. But lets create a cluster. You can do this in the studio manually, or use the sdk

In [2]:
from azure.ai.ml.entities import AmlCompute

compute_name = "defaultcompute"
helloworld_compute= AmlCompute(name=compute_name, 
                               size="STANDARD_DS3_v2", 
                               min_instances=0, 
                               max_instances=1, 
                               idle_time_before_scale_down=300)

## We already have this in the workspace, so this is just to show how to create a new one. No need to run it.
# ml_client.compute.begin_create_or_update(helloworld_compute)

for c in ml_client.compute.list():
    print(c.name)

bigcompute
defaultcompute


## Environments

Environments are basically docker images stored in the container registry associated with your azureml studio instance. These docker images are used to run your code on the compute resources. You can create environments quite simply from a conda specification, for example:

```python
%%writefile ../environments/helloworld.yaml
name: catsanddogsenv
channels:
  - conda-forge
dependencies:
  - python=3.8
  - numpy=1.21.2
  - pip=21.2.4
  - scikit-learn=0.24.2
  - scipy=1.7.1
  - pandas>=1.1,<1.2
```

In [3]:
from azure.ai.ml.entities import Environment

#give our environment a name
custom_env_name = "helloworldenv"

helloworldenv = Environment(
    name=custom_env_name,
    description="Custom environment for our hello world example",
    tags={"scikit-learn": "0.24.2"}, #you can add tags to your environment
    conda_file= "../environments/helloworld.yaml",
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest", #you have to supply a base docker image. This one is from AzureML
    )

# We already have this in the workspace, so this is just to show how to create a new one. No need to run it.
# ml_client.environments.create_or_update(helloworldenv)

# List all environments in the workspace along with their latest versions. 
for env in ml_client.environments.list():
    print(f"{env.name}:{env.latest_version}")

yolofromdocker:3
catsanddogsenv:1
helloworldenv:1
AzureML-ACPT-pytorch-1.13-py38-cuda11.7-gpu:10


## Environments from docker specifications

If you want to use a custom docker specification to build an environment, you can do that. This is useful when your dependencies require very specific OS tweaks. For example 

In [4]:
from azure.ai.ml.entities import Environment, BuildContext

env_docker_context = Environment(
    build=BuildContext(path="azureml-environment"), # Path to the directory containing your Dockerfile
    name="yolofromdocker",
    description="Environment created from a Docker context.",
)
## Uncomment to create the environment
# ml_client.environments.create_or_update(env_docker_context)

So then we have our environments in the studio:

![environments](./images/environments.png)


## Commands and Jobs
We can now bring stuff together to actually do some work!
We can define `commands`, and submit those to our compute as `jobs`.
A `command` requires information on:
- what environment to use
- what compute to use
- which code context to use
- what command to execute.

You can think of it as simply spinning up a docker container with a command. 

In [5]:
from azure.ai.ml import command
#a job that runs a simple hello world command
helloworldjob = command(name="helloworldjob",
                        compute="defaultcompute", #supply the compute name we created earlier
                        environment=f"helloworldenv:1", #specify the environment we created earlier, and the version
                        command="echo 'Hello World!'") #the command we want to run, this can be a call to a python script, or a bash script or whatever.

job = ml_client.jobs.create_or_update(helloworldjob) #submit the job to the workspace
print(f"Job {job.name} submitted to the workspace. Monitor it at {job.studio_url}")


Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


Job helloworldjob submitted to the workspace. Monitor it at https://ml.azure.com/runs/helloworldjob?wsid=/subscriptions/cad4de40-836d-424a-bdd6-3296c5c25179/resourcegroups/rgdaniel/workspaces/amldaniel&tid=8b87af7d-8647-4dc7-8df4-5f69a2011bb5


![helloworldjob](./images/helloworldjob.png)

## Data

We need access to data to do things with. We can do this in various ways 
- register datasets
    - these are versioned and directly accesible through AML
- connect to a storage account directly.

In [6]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

cats_and_dogs_data = Data(
                          name="catsanddogs_flat", 
                          path="../data/catsanddogs_flat",
                          type=AssetTypes.URI_FOLDER,
                          description="A dataset containing images of cats and dogs"
                          )

# ml_client.data.create_or_update(cats_and_dogs_data) #submit the data to the workspace

datasets = ml_client.data.list() #list all datasets in the workspace
for dataset in datasets:
    print(f"{dataset.name}:{dataset.latest_version}")

catsanddogs_flat:1
coco128:2


To connect containers in your storage account, you need to find `connections` in the portal and navigate to the container:
![container connections](./images/connectstorage.png)

In [7]:
#these are available through datastores:

datastores = ml_client.datastores.list() #list all datastores in the workspace
for datastore in datastores:
    print(f"{datastore.name}:{datastore.type}")

azureml_globaldatasets:AzureBlob
container_processed:AzureBlob
container_raw:AzureBlob
workspacefilestore:AzureFile
workspaceblobstore:AzureBlob
workspaceworkingdirectory:AzureFile
workspaceartifactstore:AzureBlob


## Pipelines!

We have all the pieces we need to start utilizing the power of AzureML pipelines! You can think of a pipeline as a bunch of jobs cobbled together through their `Inputs` and `Outputs`. 

#### Inputs
- `str`, `int`, `float` etc.
- `AzureML Datasets`
- `uri folders` --> paths to locations in connected blobstorage
- `uri files` --> paths to specific files in connected blobstorage
- `models` --> from the registered models

#### Outputs
- `uri folder`
- `uri file`
- `custom_model`
- `mlflow_model`

### Pipeline components

If we take a simple `command job` like our Hello World example above and we give it `inputs` and `outputs`, we have the beginning of a usable pipeline component. Lets say we want the component to take out name and write a file to somewhere noting that it did so:

In [10]:
from azure.ai.ml import Input, Output

inputs = {'name': 'Daniel'}
outputs = {'output_path': Output(type="uri_folder", path="azureml://datastores/container_processed/paths/presentation_examples/hello_world")}

job_component = command(
    name="helloworldjobcomponent4",
    inputs=inputs,
    outputs=outputs,
    compute="defaultcompute",
    environment=f"helloworldenv:1",
    command="echo 'Hello ${{inputs.name}}' >> ${{outputs.output_path}}/hello.txt"
)
#submit the job to the workspace
# job = ml_client.jobs.create_or_update(job_component)
print(f"Job {job.name} submitted to the workspace. Monitor it at {job.studio_url}")

Job helloworldjob submitted to the workspace. Monitor it at https://ml.azure.com/runs/helloworldjob?wsid=/subscriptions/cad4de40-836d-424a-bdd6-3296c5c25179/resourcegroups/rgdaniel/workspaces/amldaniel&tid=8b87af7d-8647-4dc7-8df4-5f69a2011bb5


![alt text](./images/helloworkdjobcomponent4.png)

![](./images/helloworldjobcomponentoutputpath.png)

### Pipeline components

You can tell by the job name that jobs cannot be easily rerun, and specifying the inputs and outputs in code like this all the time is also not optimal. This is where the `pipeline decorator` comes in. We can create a function that packages this job into a pipeline that is rerunnable and deployable. We redefine the job to write the name parameter in the filename, and append to the file with a timestamped line in case it exists. That will illustrate a pipeline component doing its thing:

In [14]:
from azure.ai.ml import dsl

write_name_component = command(
    name="write_name_component",
    description="Write name to file",
    inputs={"name": Input(type="string")},
    outputs={"output_path": Output(type="uri_folder", path="azureml://datastores/container_processed/paths/presentation_examples/hello_world_presentation")},
    compute="defaultcompute",
    environment=f"helloworldenv:1",
    command="echo $(date '+%Y-%m-%d %H:%M:%S') Hello ${{inputs.name}} >> ${{outputs.output_path}}/hello_${{inputs.name}}.txt"
)
@dsl.pipeline(experiment_name="helloworld_pipeline_presentation")
def helloworld_pipeline(name: Input):
    write_name_component_instance = write_name_component(name=name)
    return {"output_path": write_name_component_instance.outputs.output_path}

pipeline_job = helloworld_pipeline(name="Analytics_Bergen!") #create the pipeline job
ml_client.jobs.create_or_update(pipeline_job) #submit the pipeline job to the workspace

pathOnCompute is not a known attribute of class <class 'azure.ai.ml._restclient.v2023_04_01_preview.models._models_py3.UriFolderJobOutput'> and will be ignored


Experiment,Name,Type,Status,Details Page
helloworld_pipeline_presentation,green_net_fhtn7d3jxf,pipeline,NotStarted,Link to Azure Machine Learning studio


![](./images/write_name_component_overview.png)

![](./images/write_name_component_output.png)

### Chaining components
Now we have shown one component with inputs and outputs, but we can chain them together and 'orchestrate powerful pipelines' as they would have us say. Lets add a simple component that just reads the file we wrote back to us. We will redefine the write_name_component for this pipeline since output paths are experiment specific, meaning that a new experiment will fail if you run it towards a non-empty directory. 

In [15]:
write_name_component = command(
    name="write_name_component",
    description="Write name to file",
    inputs={"name": Input(type="string")},
    outputs={"output_path": Output(type="uri_folder", mode = "rw_mount")},
    compute="defaultcompute",
    environment=f"helloworldenv:1",
    command="echo $(date '+%Y-%m-%d %H:%M:%S') Hello ${{inputs.name}} >> ${{outputs.output_path}}/hello_${{inputs.name}}.txt"
)

cat_name_component = command(
    name="cat_name_component",
    description="write the contents of the name file to the console",
    inputs={"filepath" : Input(type="uri_folder"), "name": Input(type="string")},
    compute="defaultcompute",
    environment="helloworldenv:1",
    command="cat ${{inputs.filepath}}/hello_${{inputs.name}}.txt"
)

@dsl.pipeline(experiment_name="echocat_pipeline_presentation")
def echocat_pipeline(name: Input): 
    write_name_component_instance = write_name_component(name=name)
    cat_name_component_instance = cat_name_component(filepath=write_name_component_instance.outputs.output_path, name=name)

pipeline_job = echocat_pipeline(name="analytics_bergen!") #create the pipeline job
ml_client.jobs.create_or_update(pipeline_job) #submit the pipeline job to the workspace


pathOnCompute is not a known attribute of class <class 'azure.ai.ml._restclient.v2023_04_01_preview.models._models_py3.UriFolderJobOutput'> and will be ignored


Experiment,Name,Type,Status,Details Page
echocat_pipeline_presentation,cool_parcel_kr5nydl1qq,pipeline,NotStarted,Link to Azure Machine Learning studio


![](./images/echocat_output.png)

Pipelines can be published (legacy?) or deployed in the portal. This makes them reusable from endpoints and form the ui in the studio so others can run the pipelines on new data and collect results. That is how you can make your processing and analysis available to others in a scalable way. 

For more complex examples with better real-world applicability, you can have a look at the various notebooks in this repo.

thanks.

In [13]:
for endpoint in ml_client.batch_endpoints.list():
    print(f"Endpoint name: {endpoint.name}, ID: {endpoint.id}")


Endpoint name: detectandsegment2, ID: /subscriptions/cad4de40-836d-424a-bdd6-3296c5c25179/resourceGroups/rgdaniel/providers/Microsoft.MachineLearningServices/workspaces/amldaniel/batchEndpoints/detectandsegment2


In [None]:
endpoint_job = ml_client.batch_endpoints.invoke(endpoint_name="detectandsegment2", 
                                                inputs={"name": Input(type="string", default="Daniel_from_endpoint_invocation!")})

In [31]:
from datetime import datetime

In [32]:


endpoint_job = ml_client.batch_endpoints.invoke(
    endpoint_name="echocat",
    inputs={"name": Input(type="string", default="Daniel_from_endpoint_invocation_again!")},
    job_name=f"echocat_job_{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
    description="Job created from endpoint invocation",
)

In [39]:
# Create a unique path with timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_path = f"azureml://datastores/container_processed/paths/presentation_examples_{timestamp}/hello_world_presentation_{timestamp}"
endpoint_job2 = ml_client.batch_endpoints.invoke(
    endpoint_name="echocat",
    experiment_name="helloworld_pipeline_presentation_from_endpoint",
    job_name="helloworld_pipeline_presentation_from_endpoint_job",
    inputs={"name": Input(type="string", default="Daniel_from_endpoint_invocation_again")},
    outputs={"output_path": Output(type="uri_folder", path=output_path) } # Use the unique path
)