# Pipeline for ORCA-CLEAN Deep Denoizing Network

**Requirements** - In order to run this notebook, you will need:
- A basic understanding of Machine Learning
- An Azure account with an active subscription - [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)
- An Azure ML workspace with computer cluster - [Configure workspace](../../configuration.ipynb)
- A python environment
- Installed Azure Machine Learning Python SDK v2 - [install instructions](../../../README.md) - check the getting started section

# 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

## 1.1 Import the required libraries and settings

In [1]:
import datetime

from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential, EnvironmentCredential, AzureCliCredential

from azure.ai.ml import MLClient, Input, Output
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import load_component
from azure.ai.ml.entities import AmlCompute, Environment, Model, Data
from azure.ai.ml.constants import AssetTypes
import os

In [16]:
run_training_pipeline = False # set to false you only want to register the existing model and create an endpoint for the inference

## 1.2 Configure credential

We are using `DefaultAzureCredential` to get access to workspace. 
`DefaultAzureCredential` should be capable of handling most Azure SDK authentication scenarios. 

Reference for more available credentials if it does not work for you: [configure credential example](../../configuration.ipynb), [azure-identity reference doc](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python).

In [2]:
try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential(tenant_id="<YOUR TENANT ID>")

## 1.3 Get a handle to the workspace

We use config file to connect to a workspace. The Azure ML workspace should be configured with computer cluster. [Check this notebook for configure a workspace](../../configuration.ipynb)

In [3]:
# Get a handle to workspace
ml_client = MLClient.from_config(
    credential=credential, 
    # tenant_id="<YOUR TENANT ID>", # needed to sepcify the tenant to make it work in some cases
)

compute_instance_name = "whalesdenoisinggpu2"
try:
    print(ml_client.compute.get(compute_instance_name))
except Exception as ex:
    compute_instance = AmlCompute(
        name=compute_instance_name,
        type="amlcompute",
        size="Standard_NC24ads_A100_v4",
        # Minimum running nodes when there is no job running
        min_instances=0,
        # Nodes in cluster
        max_instances=4,
        # How many seconds will the node running after the job termination
        idle_time_before_scale_down=180,
        # Dedicated or LowPriority. The latter is cheaper but there is a chance of job termination
        tier="Dedicated",
    )
    ml_client.begin_create_or_update(compute_instance).result()
    print(f"Created new compute instance: {ml_client.compute.get(compute_instance_name)}")

Found the config file in: /config.json


enable_node_public_ip: true
id: /subscriptions/03fd01f6-6051-4545-a78e-ceaace399b96/resourceGroups/lianatests/providers/Microsoft.MachineLearningServices/workspaces/humpbackwhales-aml/computes/whalesdenoisinggpu2
idle_time_before_scale_down: 180
location: westeurope
max_instances: 4
min_instances: 0
name: whalesdenoisinggpu2
network_settings: {}
provisioning_state: Succeeded
size: STANDARD_NC24ADS_A100_V4
ssh_public_access_enabled: true
tier: dedicated
type: amlcompute



## 1.4 Create a custom environment

In [25]:
custom_env_name = "whales-denoising-env"
env_version = "1.8"
dependencies_dir = "./dependencies"

try:
    pipeline_job_env = ml_client.environments.get(name=custom_env_name, version=env_version)
    print(
        f"Environment with name {pipeline_job_env.name} is registered to workspace, the environment version is {pipeline_job_env.version}"
    )
except:
    pipeline_job_env = Environment(
        name=custom_env_name,
        description="Custom environment for running ORCA-CLEAN pipeline",
        conda_file=os.path.join(dependencies_dir, "conda.yaml"),
        image="mcr.microsoft.com/azureml/curated/acpt-pytorch-2.0-cuda11.7:17", #"mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04", #"mcr.microsoft.com/azureml/curated/acpt-pytorch-1.13-cuda11.7:18",
        version=env_version,
    )

    pipeline_job_env = ml_client.environments.create_or_update(pipeline_job_env)
    print(
        f"Environment with name {pipeline_job_env.name} is registered to workspace, the environment version is {pipeline_job_env.version}"
    )

Environment with name whales-denoising-env is registered to workspace, the environment version is 1.8


## 1.5 Create the Data asset objects

In [5]:
whales_data_name = "whales_data"
noise_train_data_name = "noise_train_data"
noise_test_data_name = "noise_test_data"
noise_val_data_name = "noise_val_data"
dataset_version = "3"

In [6]:
path = "/mnt/humpbackwhales/data/denoising_data/whales_data_v2"
whales_data_name = "whales_data"

try:
    whales_data = ml_client.data.get(whales_data_name, version=dataset_version)
except:
    whales_data = Data(
        path=path,
        type=AssetTypes.URI_FOLDER,
        description="whales vocalizations of all types: moans, barks, creaks, etc.",
        name=whales_data_name,
        version=dataset_version,
    )
    ml_client.data.create_or_update(whales_data)

In [7]:
path = "/mnt/humpbackwhales/data/denoising_data/noise_train_v2"
noise_train_data_name = "noise_train_data"

try:
    noise_train_data = ml_client.data.get(noise_train_data_name, version=dataset_version)
except:
    noise_train_data = Data(
        path=path,
        type=AssetTypes.URI_FOLDER,
        description="5-sec segments of only-noise retrieved from raw hydrophone recordings",
        name=noise_train_data_name,
        version=dataset_version,
    )
    ml_client.data.create_or_update(noise_train_data)

In [8]:
path = "/mnt/humpbackwhales/data/denoising_data/noise_test_v2"
noise_test_data_name = "noise_test_data"

try:
    noise_test_data = ml_client.data.get(noise_test_data_name, version=dataset_version)
except:
    noise_test_data = Data(
        path=path,
        type=AssetTypes.URI_FOLDER,
        description="5-sec segments of only-noise retrieved from raw hydrophone recordings",
        name=noise_test_data_name,
        version=dataset_version,
    )
    ml_client.data.create_or_update(noise_test_data)

In [9]:
path = "/mnt/humpbackwhales/data/denoising_data/noise_val_v2"
noise_val_data_name = "noise_val_data"

try:
    noise_val_data = ml_client.data.get(noise_val_data_name, version=dataset_version)
except:
    noise_val_data = Data(
        path=path,
        type=AssetTypes.URI_FOLDER,
        description="5-sec segments of only-noise retrieved from raw hydrophone recordings",
        name=noise_val_data_name,
        version=dataset_version,
    )
    ml_client.data.create_or_update(noise_val_data)

In [10]:
path = "/mnt/humpbackwhales/denoising/outputs"
denoising_training_outputs_name = "denoising_training_outputs"

try:
    denoising_training_outputs = ml_client.data.get(denoising_training_outputs_name, version=dataset_version)
except:
    denoising_training_outputs = Data(
        path=path,
        type=AssetTypes.URI_FOLDER,
        description="results of deep denoising network's training",
        name=denoising_training_outputs_name,
        version=dataset_version,
    )
    ml_client.data.create_or_update(denoising_training_outputs)

In [11]:
default_datastore_name = "workspaceblobstore"  # workspaceblobstore, workspacefilestore

# 2. Define and create components into workspace
## 2.1 Load components from YAML

In [12]:
parent_dir = "."
train_model = load_component(source=parent_dir + "/train_model.yml")
#score_data = load_component(source=parent_dir + "/score_data.yml")
#eval_model = load_component(source=parent_dir + "/eval_model.yml")

## 2.2 Inspect loaded component

In [13]:
print(train_model)

$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: train_model
version: 0.0.1
display_name: Train Deep Denoising Model
description: The training component
type: command
inputs:
  whales_data:
    type: uri_folder
  noise_train_data:
    type: uri_folder
  noise_val_data:
    type: uri_folder
  noise_test_data:
    type: uri_folder
  max_train_epochs:
    type: integer
    default: '100'
  learning_rate:
    type: number
    default: '0.001'
  batch_size:
    type: integer
    default: '16'
  num_workers:
    type: integer
    default: '1'
  augmentation:
    type: integer
    default: '1'
  n_fft:
    type: integer
    default: '4096'
  hop_length:
    type: integer
    default: '441'
  sequence_len:
    type: integer
    default: '1280'
  freq_compression:
    type: string
    default: linear
  model_dir:
    type: string
  log_dir:
    type: string
  checkpoint_dir:
    type: string
  cache_dir:
    type: string
  summary_dir:
    type: string
out

# 3. Pipeline job
## 3.1 Build pipeline

In [17]:
if run_training_pipeline:
    # Construct pipeline
    @pipeline()
    def pipeline_with_components_from_yaml(
        whales_data_input,
        noise_train_data_input,
        noise_val_data_input,
        noise_test_data_input,
        training_outputs

    ):
        """Pipeline with components defined via yaml."""
        train_component = train_model(
            whales_data=whales_data_input,
            noise_train_data=noise_train_data_input,
            noise_val_data=noise_val_data_input,
            noise_test_data=noise_test_data_input,
            model_dir="models",
            log_dir="logs",
            checkpoint_dir="model_checkpoints",
            cache_dir="cache",
            summary_dir="summary",
        )

        #mounted_output_dir = "azureml://datastores/workspaceblobstore/paths/azureml/denoising/ML_pipeline"

        # train_component.outputs.model_dir = denoising_training_outputs.path + "/models/"
        # train_component.outputs.log_dir = denoising_training_outputs.path + "/logs/"
        # train_component.outputs.checkpoint_dir = denoising_training_outputs.path + "/model_checkpoints/"
        # train_component.outputs.cache_dir = denoising_training_outputs.path + "/cache/"
        # train_component.outputs.summary_dir = denoising_training_outputs.path + "/summary/"

        train_component.inputs.model_dir = "models"
        train_component.inputs.log_dir = "logs"
        train_component.inputs.checkpoint_dir = "model_checkpoints"
        train_component.inputs.cache_dir = "cache"
        train_component.inputs.summary_dir = "summary"

        train_component.outputs.training_output = Output(type="uri_folder", path=training_outputs, mode="rw_mount")

    pipeline_job = pipeline_with_components_from_yaml(
        whales_data_input=Input(type="uri_folder", path=whales_data.path),
        noise_train_data_input=Input(type="uri_folder", path=noise_train_data.path),
        noise_val_data_input=Input(type="uri_folder", path=noise_val_data.path),
        noise_test_data_input=Input(type="uri_folder", path=noise_test_data.path),
        training_outputs=denoising_training_outputs.path
    )

    # set pipeline level compute
    pipeline_job.settings.default_compute = compute_instance_name

In [18]:
if run_training_pipeline:
    # Inspect built pipeline
    print(pipeline_job.outputs)

## 3.2 Submit pipeline job

In [16]:
if run_training_pipeline:
    # Submit pipeline job to workspace
    pipeline_job = ml_client.jobs.create_or_update(
        pipeline_job, experiment_name="whales_denoising"
    )
    pipeline_job

Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


Experiment,Name,Type,Status,Details Page
whales_denoising,helpful_brick_nbfdgs01jt,pipeline,Preparing,Link to Azure Machine Learning studio


In [17]:
if run_training_pipeline:
    # Wait until the job completes
    ml_client.jobs.stream(pipeline_job.name)

RunId: helpful_brick_nbfdgs01jt
Web View: https://ml.azure.com/runs/helpful_brick_nbfdgs01jt?wsid=/subscriptions/03fd01f6-6051-4545-a78e-ceaace399b96/resourcegroups/lianatests/workspaces/humpbackwhales-aml

Streaming logs/azureml/executionlogs.txt

[2023-11-13 20:45:20Z] Submitting 1 runs, first five are: 941b546b:c248e52b-cf09-491f-b78f-50980627f849
[2023-11-14 17:34:14Z] Completing processing run id c248e52b-cf09-491f-b78f-50980627f849.

Execution Summary
RunId: helpful_brick_nbfdgs01jt
Web View: https://ml.azure.com/runs/helpful_brick_nbfdgs01jt?wsid=/subscriptions/03fd01f6-6051-4545-a78e-ceaace399b96/resourcegroups/lianatests/workspaces/humpbackwhales-aml



## 3.3 Register a model

In [14]:
model_name = "orca-clean-model-v1"
model_version = "1"

# The model ORCA_CLEAN.pk trained with 100 epochs is available in Fileshare location: humpbackwhales/models
# Alternatively you can train the model using the pipeline above by setting "if run_training_pipeline" to True
run_model = Model(
    path=f"{denoising_training_outputs.path}models/ORCA-CLEAN.pk",
    name="orca-clean-model-v1",
    description="Model trained with 100 epochs, whales_data_v2 and noise_train/test/val_v2.",
    type=AssetTypes.CUSTOM_MODEL
)

ml_client.models.create_or_update(run_model)

Model({'job_name': None, 'intellectual_property': None, 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'orca-clean-model-v1', 'description': 'Model trained with 100 epochs, whales_data_v2 and noise_train/test/val_v2.', 'tags': {}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/03fd01f6-6051-4545-a78e-ceaace399b96/resourceGroups/lianatests/providers/Microsoft.MachineLearningServices/workspaces/humpbackwhales-aml/models/orca-clean-model-v1/versions/1', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/linapalk2/code/Users/linapalk/GitHub/humpback-whales-vocalizations-classification/HumpbackWhales/03_Denoising/ML_pipeline', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7f4ec0b05a80>, 'serialize': <msrest.serialization.Serializer object at 0x7f4ec0b05f60>, 'version': '1', 'latest_version': None, 'path': 'azureml://subscriptions/03fd01f6-6051-4545-a78e-cea

## 3.4 Create a real-time endpoint

In [19]:
from azure.ai.ml.entities import ManagedOnlineEndpoint

endpoint_name="whales-denoising-real-time"

endpoint = ManagedOnlineEndpoint(
    name = endpoint_name, 
    description="Endpoint for real-time inference on the deep denoizing model",
    auth_mode="key"
)

## 3.5 Deploy the endpoint

In [28]:
from azure.ai.ml.entities import (
    ManagedOnlineDeployment,
    Model,
    Environment,
    CodeConfiguration,
)
          
# Load the existing custom environment that was created in 1.4
env = ml_client.environments.get(name=custom_env_name, version=env_version)

# Load the existing model that was registered in 3.3
model = ml_client.models.get(name=model_name, version=model_version)

In [None]:
# =========================================
# STARTING FROM THIS CELL THE CODE IS W.I.P
# =========================================
orca_clean_deployment = ManagedOnlineDeployment(
    name="OrcaClean",
    endpoint_name=endpoint_name,
    model=model,
    environment=env,
    code_configuration=CodeConfiguration(
        code="orca_clean", scoring_script="predict_inference.py"
    ),
    instance_type="Standard_DS3_v2", # check if maybe we need GPU for the inference too
    instance_count=1,
)

## 3.6 Test the deployment