# Debug & monitor your training jobs

**Requirements** - In order to benefit from this tutorial, you will need:
- A basic understanding of Machine Learning
- An Azure account with an active subscription. [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)
- An Azure ML workspace with computer cluster - [Configure workspace](../../../configuration.ipynb) 

- A python environment
- Installed Azure Machine Learning Python SDK v2 - [install instructions](../../../../README.md) - check the getting started section

**Learning Objectives** - By the end of this tutorial, you should be able to:
- Connect to your AML workspace from the Python SDK
- Create and run a `Command` which executes a Python command
- Use a local file as an `input` to the Command
- Enable live debugging & monitoring by specifying `services` to the Command job

**Motivations** - This notebook explains how to setup and run a Command. The Command is a fundamental construct of Azure Machine Learning. It can be used to run a task on a specified compute (either local or on the cloud). The Command accepts `environment` and `compute` to setup required infrastructure. You can define a `command` to run on this infrastructure with `inputs`.

# 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

## 1.1. Import the required libraries

In [50]:
# import required libraries
from azure.ai.ml import MLClient
from azure.ai.ml import command
from azure.identity import DefaultAzureCredential
from azure.ai.ml.entities import JobService

In [51]:
# Enter details of your AML workspace
subscription_id = "bacbbc19-b8a9-44f3-acbc-a7b61e818bb3"
resource_group = "shoja-rg"
workspace = "shoja-west"

## 1.2. Configure workspace details and get a handle to the workspace

To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the `MLClient` from `azure.ai.ml` to get a handle to the required Azure Machine Learning workspace. We use the default [default azure authentication](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential?view=azure-python) for this tutorial. Check the [configuration notebook](../../../configuration.ipynb) for more details on how to configure credentials and connect to a workspace.

In [52]:
# get a handle to the workspace
ml_client = MLClient(
    DefaultAzureCredential(exclude_shared_token_cache_credential=True), subscription_id, resource_group, workspace
)

# 2. Configure and run the Command
In this section we will configure and run a standalone job using the `command` class. The `command` class can be used to run standalone jobs and can also be used as a function inside pipelines.

## 2.1 Configure the Command
The `command` allows user to configure the following key aspects.
- `code` - This is the path where the code to run the command is located
- `command` - This is the command that needs to be run. If you need to reserve your cluster for debugging or monitoring purposes, you can use the `sleep` command.
- `environment` - This is the environment needed for the command to run. Curated or custom environments from the workspace can be used. Or a custom environment can be created and used as well. Check out the [environment](../../../../assets/environment/environment.ipynb) notebook for more examples.
- `compute` - The compute on which the command will run. In this example we are using a compute called `cpu-cluster` present in the workspace. You can replace it any other compute in the workspace. You can run it on the local machine by using `local` for the compute. This will run the command on the local machine and all the run details and output of the job will be uploaded to the Azure ML workspace.

- `display_name` - The display name of the Job
- `description` - The description of the experiment
- `services` - Specify the applications (or SSH) that you need to interact with the live running job. You can specify `vs_code`, `tensor_board` (needs log directory), `jupyter_lab` or `SSH` (needs public key). For distributed jobs, you can specify the specific compute `nodes` index you would like to interact with. If `nodes` are not specified, interactive services are enabled only on the head node by default.

In [53]:
# create the command
job = command(
    code="./src",  # local path where the code is stored
    command="python tfevents.py && sleep 2000",  # the sleep command allows you to reserve your compute -- recommended if you are using interactive services
    environment="AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-gpu@latest",
    compute="cpu-cluster",
    display_name="debug-and-monitor-example",
    services={
        "My_jupyterlab": JobService(
            job_service_type="jupyter_lab",
        ),
        "My_vscode": JobService(
            job_service_type="vs_code",
        ),
        "My_tensorboard": JobService(
            job_service_type="tensor_board",
            properties={
                "logDir": "outputs/tblogs"  # relative path of Tensorboard logs (same as in your training script)
            },
        ),
        # "My_ssh": JobService(
        #     job_service_type="ssh",
        #     nodes="all",  # For distributed jobs, use the `nodes` property to pick which node you want to enable interactive services on. If `nodes` are not selected, by default, interactive applications are only enabled on the head node. Values are "all", or compute node index (for ex. "0", "1" etc.)
        #     properties={"sshPublicKeys": "<add-public-key>"},
        # ),
    }
    # experiment_name: debug-and-monitor-example
    # description: Enable live debugging & monitoring for your training jobs
)

## 2.2 Show running applications
Using the `MLClient` created earlier, we will now run this Command as a job in the workspaces and view endpoints for the interactive services specified during job submission.

In [54]:
# submit the command
returned_job = ml_client.create_or_update(job)

# view interactive service endpoints
ml_client.jobs.show_services(returned_job.name, 0)

Readonly attribute status will be ignored in class <class 'azure.ai.ml._restclient.v2022_10_01_preview.models._models_py3.JobService'>
Readonly attribute status will be ignored in class <class 'azure.ai.ml._restclient.v2022_10_01_preview.models._models_py3.JobService'>
Readonly attribute status will be ignored in class <class 'azure.ai.ml._restclient.v2022_10_01_preview.models._models_py3.JobService'>


TypeError: '_AttrDict' object is not callable

# Next Steps
You can see further examples of running a job [here](../../../single-step/). To learn more about debugging & monitoring your jobs, check out our [documentation](https://aka.ms/interactive_debugging).