# Tutorial: Debugging the Cluster Environment

Real-world applications of Machine Learning pipelines are often complicated, require multiple dependencies and are often sensitive to environment conditions such as library versions, operating system type and other runtime settings.  While the Azure Machine Learning platform provides a number or ways to standardize environment including:

- Azure Container Repository (ACR)
- Conda requirements management
- Github integrations

it is sometimes necessary to investigate issues in the later stages of production pipeline.

In this tutorial, you will learn how to:
- Navigate pipeline Job components
- Access individual Job component Compute nodes
- Install Job-related Docker images using ACR
- Update Job environment re-submit failed jobs 

This tutorial is applicable to all stages of the AML Pipeline, but will likely be more useful in the Outer Loop stages.

The next image shows a simple pipeline as you'll see it in the Azure studio once submitted.

![Screenshot that shows the AML Pipeline](https://github.com/azeltov/aigbb-aml-bootcamp/assets/5873303/e0575c72-2c32-4c83-9660-82c34993027b "Overview of the pipeline")


## Navigate pipeline Job components 

Whether executing an indidual ML Component or running a complex pipeline, Azure Machine Learning platform uses the Job abstraction to encapsulate Azure Compute resources.  

When debugging a failed job, the first step is to identify the Compute resource responsible for the failure being debugged.  

Navigate to the Jobs section of the Azure Machine Learning Studio workspace:

<div>
<img src="./media/select_job_1.png" height="400"/>
</div>

The Jobs section of the workspace contains the list of failed and completed jobs:
![Start compute](./media/inspect_jobs.png)

To understand the reasons for job failure, it is first necessary to understand the type of job that is being executed.  Recalling from previous exercises (03b_ReTrain_Model.ipynb), a Job may be started via the following Python API snippet.

<div>
<img src="./media/cluster_job_invoke.png" height="400"/>
</div>


The highlighted area draws attention to the type of command for this particular Job as well as type of Compute resource that is required.  In the example above, a cluster-bound job is created aimed at the "cpu-cluster" Compuate resource.

## Access individual Job component Compute nodes

Azure Machine Learning cluster environments are transient and do not retain their runtime context upon job completion regardless of the state.  In order to access an individual cluster node to perform debug actions, it is necessary to augment the job starting script to include an artificial timeout to retain cluster runtime context.

In order to retain the runtime environment for debug purposes, cluster job must be started with an artificial shutdown delay.  This may be done by adding a "sleep" snipped (highlighted in red) to the job command line.

<div>
<img src="./media/cluster_job_invoke.png" height="400"/>
</div>


Once the augmented job is started, cluster environment will persist for the additional 3600 seconds after the job fails or completes.

It is now possible to inspect the cluster by navigating to the cluster nodes.  To do so, from the Azure Machine Learning studio, click on Compute and then select the Clusters tab:

<div>
<img src="./media/navigate_to_cluster.png" height="400"/>
</div>

then select the target cluster (e.g.: "cpu-compute")

<div>
<img src="./media/drilldown_cluster.png" height="400"/>
</div>

and click on the Nodes tab:

<div>
<img src="./media/select_nodes.png" height="400"/>
</div>

Cluster nodes should be in the running state.  Select a cluster node:

<div>
<img src="./media/click_node_id.png" height="400"/>
</div>


Select Output+Logs to view job output and error messages:

<div>
<img src="./media/select_logs.png" height="400"/>
</div>

The resulting screen will display job output logs:

<div>
<img src="./media/view_logs.png" height="400"/>
</div>

## Install Job-related Docker images using ACR

To reproduce a failing envoronment in a local environment or on compute instance, it is necessary to install the corresponding Docker image that was used during job instantiation.  

To identify the docker image used during a job execution, navigate to the Jobs section of the Azure ML Workspace and click on the Environment link:

<div>
<img src="./media/docker_step_1.png" height="400"/>
</div>

The resulting screen will contain the Docker repository and image id information necessary to download and install the Docker environment locally:

<div>
<img src="./media/docker_overview.png" height="400"/>
</div>


1. Azure Container Registry (ACR) URI 
2. ACR repository-specific Docker image id
3. Parent Docker image id

### Pulling Docker Image

Log in to the ACR repository using the Azure container registry URI (#1 above)

In [None]:
!docker login 228a5fafb10849f1aca62709336183e6.azurecr.io

Next, pull the image using the full Azure Container Registry URI

In [None]:
!docker pull 228a5fafb10849f1aca62709336183e6.azurecr.io/azureml/azureml_56f11dd12f51eee236626b86f8bec2eb

Verify image pull:

In [None]:
!docker images

It is now possible to start the pulled image from command line terminal using the following command:

<div>
<img src="./media/execute_docker_commands.png" height="400"/>
</div>

## Update Job environment re-submit failed jobs

Common environment issues often result from dependency version inconsistencies.  To address version consistency, Azure Machine Learning platform uses Conda virtual environment for dependency management.

Azure Machine Learning API accepts environment configuration through a customizable conda.yml file of form:

name: model-env
channels:
  - conda-forge
dependencies:
  - python=3.8
  - numpy=1.21.2
  - pip=21.2.4
  - scikit-learn=0.24.2
  - scipy=1.7.1
  - pandas>=1.1,<1.2
  - pip:
    - inference-schema[numpy-support]==1.3.0
    - mlflow== 1.26.1
    - azureml-mlflow==1.42.0
    - psutil>=5.8,<5.9
    - tqdm>=4.59,<4.60
    - ipykernel~=6.0
    - matplotlib
    - pyarrow


Users of the platform may provide custom versions of the Conda configuration by specifying a custom file path during environment configuration.  

In [None]:
dependencies_dir = "dependencies"

In [None]:
%%writefile {dependencies_dir}/my_custom_conda.yml
name: model-env
channels:
  - conda-forge
dependencies:
  - python=3.8
  - numpy=1.21.2
  - pip:
    - inference-schema[numpy-support]==1.3.0
    - azureml-mlflow==1.42.0

In [None]:
import os
from dotenv import load_dotenv

# load the environment variables from .env
load_dotenv()

In [None]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# authenticate
credential = DefaultAzureCredential()
# # Get a handle to the workspace
ml_client = MLClient(
    credential=credential,
    subscription_id = os.environ.get('SUBSCRIPTION_ID'),
    resource_group_name = os.environ.get('RESOURCE_GROUP_NAME'),
    workspace_name = os.environ.get('WORKSPACE_NAME'),
)

In [None]:
from azure.ai.ml.entities import Environment

my_custom_env_name = "my-custom-credit-card-scikit-38"

my_custom_job_env = Environment(
    name= my_custom_env_name,
    description="Custom environment for Credit Card Defaults job",
    tags={"scikit-learn": "0.24.2"},
    conda_file=os.path.join(dependencies_dir, "my_custom_conda.yml"),
    image="mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest",
)
my_custom_job_env = ml_client.environments.create_or_update(my_custom_job_env)

print(
    f"Environment with name {my_custom_job_env.name} is registered to workspace, the environment version is {my_custom_job_env.version}"
)

Failed jobs may be re-submitted via the execute command:

In [None]:
from azure.ai.ml import command
from azure.ai.ml import Input

registered_model_name = "credit_defaults_model"
environment_name= my_custom_job_env.name + "@latest" 
job = command(
    inputs=dict(
        data=Input(
            type="uri_file",
            #path="azureml://subscriptions/f1ea6ed8-82f3-416d-881b-8b376218bc85/resourcegroups/rg_aml/workspaces/aml-default/datastores/workspaceblobstore/paths/LocalUpload/4b1dfc4d12429b46389cabdf25b886a2/default_of_credit_card_clients.csv",
            #path="https://azuremlexamples.blob.core.windows.net/datasets/credit_card/default_of_credit_card_clients.csv",
            path="azureml:credit-card_csv:2023.10.05.154542",
        ),
        test_train_ratio=0.2,
        learning_rate=0.25,
        registered_model_name=registered_model_name,
    ),
    code="./src/",  # location of source code
    command="python main.py \
        --data ${{inputs.data}} \
        --test_train_ratio ${{inputs.test_train_ratio}} \
        --learning_rate ${{inputs.learning_rate}} \
        --registered_model_name \
        ${{inputs.registered_model_name}}",
    environment = environment_name, ## Custom environment injection
    compute="cpu-cluster",
    display_name="03a_train_model_credit_default_prediction",
)