## Distributed training of YOLOV5 models on Custom Data using Azure ML Service

### Requirements/Prerequisites
- An Azure acoount with active subscription [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)
- Azure Machine Learning workspace [Configure workspace](../../../configuration.ipynb) 
- Python Environment
- Install Azure ML Python SDK Version 2
### Learning Objectives
- Connect to workspace using Python SDK v2
- use yolov5 format data .yaml files or data from local system arranged in yolo format
- Distributed training of YoloV5 model.

## 1. Connect to Azure Machine Learning Workspace

### 1.1 Import Libraries and connect to workspace using Default Credential

In [17]:
from azure.ai.ml import MLClient
from azure.ai.ml import command
from azure.ai.ml.entities import Data
from azure.ai.ml import Input
from azure.ai.ml import Output
from azure.ai.ml.constants import AssetTypes
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()

# enter details of your AML workspace and get a handle to the workspace

ml_client = MLClient(
    credential=credential,
    subscription_id="<SUBSCRIPTION_ID>",
    resource_group_name="<RESOURCE_GROUP>",
    workspace_name="<AML_WORKSPACE_NAME>",
)

## 2. Launch the distributed training job

### 2.1 Create the job 

In this section we will be configuring and running two standalone jobs. 

- `command` for distributed training job.


The `command` allows user to configure the following key aspects.
- `code` - This is the path where the code to run the command is located
- `command` - This is the command that needs to be run
- `inputs` - This is the dictionary of inputs using name value pairs to the command. The key is a name for the input within the context of the job and the value is the input value. Inputs can be referenced in the `command` using the `${{inputs.<input_name>}}` expression. To use files or folders as inputs, we can use the `Input` class. The `Input` class supports three parameters:
    - `type` - The type of input. This can be a `uri_file` or `uri_folder`. The default is `uri_folder`.         
    - `path` - The path to the file or folder. These can be local or remote files or folders. For remote files - http/https, wasb are supported. 
        - Azure ML `data`/`dataset` or `datastore` are of type `uri_folder`. To use `data`/`dataset` as input, you can use registered dataset in the workspace using the format '<data_name>:<version>'. For e.g Input(type='uri_folder', path='my_dataset:1')
    - `mode` - 	Mode of how the data should be delivered to the compute target. Allowed values are `ro_mount`, `rw_mount` and `download`. Default is `ro_mount`
- `environment` - This is the environment needed for the command to run. Curated or custom environments from the workspace can be used. Or a custom environment can be created and used as well. Check out the [environment](../../../../assets/environment/environment.ipynb) notebook for more examples.
- `compute` - The compute on which the command will run. In this example we are using a compute called `gpu-cluster` present in the workspace. You can replace it any other compute in the workspace. You can run it on the local machine by using `local` for the compute. This will run the command on the local machine and all the run details and output of the job will be uploaded to the Azure ML workspace.
- `distribution` - Distribution configuration for distributed training scenarios. Azure Machine Learning supports PyTorch, TensorFlow, and MPI-based distributed 

In [18]:
job = command(
    code="./yolov5",
    command="apt-get -y install libgl1 \
                              && pip install -r requirements.txt \
                              && python train.py --data ${{inputs.data}} \
                               --epochs ${{inputs.epoch}} \
                               --batch-size ${{inputs.batchsize}} \
                               --patience ${{inputs.patience}} \
                               --weights ${{inputs.weights}}",
    # --nproc_per_node ${{inputs.nproc_per_node}}',
    # In this case, we have stored the data in  Azure blob container (public).
    # the data yaml files are in https://azuremlexamples.blob.core.windows.net/datasets/yolov5/data/
    # The Fridge objects data is in https://azuremlexamples.blob.core.windows.net/datasets/yolov5/datasets/
    inputs={
        "data": Input(
            type="uri_file",
            # path="./yolov5/data/fridge.yaml",  # If stored locally.
            path="https://azuremlexamples.blob.core.windows.net/datasets/yolov5/data/fridge.yaml",
        ),
        "patience": 25,
        "batchsize": 32,
        "epoch": 5,
        "weights": "yolov5n.pt",
    },
    environment="AzureML-pytorch-1.8-ubuntu18.04-py37-cuda11-gpu@latest",
    compute="gpu-cluster",  # name of your cluster
    instance_count=2,  # In this, only 2 node cluster was created.
    distribution={
        "type": "PyTorch",
        "process_count_per_instance": 1,  # number of GPus per node
    },
)

### 2.2 Run the job

In [19]:
returned_job = ml_client.create_or_update(job)

[32mUploading yolov5 (63.01 MBs): 100%|██████████| 63005064/63005064 [00:06<00:00, 10438670.72it/s]
[39m

