### TAO remote client - Auto Labeling

Transfer learning is the process of transferring learned features from one application to another. It is a commonly used training technique where you use a model trained on one task and re-train to use it on a different task. Train Adapt Optimize (TAO) Toolkit  is a simple and easy-to-use Python based AI toolkit for taking purpose-built AI models and customizing them with users' own data.

![image](https://d29g4g2dyqv443.cloudfront.net/sites/default/files/akamai/TAO/tlt-tao-toolkit-bring-your-own-model-diagram.png)


### The workflow in a nutshell

- Pulling datasets from cloud
- Getting a PTM from NGC
- Model Actions
    - Train (Normal/AutoML)
    - Evaluate
    - Inference
    - Delete experiments/datasets

### Table of contents

1. [Install TAO remote client ](#head-1)
1. [FIXME's](#head-2)
1. [Login](#head-3)
1. [Create a cloud workspace](#head-2)
1. [Set dataset formats](#head-4)
1. [Create and pull train dataset](#head-5)
1. [Create and pull val dataset](#head-6)
1. [List the created datasets](#head-7)
1. [Create an experiment](#head-8)
1. [List experiments](#head-9)
1. [Assign train, eval datasets](#head-10)
1. [Assign PTM](#head-11)
1. [View hyperparameters that are enabled by default](#head-12)
1. [Train](#head-13)
1. [Set AutoML related configurations](#head-13.1)
1. [Evaluate](#head-14)
1. [TAO inference](#head-15)
1. [Delete experiment](#head-16)
1. [Delete dataset](#head-17)

### Requirements
Please find the server requirements [here](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_setup.html#)

### Install TAO remote client <a class="anchor" id="head-1"></a>

In [None]:
# SKIP this step IF you have already installed the TAO-Client wheel.
! pip3 install nvidia-tao-client

In [None]:
# View the version of the TAO-Client
! tao-client --version

### Import python packages required for notebook

In [None]:
import os
import subprocess
import ast
import json
import time
from IPython.display import clear_output
from remove_corrupted_images import remove_corrupted_images_workflow

In [None]:
namespace = 'default'
job_map = {}

### To see the dataset folder structure required for the models supported in this notebook, visit the notebooks under dataset_prepare like for [this notebook](../dataset_prepare/auto_labeling.ipynb)

### FIXME's <a class="anchor" id="head-2"></a>

1. (Optional) Enable AutoML if needed in FIXME 1
1. (Optional) Choose between bayesian and hyperband automl_algorithm in FIXME 2 (If automl was enabled in FIXME1)
1. Assign a workdir in FIXME 3 for log file download
1. Assign the ip_address and port_number in FIXME 4 ([info](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_rest_api.html))
1. Assign the ngc_key variable in FIXME 5
1. Assign the ngc_org_name variable in FIXME 6
1. Set cloud storage details in FIXME 7
1. Assign path of datasets relative to the bucket in FIXME 8
1. Database backup/restore archive filename in FIXME 9

In [None]:
model_name = "mal"

#### Toggle AutoML params
[AutoML documentation](https://docs.nvidia.com/tao/tao-toolkit/text/automl/automl.html#getting-started)

In [None]:
automl_enabled = False # FIXME1 set to True if you want to run automl for the model chosen in the previous cell
automl_algorithm = "bayesian" # FIXME2 example: bayesian/hyperband

#### Toggle downloading jobs onto local

In [None]:
workdir = "workdir_auto_labeling" # FIXME3
# Creating workdir
if not os.path.isdir(workdir):
    os.makedirs(workdir)

#### Set API service's host information

In [None]:
host_url = "http://<ip_address>:<port_number>" # FIXME4 example: https://10.137.149.22:32334
# In host machine, node ip_address and port number can be obtained as follows,
# ip_address: hostname -i
# port_number: kubectl get service tao-api-ingress-nginx-controller -o jsonpath='{.spec.ports[0].nodePort}'

#### Set NGC Personal key for authentication and NGC org to access API services

In [None]:
ngc_key = "<ngc_key>" # FIXME5 example: (Add NGC Personal key)

In [None]:
ngc_org_name = "ea-tlt" # FIXME6 your NGC ORG

### Login <a class="anchor" id="head-3"></a>

In [None]:
%env BASE_URL={host_url}/{namespace}/api/v1

# Exchange NGC_API_KEY for JWT
identity = json.loads(subprocess.getoutput(f'tao login --ngc-key {ngc_key} --ngc-org-name {ngc_org_name} --enable-telemetry'))

%env USER={identity['user_id']}
%env TOKEN={identity['token']}

### Get NVCF gpu details <a class="anchor" id="head-2"></a>

 One of the keys of the response json are to be used as platform_id when you run each job

In [None]:
# # Valid only for NVCF backend during TAO-API helm deployment currently
# response = json.loads(subprocess.getoutput(f'tao get-gpu-types'))
# print((json.dumps(response, indent=4)))

### Create cloud workspace
This workspace will be the place where your datasets reside and your results of TAO API jobs will be pushed to.

If you want to have different workspaces for dataset and experiment, duplocate the workspace creation part and adjust the metadata accordingly.

In [None]:
#FIXME7 Dataset Cloud bucket details to download dataset for experiments (Can be read only)
workspace_name = "AWS workspace info"  # A Representative name for this cloud info
cloud_type = "aws"  # If it's AWS, HuggingFace or Azure

cloud_metadata = {}
cloud_metadata["cloud_region"] = "us-west-1"  # Bucket region
cloud_metadata["cloud_bucket_name"] = ""  # Bucket name
# Access and Secret for AWS
cloud_metadata["access_key"] = ""
cloud_metadata["secret_key"] = ""

In [None]:
workspace_id = subprocess.getoutput(f"tao-client {model_name} workspace-create --name '{workspace_name}' --cloud_type {cloud_type} --cloud_details '{json.dumps(cloud_metadata)}'")
print(workspace_id)

In [None]:
# #Optional: Restore database with a mongodump file saved in workspace dump/archive/{backup_filename}
# backup_file_name = "mongodump.tar.gz" # FIXME 9
# response = subprocess.getoutput(f"tao-client {model_name} workspace-restore --workspace {workspace_id} --backup_file_name {backup_file_name}")
# print(response)

#### Set dataset path (path within cloud bucket)

In [None]:
# FIXME8 : Set paths relative to cloud bucket
train_dataset_path = "/data/auto_label_train"
eval_dataset_path = "/data/auto_label_val"

### Function to parse logs

In [None]:
def my_tail(model_name_cli, id, job_id, job_type, workdir):
	status = None
	while True:
		time.sleep(10)
		clear_output(wait=True)
		response = subprocess.getoutput(f"tao-client {model_name_cli} get-action-status --job_type {job_type} --id {id} --job {job_id}")
		response = json.loads(response)
		if response and "status" in response.keys() and response.get("status") in ("Done", "Error", "Canceled", "Paused"):
			print(json.dumps(response, indent=4))
			status = response.get("status")
			break

		logs = subprocess.getoutput(f"tao-client {model_name_cli} get-job-logs --id {id} --job {job_id} --job_type {job_type} --workdir {workdir}")
		if not logs:
			continue
		log_content_lines = logs.split("\n")        
		for line in log_content_lines:
			print(line.strip())
			if line.strip() == "Error EOF":
				status = "Error"
				break
			elif line.strip() == "Done EOF":
				status = "Done"
				break
		if status is not None:
			break
	return status

### Set dataset formats <a class="anchor" id="head-4"></a>

In [None]:
ds_type = "segmentation"
ds_format = "default"

### Create and pull train dataset <a class="anchor" id="head-5"></a>

In [None]:
train_dataset_id = subprocess.getoutput(f"tao-client {model_name} dataset-create --dataset_type {ds_type} --dataset_format {ds_format} --workspace {workspace_id} --cloud_file_path {train_dataset_path} --use_for '{json.dumps(['training'])}'")
print(train_dataset_id)

In [None]:
# Check progress
while True:
    clear_output(wait=True)
    response = subprocess.getoutput(f"tao-client {model_name} get-metadata --id {train_dataset_id} --job_type dataset")
    response = json.loads(response)
    print(json.dumps(response, sort_keys=True, indent=4))
    if response.get("status") == "invalid_pull":
        raise ValueError("Dataset pull failed")
    if response.get("status") == "pull_complete":
        break
    time.sleep(5)

#### Uncomment if you want to remove corrupted images in your dataset

In [None]:
# # This packages data-services experiments create and running the job of removing corrupted images
# try:
#     from remove_corrupted_images import remove_corrupted_images_workflow
#     train_dataset_id = remove_corrupted_images_workflow(workspace_id, train_dataset_id)
# except Exception as e:
#     raise e

### Create and pull val dataset <a class="anchor" id="head-6"></a>

In [None]:
eval_dataset_id = subprocess.getoutput(f"tao-client {model_name} dataset-create --dataset_type {ds_type} --dataset_format {ds_format} --workspace {workspace_id} --cloud_file_path {eval_dataset_path} --use_for '{json.dumps(['evaluation'])}'")
print(eval_dataset_id)

In [None]:
# Check progress
while True:
    clear_output(wait=True)
    response = subprocess.getoutput(f"tao-client {model_name} get-metadata --id {eval_dataset_id} --job_type dataset")
    response = json.loads(response)
    print(json.dumps(response, sort_keys=True, indent=4))
    if response.get("status") == "invalid_pull":
        raise ValueError("Dataset pull failed")
    if response.get("status") == "pull_complete":
        break
    time.sleep(5)

#### Uncomment if you want to remove corrupted images in your dataset

In [None]:
# # This packages data-services experiments create and running the job of removing corrupted images
# try:
#     from remove_corrupted_images import remove_corrupted_images_workflow
#     eval_dataset_id = remove_corrupted_images_workflow(workspace_id, eval_dataset_id)
# except Exception as e:
#     raise e

### List datasets <a class="anchor" id="head-7"></a>

In [None]:
message = subprocess.getoutput(f"tao-client {model_name} list-datasets")
message = ast.literal_eval(message)
for rsp in message:
    rsp_keys = rsp.keys()
    assert "id" in rsp_keys
    assert "type" in rsp_keys
    assert "format" in rsp_keys
    assert "name" in rsp_keys
    print(rsp["id"],"\t",rsp["type"],"\t",rsp["format"],"\t\t",rsp["name"])

### Create an experiment <a class="anchor" id="head-8"></a>

In [None]:
network_arch = model_name
experiment_id = subprocess.getoutput(f"tao-client {model_name} experiment-create --network_arch {network_arch} --encryption_key tlt_encode  --workspace {workspace_id}")
print(experiment_id)

### List experiments <a class="anchor" id="head-9"></a>

In [None]:
# List all user created experiments for the chosen network architecture
filter_params = {"network_arch": network_arch}
message = subprocess.getoutput(f"tao-client {model_name} list-experiments --filter_params '{json.dumps(filter_params)}'")
message = ast.literal_eval(message)
for rsp in message:
    rsp_keys = rsp.keys()
    if "encryption_key" not in rsp.keys():
        assert "name" in rsp_keys and "version" in rsp_keys and "ngc_path" in rsp_keys
        print(rsp["name"],rsp["id"],rsp["network_arch"])

### Assign train, eval datasets <a class="anchor" id="head-10"></a>

In [None]:
docker_env_vars = {} # Update any variables to be included while triggering Docker run-time like MLOPs variables 
dataset_information = {"train_datasets":[train_dataset_id],
                       "eval_dataset":eval_dataset_id,
                       "inference_dataset":eval_dataset_id,
                       "docker_env_vars": docker_env_vars,
                       "metric": "train_loss"}
patched_model = subprocess.getoutput(f"tao-client {model_name} patch-artifact-metadata --id {experiment_id} --job_type experiment --update_info '{json.dumps(dataset_information)}' ")
print(patched_model)

### Assign PTM <a class="anchor" id="head-11"></a>

In [None]:
# List all pretrained models for the chosen network architecture
filter_params = {"network_arch": network_arch}
message = subprocess.getoutput(f"tao-client {model_name} list-base-experiments --filter_params '{json.dumps(filter_params)}'")
message = ast.literal_eval(message)
for rsp in message:
    rsp_keys = rsp.keys()
    if "encryption_key" not in rsp.keys():
        assert "name" in rsp_keys and "version" in rsp_keys and "ngc_path" in rsp_keys
        print(f'PTM Name: {rsp["name"]}; PTM version: {rsp["version"]}; NGC PATH: {rsp["ngc_path"]}')

In [None]:
# Assigning pretrained models
# From the output of previous cell make the appropriate changes to this map if you want to change the default PTM backbone.
# Changing the default backbone here requires changing default spec/config during train/eval etc like for example
# If you are changing the ptm to resnet34, then you have to modify the config key num_layers if it exists to 34 manually
pretrained_map = {"mal" : "mask_auto_label:trainable_v1.0"}
no_ptm_models = set([])

In [None]:
if network_arch not in no_ptm_models:
    filter_params = {"network_arch": network_arch}
    message = subprocess.getoutput(f"tao-client {model_name} list-base-experiments --filter_params '{json.dumps(filter_params)}'")
    message = ast.literal_eval(message)
    ptm = []
    for rsp in message:
        rsp_keys = rsp.keys()
        assert "ngc_path" in rsp_keys
        if rsp["ngc_path"].endswith(pretrained_map[network_arch]):
            assert "id" in rsp_keys
            ptm_id = rsp["id"]
            ptm = [ptm_id]
            print("Metadata for model with requested NGC Path")
            print(rsp)
            break
    print(ptm)

In [None]:
if network_arch not in no_ptm_models:
    ptm_information = {"base_experiment":ptm}
    patched_model = subprocess.getoutput(f"tao-client {model_name} patch-artifact-metadata --id {experiment_id} --job_type experiment --update_info '{json.dumps(ptm_information)}' ")
    print(patched_model)

### Train <a class="anchor" id="head-13"></a>

#### View hyperparameters that are enabled for AutoML by default <a class="anchor" id="head-12"></a>

In [None]:
if automl_enabled:
    # View default automl params enabled
    automl_params = subprocess.getoutput(f"tao-client {model_name} model-automl-defaults --id {experiment_id}")

#### Set AutoML related configurations <a class="anchor" id="head-13.1"></a>
Refer to these hyper-links to see the parameters supported by each network and add more parameters if necessary in addition to the default automl enabled parameters:

In [None]:
if automl_enabled:
    # Choose any metric that is present in the kpi dictionary present in the model's status.json. 
    # Example status.json for each model can be found in the respective section in NVIDIA TAO DOCS here: https://docs.nvidia.com/tao/tao-toolkit/text/model_zoo/cv_models/index.html
    metric="kpi"

    #Refer to parameter list mentioned in the above links and add/remove any extra parameter in addition to the default enabled ones in automl_specs

    automl_information = {"automl_enabled":automl_enabled,
                          "automl_algorithm":automl_algorithm,
                          "automl_max_recommendations": 20, # Only for bayesian
                          "automl_R": 27, # Only for hyperband
                          "automl_nu": 3, # Only for hyperband
                          "epoch_multiplier": 1, # Only for hyperband
                          # Warning: The parameters that are disabled are not tested by TAO, so there might be unexpected behaviour in overriding this
                          "override_automl_disabled_params": False,
                          "automl_hyperparameters":json.loads(automl_params)
                         }
    patch_metadata = {"metric": metric, "automl_settings": automl_information}
    patched_model = subprocess.getoutput(f"tao-client {model_name} patch-artifact-metadata --id {experiment_id} --job_type experiment --update_info '{json.dumps(patch_metadata)}' ")
    patched_model = json.loads(patched_model)
    print(json.dumps(patched_model, indent=4))

#### Provide train specs

In [None]:
# Default train model specs
train_specs = subprocess.getoutput(f"tao-client {model_name} get-spec --action train --job_type experiment --id {experiment_id}")
train_specs = json.loads(train_specs)
print(json.dumps(train_specs, indent=4))

In [None]:
# Customize train model specs
train_specs["train"]["num_gpus"] = 1
train_specs["train"]["gpu_ids"] = [0]
train_specs["train"]["num_epochs"] = 5
train_specs["train"]["checkpoint_interval"] = 5
train_specs["train"]["validation_interval"] = 5
print(json.dumps(train_specs, indent=4))

#### Run train action

In [None]:
# Add --platform_id uuid for NVCF backend, where the uuid is a key from output of tao-client gpu-types
job_id = subprocess.getoutput(f"tao-client {model_name} experiment-run-action --action train --id {experiment_id} --specs '{json.dumps(train_specs)}'")
job_map["train_" + model_name] = job_id
print(job_id)

In [None]:
# Monitor job status
if automl_enabled:
    while True:
        clear_output(wait=True)
        response = subprocess.getoutput(f"tao-client {model_name} get-action-status --job_type experiment --id {experiment_id} --job {job_id}")
        response = json.loads(response)
        if "error_desc" in response.keys() and response["error_desc"] in ("Job trying to retrieve not found", "No AutoML run found"):
            print("Job is being created")
            time.sleep(5)
            continue
        print(json.dumps(response, sort_keys=True, indent=4))
        assert "status" in response.keys() and response.get("status") != "Error"
        if response.get("status") in ["Done","Error"]:
            break
        time.sleep(15)
else:
    # Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
    status = my_tail(model_name, experiment_id, job_id, "experiment", workdir)

In [None]:
## To Stop an AutoML JOB
#    1. Stop the 'Monitor job status' cell (the cell right before this cell) manually
#    2. Uncomment the snippet in the next cell and run the cell

In [None]:
# if automl_enabled:
#     job_id = job_map["train_" + model_name]
#     job_id = subprocess.getoutput(f"tao-client {model_name} job-pause --job_type experiment --id {experiment_id} --job {job_id}")
#     job_map["canceled_" + model_name] = job_id
#     print(job_id)

In [None]:
## Resume AutoML

In [None]:
# Uncomment the below snippet if you want to resume an already stopped AutoML job and then run the 'Monitor job status' cell above (4th cell above from this cell)
# if automl_enabled:
#     job_id = job_map["train_" + model_name]
#     job_id = subprocess.getoutput(f"tao-client {model_name} job-resume --id {experiment_id} --job {job_id}  --parent_job_id {parent}  --specs '{json.dumps(train_specs)}'")
#     job_map["resumed_" + model_name] = job_id
#     print(job_id)

### Publish model

#### Edit the method of choosing checkpoint from list of train checkpoint files

In [None]:
# Print model handler parameters
model_parameters = subprocess.getoutput(f"tao-client {model_name} get-metadata --id {experiment_id} --job_type experiment")
model_parameters = json.loads(model_parameters)
update_checkpoint_choosing = {}
update_checkpoint_choosing["checkpoint_choose_method"] = model_parameters["checkpoint_choose_method"]
update_checkpoint_choosing["checkpoint_epoch_number"] = model_parameters["checkpoint_epoch_number"]
print(json.dumps(update_checkpoint_choosing, indent=4))

In [None]:
# Change the method by which checkpoint from the parent action is chosen, when parent action is a train/retrain action.
# Example for evaluate action below, can be applied in the same way for other actions too
update_checkpoint_choosing["checkpoint_choose_method"] = "latest_model" # Choose between best_model/latest_model/from_epoch_number
# If from_epoch_number is chosen then assign the epoch number to the dictionary key in the format 'from_epoch_number{train_job_id}'
# update_checkpoint_choosing["checkpoint_epoch_number"]["from_epoch_number_c2f76eb7-2a75-4197-9a84-c1547f20c17d"] = 6

patched_model = subprocess.getoutput(f"tao-client {model_name} patch-artifact-metadata --id {experiment_id} --job_type experiment --update_info '{json.dumps(update_checkpoint_choosing)}'")
patched_model = json.loads(patched_model)
print(json.dumps(patched_model, indent=4))

#### Push model to private ngc team registry

In [None]:
display_name = f"TAO {model_name}"  # Display name for the model to be published on the model card
description = f"Train {model_name}"  # Short description for the model to be published on the model card
team = "tao_ea"  # Team within org for the model to be published to

job_id = job_map["train_" + model_name]
message = subprocess.getoutput(f"tao-client {model_name} publish-model --id {experiment_id} --job {job_id} --job_type experiment --display_name='{display_name}' --description='{description}' --team {team}")
print(message)

#### Remove model from private ngc team registry

In [None]:
# message = subprocess.getoutput(f"tao-client {model_name} remove-published-model --id {experiment_id} --job {job_id} --job_type experiment --team {team}")
# print(message)

### Evaluate <a class="anchor" id="head-14"></a>

#### Provide evaluate specs

In [None]:
# Default evaluate model specs
eval_specs = subprocess.getoutput(f"tao-client {model_name} get-spec --action evaluate --job_type experiment --id {experiment_id}")
eval_specs = json.loads(eval_specs)
print(json.dumps(eval_specs, indent=4))

In [None]:
# Customize evaluate model specs
# Change any spec if you wish
print(json.dumps(eval_specs, indent=4))

#### Run evaluate

In [None]:
# Add --platform_id uuid for NVCF backend, where the uuid is a key from output of tao-client gpu-types
parent = job_map["train_" + model_name]
job_id = subprocess.getoutput(f"tao-client {model_name} experiment-run-action --action evaluate --id {experiment_id} --parent_job_id {parent} --specs '{json.dumps(eval_specs)}'")
job_map["eval_" + model_name] = job_id
print(job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
status = my_tail(model_name, experiment_id, job_id, "experiment", workdir)

### TAO inference <a class="anchor" id="head-15"></a>

#### Provide TAO inference specs

In [None]:
# Default inference model specs
tao_inference_specs = subprocess.getoutput(f"tao-client {model_name} get-spec --id {experiment_id} --action inference --job_type experiment")
tao_inference_specs = json.loads(tao_inference_specs)
print(json.dumps(tao_inference_specs, indent=4))

In [None]:
# Customize TAO inference specs
#Apply changes to the specs dictionary here if required
print(json.dumps(tao_inference_specs, indent=4))

#### Run TAO inference

In [None]:
# Add --platform_id uuid for NVCF backend, where the uuid is a key from output of tao-client gpu-types
parent = job_map["train_" + model_name]
job_id = subprocess.getoutput(f"tao-client {model_name} experiment-run-action --action inference --id {experiment_id} --parent_job_id {parent} --specs '{json.dumps(tao_inference_specs)}'")
job_map["tao_inference_" + model_name] = job_id
print(job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
status = my_tail(model_name, experiment_id, job_id, "experiment", workdir)

In [None]:
# # Optional: Backup database with a mongodump file saved in workspace dump/archive/{backup_filename}
# backup_file_name = "mongodump.tar.gz" # FIXME 9
# subprocess.getoutput(f"tao-client {model_name} workspace-backup --workspace {workspace_id} --backup_file_name {backup_file_name}")

### Delete experiment <a class="anchor" id="head-16"></a>

In [None]:
subprocess.getoutput(f"tao-client {model_name} experiment-delete --id {experiment_id}")

### Delete dataset <a class="anchor" id="head-17"></a>

#### Delete train dataset

In [None]:
subprocess.getoutput(f"tao-client {model_name} dataset-delete --id {train_dataset_id}")

#### Delete val dataset

In [None]:
subprocess.getoutput(f"tao-client {model_name} dataset-delete --id {eval_dataset_id}")