### Notebook to demonstrate TAO Object Detection workflow

Transfer learning is the process of transferring learned features from one application to another. It is a commonly used training technique where you use a model trained on one task and re-train to use it on a different task. Train Adapt Optimize (TAO) Toolkit  is a simple and easy-to-use Python based AI toolkit for taking purpose-built AI models and customizing them with users' own data.

![image](https://d29g4g2dyqv443.cloudfront.net/sites/default/files/akamai/TAO/tlt-tao-toolkit-bring-your-own-model-diagram.png)

### Sample prediction for an Object Detection model
<img align="center" src="../example_images/sample_object_detection.jpg" width="960">

### The workflow in a nutshell

- Pulling datasets from cloud
- Running dataset convert
- Getting a PTM from NGC
- Model Actions
    - Train (Normal/AutoML)
    - Evaluate
    - Prune, retrain
    - Export
    - Tao-Deploy
    - Inference on TAO, TRT
    - Delete experiments/dataset
    
### Table of contents

1. [FIXME's](#head-1)
1. [Login](#head-2)
1. [Create a cloud workspace](#head-2)
1. [Set dataset formats](#head-3)
1. [Create and pull train dataset](#head-4)
1. [Create and pull val dataset](#head-5)
1. [Create and pull test dataset](#head-6)
1. [List the created datasets](#head-7)
1. [Train Dataset convert action](#head-8) (for specific models)
1. [Val dataset convert action](#head-9) (for specific models)
1. [Create an experiment](#head-10)
1. [List experiments](#head-11)
1. [Assign train, eval datasets](#head-12)
1. [Assign PTM](#head-13)
1. [Set AutoML related configurations](#head-14)
1. [Actions](#head-15)
1. [Train](#head-16)
1. [View hyperparameters that are enabled by default](#head-16.1)
1. [Evaluate](#head-17)
1. [Optimize: Prune, retrain and evaluate](#head-18)
1. [Export](#head-19)
1. [TRT Engine generation using TAO-Deploy](#head-20)
1. [TAO inference](#head-21)
1. [TRT inference](#head-22)
1. [Delete experiment](#head-23)
1. [Delete dataset](#head-24)

### Requirements
Please find the server requirements [here](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_setup.html#)

In [None]:
import json
import os
import requests
import time
from IPython.display import clear_output
import glob

### To see the dataset folder structure required for the models supported in this notebook, visit the notebooks under dataset_prepare like for [this notebook](../dataset_prepare/object_detection.ipynb)

### FIXME's <a class="anchor" id="head-1"></a>

1. Assign a model_name in FIXME 1
1. (Optional) Enable AutoML if needed in FIXME 2
1. (Optional) Choose between bayesian and hyperband automl_algorithm in FIXME 3 (If automl was enabled in FIXME2)
1. Assign the ip_address and port_number in FIXME 4 ([info](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_rest_api.html))
1. Assign the ngc_key variable in FIXME 5
1. Assign the ngc_org_name variable in FIXME 6
1. Set cloud storage details in FIXME 7
1. Assign path of datasets relative to the bucket in FIXME 8
1. (Optional) Enable Tensorboard session for experiment in FIXME 9

#### Choose a object detection model

In [None]:
# Define model_name workspaces and other variables
# Available models (#FIXME 1):
# 1. deformable_detr - https://docs.nvidia.com/tao/tao-toolkit/text/object_detection/deformable_detr.html
# 2. dino - https://docs.nvidia.com/tao/tao-toolkit/text/object_detection/dino.html
# 3. grounding_dino - https://docs.nvidia.com/tao/tao-toolkit/text/object_detection/grounding_dino.html
# 4. rtdetr - https://docs.nvidia.com/tao/tao-toolkit/text/object_detection/rtdetr.html

model_name = "dino" # FIXME1 (Add the model name from the above mentioned list)

#### Toggle AutoML params
[AutoML documentation](https://docs.nvidia.com/tao/tao-toolkit/text/automl/automl.html#getting-started)

In [None]:
automl_enabled = False # FIXME2 set to True if you want to run automl for the model chosen in the previous cell
automl_algorithm = "bayesian" # FIXME3 example: bayesian/hyperband

#### Set API service's host information

In [None]:
host_url = "http://<ip_address>:<port_number>" # FIXME4 example: https://10.137.149.22:32334
# In host machine, node ip_address and port number can be obtained as follows,
# ip_address: hostname -i
# port_number: kubectl get service tao-api-ingress-nginx-controller -o jsonpath='{.spec.ports[0].nodePort}'

#### Set NGC Personal key for authentication and NGC org to access API services

In [None]:
ngc_key = "<ngc_key>" # FIXME5 example: (Add NGC Personal key)

In [None]:
ngc_org_name = "nvstaging" # FIXME6 your NGC ORG

### Login <a class="anchor" id="head-2"></a>

In [None]:
# Validate NGC_PERSONAL_KEY
data = json.dumps({"ngc_org_name": ngc_org_name,
                   "ngc_key": ngc_key,
                   "enable_telemetry": True})
response = requests.post(f"{host_url}/api/v1/login", data=data)
assert response.ok, response.text
assert "token" in response.json().keys()
token = response.json()["token"]
print("JWT",token)

# Set base URL
base_url = f"{host_url}/api/v1/orgs/{ngc_org_name}"
print("API Calls will be forwarded to",base_url)

headers = {"Authorization": f"Bearer {token}"}

### Get NVCF gpu details <a class="anchor" id="head-2"></a>

 One of the keys of the response json are to be used as platform_id when you run each job

In [None]:
# # Valid only for NVCF backend during TAO-API helm deployment currently
# endpoint = f"{base_url}:gpu_types"
# response = requests.get(endpoint, headers=headers)

# assert response.ok, response.text
# print(response)
# print((json.dumps(response.json(), indent=4)))

### Create cloud workspace
This workspace will be the place where your datasets reside and your results of TAO API jobs will be pushed to.

If you want to have different workspaces for dataset and experiment, duplocate the workspace creation part and adjust the metadata accordingly.

In [None]:
#FIXME7 Dataset Cloud bucket details to download dataset for experiments (Can be read only)
cloud_metadata = {}
cloud_metadata["name"] = "AWS workspace info"  # A Representative name for this cloud info
cloud_metadata["cloud_type"] = "aws"  # If it's AWS, HuggingFace or Azure
cloud_metadata["cloud_specific_details"] = {}
cloud_metadata["cloud_specific_details"]["cloud_region"] = "us-west-1"  # Bucket region
cloud_metadata["cloud_specific_details"]["cloud_bucket_name"] = ""  # Bucket name
# Access and Secret for AWS
cloud_metadata["cloud_specific_details"]["access_key"] = ""
cloud_metadata["cloud_specific_details"]["secret_key"] = ""

In [None]:
# Create cloud workspace
data = json.dumps(cloud_metadata)

endpoint = f"{base_url}/workspaces"

response = requests.post(endpoint,data=data,headers=headers)
assert response.ok, response.text

print(response)
print(json.dumps(response.json(), indent=4))

assert "id" in response.json().keys()
workspace_id = response.json()["id"]

#### Set dataset path (path within cloud bucket)

In [None]:
# FIXME8 : Set paths relative to cloud bucket
train_dataset_path =  "/data/object_detection_train"
eval_dataset_path = "/data/object_detection_val"
test_dataset_path = "/data/object_detection_test"

num_classes = 4  # Change this to number of classes of the dataset in cloud

### Set dataset formats <a class="anchor" id="head-3"></a>

In [None]:
# Create train dataset
ds_type = "object_detection"
ds_format = "coco"
if model_name == "grounding_dino":
    ds_format = "odvg"

### Create and pull train dataset <a class="anchor" id="head-4"></a>

In [None]:
# Create train dataset
train_dataset_metadata = {"type": ds_type,
                          "format": ds_format,
                          "workspace":workspace_id,
                          "cloud_file_path": train_dataset_path,
                          "use_for": ["training"]
                          }
data = json.dumps(train_dataset_metadata)

endpoint = f"{base_url}/datasets"

response = requests.post(endpoint,data=data,headers=headers)
assert response.ok, response.text
assert "id" in response.json().keys()

print(response)
print(json.dumps(response.json(), indent=4))
train_dataset_id = response.json()["id"]

In [None]:
# Check progress
endpoint = f"{base_url}/datasets/{train_dataset_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)

    print(response)
    if response.json().get("status") == "invalid_pull":
        print(json.dumps(response.json().get("validation_details"), indent=4))
        raise ValueError("Dataset pull failed")
    if response.json().get("status") == "pull_complete":
        print(json.dumps(response.json(), indent=4))
        break
    assert response.ok, response.text
    time.sleep(5)

#### Uncomment if you want to remove corrupted images in your dataset

In [None]:
# # This packages data-services experiments create and running the job of removing corrupted images
# from remove_corrupted_images import remove_corrupted_images_workflow
# # try:
#     from remove_corrupted_images import remove_corrupted_images_workflow
#     train_dataset_id = remove_corrupted_images_workflow(base_url, headers, workspace_id, train_dataset_id)
# except Exception as e:
#     raise e

### Create and pull val dataset <a class="anchor" id="head-5"></a>

In [None]:
# Create eval dataset
eval_dataset_metadata = {"type": ds_type,
                         "format": "coco",
                         "workspace":workspace_id,
                         "cloud_file_path": eval_dataset_path,
                         "use_for": ["evaluation"]
                         }
data = json.dumps(eval_dataset_metadata)

endpoint = f"{base_url}/datasets"

response = requests.post(endpoint,data=data,headers=headers)
assert response.ok, response.text
assert "id" in response.json().keys()

print(response)
print(json.dumps(response.json(), indent=4))
eval_dataset_id = response.json()["id"]

In [None]:
# Check progress
endpoint = f"{base_url}/datasets/{eval_dataset_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)

    print(response)
    if response.json().get("status") == "invalid_pull":
        print(json.dumps(response.json().get("validation_details"), indent=4))
        raise ValueError("Dataset pull failed")
    if response.json().get("status") == "pull_complete":
        print(json.dumps(response.json(), indent=4))
        break
    assert response.ok, response.text
    time.sleep(5)

#### Uncomment if you want to remove corrupted images in your dataset

In [None]:
# # This packages data-services experiments create and running the job of removing corrupted images
# from remove_corrupted_images import remove_corrupted_images_workflow
# try:
#     from remove_corrupted_images import remove_corrupted_images_workflow
#     eval_dataset_id = remove_corrupted_images_workflow(base_url, headers, workspace_id, eval_dataset_id)
# except Exception as e:
#     raise e

### Create and pull test dataset <a class="anchor" id="head-6"></a>

In [None]:
# Create testing dataset for inference
test_dataset_metadata = {"type": ds_type,
                         "format":"raw",
                         "workspace":workspace_id,
                         "cloud_file_path": test_dataset_path,
                         "use_for": ["testing"]
                         }
data = json.dumps(test_dataset_metadata)

endpoint = f"{base_url}/datasets"

response = requests.post(endpoint,data=data,headers=headers)
assert response.ok, response.text
assert "id" in response.json().keys()

print(response)
print(json.dumps(response.json(), indent=4))
test_dataset_id = response.json()["id"]

In [None]:
# Check progress
endpoint = f"{base_url}/datasets/{test_dataset_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)

    print(response)
    if response.json().get("status") == "invalid_pull":
        print(json.dumps(response.json().get("validation_details"), indent=4))
        raise ValueError("Dataset pull failed")
    if response.json().get("status") == "pull_complete":
        print(json.dumps(response.json(), indent=4))
        break
    assert response.ok, response.text
    time.sleep(5)

#### Uncomment if you want to remove corrupted images in your dataset

In [None]:
# # This packages data-services experiments create and running the job of removing corrupted images
# from remove_corrupted_images import remove_corrupted_images_workflow
# # try:
#     from remove_corrupted_images import remove_corrupted_images_workflow
#     test_dataset_id = remove_corrupted_images_workflow(base_url, headers, workspace_id, test_dataset_id)
# except Exception as e:
#     raise e

### List the created datasets <a class="anchor" id="head-7"></a>

In [None]:
endpoint = f"{base_url}/datasets"

response = requests.get(endpoint, headers=headers)
assert response.ok, response.text

datasets = response.json()["datasets"]
for rsp in datasets:
    rsp_keys = rsp.keys()
    assert "id" in rsp_keys
    assert "type" in rsp_keys
    assert "format" in rsp_keys
    assert "name" in rsp_keys

print(response)
# print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose list output
print("id\t\t\t\t\t type\t\t\t format\t\t name")
for rsp in datasets:
    print(rsp["id"],"\t",rsp["type"],"\t",rsp["format"],"\t\t",rsp["name"])

In [None]:
tensorboard_enabled = False # FIXME9 set to True if you want to create a Tensorboard session for the experiment

### Create an experiment <a class="anchor" id="head-10"></a>

In [None]:
encode_key = "tlt_encode"
checkpoint_choose_method = "best_model"
data = json.dumps({"network_arch":model_name,
                   "encryption_key":encode_key,
                   "checkpoint_choose_method":checkpoint_choose_method,
                   "tensorboard_enabled": tensorboard_enabled,
                   "workspace": workspace_id})

endpoint = f"{base_url}/experiments"

response = requests.post(endpoint,data=data,headers=headers)
assert response.ok, response.text
assert "id" in response.json()

print(response)
print(json.dumps(response.json(), indent=4))
experiment_id = response.json()["id"]

### List experiments <a class="anchor" id="head-11"></a>

In [None]:
endpoint = f"{base_url}/experiments"
params = {"network_arch": model_name}
response = requests.get(endpoint, params=params, headers=headers)
assert response.ok, response.text

print(response)
# print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose list output
print("model id\t\t\t     network architecture")
for rsp in response.json()["experiments"]:
    rsp_keys = rsp.keys()
    assert "id" in rsp_keys and "network_arch" in rsp_keys
    print(rsp["name"], rsp["id"],rsp["network_arch"])


### Assign train, eval datasets <a class="anchor" id="head-12"></a>

In [None]:
docker_env_vars = {} # Update any variables to be included while triggering Docker run-time like MLOPs variables 
dataset_information = {"train_datasets":[train_dataset_id],
                       "eval_dataset":eval_dataset_id,
                       "inference_dataset":test_dataset_id,
                       "calibration_dataset":train_dataset_id,
                       "docker_env_vars": docker_env_vars}
data = json.dumps(dataset_information)

endpoint = f"{base_url}/experiments/{experiment_id}"

response = requests.patch(endpoint, data=data, headers=headers)
assert response.ok, response.text

print(response)
print(json.dumps(response.json(), indent=4))

### Assign PTM <a class="anchor" id="head-13"></a>

Search for PTM on NGC for the Object Detection model chosen

In [None]:
# List all pretrained models for the chosen network architecture
endpoint = f"{base_url}/experiments:base"
params = {"network_arch": model_name}
response = requests.get(endpoint, params=params, headers=headers)
assert response.ok, response.text

response_json = response.json()["experiments"]

for rsp in response_json:
    rsp_keys = rsp.keys()
    if "encryption_key" not in rsp.keys():
        assert "name" in rsp_keys and "version" in rsp_keys and "ngc_path" in rsp_keys
        print(f'PTM Name: {rsp["name"]}; PTM version: {rsp["version"]}; NGC PATH: {rsp["ngc_path"]}')

In [None]:
# Assigning pretrained models to different object detection models versions
# From the output of previous cell make the appropriate changes to this map if you want to change the default PTM backbone.
# Changing the default backbone here requires changing default spec/config during train/eval etc like for example
# If you are changing the ptm to resnet34, then you have to modify the config key num_layers if it exists to 34 manually
pretrained_map = {"deformable_detr": "pretrained_deformable_detr_nvimagenet:resnet50",
                  "dino": "pretrained_dino_nvimagenet:resnet50",
                  "grounding_dino": "grounding_dino:grounding_dino_swin_tiny_commercial_trainable_v1.0",
                  "rtdetr": "pretrained_rtdetr_nvimagenet:resnet50"
                 }
no_ptm_models = set([])

In [None]:
# Get pretrained model
if model_name not in no_ptm_models:
    endpoint = f"{base_url}/experiments:base"
    params = {"network_arch": model_name}
    response = requests.get(endpoint, params=params, headers=headers)
    assert response.ok, response.text

    response_json = response.json()["experiments"]

    # Search for ptm with given ngc path
    ptm = []
    for rsp in response_json:
        rsp_keys = rsp.keys()
        assert "ngc_path" in rsp_keys
        if rsp["ngc_path"].endswith(pretrained_map[model_name]):
            assert "id" in rsp_keys
            ptm_id = rsp["id"]
            ptm = [ptm_id]
            print("Metadata for model with requested NGC Path")
            print(rsp)
            break

In [None]:
if model_name not in no_ptm_models:
    ptm_information = {"base_experiment":ptm}
    data = json.dumps(ptm_information)

    endpoint = f"{base_url}/experiments/{experiment_id}"

    response = requests.patch(endpoint, data=data, headers=headers)
    assert response.ok, response.text

    print(response)
    print(json.dumps(response.json(), indent=4))

### Actions <a class="anchor" id="head-15"></a>

For all actions:
1. Get default spec schema and derive the default values
2. Modify defaults if needed
3. Post spec dictionary to the service
4. Run model action
5. Monitor job using retrieve
6. Download results using job download endpoint (if needed)

In [None]:
job_map = {}

### Train <a class="anchor" id="head-16"></a>

#### View hyperparameters that are enabled for AutoML by default <a class="anchor" id="head-14"></a>

In [None]:
if automl_enabled:
    # Get default spec schema
    endpoint = f"{base_url}/experiments/{experiment_id}/specs/train/schema"
    while True:
        response = requests.get(endpoint, headers=headers)
        if response.status_code == 404:
            if "Base spec file download state is " in response.json()["error_desc"]:
                print("Base experiment spec file is being downloaded")
                time.sleep(2)
                continue
            else:
                break
        else:
            break
    assert response.ok, response.text
    assert "automl_default_parameters" in response.json().keys()
    automl_params = response.json()["automl_default_parameters"]
    print(json.dumps(automl_params, sort_keys=True, indent=4))

#### Set AutoML related configurations <a class="anchor" id="head-16.1"></a>
Refer to these hyper-links to see the parameters supported by each network and add more parameters if necessary in addition to the default automl enabled parameters:

[Deformable Detr](https://github.com/NVIDIA/tao_front_end_services/tree/main/api/specs_utils/specs/deformable_detr/deformable_detr%20-%20train.csv), 
[DINO](https://github.com/NVIDIA/tao_front_end_services/tree/main/api/specs_utils/specs/dino/dino%20-%20train.csv), 

In [None]:
if automl_enabled:
    # Choose any metric that is present in the kpi dictionary present in the model's status.json. 
    # Example status.json for each model can be found in the respective section in NVIDIA TAO DOCS here: https://docs.nvidia.com/tao/tao-toolkit/text/model_zoo/cv_models/index.html
    metric = "kpi"

    #Refer to parameter list mentioned in the above links and add/remove any extra parameter in addition to the default enabled ones in automl_specs

    automl_information = {"automl_enabled": True,
                          "automl_algorithm": automl_algorithm,
                          "automl_max_recommendations": 20, # Only for bayesian
                          "automl_R": 27, # Only for hyperband
                          "automl_nu": 3, # Only for hyperband
                          "epoch_multiplier": 1, # Only for hyperband
                          # Warning: The parameters that are disabled are not tested by TAO, so there might be unexpected behaviour in overriding this
                          "override_automl_disabled_params": False,
                          "automl_hyperparameters": str(automl_params)}
    data = json.dumps({"metric":metric, "automl_settings": automl_information})

    endpoint = f"{base_url}/experiments/{experiment_id}"

    response = requests.patch(endpoint, data=data, headers=headers)
    assert response.ok, response.text
    
    print(response)
    print(json.dumps(response.json(), sort_keys=True, indent=4))

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/train/schema"

while True:
    response = requests.get(endpoint, headers=headers)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.ok, response.text
assert "default" in response.json().keys()

print(response)
#print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema
train_specs = response.json()["default"]
print(json.dumps(train_specs, sort_keys=True, indent=4))

In [None]:
# Customize train model specs
train_specs["train"]["num_epochs"] = 10
train_specs["train"]["checkpoint_interval"] = 10
train_specs["train"]["validation_interval"] = 10
train_specs["train"]["num_gpus"] = 1
if model_name != "grounding_dino":
    train_specs["dataset"]["num_classes"] = int(num_classes) + 1
print(json.dumps(train_specs, sort_keys=True, indent=4))

In [None]:
# Run action
parent = None
action = "train"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":train_specs,
                  #  "platform_id": "9af1aa90-8ea5-5a11-98d9-3879cd0da92c",  # Pick a platform_from output of {base_url}:gpu_types depending on GPU_type and instance_type
                   })

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.ok, response.text
assert response.json()

print(response)
print(json.dumps(response.json(), indent=4))

job_map["train_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
# For automl: Training times for different models benchmarked on 1 GPU V100 machine can be found here: https://docs.nvidia.com/tao/tao-toolkit/text/automl/automl.html#results-of-automl-experiments

job_id = job_map["train_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    if "error_desc" in response.json().keys() and response.json()["error_desc"] in ("Job trying to retrieve not found", "No AutoML run found"):
        print("Job is being created")
        time.sleep(5)
        continue
    print(response)
    print(json.dumps(response.json(), sort_keys=True, indent=4))
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

In [None]:
if tensorboard_enabled:
    print(f"Tensorboard available at {host_url}/tensorboard/v1/orgs/{ngc_org_name}/experiments/{experiment_id}")

In [None]:
## To Stop an AutoML JOB
#    1. Stop the 'Monitor job status by repeatedly running this cell' cell (the cell right before this cell) manually
#    2. Uncomment the snippet in the next cell and run the cell

In [None]:
# if automl_enabled:
#     job_id = job_map["train_" + model_name]
#     endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}:pause"

#     response = requests.post(endpoint, headers=headers)

#     print(response)
#     print(json.dumps(response.json(), indent=4))

In [None]:
## Resume AutoML

In [None]:
# # Uncomment the below snippet if you want to resume an already stopped AutoML job and then run the 'Monitor job status by repeatedly running this cell' cell above (4th cell above from this cell)
# if automl_enabled:
#     job_id = job_map["train_" + model_name]
#     endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}:resume"

#     data = json.dumps({"parent_job_id":parent,"specs":train_specs,
#                    "platform_id": "9af1aa90-8ea5-5a11-98d9-3879cd0da92c",  # Pick a platform_from output of {base_url}:gpu_types depending on GPU_type and instance_type
#                    })
#     response = requests.post(endpoint, data=data, headers=headers)

#     print(response)
#     print(json.dumps(response.json(), indent=4))

### Publish model

#### Edit the method of choosing checkpoint from list of train checkpoint files

In [None]:
# Get model handler parameters
endpoint = f"{base_url}/experiments/{experiment_id}"
response = requests.get(endpoint, headers=headers)
assert response.ok, response.text
assert response.json()

model_parameters = response.json()
update_checkpoint_choosing = {}
update_checkpoint_choosing["checkpoint_choose_method"] = model_parameters["checkpoint_choose_method"]
update_checkpoint_choosing["checkpoint_epoch_number"] = model_parameters["checkpoint_epoch_number"]
print(update_checkpoint_choosing)

In [None]:
# Change the method by which checkpoint from the parent action is chosen, when parent action is a train/retrain action.
# Example for evaluate action below, can be applied in the same way for other actions too
update_checkpoint_choosing["checkpoint_choose_method"] = "latest_model" # Choose between best_model/latest_model/from_epoch_number
# If from_epoch_number is chosen then assign the epoch number to the dictionary key in the format 'from_epoch_number{train_job_id}'
# update_checkpoint_choosing["checkpoint_epoch_number"]["from_epoch_number_28a2754e-50ef-43a8-9733-98913776dd90"] = 3
data = json.dumps(update_checkpoint_choosing)

endpoint = f"{base_url}/experiments/{experiment_id}"

response = requests.patch(endpoint, data=data, headers=headers)
assert response.ok, response.text

print(response)
print(json.dumps(response.json(), sort_keys=True, indent=4))

#### Push model to private ngc team registry

In [None]:
job_id = job_map["train_" + model_name]
data = json.dumps({"display_name": f"TAO {model_name}",
                   "description": f"Train {model_name}",
                   "team_name":"tao"})

endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}:publish_model"

response = requests.post(endpoint, data=data, headers=headers)
assert response.ok, response.text
assert response.json()

print(response)
print(json.dumps(response.json(), indent=4))

#### Remove model from private ngc team registry

In [None]:
# endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}:remove_published_model"
# params = {"team_name": "tao"}
# response = requests.delete(endpoint, params=params, headers=headers)
# assert response.ok, response.text
# assert response.json()

# print(response)
# print(json.dumps(response.json(), indent=4))

### Evaluate <a class="anchor" id="head-17"></a>

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/evaluate/schema"

while True:
    response = requests.get(endpoint, headers=headers)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.ok, response.text
assert "default" in response.json().keys()

print(response)
#print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema
eval_specs = response.json()["default"]
print(json.dumps(eval_specs, sort_keys=True, indent=4))

In [None]:
# Apply changes
if model_name in ("deformable_detr", "dino", "rtdetr"):
    eval_specs["dataset"]["num_classes"] = int(num_classes) + 1
print(json.dumps(eval_specs, sort_keys=True, indent=4))

In [None]:
# Run action
parent = job_map["train_" + model_name]
action = "evaluate"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":eval_specs,
                  #  "platform_id": "9af1aa90-8ea5-5a11-98d9-3879cd0da92c",  # Pick a platform_from output of {base_url}:gpu_types depending on GPU_type and instance_type
                   })

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.ok, response.text

print(response)
print(json.dumps(response.json(), indent=4))

job_map["evaluate_" + model_name] = response.json()
print(job_map)

### Export <a class="anchor" id="head-19"></a>

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/export/schema"

while True:
    response = requests.get(endpoint, headers=headers)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.ok, response.text
assert "default" in response.json().keys()

print(response)
#print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema
export_specs = response.json()["default"]
print(json.dumps(export_specs, sort_keys=True, indent=4))

In [None]:
# Apply changes
if model_name in ("deformable_detr", "dino", "rtdetr"):
    export_specs["dataset"]["num_classes"] = int(num_classes) + 1
print(json.dumps(export_specs, sort_keys=True, indent=4))

In [None]:
# Run action
parent = job_map["train_" + model_name]
action = "export"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":export_specs,
                  #  "platform_id": "9af1aa90-8ea5-5a11-98d9-3879cd0da92c",  # Pick a platform_from output of {base_url}:gpu_types depending on GPU_type and instance_type
                   })

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.ok, response.text

print(response)
print(json.dumps(response.json(), indent=4))

job_map["export_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map['export_' + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    assert response.ok, response.text
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    print(response)
    print(json.dumps(response.json(), indent=4))
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

### TRT Engine generation using TAO-Deploy <a class="anchor" id="head-20"></a>

- Here, we use the exported model to generate trt engine

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/gen_trt_engine/schema"

while True:
    response = requests.get(endpoint, headers=headers)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.ok, response.text
assert "default" in response.json().keys()

print(response)
#print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema
tao_deploy_specs = response.json()["default"]
print(json.dumps(tao_deploy_specs, sort_keys=True, indent=4))

In [None]:
# Apply changes
if model_name in ("deformable_detr", "dino", "grounding_dino", "rtdetr"):
    tao_deploy_specs["gen_trt_engine"]["tensorrt"]["data_type"] = "FP16"
    if model_name in ("deformable_detr", "dino"):
        tao_deploy_specs["dataset"]["num_classes"] = int(num_classes) + 1
print(json.dumps(tao_deploy_specs, sort_keys=True, indent=4))

In [None]:
# Run action
parent = job_map["export_" + model_name]
action = "gen_trt_engine"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":tao_deploy_specs,
                  #  "platform_id": "9af1aa90-8ea5-5a11-98d9-3879cd0da92c",  # Pick a platform_from output of {base_url}:gpu_types depending on GPU_type and instance_type
                   })

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.ok, response.text

print(response)
print(json.dumps(response.json(), indent=4))

job_map["model_gen_trt_engine_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map['model_gen_trt_engine_' + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    assert response.ok, response.text
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    print(response)
    print(json.dumps(response.json(), indent=4))
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

### TAO inference <a class="anchor" id="head-21"></a>

- Run inference on a set of images using the .tlt model created at train step

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/inference/schema"

while True:
    response = requests.get(endpoint, headers=headers)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.ok, response.text
assert "default" in response.json().keys()

print(response)
#print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema
tao_inference_specs = response.json()["default"]
print(json.dumps(tao_inference_specs, sort_keys=True, indent=4))

In [None]:
# Apply changes to the specs dictionary if necessary
if model_name in ("deformable_detr", "dino", "rtdetr"):
    tao_inference_specs["dataset"]["num_classes"] = int(num_classes) + 1
elif model_name == "grounding_dino":
    tao_inference_specs["dataset"]["infer_data_sources"]["captions"] = ["person"] # Classes you want to get inference results on

print(json.dumps(tao_inference_specs, sort_keys=True, indent=4))

In [None]:
# Run action
parent = job_map["train_" + model_name]
action = "inference"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":tao_inference_specs,
                  #  "platform_id": "9af1aa90-8ea5-5a11-98d9-3879cd0da92c",  # Pick a platform_from output of {base_url}:gpu_types depending on GPU_type and instance_type
                   })

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.ok, response.text

print(response)
print(json.dumps(response.json(), indent=4))

job_map["inference_tao_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map['inference_tao_' + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    assert response.ok, response.text
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    print(response)
    print(json.dumps(response.json(), indent=4))
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

### TRT inference <a class="anchor" id="head-22"></a>

- no need to change the specs since we already uploaded it at the tlt inference step

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/inference/schema"

while True:
    response = requests.get(endpoint, headers=headers)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.ok, response.text
assert "default" in response.json().keys()

print(response)
#print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema
trt_inference_specs = response.json()["default"]
print(json.dumps(trt_inference_specs, sort_keys=True, indent=4))

In [None]:
# Apply changes to the specs dictionary if necessary
if model_name in ("deformable_detr", "dino", "rtdetr"):
    trt_inference_specs["dataset"]["num_classes"] = int(num_classes) + 1
    trt_inference_specs["dataset"]["batch_size"] = 1
elif model_name == "grounding_dino":
    trt_inference_specs["dataset"]["infer_data_sources"]["captions"] = ["person"] # Classes you want to get inference results on
    trt_inference_specs["dataset"]["batch_size"] = 1

print(json.dumps(trt_inference_specs, sort_keys=True, indent=4))

In [None]:
# Run action
parent = job_map["model_gen_trt_engine_" + model_name]
action = "inference"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":trt_inference_specs,
                  #  "platform_id": "9af1aa90-8ea5-5a11-98d9-3879cd0da92c",  # Pick a platform_from output of {base_url}:gpu_types depending on GPU_type and instance_type
                   })

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.ok, response.text

print(response)
print(json.dumps(response.json(), indent=4))

job_map["inference_trt_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map['inference_trt_' + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    assert response.ok, response.text
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    print(response)
    print(json.dumps(response.json(), indent=4))
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

### Delete experiment <a class="anchor" id="head-23"></a>

In [None]:
endpoint = f"{base_url}/experiments/{experiment_id}"

response = requests.delete(endpoint,headers=headers)
assert response.ok, response.text

print(response)
print(json.dumps(response.json(), indent=4))

### Delete dataset <a class="anchor" id="head-24"></a>

#### Delete train dataset

In [None]:
endpoint = f"{base_url}/datasets/{train_dataset_id}"

response = requests.delete(endpoint,headers=headers)
assert response.ok, response.text

print(response)
print(json.dumps(response.json(), indent=4))

#### Delete val dataset

In [None]:
endpoint = f"{base_url}/datasets/{eval_dataset_id}"

response = requests.delete(endpoint,headers=headers)
assert response.ok, response.text

print(response)
print(json.dumps(response.json(), indent=4))

#### Delete test dataset

In [None]:
endpoint = f"{base_url}/datasets/{test_dataset_id}"

response = requests.delete(endpoint,headers=headers)
assert response.ok, response.text

print(response)
print(json.dumps(response.json(), indent=4))