### Notebook to demonstrate FTMS DINO AutoML Training with Local Storage and Airgapped Workflow

Transfer learning is the process of transferring learned features from one application to another. It is a commonly used training technique where you use a model trained on one task and re-train to use it on a different task. Train Adapt Optimize (TAO) Toolkit is a simple and easy-to-use Python based AI toolkit for taking purpose-built AI models and customizing them with users' own data.

---

### FTMS AutoML with DINO (DEtection TRansformer with Improved deNoising anchOr boxes)

This notebook demonstrates how to use DINO object detection model with AutoML capabilities in a local storage environment designed for airgapped workflows.

[DINO](https://arxiv.org/abs/2203.03605) is a state of the art transformer-based object detection model. Similar to Deformable DETR, DINO does not use heuristics based methods like NMS or IOU assignment found in convolution-based object detection models like Faster RCNN. Compared to Deformable DETR, DINO uses de-noising during training which can help training to converge faster.


![image](https://raw.githubusercontent.com/vpraveen-nv/model_card_images/main/api/automl_workflow.png)

---

### Sample prediction of a trained RT-DETR model

<img align="center" src="sample_images/detection_sample.jpg" width="640">

---

### Expected outcome
The expected output of this notebook is:
- A fully trained DINO object detection model optimized through AutoML
- Performance metrics comparing different hyperparameter configurations
- Exported model files ready for deployment in production environments
- Complete workflow documentation for airgapped deployments

Detailed documentation about DINO object detection is available in the TAO [documentation](https://docs.nvidia.com/tao/tao-toolkit/text/object_detection/dino.html)

---

### The workflow in a nutshell
This notebook demonstrates how to train DINO object detection models using AutoML in a completely local environment suitable for airgapped deployments:

1. **Setup Local Environment** - Configure local storage paths and verify prerequisites
2. **Connect to Local TAO API** - Connect to locally deployed TAO API service
3. **Create Local Workspace** - Setup local workspace without cloud dependencies
4. **Prepare Local Dataset** - Load and register datasets from local storage
5. **Configure DINO Model** - Setup DINO architecture and pretrained models
6. **Setup AutoML** - Configure automated hyperparameter optimization
7. **Train with AutoML** - Execute automated training with multiple trials
8. **Model Evaluation** - Evaluate best performing model
9. **Model Export** - Export optimized model for deployment
10. **Local Inference** - Run inference using local model files


---

### Requirements
Prior to running this notebook you must have:
1. A locally deployed TAO API server [(Local Setup Guide)](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_setup.html#local-deployment)
2. Local object detection dataset in KITTI or COCO format stored in `~/data/` directory
3. Pretrained DINO models downloaded locally at ~/airgapped_models
4. Set the `<>` enclosed variables with values in the Configuration section
5. Sufficient local storage space (>50GB recommended for datasets and models)

### Airgapped Environment Setup
For completely airgapped environments, ensure the following are pre-downloaded:
- TAO API Docker images
- DINO pretrained model weights
- Sample datasets for testing


---

### Debugging Finetuning Microservice and Jobs

When working with the TAO API, you may encounter issues at different stages. Use the following guidance to debug effectively:

#### 1. Dataset, Experiment, or Workspace CRUD Operation Errors

If you encounter errors related to creating, reading, updating, or deleting datasets, experiments, or workspaces **and the error messages are not clear**, check the logs of the TAO API service pods:

```bash
kubectl logs -f <pod name starting with tao-api-app-pod>
```

#### 2. Errors During Job Launch

For issues that occur **while launching a job**, check both the app and workflow pods:

```bash
kubectl logs -f <pod name starting with tao-api-app-pod>
kubectl logs -f <pod name starting with tao-api-workflow-pod>
```

#### 3. Errors After Job Launch

If errors occur **after a job has been launched**, inspect the job pod logs:

```bash
kubectl logs -f tao-api-sts-<job_id>-0
```

> **Note:**  
> Run these `kubectl` commands on the machine where your Kubernetes service is deployed.

#### Additional Debugging Tips

- **Job logs are automatically uploaded** to your cloud workspace at:  
  `/results/<job_id>/microservices_log.txt`

- **You can also view logs via the Jobs API endpoint:**  
  ```
  /api/v1/orgs/<org_name>/<experiments|datasets>/<experiment_id|dataset_id>/jobs/<job_id>/logs
  ```

**Summary Table**

| Error Type                | Where to Check Logs                                      |
|-------------------------- |---------------------------------------------------------|
| CRUD operation errors     | `tao-api-app-pod`                                       |
| Job launch errors         | `tao-api-app-pod`, `tao-api-workflow-pod`               |
| Post-launch job errors    | `tao-api-sts-<job_id>-0`                                |
| All job logs (cloud)      | `/results/<job_id>/microservices_log.txt`               |
| All job logs (API)        | `/api/v1/orgs/<org_name>/<experiments|datasets>/<experiment_id|dataset_id>/jobs/<job_id>/logs` |

---
### Performance Benchmarks

#### Execution Time Breakdown

The following table shows the approximate time required for each stage of the DINO AutoML FTMS workflow:

| **Stage** | **Duration** | **Description** |
|-----------|--------------|-----------------|
| Train Dataset Pull | **5s** | Train dataset verification and preprocessing |
| Val/Test Dataset Pull | **5s** | Validation and test dataset verification |
| AutoML Model Training | **1.5hrs** | 10 minutes per experiment |
| ONNX + TensorRT Export | **3min + 7min** | Model optimization and engine generation |
| TensorRT Inference | **20min** | High-performance inference testing |
| **Total Time** | **9hr 45min** | **Complete end-to-end workflow** |

#### Test Environment Specifications

| **Component** | **Specification** |
|---------------|-------------------|
| **GPU** | 1x NVIDIA A40 |
| **Training Dataset** | 50 images (5 MB total) |
| **Validation Dataset** | 50 images (5 MB total) |
| **Training Epochs** | 10 epochs per experiment |
| **Model Architecture** | Dino with Fan Tiny |
| **Task** | Warehouse object detection (4 classes) |

#### Performance Factors

> **Important Note:**  
> Actual execution times may vary significantly based on:
> 
> - **Hardware Configuration**: GPU type, memory, and compute capability
> - **Storage Performance**: Local disk I/O speed and cloud storage latency
> - **Network Conditions**: Bandwidth and latency to cloud storage
> - **System Load**: Other concurrent processes and resource utilization
> - **Dataset Size**: Number of images and total data volume
> - **Batch Size**: Larger batch sizes can improve training stability and speed
> - **Model Configuration**: Backbone architecture, epochs, AutoML configurations, and hyperparameters

## Transfer Datasets and Pre-trained models into local S3 bucket

### Local Dataset Structure
To see the dataset folder structure required for DINO object detection in local storage, ensure your data follows this pattern:

```
~/data/dino_train_dataset/
├── images.tar.gz
├── annotations.json
├── label_map.txt

~/data/dino_val_dataset/
├── images.tar.gz
├── annotations.json
├── label_map.txt
```

#### Setup AWS credentials as environment variables

In [None]:
import os
# For minikube environment, uncomment this section until before docker compose section
# import subprocess

# # Get the Minikube IP using subprocess
# minikube_ip = subprocess.check_output(["minikube", "ip"]).decode("utf-8").strip()

# # Set environment variables
# os.environ["CLUSTER_IP"] = minikube_ip
# os.environ["SEAWEED_ENDPOINT"] = f"http://{os.environ["CLUSTER_IP"]}:32333"

# For docker compose environment, uncomment this section
# %env CLUSTER_IP=<MACHINE IP>
# %env SEAWEED_ENDPOINT=http://$CLUSTER_IP:8333
# os.environ["SEAWEED_ENDPOINT"] = f"http://{os.environ["CLUSTER_IP"]}:8333"


# Common for both
# Set AWS CLI credentials for SeaweedFS
%env AWS_ACCESS_KEY_ID=seaweedfs
%env AWS_SECRET_ACCESS_KEY=seaweedfs123
%env AWS_DEFAULT_REGION=us-east-1


#### Create a bucket if not present already

In [None]:
%%bash
pip install awscli

# Create the main storage bucket if not already created
if ! aws s3 ls --endpoint-url "$SEAWEED_ENDPOINT" | grep -q "tao-storage"; then
    aws s3 mb --endpoint-url "$SEAWEED_ENDPOINT" s3://tao-storage
else
    echo "Bucket already exists, skipping creation."
fi

#### Copy data from local disk to local S3 bucket

In [None]:
%%bash
aws s3 cp --endpoint-url $SEAWEED_ENDPOINT ~/data/dino_train_dataset s3://tao-storage/data/dino_train_dataset --recursive
aws s3 cp --endpoint-url $SEAWEED_ENDPOINT ~/data/dino_val_dataset s3://tao-storage/data/dino_val_dataset --recursive

# This directory that is being used on S3 to copied to is the one that should be present on LOCAL_MODEL_REGISTRY in values.yaml
aws s3 cp --endpoint-url $SEAWEED_ENDPOINT ~/airgapped-models/ s3://tao-storage/shared-storage/models/ --recursive

#### Verify if the uploads were successfull

In [None]:
%%bash
aws s3 ls --endpoint-url $SEAWEED_ENDPOINT s3://tao-storage/shared-storage/models/
aws s3 ls --endpoint-url $SEAWEED_ENDPOINT s3://tao-storage/data/


## Configuration 

Fill in all `<>` enclosed variables with relevant values under this section. 

In [None]:
import json
import requests
import time
from IPython.display import clear_output, display

### TAO FTMS Host & Credentials 
**Kubernetes Environment**

* **IP Address**: Machine’s IP address
* **Port**: `32080`

**Docker Compose Environment**

* **IP Address**: `localhost`
* **Port**: Value set in `config.env` (`8090` by default)


In [None]:
# Configure for your TAO FTMS server 
ngc_key = "<NGC_PERSONAL_KEY>"
ngc_org_name = "ea-tlt"
host_url = f"http://<ip_address>:<port_number>/api/v1/orgs/{ngc_org_name}"

docker_env_vars = {}

# If you're using a PTM from a private organization like NVAIE, uncomment the following line and add your legacy NGC API Key.
# docker_env_vars['TAO_API_KEY'] = '<NGC_LEGACY_API_KEY>' #Set to NGC Legacy API Key 

### Cloud Storage Setup & Credentials 

In [None]:
# Cloud bucket details to access datasets and store experiment results
cloud_metadata = {
    "name": "tao_workspace",
    "cloud_type": "seaweedfs",
    "cloud_specific_details": {
        "cloud_region": "us-east-1",
        "cloud_bucket_name": "tao-storage",
        "access_key": "seaweedfs",
        "secret_key": "seaweedfs123",
        "endpoint_url": "http://seaweedfs-s3:8333"
    }
}

### Dataset Paths in Cloud Storage 

In [None]:
train_dataset_path =  "/data/dino_train_dataset"
eval_dataset_path = "/data/dino_val_dataset"

### Model Configuration

In [None]:
# DINO Model Configuration
# Documentation: https://docs.nvidia.com/tao/tao-toolkit/text/object_detection/dino.html
model_name = "dino"  # Fixed for this tutorial

#### Configure AutoML Parameters
[AutoML documentation](https://docs.nvidia.com/tao/tao-toolkit/text/automl/automl.html#getting-started)


In [None]:
# AutoML Configuration
automl_algorithm = "bayesian"
automl_max_recommendations = 2  # Number of AutoML experiments to run

## Create cloud workspace
This workspace will be the place where your datasets reside and results of TAO FTMS jobs will be pushed to.

In [None]:
# Create cloud workspace
data = json.dumps(cloud_metadata)

endpoint = f"{host_url}/workspaces"

response = requests.post(endpoint,data=data)

print(response)
print(json.dumps(response.json(), indent=4))
assert response.status_code in (200, 201)
assert "id" in response.json().keys()

workspace_id = response.json()["id"]

## Load Base experiments from local storage into your FTMS Database


In [None]:
# Local workspace configuration
# 3. Load base experiments into DB from below endpoint
endpoint = f"{host_url}/experiments:load_airgapped"
data = {
    "workspace_id": workspace_id,
}
response = requests.post(endpoint, json=data)
print(response)
print(json.dumps(response.json(), indent=4))
assert response.status_code in (200, 201)

## Register datasets with FTMS 

TAO FTMS requires datasets in your cloud storage to be registered to produce a unique ID that can be attached to training jobs. This step only needs to be done once and then you can use the dataset across any experiments that support the dataset format.

### List Registered Datasets

In [None]:
endpoint = f"{host_url}/datasets"

response = requests.get(endpoint)
assert response.status_code in (200, 201)

datasets = response.json()["datasets"]
for rsp in datasets:
    rsp_keys = rsp.keys()
    assert "id" in rsp_keys
    assert "type" in rsp_keys
    assert "format" in rsp_keys
    assert "name" in rsp_keys

print(response)
# print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose list output
print("id\t\t\t\t\t type\t\t\t format\t\t name")
for rsp in datasets:
    print(rsp["id"],"\t",rsp["type"],"\t",rsp["format"],"\t\t",rsp["name"])

If you have already registered your datasets, then you can directly set their IDs in the following cell to avoid creating duplicate datasets. 

In [None]:
train_dataset_id = None 
eval_dataset_id = None 
test_dataset_id = None 

### Train Dataset 

In [None]:
# Create train dataset
if train_dataset_id is None: 
    train_dataset_metadata = {"type": "object_detection",
                              "format": "coco",
                              "workspace":workspace_id,
                              "cloud_file_path": train_dataset_path,
                              "use_for": ["training"],
                              "name": "hardhat_detection_train"
                              }
    data = json.dumps(train_dataset_metadata)
    
    endpoint = f"{host_url}/datasets"
    
    response = requests.post(endpoint,data=data)
    assert response.status_code in (200, 201)
    assert "id" in response.json().keys()
    
    print(response)
    print(json.dumps(response.json(), indent=4))
    train_dataset_id = response.json()["id"]

In [None]:
# Check progress
endpoint = f"{host_url}/datasets/{train_dataset_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint)
    print(response)
    print(json.dumps(response.json(), indent=4))
    assert response.status_code in (200, 201)

    if response.json().get("status") == "invalid_pull":
        raise ValueError("Dataset pull failed")
    if response.json().get("status") == "pull_complete":
        break
    time.sleep(5)

### Validation Dataset

In [None]:
# Create eval dataset
if eval_dataset_id is None: 
    eval_dataset_metadata = {"type": "object_detection",
                             "format": "coco",
                             "workspace":workspace_id,
                             "cloud_file_path": eval_dataset_path,
                             "use_for": ["evaluation"],
                             "name" : "hardhat_detection_val" 
                             }
    data = json.dumps(eval_dataset_metadata)
    
    endpoint = f"{host_url}/datasets"
    
    response = requests.post(endpoint,data=data)
    assert response.status_code in (200, 201)
    assert "id" in response.json().keys()
    
    print(response)
    print(json.dumps(response.json(), indent=4))
    eval_dataset_id = response.json()["id"]

In [None]:
# Check progress
endpoint = f"{host_url}/datasets/{eval_dataset_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint)
    assert response.status_code in (200, 201)

    print(response)
    print(json.dumps(response.json(), indent=4))
    if response.json().get("status") == "invalid_pull":
        raise ValueError("Dataset pull failed")
    if response.json().get("status") == "pull_complete":
        break
    time.sleep(5)

## Create Experiment 

Before we can run any jobs such as training, evaluation, or inference, we must create an experiment to setup the network architecture and associated datsets. Then we can chain several jobs together to create our trained models. 

In [None]:
checkpoint_choose_method = "best_model"
data = json.dumps({"network_arch":model_name,
                   "checkpoint_choose_method":checkpoint_choose_method,
                   "workspace": workspace_id,
                   "train_datasets":[train_dataset_id],
                   "eval_dataset":eval_dataset_id,
                   "inference_dataset":eval_dataset_id,
                   "calibration_dataset":train_dataset_id})

endpoint = f"{host_url}/experiments"

response = requests.post(endpoint,data=data)
assert response.status_code in (200, 201)
assert "id" in response.json()

print(response)
print(json.dumps(response.json(), indent=4))
experiment_id = response.json()["id"]

When a job is submitted, we will receive a unique ID to reference back to it. We will store these IDs in the following ```job_map``` variable. 

In [None]:
job_map = {}

#### View hyperparameters that are enabled for AutoML by default <a class="anchor" id="head-14"></a>

In [None]:
# Get default spec schema
endpoint = f"{host_url}/experiments/{experiment_id}/specs/train/schema"
while True:
    response = requests.get(endpoint)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.status_code in (200, 201)
assert "automl_default_parameters" in response.json().keys()
automl_params = response.json()["automl_default_parameters"]
print(json.dumps(automl_params, sort_keys=True, indent=4))

#### Update the experiment with automl parameters to run experiments on <a class="anchor" id="head-14"></a>

In [None]:
automl_information = {
    "automl_enabled": True,
    "automl_algorithm": automl_algorithm,
    "automl_max_recommendations": automl_max_recommendations,
    "automl_hyperparameters": str(automl_params)
}
data = json.dumps({"metric":"kpi", "automl_settings": automl_information})

endpoint = f"{host_url}/experiments/{experiment_id}"

response = requests.patch(endpoint, data=data)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), sort_keys=True, indent=4))

## Assign Pretrained Model

To help bootstrap the model, we can start the model with pre-trained weights that have already seen a large number of images. This will help reduce the data and time required to finetune your model. Several pretrained models are available from NGC. TAO FTMS will automatically pull PTMs available to use. 

In [None]:
# List all pretrained models for the chosen network architecture
endpoint = f"{host_url}/experiments:base"
params = {"network_arch": model_name}
response = requests.get(endpoint, params=params)
assert response.status_code in (200, 201)

response_json = response.json()["experiments"]

for rsp in response_json:
    rsp_keys = rsp.keys()
    if "encryption_key" not in rsp.keys():
        assert "name" in rsp_keys and "version" in rsp_keys and "ngc_path" in rsp_keys
        print(f'PTM Name: {rsp["name"]}; PTM version: {rsp["version"]}; NGC PATH: {rsp["ngc_path"]}')

In [None]:
pretrained_map = {"dino" : "pretrained_dino_imagenet:fan_hybrid_small"}   
endpoint = f"{host_url}/experiments:base"
params = {"network_arch": model_name}
response = requests.get(endpoint, params=params)
assert response.status_code in (200, 201)

response_json = response.json()["experiments"]

# Search for ptm with given ngc path
ptm = []
for rsp in response_json:
    rsp_keys = rsp.keys()
    assert "ngc_path" in rsp_keys
    if rsp["ngc_path"].endswith(pretrained_map[model_name]):
        assert "id" in rsp_keys
        ptm_id = rsp["id"]
        ptm = [ptm_id]
        print("Metadata for model with requested NGC Path")
        print(rsp)
        break

In [None]:
ptm_information = {"base_experiment":ptm}
data = json.dumps(ptm_information)

endpoint = f"{host_url}/experiments/{experiment_id}"

response = requests.patch(endpoint, data=data)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))

## Start AutoML

### Retrieve default training spec 

Before launching a training a job, we can retrieve the default training spec for our network architecture to use as a starting point to configure the training parameters. 

In [None]:
# Get default spec schema
endpoint = f"{host_url}/experiments/{experiment_id}/specs/train/schema"

while True:
    response = requests.get(endpoint)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.status_code in (200, 201)
assert "default" in response.json().keys()

print(response)
# full_schema = response.json()
# print(json.dumps(full_schema, indent=4)) ## Uncomment for verbose schema
specs = response.json()["default"]
print(json.dumps(specs, sort_keys=True, indent=4))

### Configure Training Spec 

Now we can customize the json spec object and set our dataset, training and model parameters. No changes are needed unless you are using a custom dataset or model architecture.

#### Understanding the Spec Schema Structure

Before customizing the training spec, it's helpful to understand the available configuration options and their valid values. The `get_bounds_of_field` utility function helps you explore the schema structure and find valid parameter values if they are defined on the backend (for some fields it might not be defined).

**How to use the schema exploration utility:**

```python
from get_bounds import get_bounds_of_field

# Example: Get valid values for backbone type
# Note the conversion from dictionary access notation to list notation:
# From: ["model"]["backbone"]
# To:   ["model", "backbone"]
print(get_bounds_of_field(full_schema, ["model", "backbone"]))
```

**Key differences in notation:**
- **Dictionary access format**: `["model"]["backbone"]` - This is how you would access nested values in Python
- **Schema path format**: `["model", "backbone"]` - This is the format expected by the `get_bounds_of_field` function

---

#### Schema Navigation Tips

1. **Nested Structure**: The schema follows a hierarchical structure where each level represents a configuration category
2. **Path Format**: Always use a list of strings `["level1", "level2", "level3"]` when calling `get_bounds_of_field`
3. **Validation**: This helps ensure you're using valid parameter values before submitting your training job
4. **Documentation**: Use these explorations to understand what options are available for your specific model architecture


No changes to the default spec are needed unless you are using a custom dataset format or want to modify the model architecture and training parameters.
For more details about the configurable parameters, refer to the documentation on [RT-DETR](https://docs.nvidia.com/tao/tao-toolkit/text/cv_finetuning/pytorch/object_detection/rt_detr.html#creating-an-experiment-spec-file) in TAO 

In [None]:
# Customize train model specs
specs["train"]["num_gpus"] = 1
specs["dataset"]["num_classes"] = 5

print(json.dumps(specs, sort_keys=True, indent=4))

### Submit Training Job  

With our training spec configured, we can now submit a training job. All jobs follow the same flow of retreiving the default spec, customizing it then submitting the job. 

In [None]:
# Run action
action = "train"
data = json.dumps({"parent_job_id":None,"action":action,"specs":specs,
                   })

endpoint = f"{host_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data)
assert response.status_code in (200, 201)
assert response.json()

print(response)
print(json.dumps(response.json(), indent=4))

job_map["train_" + model_name] = response.json()
print(job_map)

After submitting the training job, an ID is returned that we can use to monitor the job progress. The following cell will continuously print the latest status until the job is complete. This notebook will track all of the job IDs in the ```job_map``` variable. 

In [None]:
job_id = job_map["train_" + model_name]
endpoint = f"{host_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint)
    if "error_desc" in response.json().keys() and response.json()["error_desc"] in ("Job trying to retrieve not found", "No AutoML run found"):
        print("Job is being created")
        time.sleep(5)
        continue
    print(response)
    print(json.dumps(response.json(), sort_keys=True, indent=4))
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

If you need to cancel the job for any reason, you can uncomment and run the following cell. You can also configure the endpoint to end with ```:pause``` or ```:resume``` instead of ```:cancel``` to temporarily stop and start the job. 

In [None]:
# job_id = job_map["train_" + model_name]
# endpoint = f"{host_url}/experiments/{experiment_id}/jobs/{job_id}:cancel"

# response = requests.post(endpoint)
# print(response)
# print(json.dumps(response.json(), indent=4))

If the job runs into any errors or if you want to check the job logs, you can uncomment and run the following cell to view the job logs. Alternatively, you can view the job logs in your cloud workspace under the path /results/<job_id>/microservices_log.txt.

In [None]:
# job_id = job_map["train_" + model_name]
# endpoint = f"{host_url}/experiments/{experiment_id}/jobs/{job_id}/logs"

# response = requests.get(endpoint)
# print(response.text)

## Evaluate AutoML Model 

Once our AutoML model has been trained, we can evaluate it on the test dataset to get detection KPIs. 

### Receive default evaluation spec 

In [None]:
# Get default spec schema
endpoint = f"{host_url}/experiments/{experiment_id}/specs/evaluate/schema"

while True:
    response = requests.get(endpoint)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.status_code in (200, 201)
assert "default" in response.json().keys()

print(response)
#print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema
specs = response.json()["default"]
print(json.dumps(specs, sort_keys=True, indent=4))

### Customize Evaluation Spec 

In [None]:
specs["dataset"]["num_classes"] = 5

print(json.dumps(specs, sort_keys=True, indent=4))

### Submit Evaluation Job 

Note that for this job we will set the ```parent_job_id``` parameter in the body of the request to the completed training job. This is required to pass the trained model from the training job into our evaluation job. 

In [None]:
# Run action
parent = job_map["train_" + model_name]
action = "evaluate"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":specs,
                   })

endpoint = f"{host_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))

job_map["evaluate_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by running this cell
job_id = job_map["evaluate_" + model_name]
endpoint = f"{host_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint)
    print(response)
    print(json.dumps(response.json(), indent=4))
    assert response.status_code in (200, 201)
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

## TRT Engine Generation 

Now that we have a automl trained model, it can be exported to ONNX format then turned into an optimzed TensorRT engine for deployment. 

### Export Model to ONNX 

In [None]:
# Get default spec schema
endpoint = f"{host_url}/experiments/{experiment_id}/specs/export/schema"

while True:
    response = requests.get(endpoint)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.status_code in (200, 201)
assert "default" in response.json().keys()

print(response)
#print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema
specs = response.json()["default"]
print(json.dumps(specs, sort_keys=True, indent=4))

In [None]:
specs["dataset"]["num_classes"] = 5

print(json.dumps(specs, sort_keys=True, indent=4))

In [None]:
# Run action
parent = job_map["train_" + model_name] #parent is trained model
action = "export"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":specs,
                   })

endpoint = f"{host_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))

job_map["export_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map["export_" + model_name]
endpoint = f"{host_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint)
    print(response)
    print(json.dumps(response.json(), indent=4))
    assert response.status_code in (200, 201)
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

### Convert ONNX to TRT Engine 

In [None]:
# Get default spec schema
endpoint = f"{host_url}/experiments/{experiment_id}/specs/gen_trt_engine/schema"

while True:
    response = requests.get(endpoint)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.status_code in (200, 201)
assert "default" in response.json().keys()

print(response)
#print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema
specs = response.json()["default"]
print(json.dumps(specs, sort_keys=True, indent=4))

In [None]:
specs["dataset"]["num_classes"] = 5
specs["gen_trt_engine"]["tensorrt"]["data_type"] = "FP16"

In [None]:
# Run action
parent = job_map["export_" + model_name]
action = "gen_trt_engine"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":specs,
                   })

endpoint = f"{host_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))

job_map["model_gen_trt_engine_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map['model_gen_trt_engine_' + model_name]
endpoint = f"{host_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint)
    assert response.status_code in (200, 201)
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    print(response)
    print(json.dumps(response.json(), indent=4))
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

### Inference TRT Engine 

Finally we can use our optimized automl model to inference on our test set and receive the annotated results. 

In [None]:
# Get default spec schema
endpoint = f"{host_url}/experiments/{experiment_id}/specs/inference/schema"

while True:
    response = requests.get(endpoint)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.status_code in (200, 201)
assert "default" in response.json().keys()

print(response)
#print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema
specs = response.json()["default"]
print(json.dumps(specs, sort_keys=True, indent=4))

In [None]:
specs["dataset"]["num_classes"] = 5
specs["dataset"]["batch_size"] = 1

In [None]:
# Run action
parent = job_map["model_gen_trt_engine_" + model_name]
action = "inference"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":specs,
                   })

endpoint = f"{host_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))

job_map["inference_trt_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map["inference_trt_" + model_name]
endpoint = f"{host_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint)
    print(response)
    print(json.dumps(response.json(), indent=4))
    assert response.status_code in (200, 201)
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

## Clean Up 

You can optionally run this section to delete the datasets and experiment results. 

### Delete experiment <a class="anchor" id="head-23"></a>

In [None]:
endpoint = f"{host_url}/experiments/{experiment_id}"

response = requests.delete(endpoint)

print(response)
print(json.dumps(response.json(), indent=4))
assert response.status_code in (200, 201)

### Delete train dataset <a class="anchor" id="head-24"></a>

In [None]:
endpoint = f"{host_url}/datasets/{train_dataset_id}"

response = requests.delete(endpoint)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))

### Delete val dataset <a class="anchor" id="head-24"></a>

In [None]:
endpoint = f"{host_url}/datasets/{eval_dataset_id}"

response = requests.delete(endpoint)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))