### Notebook to demonstrate TAO RT-DETR Object Detection and Distillation through FTMS 

Transfer learning is the process of transferring learned features from one application to another. It is a commonly used training technique where you use a model trained on one task and re-train to use it on a different task. Train Adapt Optimize (TAO) Toolkit  is a simple and easy-to-use Python based AI toolkit for taking purpose-built AI models and customizing them with users' own data.

---

### TAO Distillation 

Knowledge distillation is a powerful new feature in NVIDIA TAO that enables the training of smaller, faster, and more efficient models without sacrificing accuracy. This technique works by leveraging a high-performing, pre-trained teacher model to guide the training of a smaller student model. During training, the student learns not only from the ground truth labels but also from the richer outputs of the teacher, which often capture subtle patterns and generalizations that raw labels alone cannot provide.

By mimicking the behavior of the teacher, the student model can achieve higher accuracy than it would through conventional training methods alone. This is especially valuable when training lightweight models intended for edge or real-time applications, where computational resources are limited but high accuracy is still critical.

Knowledge distillation is ideal in two key scenarios: (1) when your small model isn't reaching the desired accuracy, and (2) when your larger model performs well but is too slow or resource-intensive for deployment. In both cases, distillation can bridge the gap between speed and accuracy, helping you deploy high-performing models that meet real-world constraints.

<img src="assets/distillation_rtdetr_workflow_diagram.png" width="500"/>

---
The following table showcases a study we did with convnext_large and convnext_tiny on COCO dataset. The student model achieved much better mAP than finetuning it directly on the COCO dataset.
| RT-DETR models            | Pretrained weights | mAP  |
|---------------------------|--------------------|------|
| Baseline (convnext_tiny)  | ImageNet 22K       | 50.0 |
| Teacher (convnext_large)  | ImageNet 22K       | 53.0 |
| Distilled (convnext_tiny) | ImageNet 22K       | 51.2 |

---

### Sample prediction of a trained RT-DETR model

<img align="center" src="sample_images/detection_sample.jpg" width="640">

### Expected Output

The output of the notebook is to train a larger object detection model as a teacher and a light weight distilled student object detection model that does better than training the student model from scratch.

Detailed instructions about the E2E workflow are covered in the TAO [documentation](https://docs.nvidia.com/tao/tao-toolkit/text/cv_finetuning/pytorch/object_detection/rt_detr.html).

---

### The workflow in a nutshell

This notebook will show how to take an RT-DETR hard hat detection model with a ConvNext Large backbone and distill it into a model 1/4 of the size with the same accuracy using TAO fine tuning microservices. 

1) Configure Connection to TAO FTMS 
2) Login & Create Cloud Workspace
3) Register Train, Validation and Test datasets
4) Train Teacher model
5) Distill Student model
6) Export student model and test inference

---
### Requirements
Prior to running this notebook you must have: 
1) A TAO FTMS server.  [(Setup Guide here)](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_setup.html#)
2) The sample detection dataset from the [hardhat_detection_coco_dataset_format.ipynb](https://github.com/NVIDIA/tao_tutorials/tree/main/notebooks/tao_api_starter_kit/dataset_prepare/hardhat_detection_coco/hardhat_detection_coco_dataset_format.ipynb) notebook uploaded to your cloud storage.
3) Set the `<>` enclosed variables with values in the Configuration section of the notebook.

---

### Debugging Finetuning Microservice and Jobs

When working with the TAO API, you may encounter issues at different stages. Use the following guidance to debug effectively:

#### 1. Dataset, Experiment, or Workspace CRUD Operation Errors

If you encounter errors related to creating, reading, updating, or deleting datasets, experiments, or workspaces **and the error messages are not clear**, check the logs of the TAO API service pods:

```bash
kubectl logs -f <pod name starting with tao-api-app-pod>
```

#### 2. Errors During Job Launch

For issues that occur **while launching a job**, check both the app and workflow pods:

```bash
kubectl logs -f <pod name starting with tao-api-app-pod>
kubectl logs -f <pod name starting with tao-api-workflow-pod>
```

#### 3. Errors After Job Launch

If errors occur **after a job has been launched**, inspect the job pod logs:

```bash
kubectl logs -f tao-api-sts-<job_id>-0
```

> **Note:**  
> Run these `kubectl` commands on the machine where your Kubernetes service is deployed.

#### Additional Debugging Tips

- **Job logs are automatically uploaded** to your cloud workspace at:  
  `/results/<job_id>/microservices_log.txt`

- **You can also view logs via the Jobs API endpoint:**  
  ```
  /api/v1/orgs/<org_name>/<experiments|datasets>/<experiment_id|dataset_id>/jobs/<job_id>/logs
  ```

**Summary Table**

| Error Type                | Where to Check Logs                                      |
|-------------------------- |---------------------------------------------------------|
| CRUD operation errors     | `tao-api-app-pod`                                       |
| Job launch errors         | `tao-api-app-pod`, `tao-api-workflow-pod`               |
| Post-launch job errors    | `tao-api-sts-<job_id>-0`                                |
| All job logs (cloud)      | `/results/<job_id>/microservices_log.txt`               |
| All job logs (API)        | `/api/v1/orgs/<org_name>/<experiments|datasets>/<experiment_id|dataset_id>/jobs/<job_id>/logs` |

---
### Performance Benchmarks

#### Execution Time Breakdown

The following table shows the approximate time required for each stage of the RT-DETR distillation workflow:

| **Stage** | **Duration** | **Description** |
|-----------|--------------|-----------------|
| Train Dataset Pull | **15min** | Train dataset verification and preprocessing |
| Val/Test Dataset Pull | **3min 30s** | Validation and test dataset verification |
| Teacher Model Training | **6hr** | Teacher model training (ConvNext-Large backbone, 80 epochs) |
| Teacher Model Evaluation | **4min 45s** | Performance assessment of teacher model |
| Student Model Distillation | **3hr** | Knowledge distillation training (80 epochs) |
| Student Model Evaluation | **3min 45s** | Performance assessment of distilled model |
| ONNX + TensorRT Export | **3min + 7min** | Model optimization and engine generation |
| TensorRT Inference | **7min** | High-performance inference testing |
| **Total Time** | **9hr 45min** | **Complete end-to-end workflow** |

#### Test Environment Specifications

| **Component** | **Specification** |
|---------------|-------------------|
| **GPU** | 1x NVIDIA A40 |
| **Training Dataset** | 4,000 images (1,010 MB total) |
| **Validation Dataset** | 500 images (125 MB total) |
| **Test Dataset** | 500 images (125 MB total) |
| **Training Epochs** | 80 (both teacher and student) |
| **Model Architecture** | RT-DETR with ConvNext-Large → ConvNext-Tiny |
| **Task** | Hard hat detection (2 classes) |

#### Performance Factors

> **Important Note:**  
> Actual execution times may vary significantly based on:
> 
> - **Hardware Configuration**: GPU type, memory, and compute capability
> - **Storage Performance**: Local disk I/O speed and cloud storage latency
> - **Network Conditions**: Bandwidth and latency to cloud storage
> - **System Load**: Other concurrent processes and resource utilization
> - **Dataset Size**: Number of images and total data volume
> - **Batch Size**: Larger batch sizes can improve training stability and speed
> - **Model Configuration**: Backbone architecture, epochs, and hyperparameters
> - **Distillation Settings**: Temperature, alpha, and loss function weights

In [None]:
import json
import os
import requests
import time
from IPython.display import clear_output
import glob

## Configuration 

Fill in all `<>` enclosed variables with relevant values under this section. 

### TAO FTMS Host & Credentials 

In [None]:
# Configure for your TAO FTMS server 
host_url = "<HOST_URL>"
ngc_key = "<NGC_API_KEY>"
ngc_org_name = "<NGC_ORG_NAME>" 

### Cloud Storage Setup & Credentials 

In [None]:
# Cloud bucket details to access datasets and store experiment results
cloud_metadata = {}
cloud_metadata["name"] = "tao_workspace"  # A Representative name for this cloud info
cloud_metadata["cloud_type"] = "aws"  # If it's AWS, HuggingFace or Azure
cloud_metadata["cloud_specific_details"] = {}
cloud_metadata["cloud_specific_details"]["cloud_region"] = "<BUCKET_REGION>"  # Bucket region
cloud_metadata["cloud_specific_details"]["cloud_bucket_name"] = "<BUCKET_NAME>"  # Bucket name
# Access and Secret for AWS
cloud_metadata["cloud_specific_details"]["access_key"] = "<ACCESS_KEY>"
cloud_metadata["cloud_specific_details"]["secret_key"] = "<SECRET_KEY>"

### Dataset Paths in Cloud Storage 

In [None]:
#FIX ME - adjust paths to point to datasets in your cloud storage. If using S3 Bucket do not include the bucket name in the path. 
train_dataset_path =  "/<PATH_TO_DATASET_IN_CLOUD_STORAGE>/hard_hat_detection_coco/train"
eval_dataset_path = "/<PATH_TO_DATASET_IN_CLOUD_STORAGE>/hard_hat_detection_coco/val"
test_dataset_path = "/<PATH_TO_DATASET_IN_CLOUD_STORAGE>/hard_hat_detection_coco/test"

### Model Configuration 

In [None]:
#No changes needed 
model_name = "rtdetr"
ds_type = "object_detection"
ds_format = "coco"
num_classes = 2

## Login

In [None]:
#Use NGC Key to login to FTMS 
data = json.dumps({"ngc_org_name": ngc_org_name,
                   "ngc_key": ngc_key,
                   "enable_telemetry": True})
response = requests.post(f"{host_url}/api/v1/login", data=data)
assert response.status_code in (200, 201)
assert "token" in response.json().keys()
token = response.json()["token"]
print("JWT",token)

# Set base URL
base_url = f"{host_url}/api/v1/orgs/{ngc_org_name}"
print("API Calls will be forwarded to",base_url)

headers = {"Authorization": f"Bearer {token}"}

## Create cloud workspace
This workspace will be the place where your datasets reside and results of TAO FTMS jobs will be pushed to.

In [None]:
# Create cloud workspace
data = json.dumps(cloud_metadata)

endpoint = f"{base_url}/workspaces"

response = requests.post(endpoint,data=data,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))

assert "id" in response.json().keys()
workspace_id = response.json()["id"]

## Register datasets with FTMS 

TAO FTMS requires datasets in your cloud storage to be registered to produce a unique ID that can be attached to training jobs. This step only needs to be done once and then you can use the dataset across any experiments that support the dataset format.

### List Registered Datasets

In [None]:
endpoint = f"{base_url}/datasets"

response = requests.get(endpoint, headers=headers)
assert response.status_code in (200, 201)

datasets = response.json()["datasets"]
for rsp in datasets:
    rsp_keys = rsp.keys()
    assert "id" in rsp_keys
    assert "type" in rsp_keys
    assert "format" in rsp_keys
    assert "name" in rsp_keys

print(response)
# print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose list output
print("id\t\t\t\t\t type\t\t\t format\t\t name")
for rsp in datasets:
    print(rsp["id"],"\t",rsp["type"],"\t",rsp["format"],"\t\t",rsp["name"])

If you have already registered your datasets, then you can directly set their IDs in the following cell to avoid creating duplicate datasets. 

In [None]:
train_dataset_id = None 
eval_dataset_id = None 
test_dataset_id = None 

### Train Dataset 

In [None]:
# Create train dataset
if train_dataset_id is None: 
    train_dataset_metadata = {"type": ds_type,
                              "format": ds_format,
                              "workspace":workspace_id,
                              "cloud_file_path": train_dataset_path,
                              "use_for": ["training"],
                              "name": "hardhat_detection_train"
                              }
    data = json.dumps(train_dataset_metadata)
    
    endpoint = f"{base_url}/datasets"
    
    response = requests.post(endpoint,data=data,headers=headers)
    assert response.status_code in (200, 201)
    assert "id" in response.json().keys()
    
    print(response)
    print(json.dumps(response.json(), indent=4))
    train_dataset_id = response.json()["id"]

In [None]:
# Check progress
endpoint = f"{base_url}/datasets/{train_dataset_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    print(response)
    print(json.dumps(response.json(), indent=4))
    assert response.status_code in (200, 201)

    if response.json().get("status") == "invalid_pull":
        raise ValueError("Dataset pull failed")
    if response.json().get("status") == "pull_complete":
        break
    time.sleep(5)

### Validation Dataset

In [None]:
# Create eval dataset
if eval_dataset_id is None: 
    eval_dataset_metadata = {"type": ds_type,
                             "format": ds_format,
                             "workspace":workspace_id,
                             "cloud_file_path": eval_dataset_path,
                             "use_for": ["evaluation"],
                             "name" : "hardhat_detection_val" 
                             }
    data = json.dumps(eval_dataset_metadata)
    
    endpoint = f"{base_url}/datasets"
    
    response = requests.post(endpoint,data=data,headers=headers)
    assert response.status_code in (200, 201)
    assert "id" in response.json().keys()
    
    print(response)
    print(json.dumps(response.json(), indent=4))
    eval_dataset_id = response.json()["id"]

In [None]:
# Check progress
endpoint = f"{base_url}/datasets/{eval_dataset_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)

    print(response)
    print(json.dumps(response.json(), indent=4))
    if response.json().get("status") == "invalid_pull":
        raise ValueError("Dataset pull failed")
    if response.json().get("status") == "pull_complete":
        break
    time.sleep(5)

### Test Dataset

In [None]:
# Create testing dataset for inference
if test_dataset_id is None: 
    test_dataset_metadata = {"type": ds_type,
                             "format":ds_format,
                             "workspace":workspace_id,
                             "cloud_file_path": test_dataset_path,
                             "use_for": ["testing"],
                             "name": "hardhat_detection_test"
                             }
    data = json.dumps(test_dataset_metadata)
    
    endpoint = f"{base_url}/datasets"
    
    response = requests.post(endpoint,data=data,headers=headers)
    assert response.status_code in (200, 201)
    assert "id" in response.json().keys()
    
    print(response)
    print(json.dumps(response.json(), indent=4))
    test_dataset_id = response.json()["id"]

In [None]:
# Check progress
endpoint = f"{base_url}/datasets/{test_dataset_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)

    print(response)
    print(json.dumps(response.json(), indent=4))
    if response.json().get("status") == "invalid_pull":
        raise ValueError("Dataset pull failed")
    if response.json().get("status") == "pull_complete":
        break
    time.sleep(5)

## Create Experiment 

Before we can run any jobs such as training, evaluation, distillation or inference, we must create an experiment to setup the network architecture and associated datsets. Then we can chain several jobs together to create our trained models. 

In [None]:
encode_key = "tlt_encode"
checkpoint_choose_method = "best_model"
data = json.dumps({"network_arch":model_name,
                   "encryption_key":encode_key,
                   "checkpoint_choose_method":checkpoint_choose_method,
                   "workspace": workspace_id,
                   "train_datasets":[train_dataset_id],
                   "eval_dataset":eval_dataset_id,
                   "inference_dataset":test_dataset_id,
                   "calibration_dataset":train_dataset_id})

endpoint = f"{base_url}/experiments"

response = requests.post(endpoint,data=data,headers=headers)
assert response.status_code in (200, 201)
assert "id" in response.json()

print(response)
print(json.dumps(response.json(), indent=4))
experiment_id = response.json()["id"]

When a job is submitted, we will receive a unique ID to reference back to it. We will store these IDs in the following ```job_map``` variable. 

In [None]:
job_map = {}

## Train Teacher Model 

Distillation first requires a large, high accuracy teacher model. To get this teacher model, we will train RT-DETR with a ConvNext Large backbone on our hard hat detection dataset. The size of this model will be 252 million parameters. 

The table lists all the supported teacher and student backbones. For example, you can use `RT-DETR+convnextv2_huge` as the teacher and distill a `RT-DETR+resnet18` student.

| Supported teacher backbones           | Supported student backbones |
|---------------------------------------|-----------------------------|
| convnext_tiny/small/base/large/xlarge | convnext_tiny/small/large   |
| convnextv2_nano/tiny/base/large/huge  | efficientvit_b0/b1          |
| resnet18/34/50/101                    | resnet34/50                 |

### Retrieve default training spec 

Before launching a training a job, we can retrieve the default training spec for our network architecture to use as a starting point to configure the training parameters. 

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/train/schema"

while True:
    response = requests.get(endpoint, headers=headers)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.status_code in (200, 201)
assert "default" in response.json().keys()

print(response)
# full_schema = response.json()
# print(json.dumps(full_schema, indent=4)) ## Uncomment for verbose schema
specs = response.json()["default"]
print(json.dumps(specs, sort_keys=True, indent=4))

### Configure Training Spec 

Now we can customize the json spec object and set our dataset, training and model parameters. No changes are needed unless you are using a custom dataset or model architecture.

#### Understanding the Spec Schema Structure

Before customizing the training spec, it's helpful to understand the available configuration options and their valid values. The `get_bounds_of_field` utility function helps you explore the schema structure and find valid parameter values if they are defined on the backend (for some fields it might not be defined).

**How to use the schema exploration utility:**

```python
from get_bounds import get_bounds_of_field

# Example: Get valid values for backbone type
# Note the conversion from dictionary access notation to list notation:
# From: ["model"]["backbone"]
# To:   ["model", "backbone"]
print(get_bounds_of_field(full_schema, ["model", "backbone"]))
```

**Key differences in notation:**
- **Dictionary access format**: `["model"]["backbone"]` - This is how you would access nested values in Python
- **Schema path format**: `["model", "backbone"]` - This is the format expected by the `get_bounds_of_field` function

---

#### Schema Navigation Tips

1. **Nested Structure**: The schema follows a hierarchical structure where each level represents a configuration category
2. **Path Format**: Always use a list of strings `["level1", "level2", "level3"]` when calling `get_bounds_of_field`
3. **Validation**: This helps ensure you're using valid parameter values before submitting your training job
4. **Documentation**: Use these explorations to understand what options are available for your specific model architecture


No changes to the default spec are needed unless you are using a custom dataset format or want to modify the model architecture and training parameters.
For more details about the configurable parameters, refer to the documentation on [RT-DETR](https://docs.nvidia.com/tao/tao-toolkit/text/cv_finetuning/pytorch/object_detection/rt_detr.html#creating-an-experiment-spec-file) in TAO 

In [None]:
# Customize train model specs
specs["train"]["num_epochs"] = 80
specs["train"]["checkpoint_interval"] = 20
specs["train"]["validation_interval"] = 5
specs["train"]["optim"]["lr_backbone"] = 0.0005
specs["train"]["optim"]["lr"] = 0.001
specs["train"]["optim"]["lr_steps"] = [1000]
specs["train"]["optim"]["momentum"] = 0.9
specs["train"]["precision"] = "bf16"
specs["train"]["activation_checkpoint"] = True 
specs["train"]["ema"] = False 
specs["train"]["num_gpus"] = 1

specs["dataset"]["batch_size"] = 8 #If OOM error, decrease batch size
specs["dataset"]["workers"] = 4
specs["dataset"]["remap_mscoco_category"] = False 
specs["dataset"]["num_classes"] = num_classes + 1 #RT-DETR requires + 1 to number  of classes 
specs["dataset"]["augmentation"]["eval_spatial_size"] = [416, 416] #set to match image size in dataset 
specs["dataset"]["augmentation"]["train_spatial_size"] = [416, 416] #set to match image size in dataset 
specs["dataset"]["augmentation"]["distortion_prob"] = 0.3
specs["dataset"]["augmentation"]["iou_crop_prob"] = 0.3

specs["model"]["backbone"] = "convnext_large"
specs["model"]["train_backbone"] = True 
specs["model"]["return_interm_indices"] = [1,2,3] 
specs["model"]["dec_layers"] = 6
specs["model"]["enc_layers"] = 1 
specs["model"]["num_queries"] = 300

print(json.dumps(specs, sort_keys=True, indent=4))

### Submit Training Job  

With our training spec configured, we can now submit a training job. All jobs follow the same flow of retreiving the default spec, customizing it then submitting the job. 

In [None]:
# Run action
action = "train"
data = json.dumps({"parent_job_id":None,"action":action,"specs":specs,
                   })

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)
assert response.json()

print(response)
print(json.dumps(response.json(), indent=4))

job_map["train_" + model_name] = response.json()
print(job_map)

After submitting the training job, an ID is returned that we can use to monitor the job progress. The following cell will continuously print the latest status until the job is complete. This notebook will track all of the job IDs in the ```job_map``` variable. 

In [None]:
job_id = job_map["train_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    if "error_desc" in response.json().keys() and response.json()["error_desc"] in ("Job trying to retrieve not found", "No AutoML run found"):
        print("Job is being created")
        time.sleep(5)
        continue
    print(response)
    print(json.dumps(response.json(), sort_keys=True, indent=4))
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

If you need to cancel the job for any reason, you can uncomment and run the following cell. You can also configure the endpoint to end with ```:pause``` or ```:resume``` instead of ```:cancel``` to temporarily stop and start the job. 

In [None]:
# job_id = job_map["train_" + model_name]
# endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}:cancel"

# response = requests.post(endpoint, headers=headers)
# print(response)
# print(json.dumps(response.json(), indent=4))

If the job runs into any errors or if you want to check the job logs, you can uncomment and run the following cell to view the job logs. Alternatively, you can view the job logs in your cloud workspace under the path /results/<job_id>/microservices_log.txt.

In [None]:
# job_id = job_map["train_" + model_name]
# endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}/logs"

# response = requests.get(endpoint, headers=headers)
# print(response.text)

## Evaluate Teacher Model 

Once our teacher model has been trained, we can evaluate it on the test dataset to get detection KPIs. 

### Receive default evaluation spec 

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/evaluate/schema"

while True:
    response = requests.get(endpoint, headers=headers)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.status_code in (200, 201)
assert "default" in response.json().keys()

print(response)
#print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema
specs = response.json()["default"]
print(json.dumps(specs, sort_keys=True, indent=4))

### Customize Evaluation Spec 

In [None]:
specs["dataset"]["batch_size"] = 8 #If OOM error, decrease batch size
specs["dataset"]["workers"] = 4
specs["dataset"]["remap_mscoco_category"] = False
specs["dataset"]["num_classes"] = num_classes + 1
specs["dataset"]["augmentation"]["eval_spatial_size"] = [416, 416]

specs["model"]["backbone"] = "convnext_large"
specs["model"]["train_backbone"] = True 
specs["model"]["return_interm_indices"] = [1,2,3] 
specs["model"]["dec_layers"] = 6
specs["model"]["enc_layers"] = 1 
specs["model"]["num_queries"] = 300

print(json.dumps(specs, sort_keys=True, indent=4))

### Submit Evaluation Job 

Note that for this job we will set the ```parent_job_id``` parameter in the body of the request to the completed training job. This is required to pass the trained model from the training job into our evaluation job. 

In [None]:
# Run action
parent = job_map["train_" + model_name]
action = "evaluate"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":specs,
                   })

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))

job_map["evaluate_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by running this cell
job_id = job_map["evaluate_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    print(response)
    print(json.dumps(response.json(), indent=4))
    assert response.status_code in (200, 201)
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

## Distill Model

With our teacher model trained, we can run a distill job and specify the smaller ConvNext Tiny backbone to be used as the student model. This will reduce the model from 252 million parameters to only 65.7 million while achieving similar accuracy. This will lead to a great speed up when inferencing the model. 

### Configure Distillation 

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/train/schema"

while True:
    response = requests.get(endpoint, headers=headers)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.status_code in (200, 201)
assert "default" in response.json().keys()

print(response)
#print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema
specs = response.json()["default"]
print(json.dumps(specs, sort_keys=True, indent=4))

In [None]:
# Customize train model specs
specs["train"]["num_epochs"] = 80
specs["train"]["checkpoint_interval"] = 20
specs["train"]["validation_interval"] = 5
specs["train"]["optim"]["lr_backbone"] = 0.0005
specs["train"]["optim"]["lr"] = 0.001
specs["train"]["optim"]["lr_steps"] = [1000]
specs["train"]["optim"]["momentum"] = 0.9
specs["train"]["precision"] = "bf16"
specs["train"]["activation_checkpoint"] = True 
specs["train"]["ema"] = False 
specs["train"]["num_gpus"] = 1

specs["dataset"]["batch_size"] = 8 #If OOM error, decrease batch size
specs["dataset"]["workers"] = 4
specs["dataset"]["remap_mscoco_category"] = False 
specs["dataset"]["num_classes"] = num_classes + 1 
specs["dataset"]["augmentation"]["eval_spatial_size"] = [416, 416]
specs["dataset"]["augmentation"]["train_spatial_size"] = [416, 416]
specs["dataset"]["augmentation"]["distortion_prob"] = 0.3
specs["dataset"]["augmentation"]["iou_crop_prob"] = 0.3

#Configure student model parameters 
specs["model"]["backbone"] = "convnext_tiny"
specs["model"]["train_backbone"] = True 
specs["model"]["return_interm_indices"] = [1,2,3] 
specs["model"]["dec_layers"] = 6
specs["model"]["enc_layers"] = 1 
specs["model"]["num_queries"] = 300

#Configure teacher model parameters from previous training job 
specs["distill"] = {}
specs["distill"]["teacher"] = {}
specs["distill"]["teacher"]["backbone"] = "convnext_large"
specs["distill"]["teacher"]["return_interm_indices"] = [1,2,3] 
specs["distill"]["teacher"]["dec_layers"] = 6 
specs["distill"]["teacher"]["enc_layers"] = 1 
specs["distill"]["teacher"]["num_queries"] = 300 
specs["distill"]["bindings"] = [{"teacher_module_name": "srcs", "student_module_name": "srcs", "criterion": "IOU", "weight": 20}]


print(json.dumps(specs, sort_keys=True, indent=4))

### Submit Distillation Job 

In [None]:
# Run action
action = "distill"
parent = job_map["train_" + model_name]
data = json.dumps({"parent_job_id":parent,"action":action,"specs":specs,
                   })

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)

print(response)
print(json.dumps(response.json(), indent=4))

assert response.status_code in (200, 201)
assert response.json()

job_map["distill_" + model_name] = response.json()
print(job_map)

In [None]:
job_id = job_map["distill_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    if "error_desc" in response.json().keys() and response.json()["error_desc"] in ("Job trying to retrieve not found", "No AutoML run found"):
        print("Job is being created")
        time.sleep(5)
        continue
    print(response)
    print(json.dumps(response.json(), sort_keys=True, indent=4))
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

## Evaluate Distilled Model

### Configure Evaluation Job 

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/evaluate/schema"

while True:
    response = requests.get(endpoint, headers=headers)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.status_code in (200, 201)
assert "default" in response.json().keys()

print(response)
#print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema
specs = response.json()["default"]
print(json.dumps(specs, sort_keys=True, indent=4))

In [None]:
specs["dataset"]["batch_size"] = 8 #If OOM error, decrease batch size
specs["dataset"]["workers"] = 4
specs["dataset"]["remap_mscoco_category"] = False 
specs["dataset"]["num_classes"] = num_classes + 1
specs["dataset"]["augmentation"]["eval_spatial_size"] = [416, 416]

specs["model"]["backbone"] = "convnext_tiny"
specs["model"]["train_backbone"] = True 
specs["model"]["return_interm_indices"] = [1,2,3] 
specs["model"]["dec_layers"] = 6
specs["model"]["enc_layers"] = 1 
specs["model"]["num_queries"] = 300

print(json.dumps(specs, sort_keys=True, indent=4))

### Submit Evaluation Job 

In [None]:
# Run action
parent = job_map["distill_" + model_name] #parent is now distillation job. This evaluation will use the distilled model. 
action = "evaluate"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":specs,
                   })

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))

job_map["evaluate_distilled_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map["evaluate_distilled_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    print(response)
    print(json.dumps(response.json(), indent=4))
    assert response.status_code in (200, 201)
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

## TRT Engine Generation 

Now that we have a small, accurate distilled model, it can be exported to ONNX format then turned into an optimzed TensorRT engine for deployment. 

### Export Model to ONNX 

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/export/schema"

while True:
    response = requests.get(endpoint, headers=headers)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.status_code in (200, 201)
assert "default" in response.json().keys()

print(response)
#print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema
specs = response.json()["default"]
print(json.dumps(specs, sort_keys=True, indent=4))

In [None]:
specs["dataset"]["batch_size"] = 8 #If OOM error, decrease batch size
specs["dataset"]["workers"] = 4
specs["dataset"]["remap_mscoco_category"] = False
specs["dataset"]["num_classes"] = num_classes + 1 
specs["dataset"]["augmentation"]["eval_spatial_size"] = [416, 416]
specs["dataset"]["augmentation"]["train_spatial_size"] = [416, 416]

specs["model"]["backbone"] = "convnext_tiny"
specs["model"]["train_backbone"] = True 
specs["model"]["return_interm_indices"] = [1,2,3] 
specs["model"]["dec_layers"] = 6
specs["model"]["enc_layers"] = 1 
specs["model"]["num_queries"] = 300

specs["export"]["input_height"] = 416
specs["export"]["input_width"] = 416

print(json.dumps(specs, sort_keys=True, indent=4))

In [None]:
# Run action
parent = job_map["distill_" + model_name] #parent is now distillation job. This will export the distilled model.  
action = "export"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":specs,
                   })

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))

job_map["export_distilled_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map["export_distilled_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    print(response)
    print(json.dumps(response.json(), indent=4))
    assert response.status_code in (200, 201)
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

### Convert ONNX to TRT Engine 

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/gen_trt_engine/schema"

while True:
    response = requests.get(endpoint, headers=headers)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.status_code in (200, 201)
assert "default" in response.json().keys()

print(response)
#print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema
specs = response.json()["default"]
print(json.dumps(specs, sort_keys=True, indent=4))

In [None]:
specs["dataset"]["batch_size"] = 8 #If OOM error, decrease batch size
specs["dataset"]["workers"] = 4
specs["dataset"]["remap_mscoco_category"] = False
specs["dataset"]["num_classes"] = num_classes + 1 
specs["dataset"]["augmentation"]["eval_spatial_size"] = [416, 416]
specs["dataset"]["augmentation"]["train_spatial_size"] = [416, 416]

specs["model"]["backbone"] = "convnext_tiny"
specs["model"]["train_backbone"] = True 
specs["model"]["return_interm_indices"] = [1,2,3] 
specs["model"]["dec_layers"] = 6
specs["model"]["enc_layers"] = 1 
specs["model"]["num_queries"] = 300

specs["gen_trt_engine"]["tensorrt"]["data_type"] = "FP16"

In [None]:
# Run action
parent = job_map["export_distilled_" + model_name]
action = "gen_trt_engine"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":specs,
                   })

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))

job_map["model_gen_trt_engine_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map['model_gen_trt_engine_' + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    print(response)
    print(json.dumps(response.json(), indent=4))
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

### Inference TRT Engine 

Finally we can use our distilled and optimized student model to inference on our test set and receive the annotated results. 

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/inference/schema"

while True:
    response = requests.get(endpoint, headers=headers)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.status_code in (200, 201)
assert "default" in response.json().keys()

print(response)
#print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema
specs = response.json()["default"]
print(json.dumps(specs, sort_keys=True, indent=4))

In [None]:
specs["dataset"]["batch_size"] = 1 #If OOM error, decrease batch size
specs["dataset"]["workers"] = 4
specs["dataset"]["remap_mscoco_category"] = False 
specs["dataset"]["num_classes"] = num_classes + 1 
specs["dataset"]["augmentation"]["eval_spatial_size"] = [416, 416]
specs["dataset"]["augmentation"]["train_spatial_size"] = [416, 416]

specs["model"]["backbone"] = "convnext_tiny"
specs["model"]["train_backbone"] = True 
specs["model"]["return_interm_indices"] = [1,2,3] 
specs["model"]["dec_layers"] = 6
specs["model"]["enc_layers"] = 1 
specs["model"]["num_queries"] = 300

specs["inference"]["input_height"] = 416 
specs["inference"]["input_width"] = 416
specs["inference"]["outline_width"] = 5
specs["inference"]["color_map"] = {}
specs["inference"]["color_map"]["helmet"] = "red"
specs["inference"]["color_map"]["head"] = "blue"

In [None]:
# Run action
parent = job_map["model_gen_trt_engine_" + model_name]
action = "inference"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":specs,
                   })

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))

job_map["inference_trt_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map["inference_trt_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    print(response)
    print(json.dumps(response.json(), indent=4))
    assert response.status_code in (200, 201)
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

## Clean Up 

You can optionally run this section to delete the datasets and experiment results. 

### Delete experiment <a class="anchor" id="head-23"></a>

In [None]:
endpoint = f"{base_url}/experiments/{experiment_id}"

response = requests.delete(endpoint,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))

### Delete train dataset <a class="anchor" id="head-24"></a>

In [None]:
endpoint = f"{base_url}/datasets/{train_dataset_id}"

response = requests.delete(endpoint,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))

### Delete val dataset <a class="anchor" id="head-24"></a>

In [None]:
endpoint = f"{base_url}/datasets/{eval_dataset_id}"

response = requests.delete(endpoint,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))

### Delete test dataset <a class="anchor" id="head-24"></a>

In [None]:
endpoint = f"{base_url}/datasets/{test_dataset_id}"

response = requests.delete(endpoint,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))