### Notebook to demonstrate TAO Multi-Golden ChangeNet Classification

Transfer learning is the process of transferring learned features from one application to another. It is a commonly used training technique where you use a model trained on one task and re-train to use it on a different task. Train Adapt Optimize (TAO) Toolkit  is a simple and easy-to-use Python based AI toolkit for taking purpose-built AI models and customizing them with users' own data.

### TAO Multi-Golden ChangeNet Classification with C-RADIO 

This notebook will walk through using Multi-Golden ChangeNet Classification with a CRADIO pretrained backbone for defect classification. 

Multi-Golden ChangeNet is a model architecture that allows multiple reference or 'golden' samples to be used as a reference when detecting if an object has a defect. This model is often used for manufacturing and quality assurance to automatically determine if parts are manufactured correctly. 

C-RADIO is a new foundation model that can be used as a backbone for a variety of tasks. In this example we will show how a pretrained CRADIO backbone can be used with Multi-Golden ChangeNet.

For more information about the Foundation Models supported in TAO, and the down-stream tasks integrated with it, refer to the TAO [Documentation](https://docs.nvidia.com/tao/tao-toolkit/text//foundation_models/overview.html#foundation-models)

<img src="assets/finetuning_workflow_diagram.png" width="600">

| **Foundation Model Architecture** | **Task Head** |
| :-- | :-- |
| NvDINOv2 | Visual ChangeNet |
| C-RADIOv2 | Visual ChangeNet |
| RADIOv2.5 | Visual ChangeNet |

---

### Sample inference from the Visual ChangeNet-Classification model

| **Pass** | **Fail** |
| :--: | :--: |
|<img align="center" title="no defects" src="https://github.com/vpraveen-nv/model_card_images/blob/main/cv/notebook/optical_inspection/pass.png?raw=true" width="400" > no defects | <img align="center" title="missing component" src="https://github.com/vpraveen-nv/model_card_images/blob/main/cv/notebook/optical_inspection/defect.png?raw=true" width="400" >  missing component |


To show this model in action, we will use the MVTec-AD datset and format it into a classification task to identify if the given image is defective or not defective.

---
### The workflow in a nutshell
This notebook will walk through using Multi-Golden ChangeNet Classification with a CRADIO pretrained backbone for defect classification. 

1) Configure Connection to TAO FTMS 
2) Login & Create Cloud Workspace
3) Register Train, Validation and Test datasets
4) Train ChangeNet model 
5) Evaluate ChangeNet model 
6) Export model and test inference

---

### Requirements
Prior to running this notebook you must have: 
1) A TAO FTMS server.  [(Setup Guide here)](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_setup.html#)
2) The sample mvtec-ad dataset from the [mvtec_ad_mgcn_dataset_format.ipynb](https://github.com/NVIDIA/tao_tutorials/blob/main/notebooks/tao_api_starter_kit/dataset_prepare/mvtec_ad_mg_changenet/mvtec_ad_mgcn_dataset_format.ipynb) notebook uploaded to your cloud storage.
3) Set the `<>` enclosed values with relevant in the Configuration section of the notebook.


---

### Debugging Finetuning Microservice and Jobs

When working with the TAO API, you may encounter issues at different stages. Use the following guidance to debug effectively:

#### 1. Dataset, Experiment, or Workspace CRUD Operation Errors

If you encounter errors related to creating, reading, updating, or deleting datasets, experiments, or workspaces **and the error messages are not clear**, check the logs of the TAO API service pods:

```bash
kubectl logs -f <pod name starting with tao-api-app-pod>
```

#### 2. Errors During Job Launch

For issues that occur **while launching a job**, check both the app and workflow pods:

```bash
kubectl logs -f <pod name starting with tao-api-app-pod>
kubectl logs -f <pod name starting with tao-api-workflow-pod>
```

#### 3. Errors After Job Launch

If errors occur **after a job has been launched**, inspect the job pod logs:

```bash
kubectl logs -f tao-api-sts-<job_id>-0
```

> **Note:**  
> Run these `kubectl` commands on the machine where your Kubernetes service is deployed.

#### Additional Debugging Tips

- **Job logs are automatically uploaded** to your cloud workspace at:  
  `/results/<job_id>/microservices_log.txt`

- **You can also view logs via the Jobs API endpoint:**  
  ```
  /api/v1/orgs/<org_name>/<experiments|datasets>/<experiment_id|dataset_id>/jobs/<job_id>/logs
  ```

**Summary Table**

| Error Type                | Where to Check Logs                                      |
|-------------------------- |---------------------------------------------------------|
| CRUD operation errors     | `tao-api-app-pod`                                       |
| Job launch errors         | `tao-api-app-pod`, `tao-api-workflow-pod`               |
| Post-launch job errors    | `tao-api-sts-<job_id>-0`                                |
| All job logs (cloud)      | `/results/<job_id>/microservices_log.txt`               |
| All job logs (API)        | `/api/v1/orgs/<org_name>/<experiments|datasets>/<experiment_id|dataset_id>/jobs/<job_id>/logs` |

---

### Performance Benchmarks

#### Execution Time Breakdown

The following table shows the approximate time required for each stage of the Multi-Golden ChangeNet classification workflow:

| **Stage** | **Duration** | **Description** |
|-----------|--------------|-----------------|
| Train Dataset Pull | **15min** | Train dataset verification and preprocessing |
| Val/Test Dataset Pull | **3min 30s** | Validation and test dataset verification |
| Model Finetuning | **18hr** | ChangeNet training with C-RADIO backbone (80 epochs) |
| Model Evaluation | **8min 45s** | Performance assessment and metrics calculation |
| ONNX Export | **4min** | Model conversion to ONNX format |
| TensorRT Export | **10min** | TensorRT engine optimization and generation |
| TensorRT Inference | **7min** | High-performance inference testing |
| **Total Time** | **18hr 50min** | **Complete end-to-end workflow** |

#### Test Environment Specifications

| **Component** | **Specification** |
|---------------|-------------------|
| **GPU** | 1x NVIDIA A40 |
| **Training Dataset** | 4,000 images (1,010 MB total) |
| **Validation Dataset** | 500 images (125 MB total) |
| **Test Dataset** | 500 images (125 MB total) |
| **Training Epochs** | 80 |
| **Model Architecture** | Multi-Golden ChangeNet with C-RADIO-v2b backbone |
| **Task** | Manufacturing defect detection (2 classes: defective/non-defective) |
| **Dataset** | MVTec-AD (Anomaly Detection) |


#### Performance Factors

> **Important Note:**  
> Actual execution times may vary significantly based on:
> 
> - **Hardware Configuration**: GPU type, memory, and compute capability
> - **Storage Performance**: Local disk I/O speed and cloud storage latency
> - **Network Conditions**: Bandwidth and latency to cloud storage
> - **System Load**: Other concurrent processes and resource utilization
> - **Dataset Size**: Number of images and total data volume
> - **Batch Size**: Larger batch sizes can improve training stability and speed
> - **Model Configuration**: Backbone architecture, epochs, and hyperparameters


In [None]:
import json
import os
import requests
import time
from IPython.display import clear_output

## Configuration 

Fill in all `<>` enclosed variables with relevant values under this section. 

### TAO FTMS Host & Credentials 

In [None]:
# Configure for your TAO FTMS server 
host_url = "<HOST_URL>"
ngc_key = "<NGC_API_KEY>"
ngc_org_name = "<NGC_ORG_NAME>"

docker_env_vars = {}

# If you're using a PTM from a private organization like NVAIE, uncomment the following line and add your legacy NGC API Key.
# docker_env_vars['TAO_API_KEY'] = '<NGC_LEGACY_API_KEY>' #Set to NGC Legacy API Key 

### Cloud Storage Setup & Credentials 

In [None]:
# Cloud bucket details to access datasets and store experiment results
cloud_metadata = {}
cloud_metadata["name"] = "tao_workspace"  # A Representative name for this cloud info
cloud_metadata["cloud_type"] = "aws"  # If it's AWS, HuggingFace or Azure
cloud_metadata["cloud_specific_details"] = {}
cloud_metadata["cloud_specific_details"]["cloud_region"] ="<BUCKET_REGION>"  # Bucket region
cloud_metadata["cloud_specific_details"]["cloud_bucket_name"] = "<BUCKET_NAME>"  # Bucket name
# Access and Secret for AWS
cloud_metadata["cloud_specific_details"]["access_key"] = "<ACCESS_KEY>"
cloud_metadata["cloud_specific_details"]["secret_key"] = "<SECRET_KEY>"

### Dataset Paths in Cloud Storage 

In [None]:
#FIX ME - adjust paths to point to datasets in your cloud storage. If using S3 Bucket do not include the bucket name in the path. 
train_dataset_path =  "/<PATH_TO_DATASET_IN_CLOUD_STORAGE>/mvtec_ad_mgcn/train"
eval_dataset_path = "/<PATH_TO_DATASET_IN_CLOUD_STORAGE>/mvtec_ad_mgcn/val"
test_dataset_path = "/<PATH_TO_DATASET_IN_CLOUD_STORAGE>/mvtec_ad_mgcn/test"

### Model Configuration

In [None]:
model_name = "visual_changenet_classify"
ds_type = "visual_changenet_classify"
ds_format = "default"
num_classes = 2

## Login <a class="anchor" id="head-2"></a>

In [None]:
# Validate NGC_PERSONAL_KEY
data = json.dumps({"ngc_org_name": ngc_org_name,
                   "ngc_key": ngc_key,
                   "enable_telemetry": True})
response = requests.post(f"{host_url}/api/v1/login", data=data)
assert response.ok, response.text
assert "token" in response.json().keys()
token = response.json()["token"]
print("JWT",token)

# Set base URL
base_url = f"{host_url}/api/v1/orgs/{ngc_org_name}"
print("API Calls will be forwarded to",base_url)

headers = {"Authorization": f"Bearer {token}"}

## Create cloud workspace
This workspace will be the place where your datasets reside and results of TAO FTMS jobs will be pushed to.

In [None]:
# Create cloud workspace
data = json.dumps(cloud_metadata)

endpoint = f"{base_url}/workspaces"

response = requests.post(endpoint,data=data,headers=headers)
assert response.ok, response.text

print(response)
print(json.dumps(response.json(), indent=4))

assert "id" in response.json().keys()
workspace_id = response.json()["id"]

## Register datasets with FTMS 

TAO FTMS requires datasets in your cloud storage to be registered to produce a unique ID that can be attached to training jobs. This step only needs to be done once and then you can use the dataset across any experiments that support the dataset format.

### List Registered Datasets

In [None]:
endpoint = f"{base_url}/datasets"

response = requests.get(endpoint, headers=headers)
assert response.ok, response.text

datasets = response.json()["datasets"]
for rsp in datasets:
    rsp_keys = rsp.keys()
    assert "id" in rsp_keys
    assert "type" in rsp_keys
    assert "format" in rsp_keys
    assert "name" in rsp_keys

print(response)
# print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose list output
print("id\t\t\t\t\t type\t\t\t format\t\t name")
for rsp in datasets:
    print(rsp["id"],"\t",rsp["type"],"\t",rsp["format"],"\t\t",rsp["name"])

If you have already registered the datasets with TAO FTMS, then fill in the following cell with the dataset IDs to avoid creating duplicate datasets. 

In [None]:
train_dataset_id = None 
eval_dataset_id = None 
test_dataset_id = None 

### Train Dataset 

In [None]:
# Create train dataset
if train_dataset_id is None: 
    train_dataset_metadata = {"type": ds_type,
                              "format": ds_format,
                              "workspace":workspace_id,
                              "cloud_file_path": train_dataset_path,
                              "use_for": ["training"],
                              "name": "mvtec_mgcn_train"
                              }
    data = json.dumps(train_dataset_metadata)
    
    endpoint = f"{base_url}/datasets"
    
    response = requests.post(endpoint,data=data,headers=headers)
    assert response.ok, response.text
    assert "id" in response.json().keys()
    
    print(response)
    print(json.dumps(response.json(), indent=4))
    train_dataset_id = response.json()["id"]

In [None]:
# Check progress
endpoint = f"{base_url}/datasets/{train_dataset_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    print(response)
    print(json.dumps(response.json(), indent=4))
    assert response.ok, response.text

    if response.json().get("status") == "invalid_pull":
        raise ValueError("Dataset pull failed")
    if response.json().get("status") == "pull_complete":
        break
    time.sleep(5)

### Validation Dataset

In [None]:
# Create eval dataset
if eval_dataset_id is None: 
    eval_dataset_metadata = {"type": ds_type,
                             "format": ds_format,
                             "workspace":workspace_id,
                             "cloud_file_path": eval_dataset_path,
                             "use_for": ["evaluation"],
                             "name" : "mvtec_mgcn_val"
                             }
    data = json.dumps(eval_dataset_metadata)
    
    endpoint = f"{base_url}/datasets"
    
    response = requests.post(endpoint,data=data,headers=headers)
    assert response.ok, response.text
    assert "id" in response.json().keys()
    
    print(response)
    print(json.dumps(response.json(), indent=4))
    eval_dataset_id = response.json()["id"]

In [None]:
# Check progress
endpoint = f"{base_url}/datasets/{eval_dataset_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    assert response.ok, response.text

    print(response)
    print(json.dumps(response.json(), indent=4))
    if response.json().get("status") == "invalid_pull":
        raise ValueError("Dataset pull failed")
    if response.json().get("status") == "pull_complete":
        break
    time.sleep(5)

### Test Dataset

In [None]:
# Create testing dataset for inference
if test_dataset_id is None: 
    test_dataset_metadata = {"type": ds_type,
                             "format":ds_format,
                             "workspace":workspace_id,
                             "cloud_file_path": test_dataset_path,
                             "use_for": ["testing"],
                             "name": "mvtec_mgcn_test"
                             }
    data = json.dumps(test_dataset_metadata)
    
    endpoint = f"{base_url}/datasets"
    
    response = requests.post(endpoint,data=data,headers=headers)
    assert response.ok, response.text
    assert "id" in response.json().keys()
    
    print(response)
    print(json.dumps(response.json(), indent=4))
    test_dataset_id = response.json()["id"]

In [None]:
# Check progress
endpoint = f"{base_url}/datasets/{test_dataset_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    assert response.ok, response.text

    print(response)
    print(json.dumps(response.json(), indent=4))
    if response.json().get("status") == "invalid_pull":
        raise ValueError("Dataset pull failed")
    if response.json().get("status") == "pull_complete":
        break
    time.sleep(5)

## Create an experiment <a class="anchor" id="head-9"></a>

Before we can run any jobs such as training, evaluation or inference, we must create an experiment to setup the network architecture and associated datsets. Then we can chain several jobs together to create our trained models. 

In [None]:
encode_key = "tlt_encode"
checkpoint_choose_method = "best_model"
data = json.dumps({"network_arch":model_name,
                   "encryption_key":encode_key,
                   "checkpoint_choose_method":checkpoint_choose_method,
                   "workspace": workspace_id,
                   "train_datasets":[train_dataset_id],
                   "eval_dataset":eval_dataset_id,
                   "inference_dataset":test_dataset_id,
                   "calibration_dataset":train_dataset_id,
                   "docker_env_vars": docker_env_vars})

endpoint = f"{base_url}/experiments"

response = requests.post(endpoint,data=data,headers=headers)
assert response.ok, response.text
assert "id" in response.json()

print(response)
print(json.dumps(response.json(), indent=4))
experiment_id = response.json()["id"]

When a job is submitted, we will receive a unique ID to reference back to it. We will store these IDs in the following ```job_map``` variable. 

In [None]:
job_map = {}

## Assign Pretrained Model

To help bootstrap the model, we can start the model with pre-trained weights that have already seen a large number of images. This will help reduce the data and time required to finetune your model. Several pretrained models are available from NGC. TAO FTMS will automatically pull PTMs available to use. 

In [None]:
# List all pretrained models for the chosen network architecture
endpoint = f"{base_url}/experiments:base"
params = {"network_arch": model_name}
response = requests.get(endpoint, params=params, headers=headers)
assert response.ok, response.text

response_json = response.json()["experiments"]

for rsp in response_json:
    rsp_keys = rsp.keys()
    if "encryption_key" not in rsp.keys():
        assert "name" in rsp_keys and "version" in rsp_keys and "ngc_path" in rsp_keys
        print(f'PTM Name: {rsp["name"]}; PTM version: {rsp["version"]}; NGC PATH: {rsp["ngc_path"]}')

In [None]:
pretrained_map = {"visual_changenet_classify" : "crdaiov2_vit:c_radio_v2_base"}   
endpoint = f"{base_url}/experiments:base"
params = {"network_arch": model_name}
response = requests.get(endpoint, params=params, headers=headers)
assert response.ok, response.text

response_json = response.json()["experiments"]

# Search for ptm with given ngc path
ptm = []
for rsp in response_json:
    rsp_keys = rsp.keys()
    assert "ngc_path" in rsp_keys
    if rsp["ngc_path"].endswith(pretrained_map[model_name]):
        assert "id" in rsp_keys
        ptm_id = rsp["id"]
        ptm = [ptm_id]
        print("Metadata for model with requested NGC Path")
        print(rsp)
        break

In [None]:
ptm_information = {"base_experiment":ptm}
data = json.dumps(ptm_information)

endpoint = f"{base_url}/experiments/{experiment_id}"

response = requests.patch(endpoint, data=data, headers=headers)
assert response.ok, response.text

print(response)
print(json.dumps(response.json(), indent=4))

## Train Model

### Retrieve default training spec

Before launching a training a job, we can retrieve the default training spec for our network architecture to use as a starting point to configure the training parameters. 

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/train/schema"

while True:
    response = requests.get(endpoint, headers=headers)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.ok, response.text
assert "default" in response.json().keys()

print(response)
# full_schema = response.json()
# print(json.dumps(full_schema, indent=4)) ## Uncomment for verbose schema
specs = response.json()["default"]
print(json.dumps(specs, sort_keys=True, indent=4))

### Configure Training Spec

Now we can customize the JSON spec object and set our dataset, training and model parameters. The training specification defines all the hyperparameters, model architecture settings, and training configurations that will be used during the training process.

#### Understanding the Spec Schema Structure

Before customizing the training spec, it's helpful to understand the available configuration options and their valid values. The `get_bounds_of_field` utility function helps you explore the schema structure and find valid parameter values if they are defined on the backend (for some fields it might not be defined).

**How to use the schema exploration utility:**

```python
from get_bounds import get_bounds_of_field

# Example: Get valid values for backbone type
# Note the conversion from dictionary access notation to list notation:
# From: ["model"]["backbone"]["type"] 
# To:   ["model", "backbone", "type"]
print(get_bounds_of_field(full_schema, ["model", "backbone", "type"]))
```

**Key differences in notation:**
- **Dictionary access format**: `["model"]["backbone"]["type"]` - This is how you would access nested values in Python
- **Schema path format**: `["model", "backbone", "type"]` - This is the format expected by the `get_bounds_of_field` function

---

#### Schema Navigation Tips

1. **Nested Structure**: The schema follows a hierarchical structure where each level represents a configuration category
2. **Path Format**: Always use a list of strings `["level1", "level2", "level3"]` when calling `get_bounds_of_field`
3. **Validation**: This helps ensure you're using valid parameter values before submitting your training job
4. **Documentation**: Use these explorations to understand what options are available for your specific model architecture


No changes to the default spec are needed unless you are using a custom dataset format or want to modify the model architecture and training parameters.
For more details about the configurable parameters, refer to the documentation on [VisualChangeNet](https://docs.nvidia.com/tao/tao-toolkit/text/cv_finetuning/pytorch/visual_changenet/visual_changenet_classify.html#creating-a-training-experiment-spec-file) in TAO 

In [None]:
# Customize train model specs
specs["train"]["num_epochs"] = 80
specs["train"]["checkpoint_interval"] = 20
specs["train"]["validation_interval"] = 5
specs["dataset"]["classify"]["batch_size"]= 8
specs["model"]["classify"]["eval_margin"]= 0.3
specs["model"]["classify"]["embed_dec"] = 30
specs["model"]["classify"]["difference_module"] = "learnable"
specs["model"]["classify"]["learnable_difference_modules"] = 4
specs["model"]["decode_head"]["use_summary_token"] = True
specs["model"]["decode_head"]["use_summary_token"] = True
specs["model"]["decode_head"]["feature_strides"] = [4,8,16,32]
specs["model"]["backbone"]["type"] = "c_radio_v2_vit_base_patch16_224"
specs["train"]["classify"]["loss"] = "ce"
specs["train"]["optim"]["lr"] = 0.001
specs["dataset"]["classify"]["input_map"] = {}
specs["dataset"]["classify"]["image_ext"] = ".png"
specs["dataset"]["classify"]["num_input"] = 1
specs["dataset"]["classify"]["num_golden"] = 4
specs["dataset"]["classify"]["augmentation_config"]["augment"] = True
specs["dataset"]["classify"]["augmentation_config"]["random_rotate"]["enable"] = False 
specs["task"] = "classify"
print(json.dumps(specs, sort_keys=True, indent=4))

### Submit Training Job  

With our training spec configured, we can now submit a training job. All jobs follow the same flow of retreiving the default spec, customizing it then submitting the job. 

In [None]:
# Run action
action = "train"
data = json.dumps({"parent_job_id":None,"action":action,"specs":specs,
                   })

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.ok, response.text
assert response.json()

print(response)
print(json.dumps(response.json(), indent=4))

job_map["train_" + model_name] = response.json()
print(job_map)

After submitting the training job, an ID is returned that we can use to monitor the job progress. The following cell will continuously print the latest status until the job is complete. This notebook will track all of the job IDs in the ```job_map``` variable. 

In [None]:
job_id = job_map["train_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    if "error_desc" in response.json().keys() and response.json()["error_desc"] in ("Job trying to retrieve not found", "No AutoML run found"):
        print("Job is being created")
        time.sleep(5)
        continue
    print(response)
    print(json.dumps(response.json(), sort_keys=True, indent=4))
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

If you need to cancel the job for any reason, you can uncomment and run the following cell. You can also configure the endpoint to end with ```:pause``` or ```:resume``` instead of ```:cancel``` to temporarily stop and start the job. 

In [None]:
# job_id = job_map["train_" + model_name]
# endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}:cancel"

# response = requests.post(endpoint, headers=headers)

# print(response)
# print(json.dumps(response.json(), indent=4))

If the job runs into any errors or if you want to check the job logs, you can uncomment and run the following cell to view the job logs. Alternatively, you can view the job logs in your cloud workspace under the path /results/<job_id>/microservices_log.txt.

In [None]:
# job_id = job_map["train_" + model_name]
# endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}/logs"

# response = requests.get(endpoint, headers=headers)
# print(response.text)

## Evaluate

Once our model has been been trained, we can evaulate it on the test datset to get accuracy KPIs. 

### Receive default evaluation spec 

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/evaluate/schema"

while True:
    response = requests.get(endpoint, headers=headers)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.ok, response.text
assert "default" in response.json().keys()

print(response)
#print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema
specs = response.json()["default"]
print(json.dumps(specs, sort_keys=True, indent=4))

### Customize Evaluation Spec 

In [None]:
specs["dataset"]["classify"]["batch_size"]= 8
specs["model"]["classify"]["eval_margin"]= 0.3
specs["model"]["classify"]["embed_dec"] = 30
specs["model"]["classify"]["difference_module"] = "learnable"
specs["model"]["classify"]["learnable_difference_modules"] = 4
specs["model"]["decode_head"]["use_summary_token"] = True
specs["model"]["decode_head"]["feature_strides"] = [4,8,16,32]
specs["model"]["backbone"]["type"] = "c_radio_v2_vit_base_patch16_224"
specs["train"]["classify"]["loss"] = "ce"

specs["dataset"]["classify"]["input_map"] = {}
specs["dataset"]["classify"]["image_ext"] = ".png"
specs["dataset"]["classify"]["num_input"] = 1
specs["dataset"]["classify"]["num_golden"] = 4
specs["dataset"]["classify"]["augmentation_config"]["augment"] = False 
specs["task"] = "classify"
print(json.dumps(specs, sort_keys=True, indent=4))

### Submit Evaluation Job 

Note that for this job we will set the ```parent_job_id``` parameter in the body of the request to the completed training job. This is required to pass the trained model from the training job into our evaluation job. 

In [None]:
# Run action
parent = job_map["train_" + model_name]
action = "evaluate"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":specs,
                   })

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.ok, response.text

print(response)
print(json.dumps(response.json(), indent=4))

job_map["evaluate_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by running this cell
job_id = job_map["evaluate_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    assert response.ok, response.text
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    print(response)
    print(json.dumps(response.json(), indent=4))
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

## TRT Engine Generation 

Now that we have a finetuned model, it can be exported to ONNX format then turned into an optimzed TensorRT engine for deployment. 

### Export Model to ONNX 

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/export/schema"

while True:
    response = requests.get(endpoint, headers=headers)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.ok, response.text
assert "default" in response.json().keys()

print(response)
#print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema
specs = response.json()["default"]
print(json.dumps(specs, sort_keys=True, indent=4))

In [None]:
specs["dataset"]["classify"]["batch_size"]= 8
specs["model"]["classify"]["eval_margin"]= 0.3
specs["model"]["classify"]["embed_dec"] = 30
specs["model"]["classify"]["difference_module"] = "learnable"
specs["model"]["classify"]["learnable_difference_modules"] = 4
specs["model"]["decode_head"]["use_summary_token"] = True
specs["model"]["decode_head"]["feature_strides"] = [4,8,16,32]
specs["model"]["backbone"]["type"] = "c_radio_v2_vit_base_patch16_224"
specs["train"]["classify"]["loss"] = "ce"

specs["dataset"]["classify"]["input_map"] = {}
specs["dataset"]["classify"]["image_ext"] = ".png"
specs["dataset"]["classify"]["num_input"] = 1
specs["dataset"]["classify"]["num_golden"] = 4
specs["dataset"]["classify"]["augmentation_config"]["augment"] = False 
specs["task"] = "classify"

specs["export"]["input_channel"] = 3 
specs["export"]["input_width"] = 224
specs["export"]["input_height"] = 224 
specs["export"]["batch_size"] = -1 

print(json.dumps(specs, sort_keys=True, indent=4))

In [None]:
# Run action
parent = job_map["train_" + model_name] #parent is now training job. This will export the trained model.  
action = "export"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":specs,
                   })

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.ok, response.text

print(response)
print(json.dumps(response.json(), indent=4))

job_map["export_trained_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map["export_trained_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    print(response)
    print(json.dumps(response.json(), indent=4))
    assert response.ok, response.text
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

### Convert ONNX to TRT Engine 

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/gen_trt_engine/schema"

while True:
    response = requests.get(endpoint, headers=headers)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.ok, response.text
assert "default" in response.json().keys()

print(response)
#print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema
specs = response.json()["default"]
print(json.dumps(specs, sort_keys=True, indent=4))

In [None]:
specs["dataset"]["classify"]["batch_size"]= 8
specs["model"]["classify"]["eval_margin"]= 0.3
specs["model"]["classify"]["embed_dec"] = 30
specs["model"]["classify"]["difference_module"] = "learnable"
specs["model"]["classify"]["learnable_difference_modules"] = 4
specs["model"]["decode_head"]["use_summary_token"] = True
specs["model"]["decode_head"]["feature_strides"] = [4,8,16,32]
specs["model"]["backbone"]["type"] = "c_radio_v2_vit_base_patch16_224"
specs["train"]["classify"]["loss"] = "ce"

specs["dataset"]["classify"]["input_map"] = {}
specs["dataset"]["classify"]["image_ext"] = ".png"
specs["dataset"]["classify"]["num_input"] = 1
specs["dataset"]["classify"]["num_golden"] = 4
specs["dataset"]["classify"]["augmentation_config"]["augment"] = False 
specs["task"] = "classify"

specs["gen_trt_engine"]["tensorrt"]["data_type"] = "FP16"

In [None]:
# Run action
parent = job_map["export_trained_" + model_name]
action = "gen_trt_engine"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":specs,
                   })

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.ok, response.text

print(response)
print(json.dumps(response.json(), indent=4))

job_map["model_gen_trt_engine_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map['model_gen_trt_engine_' + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    assert response.ok, response.text
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    print(response)
    print(json.dumps(response.json(), indent=4))
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

### Inference TRT Engine

Finally we can use our optimized model to inference on our test set and receive the annotated results. 

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/inference/schema"

while True:
    response = requests.get(endpoint, headers=headers)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.ok, response.text
assert "default" in response.json().keys()

print(response)
#print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema
specs = response.json()["default"]
print(json.dumps(specs, sort_keys=True, indent=4))

In [None]:
specs["dataset"]["classify"]["batch_size"]= 1
specs["model"]["classify"]["eval_margin"]= 0.3
specs["model"]["classify"]["embed_dec"] = 30
specs["model"]["classify"]["difference_module"] = "learnable"
specs["model"]["classify"]["learnable_difference_modules"] = 4
specs["model"]["decode_head"]["use_summary_token"] = True
specs["model"]["decode_head"]["feature_strides"] = [4,8,16,32]
specs["model"]["backbone"]["type"] = "c_radio_v2_vit_base_patch16_224"
specs["train"]["classify"]["loss"] = "ce"

specs["dataset"]["classify"]["input_map"] = {}
specs["dataset"]["classify"]["image_ext"] = ".png"
specs["dataset"]["classify"]["num_input"] = 1
specs["dataset"]["classify"]["num_golden"] = 4
specs["dataset"]["classify"]["augmentation_config"]["augment"] = False 
specs["task"] = "classify"

specs["inference"]["batch_size"] = 1 

In [None]:
# Run action
parent = job_map["model_gen_trt_engine_" + model_name]
action = "inference"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":specs,
                   })

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.ok, response.text

print(response)
print(json.dumps(response.json(), indent=4))

job_map["inference_trt_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map["inference_trt_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    print(response)
    print(json.dumps(response.json(), indent=4))
    assert response.ok, response.text
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

## Clean Up 

You can optionally run this section to delete the datasets and experiment results. 

### Delete experiment <a class="anchor" id="head-23"></a>

In [None]:
endpoint = f"{base_url}/experiments/{experiment_id}"

response = requests.delete(endpoint,headers=headers)
print(response)
assert response.ok, response.text

print(response)
print(json.dumps(response.json(), indent=4))

### Delete train dataset <a class="anchor" id="head-24"></a>

In [None]:
endpoint = f"{base_url}/datasets/{train_dataset_id}"

response = requests.delete(endpoint,headers=headers)
assert response.ok, response.text

print(response)
print(json.dumps(response.json(), indent=4))

### Delete val dataset <a class="anchor" id="head-24"></a>

In [None]:
endpoint = f"{base_url}/datasets/{eval_dataset_id}"

response = requests.delete(endpoint,headers=headers)
assert response.ok, response.text

print(response)
print(json.dumps(response.json(), indent=4))

### Delete test dataset <a class="anchor" id="head-24"></a>

In [None]:
endpoint = f"{base_url}/datasets/{test_dataset_id}"

response = requests.delete(endpoint,headers=headers)
assert response.ok, response.text

print(response)
print(json.dumps(response.json(), indent=4))