### Notebook to demonstrate TAO SSL MAE Pretraining and Finetuning through FTMS 

Transfer learning is the process of transferring learned features from one application to another. It is a commonly used training technique where you use a model trained on one task and re-train to use it on a different task. Train Adapt Optimize (TAO) Toolkit  is a simple and easy-to-use Python based AI toolkit for taking purpose-built AI models and customizing them with users' own data.

### TAO Self-Supervised Learning with Masked Autoencoders

This notebook demonstrates how to use Self-Supervised Learning (SSL) with Masked Autoencoders (MAE) for model pretraining, followed by finetuning on downstream classification tasks. This two-stage approach helps improve model performance, particularly in scenarios where labeled data is limited but a large volume of unlabeled images is available.

Self-Supervised Learning enables a model to learn useful representations from unlabeled data, making it highly effective for domain adaptation. By pretraining on tens or even hundreds of thousands of domain-specific, unlabeled images, the model develops an internal understanding of visual patterns relevant to your use case. This self-learned knowledge becomes a valuable foundation for improving performance on downstream tasks like classification.

<img src="assets/ssl_mae_workflow_diagram.png" width="800"/>

#### Architecture diagram for MAE

<img align="center" src="sample_images/mask_auto_encoder.png" width="640">

The process consists of two main stages:

**Pretraining Stage**: In this stage, the model is trained using a Masked Autoencoder, which hides (or "masks") parts of each input image and trains the model to reconstruct the missing regions. This forces the model to learn meaningful visual features from the data without requiring any labels.

**Finetuning Stage**: After pretraining, the model can be finetuned using a smaller set of labeled data. Since the model has already learned generalized visual features, it requires fewer labeled examples to adapt to specific tasks such as classification.

SSL with MAE is especially useful if your model isn’t reaching the desired accuracy when trained on labeled data alone, and you have access to a large repository of unlabeled images. It’s a powerful tool to boost performance through domain adaptation—without the costly effort of manual annotation.

---
The sample image below shows how Self Supervised Learning helps improve the model to look for better domain specific features using unlabelled images.

<img align="center" title="missing component" src="sample_images/sample_domain_adaptation.png" width="400" >

### The workflow in a nutshell
This notebook demonstrates how to use Self-Supervised Learning (SSL) with Masked Autoencoders (MAE) for model pretraining, followed by finetuning on downstream classification tasks. 

1) Configure Connection to TAO FTMS 
2) Login & Create Cloud Workspace
3) Register Train, Validation and Test datasets
4) Pretrain the Masked Auto Encoder backbone through SSL 
5) Finetune model using pretrained backbone on a classification task 
6) Export model

---

### Requirements
Prior to running this notebook you must have: 
1) A TAO FTMS server.  [(Setup Guide here)](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_setup.html#)
2) The sample MVTec dataset from the [mvtec_ad_classification_dataset_format.ipynb](https://github.com/NVIDIA/tao_tutorials/tree/main/notebooks/tao_api_starter_kit/dataset_prepare/mvtec_ad_classification/mvtec_ad_classification_dataset_format.ipynb) notebook uploaded to your cloud storage.
3) Set the `<>` enclosed variables with values in the Configuration section of the notebook.

---

### Expected outcome
The expected output of this notebook is a backbone that's been adapted to the Anomaly detection domain using MAE and a finetuned classification model initialized from the domain adapted backbone.

Detailed documentation about the Self-Supervised Learning via Mask Auto Encoders is available in the TAO [documentation](https://docs.nvidia.com/tao/tao-toolkit/text/cv_finetuning/pytorch/self_supervised_learning/mae.html)

---
### Debugging Finetuning Microservice and Jobs

When working with the TAO API, you may encounter issues at different stages. Use the following guidance to debug effectively:

#### 1. Dataset, Experiment, or Workspace CRUD Operation Errors

If you encounter errors related to creating, reading, updating, or deleting datasets, experiments, or workspaces **and the error messages are not clear**, check the logs of the TAO API service pods:

```bash
kubectl logs -f <pod name starting with tao-api-app-pod>
```

#### 2. Errors During Job Launch

For issues that occur **while launching a job**, check both the app and workflow pods:

```bash
kubectl logs -f <pod name starting with tao-api-app-pod>
kubectl logs -f <pod name starting with tao-api-workflow-pod>
```

#### 3. Errors After Job Launch

If errors occur **after a job has been launched**, inspect the job pod logs:

```bash
kubectl logs -f tao-api-sts-<job_id>-0
```

> **Note:**  
> Run these `kubectl` commands on the machine where your Kubernetes service is deployed.

#### Additional Debugging Tips

- **Job logs are automatically uploaded** to your cloud workspace at:
  `/results/<job_id>/microservices_log.txt`

- **You can also view logs via the Jobs API endpoint:**  
  ```
  /api/v1/orgs/<org_name>/<experiments|datasets>/<experiment_id|dataset_id>/jobs/<job_id>/logs
  ```

**Summary Table**

| Error Type                | Where to Check Logs                                      |
|-------------------------- |---------------------------------------------------------|
| CRUD operation errors     | `tao-api-app-pod`                                       |
| Job launch errors         | `tao-api-app-pod`, `tao-api-workflow-pod`               |
| Post-launch job errors    | `tao-api-sts-<job_id>-0`                                |
| All job logs (cloud)      | `/results/<job_id>/microservices_log.txt`               |
| All job logs (API)        | `/api/v1/orgs/<org_name>/<experiments|datasets>/<experiment_id|dataset_id>/jobs/<job_id>/logs` |

---

### Performance Benchmarks

#### Execution Time Breakdown

The following table shows the approximate time required for each stage of the MAE SSL workflow:

| **Stage** | **Duration** | **Description** |
|-----------|--------------|-----------------|
| Train Dataset Pull | **2min** | Train dataset verification and preprocessing |
| Val/Test Dataset Pull | **1min** | Validation and test dataset verification |
| SSL Pre-Training | **80min** | SSL using MAE on a ConvNext Tiny backbone |
| Classification Model Finetuning | **70min** | Finetuning stage to attach a classification task head and train the full model on classifying defects |
| Evaluate Finetuned Model | **4min** | Performance assessment of finetuned model |
| Export Model to ONNX | **4min** | Export finetuned model to ONNX format |
| TRT Engine Generation | **4min** | Generate optimized TensorRT engine |

The end to end workflow is estimated to take 165 mins on a single NVIDIA A30 GPU.

### Test Environment Specifications

| **Component** | **Specification** |
|---------------|-------------------|
| **GPU** | 1x NVIDIA A30 |
| **Training Dataset** | 3747 images (3.4GB total) |
| **Validation Dataset** | 803 images (800MB total) |
| **Test Dataset** | 804 images (800MB total) |
| **Training Epochs** | 80 (both pre-training and finetuning) |
| **Model Architecture** | MAE with ConvNext-Tiny backbone |
| **Task** | Anomaly defect classification |

### Performance Factors

> **Important Note:**
> Actual execution times may vary significantly based on:
>
> - **Hardware Configuration**: GPU type, memory, and compute capability
> - **Storage Performance**: Local disk I/O speed and cloud storage latency
> - **Network Conditions**: Bandwidth and latency to cloud storage
> - **System Load**: Other concurrent processes and resource utilization
> - **Dataset Size**: Number of images and total data volume
> - **Batch Size**: Number of images processed in parallel. Higher batch sizes require more GPU memory
> - **Model Configuration**: Backbone architecture, epochs, and hyperparameters

In [1]:
import json
import os
import requests
import time
from IPython.display import clear_output
import glob

## Configuration 

Fill in all `<>` enclosed variables with relevant values under this section. 

### TAO FTMS Host & Credentials 

In [None]:
#FIX ME - Configure for your TAO FTMS server 
host_url = "<HOST_URL>"
ngc_key = "<NGC_API_KEY>"
ngc_org_name = "<NGC_ORG_NAME>"

### Cloud Storage Setup & Credentials 

In [None]:
# Cloud bucket details to access datasets and store experiment results
cloud_metadata = {}
cloud_metadata["name"] = "tao_workspace"  # A Representative name for this cloud info
cloud_metadata["cloud_type"] = "aws"  # If it's AWS, HuggingFace or Azure
cloud_metadata["cloud_specific_details"] = {}
cloud_metadata["cloud_specific_details"]["cloud_region"] ="<BUCKET_REGION>"  # Bucket region
cloud_metadata["cloud_specific_details"]["cloud_bucket_name"] = "<BUCKET_NAME>"  # Bucket name
# Access and Secret for AWS
cloud_metadata["cloud_specific_details"]["access_key"] = "<ACCESS_KEY>"
cloud_metadata["cloud_specific_details"]["secret_key"] = "<SECRET_KEY>"

### Dataset Paths in Cloud Storage 

In [None]:
# Adjust paths to point to datasets in your cloud storage. If using S3 Bucket do not include the bucket name in the path. 
train_dataset_path =  "/<PATH_TO_DATASET_IN_CLOUD_STORAGE>/mvtec_ad_classification/train"
eval_dataset_path = "/<PATH_TO_DATASET_IN_CLOUD_STORAGE>/mvtec_ad_classification/val"
test_dataset_path = "/<PATH_TO_DATASET_IN_CLOUD_STORAGE>/mvtec_ad_classification/test"

### Model Configuration 

In [None]:
model_name = "mae"
ds_type = "image_classification"
ds_format = "ssl"
num_classes = 2 

## Login

In [None]:
# Validate NGC_PERSONAL_KEY
data = json.dumps({"ngc_org_name": ngc_org_name,
                   "ngc_key": ngc_key,
                   "enable_telemetry": True})
response = requests.post(f"{host_url}/api/v1/login", data=data)
assert response.status_code in (200, 201)
assert "token" in response.json().keys()
token = response.json()["token"]
print("JWT",token)

# Set base URL
base_url = f"{host_url}/api/v1/orgs/{ngc_org_name}"
print("API Calls will be forwarded to",base_url)

headers = {"Authorization": f"Bearer {token}"}

## Create cloud workspace
This workspace will be the place where your datasets reside and results of TAO FTMS jobs will be pushed to.

In [None]:
# Create cloud workspace
data = json.dumps(cloud_metadata)

endpoint = f"{base_url}/workspaces"

response = requests.post(endpoint,data=data,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))

assert "id" in response.json().keys()
workspace_id = response.json()["id"]

## Register datasets with FTMS 

TAO FTMS requires datasets in your cloud storage to be registered to produce a unique ID that can be attached to training jobs. This step only needs to be done once and then you can use the dataset across any experiments that support the dataset format.

### List Registered Datasets

In [None]:
endpoint = f"{base_url}/datasets"

response = requests.get(endpoint, headers=headers)
assert response.status_code in (200, 201)

datasets = response.json()["datasets"]
for rsp in datasets:
    rsp_keys = rsp.keys()
    assert "id" in rsp_keys
    assert "type" in rsp_keys
    assert "format" in rsp_keys
    assert "name" in rsp_keys

print(response)
# print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose list output
print("id\t\t\t\t\t type\t\t\t format\t\t name")
for rsp in datasets:
    print(rsp["id"],"\t",rsp["type"],"\t",rsp["format"],"\t\t",rsp["name"])

If you have already registered your datasets, then you can directly set their IDs in the following cell to avoid creating duplicate datasets. 

In [None]:
train_dataset_id = None 
eval_dataset_id = None 
test_dataset_id = None 

### Train Dataset 

In [None]:
# Create train dataset
if train_dataset_id is None: 
    train_dataset_metadata = {"type": ds_type,
                              "format": ds_format,
                              "workspace":workspace_id,
                              "cloud_file_path": train_dataset_path,
                              "use_for": ["training"],
                              "name": "mvtec_classification_train"
                              }
    data = json.dumps(train_dataset_metadata)
    
    endpoint = f"{base_url}/datasets"
    
    response = requests.post(endpoint,data=data,headers=headers)
    assert response.status_code in (200, 201)
    assert "id" in response.json().keys()
    
    print(response)
    print(json.dumps(response.json(), indent=4))
    train_dataset_id = response.json()["id"]

In [None]:
# Check progress
endpoint = f"{base_url}/datasets/{train_dataset_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    print(response)
    print(json.dumps(response.json(), indent=4))
    assert response.status_code in (200, 201)

    if response.json().get("status") == "invalid_pull":
        raise ValueError("Dataset pull failed")
    if response.json().get("status") == "pull_complete":
        break
    time.sleep(5)

## Validation Dataset

In [None]:
# Create eval dataset
if eval_dataset_id is None: 
    eval_dataset_metadata = {"type": ds_type,
                             "format": ds_format,
                             "workspace":workspace_id,
                             "cloud_file_path": eval_dataset_path,
                             "use_for": ["evaluation"],
                             "name" : "mvtec_classification_val"
                             }
    data = json.dumps(eval_dataset_metadata)
    
    endpoint = f"{base_url}/datasets"
    
    response = requests.post(endpoint,data=data,headers=headers)
    assert response.status_code in (200, 201)
    assert "id" in response.json().keys()
    
    print(response)
    print(json.dumps(response.json(), indent=4))
    eval_dataset_id = response.json()["id"]

In [None]:
# Check progress
endpoint = f"{base_url}/datasets/{eval_dataset_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)

    print(response)
    print(json.dumps(response.json(), indent=4))
    if response.json().get("status") == "invalid_pull":
        raise ValueError("Dataset pull failed")
    if response.json().get("status") == "pull_complete":
        break
    time.sleep(5)

## Test Dataset

In [None]:
# Create testing dataset for inference
if test_dataset_id is None: 
    test_dataset_metadata = {"type": ds_type,
                             "format":ds_format,
                             "workspace":workspace_id,
                             "cloud_file_path": test_dataset_path,
                             "use_for": ["testing"],
                             "name": "mvtec_classification_test"
                             }
    data = json.dumps(test_dataset_metadata)
    
    endpoint = f"{base_url}/datasets"
    
    response = requests.post(endpoint,data=data,headers=headers)
    assert response.status_code in (200, 201)
    assert "id" in response.json().keys()
    
    print(response)
    print(json.dumps(response.json(), indent=4))
    test_dataset_id = response.json()["id"]

In [None]:
# Check progress
endpoint = f"{base_url}/datasets/{test_dataset_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)

    print(response)
    print(json.dumps(response.json(), indent=4))
    if response.json().get("status") == "invalid_pull":
        raise ValueError("Dataset pull failed")
    if response.json().get("status") == "pull_complete":
        break
    time.sleep(5)

## Create Experiment 

Before we can run any jobs such as training, evaluation, distillation or inference, we must create an experiment to setup the network architecture and associated datsets. Then we can chain several jobs together to create our trained models. 

In [None]:
encode_key = "tlt_encode"
checkpoint_choose_method = "best_model"
data = json.dumps({"network_arch":model_name,
                   "encryption_key":encode_key,
                   "checkpoint_choose_method":checkpoint_choose_method,
                   "workspace": workspace_id,
                   "train_datasets":[train_dataset_id],
                   "eval_dataset":eval_dataset_id,
                   "inference_dataset":test_dataset_id,
                   "calibration_dataset":train_dataset_id})

endpoint = f"{base_url}/experiments"

response = requests.post(endpoint,data=data,headers=headers)
assert response.status_code in (200, 201)
assert "id" in response.json()

print(response)
print(json.dumps(response.json(), indent=4))
experiment_id = response.json()["id"]

When a job is submitted, we will receive a unique ID to reference back to it. We will store these IDs in the following ```job_map``` variable. 

In [None]:
job_map = {}

## SSL Pre-Training 

We will first start with self supervised learning using masked auto encoding on a ConvNext Tiny backbone. For this notebook, we will use the training set as an input for this stage. When adapting to your own datasets, it is recommended to have tens or hundreds of thousands of unlabelled images in your pre-training dataset to get measureable results. 

### Retrieve default training spec 

Before launching a training a job, we can retrieve the default training spec for our network architecture to use as a starting point to configure the training parameters. 

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/train/schema"

while True:
    response = requests.get(endpoint, headers=headers)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.status_code in (200, 201)
assert "default" in response.json().keys()

print(response)
#print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema
specs = response.json()["default"]
print(json.dumps(specs, sort_keys=True, indent=4))

### Configure Training Spec 

Now we can customize the json spec object and set our dataset, training and model parameters. To train using SSL we will specify the stage as  ```pretrain```. No other changes are needed unless you are using a custom dataset or model architecture.

Detailed information about the configuring the training spec and available hyper parameters are defined [here](https://docs.nvidia.com/tao/tao-toolkit/text/cv_finetuning/pytorch/self_supervised_learning/mae.html#creating-an-experiment-spec-file). 

In [None]:
# Override any of the parameters listed in the previous cell as required
specs["train"]["stage"] = "pretrain" 
specs["train"]["num_gpus"] = 1
specs["train"]["num_epochs"] = 10
specs["train"]["checkpoint_interval"] = 10 # must be less than or equal to num_epochs
specs["dataset"]["batch_size"] = 32
specs["dataset"]["augmentation"]["min_scale"] = 1.0 
specs["dataset"]["augmentation"]["max_scale"] = 1.0 
specs["dataset"]["augmentation"]["color_jitter"] = 0.0 
specs["dataset"]["augmentation"]["auto_aug"] = "" 
specs["dataset"]["augmentation"]["mixup"] = 0
specs["dataset"]["augmentation"]["cutmix"] = 0 
specs["dataset"]["augmentation"]["cutmix_minmax"] = None 
specs["dataset"]["augmentation"]["mixup_prob"] = 0.0
specs["dataset"]["augmentation"]["mixup_switch_prob"] = 0.0 
specs["dataset"]["augmentation"]["mixup_mode"] = "batch" 
specs["model"]["arch"] = "convnextv2_base"
print(json.dumps(specs, sort_keys=True, indent=4))

### Submit SSL Pre-Training Job 

With our training spec configured, we can now submit a training job. All jobs follow the same flow of retreiving the default spec, customizing it, then submitting the job.

In [None]:
# Run action
action = "train"
data = json.dumps({"parent_job_id":None,"action":action,"specs":specs,
                   })

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)
assert response.json()

print(response)
print(json.dumps(response.json(), indent=4))

job_map["pretrain_" + model_name] = response.json()
print(job_map)

After submitting the training job, an ID is returned that we can use to monitor the job progress. The following cell will continuously print the latest status until the job is complete. This notebook will track all of the job IDs in the ```job_map``` variable. 

In [None]:
job_id = job_map["pretrain_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    if "error_desc" in response.json().keys() and response.json()["error_desc"] in ("Job trying to retrieve not found", "No AutoML run found"):
        print("Job is being created")
        time.sleep(5)
        continue
    print(response)
    print(json.dumps(response.json(), sort_keys=True, indent=4))
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

If you need to cancel the job for any reason, you can uncomment and run the following cell. You can also configure the endpoint to end with ```:pause``` or ```:resume``` instead of ```:cancel``` to temporarily stop and start the job. 

In [None]:
# job_id = job_map["pretrain_" + model_name]
# endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}:cancel"

# response = requests.post(endpoint, headers=headers)
# print(response)
# print(json.dumps(response.json(), indent=4))

If the job runs into any errors or if you want to check the job logs, you can uncomment and run the following cell to view the job logs. Alternatively, you can view the job logs in your cloud workspace under the path /results/<job_id>/microservices_log.txt.

In [None]:
# job_id = job_map["pretrain_" + model_name]
# endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}/logs"

# response = requests.get(endpoint, headers=headers)
# print(response.text)

## Classification Model Finetuning 

With the backbone pretrained, we can now do a finetuning stage to attach a classification task head and train the full model on classifying defects. 

### Retrieve default training spec 

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/train/schema"

while True:
    response = requests.get(endpoint, headers=headers)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.status_code in (200, 201)
assert "default" in response.json().keys()

print(response)
# full_schema = response.json()
# print(json.dumps(full_schema, indent=4)) ## Uncomment for verbose schema
specs = response.json()["default"]
print(json.dumps(specs, sort_keys=True, indent=4))

### Configure Training Spec

#### Understanding the Spec Schema Structure

Before customizing the training spec, it's helpful to understand the available configuration options and their valid values. The `get_bounds_of_field` utility function helps you explore the schema structure and find valid parameter values if they are defined on the backend (for some fields it might not be defined).

**How to use the schema exploration utility:**

```python
from get_bounds import get_bounds_of_field

# Example: Get valid values for backbone type
# Note the conversion from dictionary access notation to list notation:
# From: ["model"]["arch"]
# To:   ["model", "arch"]
print(get_bounds_of_field(full_schema, ["model", "arch"]))
```

**Key differences in notation:**
- **Dictionary access format**: `["model"]["arch"]` - This is how you would access nested values in Python
- **Schema path format**: `["model", "arch"]` - This is the format expected by the `get_bounds_of_field` function

---

#### Schema Navigation Tips

1. **Nested Structure**: The schema follows a hierarchical structure where each level represents a configuration category
2. **Path Format**: Always use a list of strings `["level1", "level2", "level3"]` when calling `get_bounds_of_field`
3. **Validation**: This helps ensure you're using valid parameter values before submitting your training job
4. **Documentation**: Use these explorations to understand what options are available for your specific model architecture


No changes to the default spec are needed unless you are using a custom dataset format or want to modify the model architecture and training parameters.

In [None]:
specs["train"]["stage"] = "finetune" 
specs["train"]["num_gpus"] = 1
specs["train"]["num_epochs"] = 80
specs["train"]["checkpoint_interval"] = 20 # must be less than or equal to num_epochs
specs["train"]["validation_interval"] = 5 # must be less than or equal to num_epochs
specs["train"]["optim"]["type"] = "AdamW" 
specs["train"]["optim"]["lr"] = 0.001 
specs["train"]["optim"]["backbone_multiplier"] = 1 
specs["train"]["optim"]["momentum"] = 0.9 
specs["train"]["optim"]["weight_decay"] = 0.0
specs["train"]["optim"]["layer_decay"] = 0.0
specs["train"]["optim"]["lr_scheduler"] = "multistep" 
specs["train"]["optim"]["milestones"] = [10, 20] 
specs["train"]["optim"]["gamma"] = 0.1 
specs["train"]["optim"]["warmup_epochs"] = 1 

specs["dataset"]["batch_size"] = 8
specs["dataset"]["num_workers_per_gpu"] = 8

specs["dataset"]["augmentation"]["min_scale"] = 1.0 
specs["dataset"]["augmentation"]["max_scale"] = 1.0 
specs["dataset"]["augmentation"]["color_jitter"] = 0.0 
specs["dataset"]["augmentation"]["auto_aug"] = "" 
specs["dataset"]["augmentation"]["mixup"] = 0
specs["dataset"]["augmentation"]["cutmix"] = 0 
specs["dataset"]["augmentation"]["cutmix_minmax"] = None 
specs["dataset"]["augmentation"]["mixup_prob"] = 0.0
specs["dataset"]["augmentation"]["mixup_switch_prob"] = 0.0 
specs["dataset"]["augmentation"]["mixup_mode"] = "batch" 

specs["model"]["arch"] = "convnextv2_base"
specs["model"]["num_classes"] = 2

print(json.dumps(specs, sort_keys=True, indent=4))

### Submit Finetune Training Job 

To use the backbone from the pretraining stage, it needs to be set as the parent job. 

In [None]:
# Run action
parent = job_map["pretrain_" + model_name]
action = "train"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":specs,})

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)
assert response.json()

print(response)
print(json.dumps(response.json(), indent=4))

job_map["finetune_" + model_name] = response.json()
print(job_map)

In [None]:
job_id = job_map["finetune_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    if "error_desc" in response.json().keys() and response.json()["error_desc"] in ("Job trying to retrieve not found", "No AutoML run found"):
        print("Job is being created")
        time.sleep(5)
        continue
    print(response)
    print(json.dumps(response.json(), sort_keys=True, indent=4))
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

## Evaluate Finetuned Model 

### Receive default evaluation spec

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/evaluate/schema"

while True:
    response = requests.get(endpoint, headers=headers)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.status_code in (200, 201)
assert "default" in response.json().keys()

print(response)
#print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema
specs = response.json()["default"]
print(json.dumps(specs, sort_keys=True, indent=4))

### Customize Evaluation Spec

In [None]:
specs["train"]["stage"] = "finetune" 

specs["dataset"]["batch_size"] = 8
specs["dataset"]["num_workers_per_gpu"] = 8

specs["model"]["arch"] = "convnextv2_base"
specs["model"]["num_classes"] = 2

print(json.dumps(specs, sort_keys=True, indent=4))

### Submit Evaluation Job 

In [None]:
# Run action
parent = job_map["finetune_" + model_name] #parent is now distillation job. This evaluation will use the distilled model. 
action = "evaluate"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":specs,
                   })

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))

job_map["evaluate_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map["evaluate_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    print(response)
    print(json.dumps(response.json(), indent=4))
    assert response.status_code in (200, 201)
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

## TRT Engine Generation 

Now that we have a finetuned model, it can be exported to ONNX format then turned into an optimzed TensorRT engine for deployment.

### Export Model to ONNX 

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/export/schema"

while True:
    response = requests.get(endpoint, headers=headers)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.status_code in (200, 201)
assert "default" in response.json().keys()

print(response)
#print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema
specs = response.json()["default"]
print(json.dumps(specs, sort_keys=True, indent=4))

In [None]:
specs["train"]["stage"] = "finetune" 
specs["model"]["arch"] = "convnextv2_base"
specs["model"]["num_classes"] = 2
specs["export"]["input_channel"] = 3 
specs["export"]["input_width"] = 224
specs["export"]["input_height"] = 224 
specs["export"]["batch_size"] = -1 

In [None]:
# Run action
parent = job_map["finetune_" + model_name] #parent is now distillation job. This will export the distilled model.  
action = "export"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":specs,
                   })

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))

job_map["export_finetuned_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map["export_finetuned_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    print(response)
    print(json.dumps(response.json(), indent=4))
    assert response.status_code in (200, 201)
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

### Convert ONNX to TRT Engine 

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/gen_trt_engine/schema"

while True:
    response = requests.get(endpoint, headers=headers)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.status_code in (200, 201)
assert "default" in response.json().keys()

print(response)
#print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema
specs = response.json()["default"]
print(json.dumps(specs, sort_keys=True, indent=4))

In [None]:
specs["train"]["stage"] = "finetune" 
specs["model"]["arch"] = "convnextv2_base"
specs["model"]["num_classes"] = 2
specs["gen_trt_engine"]["tensorrt"]["data_type"] = "FP16"

In [None]:
# Run action
parent = job_map["export_finetuned_" + model_name]
action = "gen_trt_engine"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":specs,
                   })

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))

job_map["model_gen_trt_engine_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map['model_gen_trt_engine_' + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    print(response)
    print(json.dumps(response.json(), indent=4))
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

## Clean Up 

You can optionally run this section to delete the datasets and experiment results. 

### Delete experiment <a class="anchor" id="head-23"></a>

In [None]:
endpoint = f"{base_url}/experiments/{experiment_id}"

response = requests.delete(endpoint,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))

### Delete train dataset <a class="anchor" id="head-24"></a>

In [None]:
endpoint = f"{base_url}/datasets/{train_dataset_id}"

response = requests.delete(endpoint,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))

### Delete val dataset <a class="anchor" id="head-24"></a>

In [None]:
endpoint = f"{base_url}/datasets/{eval_dataset_id}"

response = requests.delete(endpoint,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))

### Delete test dataset <a class="anchor" id="head-24"></a>

In [None]:
endpoint = f"{base_url}/datasets/{test_dataset_id}"

response = requests.delete(endpoint,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))