### Notebook to demonstrate Image Segmentation workflow

Transfer learning is the process of transferring learned features from one application to another. It is a commonly used training technique where you use a model trained on one task and re-train to use it on a different task. Train Adapt Optimize (TAO) Toolkit  is a simple and easy-to-use Python based AI toolkit for taking purpose-built AI models and customizing them with users' own data.

![image](https://d29g4g2dyqv443.cloudfront.net/sites/default/files/akamai/TAO/tlt-tao-toolkit-bring-your-own-model-diagram.png)

### Sample prediction for an Semantic Segmentation model - Unet, Segformer
<img align="center" src="../example_images/sample_semantic_segmentation.jpg">

### Sample prediction for an Instance Segmentation model - Mask-RCNN
<img align="center" width="800" src="https://d29g4g2dyqv443.cloudfront.net/sites/default/files/akamai/TLT/test.jpg">

### The workflow in a nutshell

- Creating a dataset
- Upload dataset to the service
- Running dataset convert (For Mask-RCNN)
- Getting a PTM from NGC
- Model Actions
    - Train (Normal/AutoML)
    - Evaluate
    - Prune, retrain
    - Export
    - TAO-Deploy
    - Inference on TAO
    - Inference on TRT

### Table of contents

1. [Create datasets ](#head-1)
1. [List the created datasets](#head-2)
1. [Dataset convert Action](#head-3)
1. [Create an experiment](#head-4)
1. [List experiments](#head-5)
1. [Assign train, eval datasets](#head-6)
1. [Assign PTM](#head-7)
1. [View hyperparameters that are enabled by default](#head-8)
1. [Set AutoML related configurations](#head-9)
1. [Actions](#head-10)
1. [Train](#head-11)
1. [Evaluate](#head-12)
1. [Optimize: Apply specs for prune](#head-14)
1. [Optimize: Apply specs for retrain](#head-15)
1. [Optimize: Run actions](#head-16)
1. [Export](#head-17)
1. [TRT Engine generation using TAO-Deploy](#head-19)
1. [TAO inference](#head-20)
1. [TRT inference](#head-21)

### Requirements
Please find the server requirements [here](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_setup.html#)

In [None]:
import json
import os
import requests
import uuid
import time
from IPython.display import clear_output
import subprocess
import glob
from IPython.display import Image

### FIXME

1. Assign a model_name in FIXME 1
1. Assign a workdir in FIXME 2
1. Assign the ip_address and port_number in FIXME 3 ([info](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_rest_api.html))
1. Assign the ngc_api_key variable in FIXME 4
1. (Optional) Enable AutoML if needed in FIXME 5
1. (Optional) Choose between bayesian and hyperband automl_algorithm in FIXME 6 (If automl was enabled in FIXME5)
1. Choose to download jobs or not in FIXME 7
1. Choose between default and custom dataset in FIXME 8
1. Assign path of DATA_DIR in FIXME 9

In [None]:
# Define model_name workspaces and other variables
# Available models (#FIXME 1):
# 1. mask_rcnn - https://docs.nvidia.com/tao/tao-toolkit/text/instance_segmentation/mask_rcnn.html
# 2. segformer - https://docs.nvidia.com/tao/tao-toolkit/text/semantic_segmentation/segformer.html
# 3. unet - https://docs.nvidia.com/tao/tao-toolkit/text/semantic_segmentation/unet.html

model_name = "mask_rcnn" # FIXME1 (Add the model name from the above mentioned list)

In [None]:
workdir = "workdir_segmentation" # FIXME2
host_url = "http://<ip_address>:<port_number>" # FIXME3 example: https://10.137.149.22:32334
# In host machine, node ip_address and port number can be obtained as follows,
# ip_address: hostname -i
# port_number: kubectl get service ingress-nginx-controller -o jsonpath='{.spec.ports[0].nodePort}'
ngc_api_key = "<ngc_api_key>" # FIXME4 example: (Add NGC API key)

In [None]:
automl_enabled = False # FIXME5 set to True if you want to run automl for the model chosen in the previous cell
automl_algorithm = "bayesian" # FIXME6 example: bayesian/hyperband
# FIXME7 Defaulted to False as downloading jobs from service to your machine takes time
# Set to True if you want to download jobs where examples have been provided like for train, export, inference.
download_jobs = False

In [None]:
# Exchange NGC_API_KEY for JWT
data = json.dumps({"ngc_api_key": ngc_api_key})
response = requests.post(f"{host_url}/api/v1/login", data=data)
assert response.status_code in (200, 201)
assert "user_id" in response.json().keys()
user_id = response.json()["user_id"]
print("User ID",user_id)
assert "token" in response.json().keys()
token = response.json()["token"]
print("JWT",token)

# Set base URL
base_url = f"{host_url}/api/v1/users/{user_id}"
print("API Calls will be forwarded to",base_url)

headers = {"Authorization": f"Bearer {token}"}

In [None]:
# Creating workdir
if not os.path.isdir(workdir):
    os.makedirs(workdir)

### Function to split tar files <a class="anchor" id="head-1.1"></a>

In [None]:
import os
import tarfile

def split_tar_file(input_tar_path, output_dir, max_split_size=0.2*1024*1024*1024):
	os.makedirs(output_dir, exist_ok=True)
	
	with tarfile.open(input_tar_path, 'r') as original_tar:
		members = original_tar.getmembers()
		current_split_size = 0
		current_split_number = 0
		current_split_name = os.path.join(output_dir, f'smaller_file_{current_split_number}.tar')
		
		with tarfile.open(current_split_name, 'w') as split_tar:
			for member in members:
				if current_split_size + member.size <= max_split_size:
					split_tar.addfile(member, original_tar.extractfile(member))
					current_split_size += member.size
				else:
					split_tar.close()
					current_split_number += 1
					current_split_name = os.path.join(output_dir, f'smaller_file_{current_split_number}.tar')
					current_split_size = 0
					split_tar = tarfile.open(current_split_name, 'w')  # Open a new split tar archive
					split_tar.addfile(member, original_tar.extractfile(member))
					current_split_size += member.size

### Set dataset type, format <a class="anchor" id="head-1.1"></a>

**Instance Segmentation:**
We will be using the `COCO dataset` for Instance segmentation - MaskRCNN. `download_coco.sh` script from dataset prepare will be used to download and unzip the coco2017 dataset from [here](https://cocodataset.org/#download)


**If using custom dataset; it should follow this dataset structure**
```
DATA_DIR
├── annotations.json
├── images
    ├── image_name_1.jpg
    ├── image_name_2.jpg
    ├── ...

```

**Semantic Segmentation:**
We will be using the `ISBI Challenge: Segmentation of neuronal structures in EM stacks dataset` for the binary segmentation tutorial (Unet and Segformer). Please access the open source repo [here](https://github.com/alexklibisz/isbi-2012/tree/master/data) to download the data. The data is in .tif format. Copy the train-labels.tif, train-volume.tif, test-volume.tif files to `DATA_DIR`.

**If using custom dataset; it should follow this dataset structure**
```
DATA_DIR
├── images
│   ├── test
│   │   ├── image_0.png
│   │   ├── image_1.png
|   |   ├── ...
│   ├── train
│   │   ├── image_2.png
│   │   ├── image_3.png
|   |   ├── ...
│   └── val
│       ├── image_4.png
│       ├── image_5.png
|       ├── ...
├── masks
    ├── train
    │   ├── image_2.png
    │   ├── image_3.png
    |   ├── ...
    └── val
        ├── image_4.png
        ├── image_5.png
        ├── ...

```
The filename should match for images and masks

In [None]:
# Create train dataset
if model_name in ("unet","segformer"):
    ds_type = "semantic_segmentation"
    ds_format = "unet"
elif model_name == "mask_rcnn":
    ds_type = "instance_segmentation"
    ds_format = "coco"

In [None]:
dataset_to_be_used = "default" #FIXME8 #default/custom; default for the dataset used in this tutorial notebook; custom for a different dataset
DATA_DIR = model_name # FIXME9
os.environ['DATA_DIR']= DATA_DIR
!mkdir -p $DATA_DIR
job_map = {}

### Dataset download and pre-processing <a class="anchor" id="head-1"></a>

In [None]:
if model_name == "mask_rcnn" and dataset_to_be_used == "default":
    !bash ../dataset_prepare/coco/download_coco.sh $DATA_DIR
    # Remove existing data
    !rm -rf $DATA_DIR/train2017/images
    !rm -rf $DATA_DIR/val2017/images
    # Rearrange data in the required format
    !mkdir -p $DATA_DIR/train2017/
    !mkdir -p $DATA_DIR/val2017/
    !mv $DATA_DIR/raw-data/train2017 $DATA_DIR/train2017/images
    !mv $DATA_DIR/raw-data/annotations/instances_train2017.json $DATA_DIR/train2017/annotations.json
    !mv $DATA_DIR/raw-data/val2017 $DATA_DIR/val2017/images
    !mv $DATA_DIR/raw-data/annotations/instances_val2017.json $DATA_DIR/val2017/annotations.json
    !cp ../dataset_prepare/coco/label_map.txt $DATA_DIR/train2017/
    !cp ../dataset_prepare/coco/label_map.txt $DATA_DIR/val2017/
    
# For unet/segformer you have to manually download from the github link https://github.com/alexklibisz/isbi-2012/tree/master/data and place it in $DATA_DIR

In [None]:
# Verify the downloaded dataset
if model_name == "mask_rcnn":
    !if [ ! -d $DATA_DIR/train2017/images ]; then echo 'Images folder not found'; else echo 'Found images folder';fi
    !if [ ! -f $DATA_DIR/train2017/annotations.json ]; then echo 'annotations file not found'; else echo 'Found annotations file';fi
    !if [ ! -d $DATA_DIR/val2017/images ]; then echo 'Images folder not found'; else echo 'Found images folder';fi
    !if [ ! -f $DATA_DIR/val2017/annotations.json ]; then echo 'annotations file not found'; else echo 'Found annotations file';fi
if model_name in ("unet","segformer") and dataset_to_be_used == "default":
    assert (os.path.exists(f"{DATA_DIR}/train-volume.tif"))
    assert (os.path.exists(f"{DATA_DIR}/train-labels.tif"))
    assert (os.path.exists(f"{DATA_DIR}/test-volume.tif"))

In [None]:
if model_name in ("unet","segformer"):
    if dataset_to_be_used == "default":
        !python3 -m pip install Pillow opencv-python numpy
        # create images and masks from the tif files
        !bash ../dataset_prepare/unet/prepare_data.sh $DATA_DIR
        assert (os.path.exists(f"{DATA_DIR}/images/train"))
        assert (os.path.exists(f"{DATA_DIR}/images/val"))
        assert (os.path.exists(f"{DATA_DIR}/images/test"))
        assert (os.path.exists(f"{DATA_DIR}/masks/train"))
        assert (os.path.exists(f"{DATA_DIR}/masks/val"))
    !tar -czf isbi_data.tar.gz -C $DATA_DIR .
elif model_name == "mask_rcnn":
    !tar -C $DATA_DIR/train2017 -czf coco_train.tar.gz images annotations.json label_map.txt
    !tar -C $DATA_DIR/val2017 -czf coco_val.tar.gz images annotations.json label_map.txt

### Set dataset path <a class="anchor" id="head-1.1"></a>

In [None]:
if model_name in ("unet","segformer"):
    train_dataset_path = "isbi_data.tar.gz"
    eval_dataset_path = "isbi_data.tar.gz"
elif model_name == "mask_rcnn":
    train_dataset_path = "coco_train.tar.gz"
    eval_dataset_path = "coco_val.tar.gz"

### Create and upload train dataset <a class="anchor" id="head-1.2"></a>

In [None]:
# Create train dataset
data = json.dumps({"type":ds_type,"format":ds_format})

endpoint = f"{base_url}/datasets"

response = requests.post(endpoint,data=data,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(response.json())

assert "id" in response.json().keys()
train_dataset_id = response.json()["id"]

In [None]:
# Update
docker_env_vars = {} # Update any variables to be included while triggering Docker run-time like MLOPs variables 
dataset_information = {"name":"Train dataset",
                       "description":"My train dataset",
                       "docker_env_vars": docker_env_vars}
data = json.dumps(dataset_information)

endpoint = f"{base_url}/datasets/{train_dataset_id}"

response = requests.patch(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)

print(response)
print(response.json())

In [None]:
# Upload
output_dir = os.path.join(os.path.dirname(os.path.abspath(train_dataset_path)), model_name, "train")
split_tar_file(train_dataset_path, output_dir)
for idx, tar_dataset_path in enumerate(os.listdir(output_dir)):
    print(f"Uploading {idx+1}/{len(os.listdir(output_dir))} tar split")
    files = [("file",open(os.path.join(output_dir, tar_dataset_path),"rb"))]

    endpoint = f"{base_url}/datasets/{train_dataset_id}:upload"

    response = requests.post(endpoint, files=files, headers=headers)

    print(response)
    print(response.json())
    assert response.status_code in (200, 201)
    assert "message" in response.json().keys() and response.json()["message"] == "Server recieved file and upload process started"

### Create and upload val dataset <a class="anchor" id="head-1.3"></a>

In [None]:
# Create eval dataset
data = json.dumps({"type":ds_type,"format":ds_format})

endpoint = f"{base_url}/datasets"

response = requests.post(endpoint,data=data,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(response.json())

assert "id" in response.json().keys()
eval_dataset_id = response.json()["id"]

In [None]:
# Update
docker_env_vars = {} # Update any variables to be included while triggering Docker run-time like MLOPs variables 
dataset_information = {"name":"Evaluation dataset",
                       "description":"My eval dataset",
                       "docker_env_vars": docker_env_vars}
data = json.dumps(dataset_information)

endpoint = f"{base_url}/datasets/{eval_dataset_id}"

response = requests.patch(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)

print(response)
print(response.json())

In [None]:
# Upload
output_dir = os.path.join(os.path.dirname(os.path.abspath(eval_dataset_path)), model_name, "eval")
split_tar_file(eval_dataset_path, output_dir)
for idx, tar_dataset_path in enumerate(os.listdir(output_dir)):
    print(f"Uploading {idx+1}/{len(os.listdir(output_dir))} tar split")
    files = [("file",open(os.path.join(output_dir, tar_dataset_path),"rb"))]

    endpoint = f"{base_url}/datasets/{eval_dataset_id}:upload"

    response = requests.post(endpoint, files=files, headers=headers)
    assert response.status_code in (200, 201)
    assert "message" in response.json().keys() and response.json()["message"] == "Server recieved file and upload process started"

    print(response)
    print(response.json())

### List the created datasets <a class="anchor" id="head-2"></a>

In [None]:
endpoint = f"{base_url}/datasets"

response = requests.get(endpoint, headers=headers)
assert response.status_code in (200, 201)

print(response)
# print(response.json()) ## Uncomment for verbose list output
print("id\t\t\t\t\t type\t\t\t format\t\t name")
for rsp in response.json():
    rsp_keys = rsp.keys()
    assert "id" in rsp_keys
    assert "type" in rsp_keys
    assert "format" in rsp_keys
    assert "name" in rsp_keys
    print(rsp["id"],"\t",rsp["type"],"\t",rsp["format"],"\t\t",rsp["name"])

### Train Dataset convert Action <a class="anchor" id="head-3"></a>
#### Run dataset convert only for coco data format, skip to Create an experimentfor unet data format

In [None]:
if ds_format == "coco" and model_name != "segformer":
    # Get default spec schema
    endpoint = f"{base_url}/datasets/{train_dataset_id}/specs/convert/schema"

    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)

    #print(response)
    #print(response.json()) ## Uncomment for verbose schema

    assert "default" in response.json().keys()
    train_ds_convert_specs = response.json()["default"]
    print(json.dumps(train_ds_convert_specs, sort_keys=True, indent=4))

In [None]:
if ds_format == "coco" and model_name != "segformer":
    # Apply changes
    train_ds_convert_specs["dataset_convert"]["num_shards"] = 256
    train_ds_convert_specs["dataset_convert"]["tag"] = "train"
    print(json.dumps(train_ds_convert_specs, sort_keys=True, indent=4))

In [None]:
if ds_format == "coco" and model_name != "segformer":
    # Run action
    parent = None
    action = "convert"
    data = json.dumps({"parent_job_id":parent,"action":action, "specs":train_ds_convert_specs})

    endpoint = f"{base_url}/datasets/{train_dataset_id}/jobs"

    response = requests.post(endpoint, data=data, headers=headers)
    assert response.status_code in (200, 201)
    assert response.json()

    print(response)
    print(response.json())

    ds_convert_id = response.json()
    job_map["train_dataset_convert_"+model_name] = ds_convert_id

In [None]:
if ds_format == "coco" and model_name != "segformer":
    # Monitor job status by repeatedly running this cell
    job_id = ds_convert_id
    endpoint = f"{base_url}/datasets/{train_dataset_id}/jobs/{job_id}"

    while True:    
        clear_output(wait=True)
        response = requests.get(endpoint, headers=headers)
        assert response.status_code in (200, 201)
        print(response)
        print(response.json())
        assert "status" in response.json().keys() and response.json().get("status") != "Error"
        if response.json().get("status") in ["Done","Error", "Canceled"] or response.status_code not in (200,201):
            break
        time.sleep(15)

### Eval Dataset convert Action <a class="anchor" id="head-3"></a>

In [None]:
if ds_format == "coco" and model_name != "segformer":
    # Now, repeat the same for the eval dataset
    # Get default spec schema
    endpoint = f"{base_url}/datasets/{eval_dataset_id}/specs/convert/schema"

    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)

    print(response)
    #print(response.json()) ## Uncomment for verbose schema
    assert "default" in response.json().keys()
    eval_ds_convert_specs = response.json()["default"]
    print(json.dumps(eval_ds_convert_specs, sort_keys=True, indent=4))

In [None]:
if ds_format == "coco" and model_name != "segformer":
    ## Apply changes
    eval_ds_convert_specs["dataset_convert"]["num_shards"] = 256
    eval_ds_convert_specs["dataset_convert"]["tag"] = "val"
    print(json.dumps(eval_ds_convert_specs, sort_keys=True, indent=4))

In [None]:
if ds_format == "coco" and model_name != "segformer":
    # Run action
    parent = job_map["train_dataset_convert_"+model_name]
    action = "convert"
    data = json.dumps({"parent_job_id":parent,"action":action,"specs":eval_ds_convert_specs})

    endpoint = f"{base_url}/datasets/{eval_dataset_id}/jobs"

    response = requests.post(endpoint, data=data, headers=headers)
    assert response.status_code in (200, 201)
    assert response.json()

    print(response)
    print(response.json())

    eval_ds_convert_id = response.json()
    job_map["eval_dataset_convert_"+model_name] = eval_ds_convert_id

In [None]:
if ds_format == "coco" and model_name != "segformer":
    # Monitor job status by repeatedly running this cell
    job_id = eval_ds_convert_id
    endpoint = f"{base_url}/datasets/{eval_dataset_id}/jobs/{job_id}"

    while True:    
        clear_output(wait=True)
        response = requests.get(endpoint, headers=headers)
        assert response.status_code in (200, 201)
        print(response)
        print(response.json())
        assert "status" in response.json().keys() and response.json().get("status") != "Error"
        if response.json().get("status") in ["Done","Error", "Canceled"] or response.status_code not in (200,201):
            break
        time.sleep(15)

### Create an experiment <a class="anchor" id="head-4"></a>

In [None]:
if model_name == "segformer":
    encode_key = "nvidia_tao"
else:
    encode_key = "tlt_encode"
checkpoint_choose_method = "best_model"
data = json.dumps({"network_arch":model_name,"encryption_key":encode_key,"checkpoint_choose_method":checkpoint_choose_method})

endpoint = f"{base_url}/experiments"

response = requests.post(endpoint,data=data,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(response.json())
assert "id" in response.json().keys()
experiment_id = response.json()["id"]

### List experiments <a class="anchor" id="head-5"></a>

In [None]:
endpoint = f"{base_url}/experiments"
params = {"network_arch": model_name}
response = requests.get(endpoint, params=params, headers=headers)
assert response.status_code in (200, 201)

print(response)
# print(response.json()) ## Uncomment for verbose list output
print("model id\t\t\t     network architecture")
for rsp in response.json():
    rsp_keys = rsp.keys()
    assert "id" in rsp_keys and "network_arch" in rsp_keys
    print(rsp["name"],rsp["id"],rsp["network_arch"])

### Assign train, eval datasets <a class="anchor" id="head-6"></a>

- Note: make sure the order for train_datasets is [source ID, target ID]
- eval_dataset is kept same as target for demo purposes

In [None]:
docker_env_vars = {} # Update any variables to be included while triggering Docker run-time like MLOPs variables 
dataset_information = {"train_datasets":[train_dataset_id],
                       "eval_dataset":eval_dataset_id,
                       "inference_dataset":eval_dataset_id,
                       "calibration_dataset":train_dataset_id,
                       "docker_env_vars": docker_env_vars}
data = json.dumps(dataset_information)

endpoint = f"{base_url}/experiments/{experiment_id}"

response = requests.patch(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)

print(response)
print(response.json())

### Assign PTM <a class="anchor" id="head-7"></a>

Search for the PTM on NGC for the Segmentation model chosen

In [None]:
# List all pretrained models for the chosen network architecture
endpoint = f"{base_url}/experiments"
params = {"network_arch": model_name}
response = requests.get(endpoint, params=params, headers=headers)
assert response.status_code in (200, 201)

response_json = response.json()

for rsp in response_json:
    rsp_keys = rsp.keys()
    if "encryption_key" not in rsp.keys():
        assert "name" in rsp_keys and "version" in rsp_keys and "ngc_path" in rsp_keys and "additional_id_info" in rsp_keys
        print(f'PTM Name: {rsp["name"]}; PTM version: {rsp["version"]}; NGC PATH: {rsp["ngc_path"]}; Additional info: {rsp["additional_id_info"]}')

In [None]:
# Assigning pretrained models to different networks
# From the output of previous cell make the appropriate changes to this map if you want to change the default PTM backbone.
# Changing the default backbone here requires changing default spec/config during train/eval etc like for example
# If you are changing the ptm to resnet34, then you have to modify the config key num_layers if it exists to 34 manually
pretrained_map = {"mask_rcnn" : "pretrained_instance_segmentation:resnet18",
                  "segformer" : "pretrained_segformer_imagenet:fan_hybrid_tiny",
                  "unet" : "pretrained_semantic_segmentation:resnet18"}
no_ptm_models = set([])

In [None]:
# Get pretrained model
if model_name not in no_ptm_models:
    endpoint = f"{base_url}/experiments"
    params = {"network_arch": model_name}
    response = requests.get(endpoint, params=params, headers=headers)
    assert response.status_code in (200, 201)

    response_json = response.json()

    # Search for ptm with given ngc path
    ptm = []
    for rsp in response_json:
        rsp_keys = rsp.keys()
        assert "ngc_path" in rsp_keys
        if rsp["ngc_path"].endswith(pretrained_map[model_name]):
            assert "id" in rsp_keys
            ptm_id = rsp["id"]
            ptm = [ptm_id]
            print("Metadata for model with requested NGC Path")
            print(rsp)
            break

In [None]:
if model_name not in no_ptm_models:
    ptm_information = {"base_experiment":ptm}
    data = json.dumps(ptm_information)

    endpoint = f"{base_url}/experiments/{experiment_id}"

    response = requests.patch(endpoint, data=data, headers=headers)
    assert response.status_code in (200, 201)

    print(response)
    print(response.json())

### View hyperparameters that are enabled for AutoML by default <a class="anchor" id="head-8"></a>

In [None]:
if automl_enabled:
    # Get default spec schema
    endpoint = f"{base_url}/experiments/{experiment_id}/specs/train/schema"
    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)
    assert "automl_default_parameters" in response.json().keys()
    automl_specs = response.json()["automl_default_parameters"]
    print(json.dumps(automl_specs, sort_keys=True, indent=4))

### Actions <a class="anchor" id="head-10"></a>

For all actions:
1. Get default spec schema and derive the default values
2. Modify defaults if needed
3. Post spec dictionary to the service
4. Run model action
5. Monitor job using retrieve
6. Download results using job download endpoint (if needed)

### Train <a class="anchor" id="head-11"></a>

#### Set AutoML related configurations <a class="anchor" id="head-9"></a>
Refer to these hyper-links to see the parameters supported by each network and add more parameters if necessary in addition to the default automl enabled parameters: 

[Mask RCNN](https://github.com/NVIDIA/tao_front_end_services/tree/main/api/specs_utils/specs/mask_rcnn/mask_rcnn%20-%20train.csv),
[Segformer](https://github.com/NVIDIA/tao_front_end_services/tree/main/api/specs_utils/specs/segformer/segformer%20-%20train.csv),
[Unet](https://github.com/NVIDIA/tao_front_end_services/tree/main/api/specs_utils/specs/unet/unet%20-%20train.csv)

In [None]:
if automl_enabled:
    # Choose any metric that is present in the kpi dictionary present in the model's status.json. 
    # Example status.json for each model can be found in the respective section in NVIDIA TAO DOCS here: https://docs.nvidia.com/tao/tao-toolkit/text/model_zoo/cv_models/index.html
    metric="kpi"

    additional_automl_parameters = [] #Refer to parameter list mentioned in the above links and add any extra parameter in addition to the default enabled ones
    remove_default_automl_parameters = [] #Remove any hyperparameters that are enabled by default for AutoML

    automl_information = {"automl_enabled":automl_enabled,
                          "automl_algorithm":automl_algorithm,
                          "automl_max_recommendations": 20, # Only for bayesian
                          "automl_R": 27, # Only for hyperband
                          "automl_nu": 3, # Only for hyperband
                          "epoch_multiplier": 1, # Only for hyperband
                          # Enable this if you want to add parameters to automl_add_hyperparameters below that are disabled by TAO in the automl_enabled column of the spec csv.
                          # Warning: The parameters that are disabled are not tested by TAO, so there might be unexpected behaviour in overriding this
                          "override_automl_disabled_params": False,
                          "metric":metric,
                          "automl_add_hyperparameters":str(additional_automl_parameters),
                          "automl_remove_hyperparameters":str(remove_default_automl_parameters)
                         }
    data = json.dumps(automl_information)

    endpoint = f"{base_url}/experiments/{experiment_id}"

    response = requests.patch(endpoint, data=data, headers=headers)
    assert response.status_code in (200, 201)

    print(response)
    print(json.dumps(response.json(), sort_keys=True, indent=4))

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/train/schema"

response = requests.get(endpoint, headers=headers)
assert response.status_code in (200, 201)

print(response)
#print(response.json()) ## Uncomment for verbose schema
assert "default" in response.json().keys()
train_specs = response.json()["default"]
print(json.dumps(train_specs, sort_keys=True, indent=4))

In [None]:
# Override any of the parameters listed in the previous cell as required
if model_name == "mask_rcnn":
    # For each network the parameter key might be different for example, in mask_rcnn training duration is determined by num_epochs or total_steps
    train_specs["num_epochs"] = 5
    train_specs["gpus"] = 1
    train_specs["num_examples_per_epoch"] = 5000 # Set it as the number of images in your dataset for mask-rcnn / num of GPU's
    train_specs["learning_rate_steps"] = "[100000,150000,200000]" # Set it less than the total number of steps
elif model_name == "unet":
    train_specs["training_config"]["epochs"] = 50
    train_specs["gpus"] = 1
elif model_name == "segformer":
    train_specs["dataset"]["batch_size"] = 4
    train_specs["train"]["max_iters"] = 1000
    train_specs["train"]["num_gpus"] = 1
    train_specs["gpus"] = 1
print(json.dumps(train_specs, sort_keys=True, indent=4))

In [None]:
# Run action
parent = job_map.get("eval_dataset_convert_"+model_name, job_map.get("train_dataset_convert_"+model_name, None))
parent_id = eval_dataset_id
action = "train"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":train_specs})

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)
assert response.json()

print(response)
print(response.json())

job_map["train_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
# For automl: Training times for different models benchmarked on 1 GPU V100 machine can be found here: https://docs.nvidia.com/tao/tao-toolkit/text/automl/automl.html#results-of-automl-experiments

job_id = job_map["train_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    if "error_desc" in response.json().keys() and response.json()["error_desc"] in ("Job trying to retrieve not found", "No AutoML run found"):
        print("Job is being created")
        time.sleep(5)
        continue
    assert response.status_code in (200, 201)
    print(response)
    print(json.dumps(response.json(), sort_keys=True, indent=4))
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled"] or response.status_code not in (200,201):
        break
    time.sleep(15)

In [None]:
## To Stop an AutoML JOB
#    1. Stop the 'Monitor job status by repeatedly running this cell' cell (the cell right before this cell) manually
#    2. Uncomment the snippet in the next cell and run the cell

In [None]:
# if automl_enabled:
#     job_id = job_map["train_" + model_name]
#     endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}:cancel"

#     response = requests.post(endpoint, headers=headers)
#     assert response.status_code in (200, 201)

#     print(response)
#     print(response.json())

In [None]:
## Resume AutoML

In [None]:
# Uncomment the below snippet if you want to resume an already stopped AutoML job and then run the 'Monitor job status by repeatedly running this cell' cell above (4th cell above from this cell)
# if automl_enabled:
#     job_id = job_map["train_" + model_name]
#     endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}:resume"

#     data = json.dumps({"parent_job_id":parent,"specs":train_specs})
#     response = requests.post(endpoint, data=data, headers=headers)
#     assert response.status_code in (200, 201)

#     print(response)
#     print(response.json())

### Download train job artifacts <a class="anchor" id="head-12"></a>

In [None]:
# Example to list the files of the executed train job
job_id = job_map["train_" + model_name]
endpoint = f'{base_url}/experiments/{experiment_id}/jobs/{job_id}:list_files'

response = requests.get(endpoint, headers=headers)
print(json.dumps(response.json(), sort_keys=True, indent=4))

In [None]:
## Patch the model with proper metric before training to run this cell; By default loss is used, but some models dont log the parameter under the name 'loss'

# # Download selective job contents once the above job shows "Done" status
# # Example to download selective files of train job (Note: will take time)
# endpoint = f'{base_url}/experiments/{experiment_id}/jobs/{job_id}:download_selective_files'

# file_lists = [] # Choose file names from the previous cell where all the files for this job were listed
# best_model = False # Enable this to download the checkpoint of the best performing model w.r.t to the metric chosen before starting training
# latest_model = True # Enable this to download the latest checkpoint of the training job; Disable best_model to use latest_model

# params = {"file_lists": file_lists, "best_model": best_model, "latest_model": latest_model}

# # Save
# temptar = f'{job_id}.tar.gz'
# with requests.get(endpoint, headers=headers, params=params, stream=True) as r:
#     r.raise_for_status()
#     with open(temptar, 'wb') as f:
#         for chunk in r.iter_content(chunk_size=8192):
#             f.write(chunk)

# print("Untarring")
# # Untar to destination
# tar_command = f'tar -xvf {temptar} -C {workdir}/'
# os.system(tar_command)
# os.remove(temptar)
# print(f"Results at {workdir}/{job_id}")
# model_downloaded_path = f"{workdir}/{job_id}"

In [None]:
# Downloading train job takes a longer time, uncomment this cell if you want to still proceed
if download_jobs:
    endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"
    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)
    expected_file_size = response.json().get("job_tar_stats", {}).get("file_size")
    print("expected_file_size: ", expected_file_size)

    !python3 -m pip install tqdm
    from tqdm import tqdm

    endpoint = f'{base_url}/experiments/{experiment_id}/jobs/{job_id}:download'
    temptar = f'{job_id}.tar.gz'

    with tqdm(total=expected_file_size, unit='B', unit_scale=True) as progress_bar:
        while True:
            # Check if the file already exists
            headers_download_job = dict(headers)
            if os.path.exists(temptar):
                # Get the current file size
                file_size = os.path.getsize(temptar)
                print(f"File size of dowloaded content until now is {file_size}")

                # If the file size matches the expected size, break out of the loop
                if file_size >= (expected_file_size-1):
                    print("Download completed successfully.")
                    print("Untarring")
                    # Untar to destination
                    tar_command = f'tar -xf {temptar} -C {workdir}/'
                    os.system(tar_command)
                    os.remove(temptar)
                    print(f"Results at {workdir}/{job_id}")
                    model_downloaded_path = f"{workdir}/{job_id}"
                    break

                # Set the headers to resume the download from where it left off
                headers_download_job['Range'] = f'bytes={file_size}-'
            # Open the file for writing in binary mode
            with open(temptar, 'ab') as f:
                try:
                    response = requests.get(endpoint, headers=headers_download_job, stream=True)
                    print(response)
                    # Check if the request was successful
                    if response.status_code in [200, 206]:
                        # Iterate over the content in chunks
                        for chunk in response.iter_content(chunk_size=1024):
                            if chunk:
                                # Write the chunk to the file
                                f.write(chunk)
                                # Flush and sync the file to disk
                                f.flush()
                                os.fsync(f.fileno())
                            progress_bar.update(len(chunk))
                    else:
                        print(f"Failed to download file. Status code: {response.status_code}")
                except requests.exceptions.RequestException as e:
                    print("Connection interrupted during download, resuming download from breaking point")
                    time.sleep(5)  # Sleep for a while before retrying the request
                    continue  # Continue the loop to retry the request

In [None]:
# View the checkpoints generated for the training job and for automl jobs, in addition view: best performing model's config and the results of all automl experiments
if download_jobs:
    if automl_enabled:
        !python3 -m pip install pandas==1.5.1
        import pandas as pd
        model_downloaded_path = f"{model_downloaded_path}/best_model"
        assert glob.glob(f"{model_downloaded_path}/*.protobuf") or glob.glob(f"{model_downloaded_path}/*.yaml")

    assert os.path.exists(model_downloaded_path)
    assert (glob.glob(model_downloaded_path + "/**/*.tlt", recursive=True) + glob.glob(model_downloaded_path + "/**/*.hdf5", recursive=True) + glob.glob(model_downloaded_path + "/**/*.pth", recursive=True))

    if os.path.exists(model_downloaded_path):        
        #List the binary model file
        print("\nCheckpoints for the training experiment")
        if os.path.exists(model_downloaded_path+"/train/weights") and len(os.listdir(model_downloaded_path+"/train/weights")) > 0:
            print(f"Folder: {model_downloaded_path}/train/weights")
            print("Files:", os.listdir(model_downloaded_path+"/train/weights"))
        elif os.path.exists(model_downloaded_path+"/weights") and len(os.listdir(model_downloaded_path+"/weights")) > 0:
            print(f"Folder: {model_downloaded_path}/weights")
            print("Files:", os.listdir(model_downloaded_path+"/weights"))
        else:
            print(f"Folder: {model_downloaded_path}")
            print("Files:", os.listdir(model_downloaded_path))

        if automl_enabled:
            assert glob.glob(f"{model_downloaded_path}/*.protobuf") or glob.glob(f"{model_downloaded_path}/*.yaml")
            experiment_artifacts = json.load(open(f"{model_downloaded_path}/controller.json","r"))
            data_frame = pd.DataFrame(experiment_artifacts)
            # Print experiment id/number and the corresponding result
            print("\nResults of all experiments")
            with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.max_colwidth', None):
                print(data_frame[["id","result"]])

### Evaluate <a class="anchor" id="head-12"></a>

In [None]:
# Get model handler parameters
endpoint = f"{base_url}/experiments/{experiment_id}"
response = requests.get(endpoint, headers=headers)
assert response.status_code in (200, 201)
assert response.json()

model_parameters = response.json()
update_checkpoint_choosing = {}
update_checkpoint_choosing["checkpoint_choose_method"] = model_parameters["checkpoint_choose_method"]
update_checkpoint_choosing["checkpoint_epoch_number"] = model_parameters["checkpoint_epoch_number"]
print(update_checkpoint_choosing)

In [None]:
# Change the method by which checkpoint from the parent action is chosen, when parent action is a train/retrain action.
# Example for evaluate action below, can be applied in the same way for other actions too
update_checkpoint_choosing["checkpoint_choose_method"] = "latest_model" # Choose between best_model/latest_model/from_epoch_number
# If from_epoch_number is chosen then assign the epoch number to the dictionary key in the format 'from_epoch_number{train_job_id}'
# update_checkpoint_choosing["checkpoint_epoch_number"]["from_epoch_number_28a2754e-50ef-43a8-9733-98913776dd90"] = 3
data = json.dumps(update_checkpoint_choosing)

endpoint = f"{base_url}/experiments/{experiment_id}"

response = requests.patch(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)
print(response)
print(json.dumps(response.json(), sort_keys=True, indent=4))

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/evaluate/schema"

response = requests.get(endpoint, headers=headers)
assert response.status_code in (200, 201)

print(response)
#print(response.json()) ## Uncomment for verbose schema
assert "default" in response.json().keys()
eval_specs = response.json()["default"]
print(json.dumps(eval_specs, sort_keys=True, indent=4))

In [None]:
# Apply changes to the specs if required
if model_name == "segformer":
    eval_specs["dataset"]["batch_size"] = 4
print(json.dumps(eval_specs, sort_keys=True, indent=4))

In [None]:
# Run action
parent = job_map["train_" + model_name]
action = "evaluate"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":eval_specs})

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)
assert response.json()

print(response)
print(response.json())

job_map["evaluate_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map["evaluate_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:    
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)
    print(response)
    print(response.json())
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled"] or response.status_code not in (200,201):
        break
    time.sleep(15)

### Prune, Retrain and Evaluation <a class="anchor" id="head-13"></a>

- We optimize the trained model by pruning and retraining in the following cells

#### Prune <a class="anchor" id="head-14"></a>

In [None]:
if model_name != "segformer":
    # Get default spec schema
    endpoint = f"{base_url}/experiments/{experiment_id}/specs/prune/schema"

    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)

    print(response)
    #print(response.json()) ## Uncomment for verbose schema
    assert "default" in response.json().keys()
    prune_specs = response.json()["default"]
    print(json.dumps(prune_specs, sort_keys=True, indent=4))

In [None]:
if model_name != "segformer":
    # Apply changes to specs if necessary like
    prune_specs["pruning_threshold"] = 0.7
    print(json.dumps(prune_specs, sort_keys=True, indent=4))

In [None]:
if model_name != "segformer":
    # Run actions
    parent = job_map["train_" + model_name]
    action = "prune"
    data = json.dumps({"parent_job_id":parent,"action":action,"specs":prune_specs})

    endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

    response = requests.post(endpoint, data=data, headers=headers)
    assert response.status_code in (200, 201)
    assert response.json()

    print(response)
    print(response.json())

    job_map["prune_" + model_name] = response.json()
    print(job_map)

In [None]:
if model_name != "segformer":
    # Monitor job status by repeatedly running this cell (retrain)
    job_id = job_map["prune_" + model_name]
    endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

    while True:    
        clear_output(wait=True)
        response = requests.get(endpoint, headers=headers)
        assert response.status_code in (200, 201)
        print(response)
        print(response.json())
        assert "status" in response.json().keys() and response.json().get("status") != "Error"
        if response.json().get("status") in ["Done","Error", "Canceled"] or response.status_code not in (200,201):
            break
        time.sleep(15)

#### Retrain <a class="anchor" id="head-15"></a>

In [None]:
if model_name != "segformer":
    # Get default spec schema
    endpoint = f"{base_url}/experiments/{experiment_id}/specs/retrain/schema"

    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)

    print(response)
    #print(response.json()) ## Uncomment for verbose schema
    assert "default" in response.json().keys()
    retrain_specs = response.json()["default"]

    print(json.dumps(retrain_specs, sort_keys=True, indent=4))

In [None]:
# Override any of the parameters listed in the previous cell as required
if model_name != "segformer":
    if model_name == "mask_rcnn":
        # For each network the parameter key might be different for example, in mask_rcnn training duration is determined by num_epochs or total_steps
        retrain_specs["num_epochs"] = 5
        retrain_specs["gpus"] = 1
        retrain_specs["num_examples_per_epoch"] = 5000 # Set it as the number of images in your dataset for mask-rcnn / num of GPU's
        retrain_specs["learning_rate_steps"] = "[100000,150000,200000]" # Set it less than the total number of steps
    elif model_name == "unet":
        retrain_specs["training_config"]["epochs"] = 50
        retrain_specs["gpus"] = 1
    print(json.dumps(retrain_specs, sort_keys=True, indent=4))

In [None]:
if model_name != "segformer":
    # Run actions
    parent = job_map["prune_" + model_name]
    action = "retrain"
    data = json.dumps({"parent_job_id":parent,"action":action,"specs":retrain_specs})

    endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

    response = requests.post(endpoint, data=data, headers=headers)
    assert response.status_code in (200, 201)
    assert response.json()

    print(response)
    print(response.json())

    job_map["retrain_" + model_name] = response.json()
    print(job_map)

In [None]:
if model_name != "segformer":
    # Monitor job status by repeatedly running this cell (prune)
    job_id = job_map["retrain_" + model_name]
    endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

    while True:
        clear_output(wait=True)
        response = requests.get(endpoint, headers=headers)
        assert response.status_code in (200, 201)
        print(response)
        print(response.json())
        assert "status" in response.json().keys() and response.json().get("status") != "Error"
        if response.json().get("status") in ["Done","Error", "Canceled"] or response.status_code not in (200,201):
            break
        time.sleep(15)

In [None]:
# Optional cancel job - for jobs that are pending/running (retrain)

# job_id = job_map["retrain_" + model_name]
# endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}:cancel"

# response = requests.post(endpoint, headers=headers)
# assert response.status_code in (200, 201)

# print(response)
# print(response.json())

In [None]:
# Optional delete job - for jobs that are error/done (retrain)

# job_id = job_map["retrain_" + model_name]
# endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

# response = requests.delete(endpoint, headers=headers)
# assert response.status_code in (200, 201)

# print(response)
# print(response.json())

#### Evaluate after retrain <a class="anchor" id="head-15"></a>

In [None]:
# Get default spec schema
if model_name != "segformer":
    endpoint = f"{base_url}/experiments/{experiment_id}/specs/evaluate/schema"

    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)

    print(response)
    #print(response.json()) ## Uncomment for verbose schema
    assert "default" in response.json().keys()
    eval_retrain_specs = response.json()["default"]
    print(json.dumps(eval_retrain_specs, sort_keys=True, indent=4))

In [None]:
# Apply changes to the specs if required
if model_name != "segformer":
    print(json.dumps(eval_retrain_specs, sort_keys=True, indent=4))

In [None]:
if model_name != "segformer":
    # Run actions
    parent = job_map["retrain_" + model_name]
    action = "evaluate"
    data = json.dumps({"parent_job_id":parent,"action":action,"specs":eval_retrain_specs})

    endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

    response = requests.post(endpoint, data=data, headers=headers)
    assert response.status_code in (200, 201)
    assert response.json()

    print(response)
    print(response.json())

    job_map["eval_retrain_" + model_name] = response.json()
    print(job_map)

In [None]:
if model_name != "segformer":
    # Monitor job status by repeatedly running this cell (evaluate)
    job_id = job_map["eval_retrain_" + model_name]
    endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

    while True:    
        clear_output(wait=True)
        response = requests.get(endpoint, headers=headers)
        assert response.status_code in (200, 201)
        print(response)
        print(response.json())
        assert "status" in response.json().keys() and response.json().get("status") != "Error"
        if response.json().get("status") in ["Done","Error", "Canceled"] or response.status_code not in (200,201):
            break
        time.sleep(15)

### Export <a class="anchor" id="head-15"></a>

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/export/schema"

response = requests.get(endpoint, headers=headers)
assert response.status_code in (200, 201)

print(response)
#print(response.json()) ## Uncomment for verbose schema
assert "default" in response.json().keys()
export_specs = response.json()["default"]
print(json.dumps(export_specs, sort_keys=True, indent=4))

In [None]:
# Apply changes
print(json.dumps(export_specs, sort_keys=True, indent=4))

In [None]:
# Run action
parent = job_map["train_" + model_name]
action = "export"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":export_specs})

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)
assert response.json()

print(response)
print(response.json())

job_map["export_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map["export_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:    
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)
    print(response)
    print(response.json())
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled"] or response.status_code not in (200,201):
        break
    time.sleep(15)

### TRT Engine generation using TAO-Deploy <a class="anchor" id="head-19"></a>

- Here, we use the exported model to generate trt engine on the target platform

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/gen_trt_engine/schema"

response = requests.get(endpoint, headers=headers)
assert response.status_code in (200, 201)

print(response)
#print(response.json()) ## Uncomment for verbose schema
assert "default" in response.json().keys()
tao_deploy_specs = response.json()["default"]
print(json.dumps(tao_deploy_specs, sort_keys=True, indent=4))

In [None]:
# Apply changes to specs dictionary in this cell if required
if model_name == "segformer":
    tao_deploy_specs["gen_trt_engine"]["tensorrt"]["data_type"] = "fp16"
else:
    tao_deploy_specs["data_type"] = "int8"
print(json.dumps(tao_deploy_specs, sort_keys=True, indent=4))

In [None]:
# Run action
parent = job_map["export_" + model_name]
action = "gen_trt_engine"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":tao_deploy_specs})

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)
assert response.json()

print(response)
print(response.json())

job_map["model_gen_trt_engine_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map["model_gen_trt_engine_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:    
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)
    print(response)
    print(response.json())
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled"] or response.status_code not in (200,201):
        break
    time.sleep(15)

### TAO inference <a class="anchor" id="head-20"></a>

- Run inference on a set of images using the .tlt model created at train step

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/inference/schema"

response = requests.get(endpoint, headers=headers)
assert response.status_code in (200, 201)

print(response)
#print(response.json()) ## Uncomment for verbose schema
assert "default" in response.json().keys()
tao_inference_specs = response.json()["default"]
print(json.dumps(tao_inference_specs, sort_keys=True, indent=4))

In [None]:
# Apply changes to specs if necessary
if model_name == "segformer":
    tao_inference_specs["dataset"]["batch_size"] = 4
print(json.dumps(tao_inference_specs, sort_keys=True, indent=4))

In [None]:
# Run action
parent = job_map["train_" + model_name]
action = "inference"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":tao_inference_specs})

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)
assert response.json()

print(response)
print(response.json())

job_map["inference_tlt_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map["inference_tlt_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:    
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)
    print(response)
    print(response.json())
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled"] or response.status_code not in (200,201):
        break
    time.sleep(15)

In [None]:
# Download job contents once the above job shows "Done" status
if download_jobs:
    job_id = job_map["inference_tlt_" + model_name]
    endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"
    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)
    expected_file_size = response.json().get("job_tar_stats", {}).get("file_size")
    print("expected_file_size: ", expected_file_size)

    !python3 -m pip install tqdm
    from tqdm import tqdm

    endpoint = f'{base_url}/experiments/{experiment_id}/jobs/{job_id}:download'
    temptar = f'{job_id}.tar.gz'

    with tqdm(total=expected_file_size, unit='B', unit_scale=True) as progress_bar:
        while True:
            # Check if the file already exists
            headers_download_job = dict(headers)
            if os.path.exists(temptar):
                # Get the current file size
                file_size = os.path.getsize(temptar)
                print(f"File size of dowloaded content until now is {file_size}")

                # If the file size matches the expected size, break out of the loop
                if file_size >= (expected_file_size-1):
                    print("Download completed successfully.")
                    print("Untarring")
                    # Untar to destination
                    tar_command = f'tar -xf {temptar} -C {workdir}/'
                    os.system(tar_command)
                    os.remove(temptar)
                    print(f"Results at {workdir}/{job_id}")
                    inference_out_path = f"{workdir}/{job_id}"
                    break

                # Set the headers to resume the download from where it left off
                headers_download_job['Range'] = f'bytes={file_size}-'
            # Open the file for writing in binary mode
            with open(temptar, 'ab') as f:
                try:
                    response = requests.get(endpoint, headers=headers_download_job, stream=True)
                    print(response)
                    # Check if the request was successful
                    if response.status_code in [200, 206]:
                        # Iterate over the content in chunks
                        for chunk in response.iter_content(chunk_size=1024):
                            if chunk:
                                # Write the chunk to the file
                                f.write(chunk)
                                # Flush and sync the file to disk
                                f.flush()
                                os.fsync(f.fileno())
                            progress_bar.update(len(chunk))
                    else:
                        print(f"Failed to download file. Status code: {response.status_code}")
                except requests.exceptions.RequestException as e:
                    print("Connection interrupted during download, resuming download from breaking point")
                    time.sleep(5)  # Sleep for a while before retrying the request
                    continue  # Continue the loop to retry the request

In [None]:
if download_jobs:
    sample_image = (glob.glob(f"{inference_out_path}/**/*.jpg", recursive=True) + glob.glob(f"{inference_out_path}/**/*.png", recursive=True))[0]
    Image(filename=sample_image) 

### TRT inference <a class="anchor" id="head-21"></a>

- Set batch size to the value used during trt engine generation

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/inference/schema"

response = requests.get(endpoint, headers=headers)
assert response.status_code in (200, 201)

print(response)
#print(response.json()) ## Uncomment for verbose schema
assert "default" in response.json().keys()
trt_inference_specs = response.json()["default"]
print(json.dumps(trt_inference_specs, sort_keys=True, indent=4))

In [None]:
# Apply changes to specs if necessary
if model_name == "segformer":
    trt_inference_specs["dataset"]["batch_size"] = 1
print(json.dumps(trt_inference_specs, sort_keys=True, indent=4))

In [None]:
# Run action
parent = job_map["model_gen_trt_engine_" + model_name]
action = "inference"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":trt_inference_specs})

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)
assert response.json()

print(response)
print(response.json())

job_map["inference_trt_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map["inference_trt_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:    
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)
    print(response)
    print(response.json())
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled"] or response.status_code not in (200,201):
        break
    time.sleep(15)

In [None]:
# Download job contents once the above job shows "Done" status
if download_jobs:
    job_id = job_map["inference_trt_" + model_name]
    endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"
    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)
    expected_file_size = response.json().get("job_tar_stats", {}).get("file_size")
    print("expected_file_size: ", expected_file_size)

    !python3 -m pip install tqdm
    from tqdm import tqdm

    endpoint = f'{base_url}/experiments/{experiment_id}/jobs/{job_id}:download'
    temptar = f'{job_id}.tar.gz'

    with tqdm(total=expected_file_size, unit='B', unit_scale=True) as progress_bar:
        while True:
            # Check if the file already exists
            headers_download_job = dict(headers)
            if os.path.exists(temptar):
                # Get the current file size
                file_size = os.path.getsize(temptar)
                print(f"File size of dowloaded content until now is {file_size}")

                # If the file size matches the expected size, break out of the loop
                if file_size >= (expected_file_size-1):
                    print("Download completed successfully.")
                    print("Untarring")
                    # Untar to destination
                    tar_command = f'tar -xf {temptar} -C {workdir}/'
                    os.system(tar_command)
                    os.remove(temptar)
                    print(f"Results at {workdir}/{job_id}")
                    inference_out_path = f"{workdir}/{job_id}"
                    break

                # Set the headers to resume the download from where it left off
                headers_download_job['Range'] = f'bytes={file_size}-'
            # Open the file for writing in binary mode
            with open(temptar, 'ab') as f:
                try:
                    response = requests.get(endpoint, headers=headers_download_job, stream=True)
                    print(response)
                    # Check if the request was successful
                    if response.status_code in [200, 206]:
                        # Iterate over the content in chunks
                        for chunk in response.iter_content(chunk_size=1024):
                            if chunk:
                                # Write the chunk to the file
                                f.write(chunk)
                                # Flush and sync the file to disk
                                f.flush()
                                os.fsync(f.fileno())
                            progress_bar.update(len(chunk))
                    else:
                        print(f"Failed to download file. Status code: {response.status_code}")
                except requests.exceptions.RequestException as e:
                    print("Connection interrupted during download, resuming download from breaking point")
                    time.sleep(5)  # Sleep for a while before retrying the request
                    continue  # Continue the loop to retry the request

In [None]:
if download_jobs:
    sample_image = (glob.glob(f"{inference_out_path}/**/*.jpg", recursive=True) + glob.glob(f"{inference_out_path}/**/*.png", recursive=True))[0]
    Image(filename=sample_image) 

### Delete model <a class="anchor" id="head-21"></a>

In [None]:
endpoint = f"{base_url}/experiments/{experiment_id}"

response = requests.delete(endpoint,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(response.json())

### Delete dataset <a class="anchor" id="head-21"></a>

#### Delete train dataset <a class="anchor" id="head-21"></a>

In [None]:
endpoint = f"{base_url}/datasets/{train_dataset_id}"

response = requests.delete(endpoint,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(response.json())

#### Delete val dataset <a class="anchor" id="head-21"></a>

In [None]:
endpoint = f"{base_url}/datasets/{eval_dataset_id}"

response = requests.delete(endpoint,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(response.json())