### TAO remote client - Classification

Transfer learning is the process of transferring learned features from one application to another. It is a commonly used training technique where you use a model trained on one task and re-train to use it on a different task. Train Adapt Optimize (TAO) Toolkit  is a simple and easy-to-use Python based AI toolkit for taking purpose-built AI models and customizing them with users' own data.

![image](https://d29g4g2dyqv443.cloudfront.net/sites/default/files/akamai/TAO/tlt-tao-toolkit-bring-your-own-model-diagram.png)

### Sample prediction for an Image Classification model
<img align="center" src="../example_images/sample_image_classification.jpg">

### The workflow in a nutshell

- Creating a dataset
- Upload dataset to the service
- Getting a PTM from NGC
- Model Actions
    - Train (Normal/AutoML)
    - Evaluate
    - Prune, retrain
    - Export
    - Tao-Deploy
    - Inference on TAO

### Table of contents

1. [Install TAO remote client ](#head-1)
1. [Set the remote service base URL](#head-2)
1. [Access the shared volume](#head-3)
1. [Create the datasets](#head-4)
1. [List datasets](#head-5)
1. [Create an experiment](#head-8)
1. [Find pretrained model](#head-9)
1. [Customize model metadata](#head-10)
1. [View hyperparameters that are enabled for AutoML by default](#head-11)
1. [Set AutoML related configurations](#head-12)
1. [Provide train specs](#head-13)
1. [Run train](#head-14)
1. [View checkpoint files](#head-15)
1. [Provide evaluate specs](#head-16)
1. [Run evaluate](#head-17)
1. [Provide prune specs](#head-18)
1. [Run prune](#head-19)
1. [Provide retrain specs](#head-20)
1. [Run retrain](#head-21)
1. [Run evaluate on retrain](#head-21-1)
1. [Provide export specs](#head-22)
1. [Run export](#head-23)
1. [Provide trt engine generation specs](#head-26)
1. [Run TRT Engine generation using TAO-Deploy](#head-27)
1. [Provide TAO inference specs](#head-28)
1. [Run TAO inference](#head-29)
1. [Delete experiment](#head-30)
1. [Delete datasets](#head-31)
1. [Unmount shared volume](#head-32)
1. [Uninstall TAO Remote Client](#head-33)

### Requirements
Please find the server requirements [here](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_setup.html#)

In [None]:
import os
import glob
import subprocess
import json
import time
import ast
from IPython.display import clear_output

In [None]:
namespace = 'default'

### FIXME

1. Assign a model_name in FIXME 1
1. Assign a workdir in FIXME 2
1. Assign the ip_address and port_number in FIXME 3 ([info](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_rest_api.html))
1. Assign the ngc_api_key variable in FIXME 4
1. (Optional) Enable AutoML if needed in FIXME 5
1. (Optional) Choose between bayesian and hyperband automl_algorithm in FIXME 6 (If automl was enabled in FIXME5)
1. Choose to download jobs or not in FIXME 7
1. Choose between default and custom dataset in FIXME 8
1. Assign path of DATA_DIR in FIXME 9

In [None]:
# Available models (#FIXME 1):
# 1. classification_pyt - https://docs.nvidia.com/tao/tao-toolkit/text/image_classification.html
# 2. classification_tf1 - https://docs.nvidia.com/tao/tao-toolkit/text/image_classification.html
# 3. classification_tf2 - https://docs.nvidia.com/tao/tao-toolkit/text/image_classification_tf2.html
# 4. multitask_classification - https://docs.nvidia.com/tao/tao-toolkit/text/multitask_image_classification.html
# classification is the same as multi_class classification

model_name = "multitask_classification"  # FIXME1 (Add the model name from the above mentioned list)

### Install TAO remote client <a class="anchor" id="head-1"></a>

In [None]:
# SKIP this step IF you have already installed the TAO-Client wheel.
! pip3 install nvidia-transfer-learning-client

In [None]:
# View the version of the TAO-Client
! nvtl --version

### Set the remote service base URL <a class="anchor" id="head-2"></a>

In [None]:
# Define the node_addr and port number
workdir = "workdir_classification" # FIXME2
host_url = "http://<ip_address>:<port_number>" # FIXME3 example: https://10.137.149.22:32334
# In host machine, node ip_address and port number can be obtained as follows,
# ip_address: hostname -i
# port_number: kubectl get service ingress-nginx-controller -o jsonpath='{.spec.ports[0].nodePort}'

ngc_api_key = "<ngc_api_key>" # FIXME4 example: (Add NGC API key)

In [None]:
automl_enabled = False # FIXME5 set to True if you want to run automl for the model chosen in the previous cell
automl_algorithm="bayesian" # FIXME6 example: bayesian/hyperband
# FIXME7 Defaulted to False as downloading jobs from service to your machine takes time
# Set to True if you want to download jobs where examples have been provided like for train, export, inference.
download_jobs = False

In [None]:
%env BASE_URL={host_url}/{namespace}/api/v1

# Exchange NGC_API_KEY for JWT
identity = json.loads(subprocess.getoutput(f"nvtl login --ngc-api-key {ngc_api_key}"))

%env USER={identity['user_id']}
%env TOKEN={identity['token']}

In [None]:
# Creating workdir
workdir = os.path.abspath(workdir)
if not os.path.isdir(workdir):
    os.makedirs(workdir)

### Function to parse logs <a class="anchor" id="head-1.1"></a>

In [None]:
def my_tail(model_name_cli, id, job_id, job_type, workdir):
	status = None
	while True:
		time.sleep(10)
		clear_output(wait=True)
		log_file_path = subprocess.getoutput(f"nvtl {model_name_cli} get-log-file --id {id} --job {job_id} --job_type {job_type} --workdir {workdir}")
		if not os.path.exists(log_file_path):
			continue
		with open(log_file_path, 'rb') as log_file:
			log_contents = log_file.read()
		log_content_lines = log_contents.decode("utf-8").split("\n")        
		for line in log_content_lines:
			print(line.strip())
			if line.strip() == "Error EOF":
				status = "Error"
				break
			elif line.strip() == "Done EOF":
				status = "Done"
				break
		if status is not None:
			break
	return status

### Function to split tar files <a class="anchor" id="head-1.1"></a>

In [None]:
import os
import tarfile

def split_tar_file(input_tar_path, output_dir, max_split_size=0.2*1024*1024*1024):
	os.makedirs(output_dir, exist_ok=True)
	
	with tarfile.open(input_tar_path, 'r') as original_tar:
		members = original_tar.getmembers()
		current_split_size = 0
		current_split_number = 0
		current_split_name = os.path.join(output_dir, f'smaller_file_{current_split_number}.tar')
		
		with tarfile.open(current_split_name, 'w') as split_tar:
			for member in members:
				if current_split_size + member.size <= max_split_size:
					split_tar.addfile(member, original_tar.extractfile(member))
					current_split_size += member.size
				else:
					split_tar.close()
					current_split_number += 1
					current_split_name = os.path.join(output_dir, f'smaller_file_{current_split_number}.tar')
					current_split_size = 0
					split_tar = tarfile.open(current_split_name, 'w')  # Open a new split tar archive
					split_tar.addfile(member, original_tar.extractfile(member))
					current_split_size += member.size

### Set dataset type, format <a class="anchor" id="head-1.1"></a>

**For multi-class classification:**

We will be using the `pascal VOC dataset` for the tutorial. To find more details please visit [here](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html#devkit). Please download the dataset present [here](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar) to the environment variable $DATA_DIR.

**If using custom dataset; it should follow this dataset structure, and skip running** "**Split dataset into train and val sets**"
```
DATA_DIR
├── classes.txt
├── images_test
│   ├── class_name_1
│   │   ├── image_name_1.jpg
│   │   ├── image_name_2.jpg
│   │   ├── ...
|   |   ... 
│   └── class_name_n
│       ├── image_name_3.jpg
│       ├── image_name_4.jpg
│       ├── ...
├── images_train
│   ├── class_name_1
│   │   ├── image_name_5.jpg
│   │   ├── image_name_6.jpg
|   |   ...
│   └── class_name_n
│       ├── image_name_7.jpg
│       ├── image_name_8.jpg
│       ├── ...
|
└── images_val
    ├── class_name_1
    │   ├── image_name_9.jpg
    │   ├── image_name_10.jpg
    │   ├── ...
    |   ...
    └── class_name_n
        ├── image_name_11.jpg
        ├── image_name_12.jpg
        ├── ...
```
- Each class name folder should contain the images corresponding to that class
- Same class name folders should be present across images_test, images_train and images_val
- classes.txt is a file which contains the names of all classes (each name in a separate line)

**For multi-task classification:**

We will be using the Fashion Product Images (Small) for the tutorial. This dataset is available on Kaggle.In this tutorial, our trained classification network will perform three tasks: article category classification, base color classification and target season classification.

To download the dataset, you will need a Kaggle account. After login, you can download the dataset zip file [here](https://www.kaggle.com/paramaggarwal/fashion-product-images-small). The downloaded file is archive.zip with a subfolder called myntradataset. Unzip contents in this subfolder to your workdir created in the cell above and you should have a folder called images and a CSV file called styles.csv

**If using custom dataset; it should follow this dataset structure**
```
DATA_DIR
├── images
│   ├── image_name_1.jpg
│   ├── image_name_2.jpg
|   |   ├── ...
├── styles.csv
```

In [None]:
if model_name == "classification_pyt":
    ds_format = "classification_pyt"
elif "classification_" in model_name:
    ds_format = "default"
elif model_name == "multitask_classification":
    ds_format = "custom"

In [None]:
dataset_to_be_used = "default" # FIXME8 example: default/custom; default for the dataset used in this tutorial notebook; custom for a different dataset
DATA_DIR = model_name # FIXME9
os.environ['DATA_DIR']= DATA_DIR
!mkdir -p $DATA_DIR
job_map = {}

### Dataset download and pre-processing <a class="anchor" id="head-1"></a>

In [None]:
if dataset_to_be_used == "default":
    if "classification_" in model_name:
        assert os.path.exists(os.path.join(DATA_DIR,"VOCtrainval_11-May-2012.tar"))
        !tar -xf $DATA_DIR/VOCtrainval_11-May-2012.tar -C $DATA_DIR
        assert (os.path.exists(f"{DATA_DIR}/VOCdevkit/"))
        !rm -rf $DATA_DIR/split
    elif model_name == "multitask_classification":
        assert os.path.exists(os.path.join(DATA_DIR,"archive.zip"))
        !unzip -uq $DATA_DIR/archive.zip -d $DATA_DIR/
        assert (os.path.exists(f"{DATA_DIR}/images"))
        assert (os.path.exists(f"{DATA_DIR}/styles.csv"))
        # Create subdirectories and remove existing files in them
        !mkdir -p $DATA_DIR/images_train && rm -rf $DATA_DIR/images_train/*
        !mkdir -p $DATA_DIR/images_val && rm -rf $DATA_DIR/images_val/*
        !mkdir -p $DATA_DIR/images_test && rm -rf $DATA_DIR/images_test/*            

#### Split dataset into train and val sets

In [None]:
!python3 -m pip install numpy pandas==1.5.1 tqdm
if "classification_" in model_name and dataset_to_be_used == "default":
    !python3 ../dataset_prepare/classification/dataset_split.py
    assert (os.path.exists(f"{DATA_DIR}/split/images_train/"))
    assert (os.path.exists(f"{DATA_DIR}/split/images_val/"))
    assert (os.path.exists(f"{DATA_DIR}/split/images_test/"))
elif model_name == "multitask_classification" and dataset_to_be_used == "default":
    !python3 ../dataset_prepare/multitask_classification/dataset_split.py
    assert (os.path.exists(f"{DATA_DIR}/images_train/"))
    assert (os.path.exists(f"{DATA_DIR}/images_val/"))
    assert (os.path.exists(f"{DATA_DIR}/images_test/"))
    assert (os.path.exists(f"{DATA_DIR}/train.csv"))
    assert (os.path.exists(f"{DATA_DIR}/val.csv"))

### Create Tar files to upload

In [None]:
if "classification_" in model_name:
    !tar -C $DATA_DIR/split/ -czf classification_train.tar.gz images_train classes.txt
    !tar -C $DATA_DIR/split/ -czf classification_val.tar.gz images_val classes.txt
    !tar -C $DATA_DIR/split/ -czf classification_test.tar.gz images_test classes.txt
elif model_name == "multitask_classification":
    !tar -C $DATA_DIR/ -czf mt_classification_train.tar.gz images_train train.csv val.csv
    !tar -C $DATA_DIR/ -czf mt_classification_val.tar.gz images_val val.csv
    !tar -C $DATA_DIR/ -czf mt_classification_test.tar.gz images_test

In [None]:
if "classification_" in model_name:
    train_dataset_path =  "classification_train.tar.gz"
    eval_dataset_path = "classification_val.tar.gz"
    test_dataset_path = "classification_test.tar.gz"
elif model_name == "multitask_classification":
    train_dataset_path =  "mt_classification_train.tar.gz"
    eval_dataset_path = "mt_classification_val.tar.gz"
    test_dataset_path = "mt_classification_test.tar.gz"

### Create and upload train dataset <a class="anchor" id="head-1.2"></a>

In [None]:
train_dataset_id = subprocess.getoutput(f"nvtl {model_name} dataset-create --dataset_type image_classification --dataset_format {ds_format}")
print(train_dataset_id)

In [None]:
output_dir = os.path.join(os.path.dirname(os.path.abspath(train_dataset_path)), model_name, "train")
split_tar_file(train_dataset_path, output_dir)
for idx, tar_dataset_path in enumerate(os.listdir(output_dir)):
    print(f"Uploading {idx+1}/{len(os.listdir(output_dir))} tar split")
    upload_train_dataset_message = subprocess.getoutput(f"nvtl {model_name} dataset-upload --id {train_dataset_id} --path {os.path.join(output_dir,tar_dataset_path)}")
    print(upload_train_dataset_message)

### Create and upload val dataset <a class="anchor" id="head-1.3"></a>

In [None]:
eval_dataset_id = subprocess.getoutput(f"nvtl {model_name} dataset-create --dataset_type image_classification --dataset_format {ds_format}")
print(eval_dataset_id)

In [None]:
output_dir = os.path.join(os.path.dirname(os.path.abspath(eval_dataset_path)), model_name, "val")
split_tar_file(eval_dataset_path, output_dir)
for idx, tar_dataset_path in enumerate(os.listdir(output_dir)):
    print(f"Uploading {idx+1}/{len(os.listdir(output_dir))} tar split")
    upload_val_dataset_message = subprocess.getoutput(f"nvtl {model_name} dataset-upload --id {eval_dataset_id} --path {os.path.join(output_dir,tar_dataset_path)}")
    print(upload_val_dataset_message)

### Create and upload test dataset <a class="anchor" id="head-1.4"></a>

In [None]:
test_dataset_id = subprocess.getoutput(f"nvtl {model_name} dataset-create --dataset_type image_classification --dataset_format raw")
print(test_dataset_id)

In [None]:
output_dir = os.path.join(os.path.dirname(os.path.abspath(test_dataset_path)), model_name, "test")
split_tar_file(test_dataset_path, output_dir)
for idx, tar_dataset_path in enumerate(os.listdir(output_dir)):
    print(f"Uploading {idx+1}/{len(os.listdir(output_dir))} tar split")
    upload_test_dataset_message = subprocess.getoutput(f"nvtl {model_name} dataset-upload --id {test_dataset_id} --path {os.path.join(output_dir,tar_dataset_path)}")
    print(upload_test_dataset_message)

### List the created datasets <a class="anchor" id="head-5"></a>

In [None]:
message = subprocess.getoutput(f"nvtl {model_name} list-datasets")
message = ast.literal_eval(message)
for rsp in message:
    rsp_keys = rsp.keys()
    assert "id" in rsp_keys
    assert "type" in rsp_keys
    assert "format" in rsp_keys
    assert "name" in rsp_keys
    print(rsp["id"],"\t",rsp["type"],"\t",rsp["format"],"\t\t",rsp["name"])

### Create an experiment <a class="anchor" id="head-8"></a>

In [None]:
network_arch = model_name
if "classification_" in network_arch:
    encode_key = "nvidia_tlt"
else:
    encode_key = "tlt_encode"
experiment_id = subprocess.getoutput(f"nvtl {model_name} experiment-create --network_arch {network_arch} --encryption_key {encode_key} ")
print(experiment_id)

### Assign train, eval datasets <a class="anchor" id="head-10"></a>

In [None]:
dataset_information = {"train_datasets":[train_dataset_id],
                       "eval_dataset":eval_dataset_id,
                       "inference_dataset":test_dataset_id,
                       "calibration_dataset":train_dataset_id}
patched_model = subprocess.getoutput(f"nvtl {model_name} patch-artifact-metadata --id {experiment_id} --job_type experiment --update_info '{json.dumps(dataset_information)}' ")
print(patched_model)

### List experiments <a class="anchor" id="head-5"></a>

In [None]:
# List all pretrained models for the chosen network architecture
filter_params = {"network_arch": network_arch}
message = subprocess.getoutput(f"nvtl {model_name} list-experiments --filter_params '{json.dumps(filter_params)}'")
message = ast.literal_eval(message)
for rsp in message:
    rsp_keys = rsp.keys()
    if "encryption_key" not in rsp.keys():
        assert "name" in rsp_keys and "version" in rsp_keys and "ngc_path" in rsp_keys and "additional_id_info" in rsp_keys
        print(f'PTM Name: {rsp["name"]}; PTM version: {rsp["version"]}; NGC PATH: {rsp["ngc_path"]}; Additional info: {rsp["additional_id_info"]}')

### Assign PTM <a class="anchor" id="head-7"></a>

Search for PTM on NGC for the Classification model chosen

In [None]:
# Assigning pretrained models to different classification models
# From the output of previous cell make the appropriate changes to this map if you want to change the default PTM backbone.
# Changing the default backbone here requires changing default spec/config during train/eval etc like for example
# If you are changing the ptm to resnet34, then you have to modify the config key num_layers if it exists to 34 manually
pretrained_map = {"classification_tf1" : "pretrained_classification:resnet18",
                  "classification_tf2" : "pretrained_classification_tf2:efficientnet_b0",
                  "classification_pyt" : "pretrained_fan_classification_imagenet:fan_hybrid_tiny",
                  "multitask_classification" : "pretrained_classification:resnet10"}
no_ptm_models = set([])

In [None]:
if network_arch not in no_ptm_models:
    filter_params = {"network_arch": network_arch}
    message = subprocess.getoutput(f"nvtl {model_name} list-experiments --filter_params '{json.dumps(filter_params)}'")
    message = ast.literal_eval(message)
    ptm = []
    for rsp in message:
        rsp_keys = rsp.keys()
        assert "ngc_path" in rsp_keys
        if rsp["ngc_path"].endswith(pretrained_map[network_arch]):
            assert "id" in rsp_keys
            ptm_id = rsp["id"]
            ptm = [ptm_id]
            print("Metadata for model with requested NGC Path")
            print(rsp)
            break
    print(ptm)

In [None]:
if network_arch not in no_ptm_models:
    ptm_information = {"base_experiment":ptm}
    patched_model = subprocess.getoutput(f"nvtl {model_name} patch-artifact-metadata --id {experiment_id} --job_type experiment --update_info '{json.dumps(ptm_information)}' ")
    print(patched_model)

### View hyperparameters that are enabled for AutoML by default <a class="anchor" id="head-11"></a>

In [None]:
if automl_enabled:
    # View default automl specs enabled
    ! nvtl {model_name} model-automl-defaults --id {experiment_id}

### Train <a class="anchor" id="head-11"></a>

#### Set AutoML related configurations <a class="anchor" id="head-12"></a>
Refer to these hyper-links to see the parameters supported by each network and add more parameters if necessary in addition to the default automl enabled parameters: 

[Classification TF1](https://github.com/NVIDIA/tao_front_end_services/tree/main/api/specs_utils/specs/classification_tf1/classification_tf1%20-%20train.csv), 
[Classification TF2](https://github.com/NVIDIA/tao_front_end_services/tree/main/api/specs_utils/specs/classification_tf2/classification_tf2%20-%20train.csv), 
[Classification Pytorch](https://github.com/NVIDIA/tao_front_end_services/tree/main/api/specs_utils/specs/classification_pyt/classification_pyt%20-%20train.csv), 
[Multitask classification](https://github.com/NVIDIA/tao_front_end_services/tree/main/api/specs_utils/specs/multitask_classification/multitask_classification%20-%20train.csv)

In [None]:
if automl_enabled:
    # Choose any metric that is present in the kpi dictionary present in the model's status.json. 
    # Example status.json for each model can be found in the respective section in NVIDIA TAO DOCS here: https://docs.nvidia.com/tao/tao-toolkit/text/model_zoo/cv_models/index.html
    if model_name == "classification_pyt":
        metric = "loss"
    else:
        metric = "kpi" 

    additional_automl_parameters = [] #Refer to parameter list mentioned in the above links and add any extra parameter in addition to the default enabled ones
    remove_default_automl_parameters = [] #Remove any hyperparameters that are enabled by default for AutoML

    automl_information = {"automl_enabled":automl_enabled,
                          "automl_algorithm":automl_algorithm,
                          "automl_max_recommendations": 20, # Only for bayesian
                          "automl_R": 27, # Only for hyperband
                          "automl_nu": 3, # Only for hyperband
                          "epoch_multiplier": 1, # Only for hyperband
                          # Enable this if you want to add parameters to automl_add_hyperparameters below that are disabled by TAO in the automl_enabled column of the spec csv.
                          # Warning: The parameters that are disabled are not tested by TAO, so there might be unexpected behaviour in overriding this
                          "override_automl_disabled_params": False,
                          "metric":metric,
                          "automl_add_hyperparameters":str(additional_automl_parameters),
                          "automl_remove_hyperparameters":str(remove_default_automl_parameters)
                         }
    patched_model = subprocess.getoutput(f"nvtl {model_name} patch-artifact-metadata --id {experiment_id} --job_type experiment --update_info '{json.dumps(automl_information)}' ")
    patched_model = json.loads(patched_model)
    print(json.dumps(patched_model, indent=4))

#### Provide train specs <a class="anchor" id="head-13"></a>

In [None]:
# Default train model specs
train_specs = subprocess.getoutput(f"nvtl {model_name} get-spec --action train --job_type experiment --id {experiment_id}")
train_specs = json.loads(train_specs)
print(json.dumps(train_specs, indent=4))

In [None]:
# Customize train model specs
# Example for multitask_classification (for each network the parameter key might be different)
if model_name == "multitask_classification":
    train_specs["training_config"]["num_epochs"] = 10
    train_specs["gpus"] = 1
# Example for classification_pyt
elif model_name == "classification_pyt":
    train_specs["train"]["train_config"]["runner"]["max_epochs"] = 40
    train_specs["train"]["num_gpus"] = 1
    train_specs["gpus"] = 1
# Example for classification_tf1
elif model_name == "classification_tf1":
    train_specs["train_config"]["n_epochs"] = 80
    train_specs["gpus"] = 1
# Example for classification_tf2
elif model_name == "classification_tf2":
    train_specs["train"]["num_epochs"] = 80
    train_specs["gpus"] = 1
print(json.dumps(train_specs, indent=4))

#### Run train <a class="anchor" id="head-14"></a>

In [None]:
job_id = subprocess.getoutput(f"nvtl {model_name} experiment-run-action --action train --id {experiment_id} --specs '{json.dumps(train_specs)}'")
job_map["train_" + model_name] = job_id
print(job_id)

In [None]:
# Monitor job status
if automl_enabled:    
    while True:
        clear_output(wait=True)
        response = subprocess.getoutput(f"nvtl {model_name} get-action-status --job_type experiment --id {experiment_id} --job {job_id}")
        response = json.loads(response)
        if "error_desc" in response.keys() and response["error_desc"] in ("Job trying to retrieve not found", "No AutoML run found"):
            print("Job is being created")
            time.sleep(5)
            continue
        print(json.dumps(response, sort_keys=True, indent=4))
        assert "status" in response.keys() and response.get("status") != "Error"
        if response.get("status") in ["Done","Error"]:
            break
        time.sleep(15)
else:
    # Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
    status = my_tail(model_name, experiment_id, job_id, "experiment", workdir)

In [None]:
## To Stop an AutoML JOB
#    1. Stop the 'Monitor job status' cell (the cell right before this cell) manually
#    2. Uncomment the snippet in the next cell and run the cell

In [None]:
# if automl_enabled:
#     job_id = job_map["train_" + model_name]
#     job_id = subprocess.getoutput(f"nvtl {model_name} job-cancel --job_type experiment --id {experiment_id} --job {job_id}")
#     job_map["canceled_" + model_name] = job_id
#     print(job_id)

In [None]:
## Resume AutoML

In [None]:
# Uncomment the below snippet if you want to resume an already stopped AutoML job and then run the 'Monitor job status' cell above (4th cell above from this cell)
# if automl_enabled:
#     job_id = job_map["train_" + model_name]
#     job_id = subprocess.getoutput(f"nvtl {model_name} job-resume --id {experiment_id} --job {job_id} --specs '{json.dumps(train_specs)}'")
#     job_map["resumed_" + model_name] = job_id
#     print(job_id)

### Download train job artifacts <a class="anchor" id="head-15"></a>

In [None]:
job_id = job_map["train_" + model_name]
file_list = subprocess.getoutput(f"nvtl {model_name} list-job-files --id {experiment_id} --job {job_id} --job_type experiment --retrieve_logs True --retrieve_specs False")
print(file_list)

In [None]:
## Patch the model with proper metric before training to run this cell; By default loss is used, but some models dont log the parameter under the name 'loss'
# file_lists = []
# temptar = subprocess.getoutput(f"nvtl {model_name} download-selective-files --id {experiment_id} --job {job_id} --job_type experiment --workdir {workdir} --file_lists '{file_lists}' --best_model False --latest_model True --tar_files True")
# tar_command = f'tar -xvf {temptar} -C {workdir}/'
# os.system(tar_command)
# os.remove(temptar)
# print(f"Results at {workdir}/{job_id}")
# model_downloaded_path = f"{workdir}/{job_id}"

In [None]:
# Downloading train job takes a longer time, uncomment this cell if you want to still proceed
if download_jobs:
    temptar = subprocess.getoutput(f"nvtl {model_name} download-entire-job --id {experiment_id} --job {job_id} --job_type experiment --workdir {workdir}")
    tar_command = f'tar -xvf {temptar} -C {workdir}/'
    os.system(tar_command)
    os.remove(temptar)
    print(f"Results at {workdir}/{job_id}")
    model_downloaded_path = f"{workdir}/{job_id}"

In [None]:
# View the checkpoints generated for the training job and for automl jobs, in addition view: best performing model's config and the results of all automl experiments
if download_jobs:
    if automl_enabled:
        !python3 -m pip install pandas==1.5.1
        import pandas as pd
        model_downloaded_path = f"{model_downloaded_path}/best_model"
        assert glob.glob(f"{model_downloaded_path}/*.protobuf") or glob.glob(f"{model_downloaded_path}/*.yaml")

    assert os.path.exists(model_downloaded_path)
    assert (glob.glob(model_downloaded_path + "/**/*.tlt", recursive=True) + glob.glob(model_downloaded_path + "/**/*.hdf5", recursive=True) + glob.glob(model_downloaded_path + "/**/*.pth", recursive=True))

    if os.path.exists(model_downloaded_path):        
        #List the binary model file
        print("\nCheckpoints for the training experiment")
        if os.path.exists(model_downloaded_path+"/train/weights") and len(os.listdir(model_downloaded_path+"/train/weights")) > 0:
            print(f"Folder: {model_downloaded_path}/train/weights")
            print("Files:", os.listdir(model_downloaded_path+"/train/weights"))
        elif os.path.exists(model_downloaded_path+"/weights") and len(os.listdir(model_downloaded_path+"/weights")) > 0:
            print(f"Folder: {model_downloaded_path}/weights")
            print("Files:", os.listdir(model_downloaded_path+"/weights"))
        else:
            print(f"Folder: {model_downloaded_path}")
            print("Files:", os.listdir(model_downloaded_path))

        if automl_enabled:
            assert glob.glob(f"{model_downloaded_path}/*.protobuf") or glob.glob(f"{model_downloaded_path}/*.yaml")
            experiment_artifacts = json.load(open(f"{model_downloaded_path}/controller.json","r"))
            data_frame = pd.DataFrame(experiment_artifacts)
            # Print experiment id/number and the corresponding result
            print("\nResults of all experiments")
            with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.max_colwidth', None):
                print(data_frame[["id","result"]])

### Evaluate <a class="anchor" id="head-16"></a>

#### Provide evaluate specs <a class="anchor" id="head-16"></a>

In [None]:
# Default evaluate model specs
eval_specs = subprocess.getoutput(f"nvtl {model_name} get-spec --action evaluate --job_type experiment --id {experiment_id}")
eval_specs = json.loads(eval_specs)
print(json.dumps(eval_specs, indent=4))

In [None]:
# Customize evaluate model specs
# Change any spec if you wish
print(json.dumps(eval_specs, indent=4))

#### Run evaluate <a class="anchor" id="head-17"></a>

In [None]:
# Print model handler parameters
model_parameters = subprocess.getoutput(f"nvtl {model_name} get-metadata --id {experiment_id} --job_type experiment")
model_parameters = json.loads(model_parameters)
update_checkpoint_choosing = {}
update_checkpoint_choosing["checkpoint_choose_method"] = model_parameters["checkpoint_choose_method"]
update_checkpoint_choosing["checkpoint_epoch_number"] = model_parameters["checkpoint_epoch_number"]
print(json.dumps(update_checkpoint_choosing, indent=4))

In [None]:
# Change the method by which checkpoint from the parent action is chosen, when parent action is a train/retrain action.
# Example for evaluate action below, can be applied in the same way for other actions too
update_checkpoint_choosing["checkpoint_choose_method"] = "latest_model" # Choose between best_model/latest_model/from_epoch_number
# If from_epoch_number is chosen then assign the epoch number to the dictionary key in the format 'from_epoch_number{train_job_id}'
# update_checkpoint_choosing["checkpoint_epoch_number"]["from_epoch_number_c2f76eb7-2a75-4197-9a84-c1547f20c17d"] = 6

patched_model = subprocess.getoutput(f"nvtl {model_name} patch-artifact-metadata --id {experiment_id} --job_type experiment --update_info '{json.dumps(update_checkpoint_choosing)}'")
patched_model = json.loads(patched_model)
print(json.dumps(patched_model, indent=4))

In [None]:
parent = job_map["train_" + model_name]
job_id = subprocess.getoutput(f"nvtl {model_name} experiment-run-action --action evaluate --id {experiment_id} --parent_job_id {parent} --specs '{json.dumps(eval_specs)}'")
job_map["eval_" + model_name] = job_id
print(job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
status = my_tail(model_name, experiment_id, job_id, "experiment", workdir)

### Prune, Retrain and Evaluation <a class="anchor" id="head-13"></a>

- We optimize the trained model by pruning and retraining in the following cells

#### Prune <a class="anchor" id="head-18"></a>

In [None]:
if model_name != "classification_pyt":
    # Default prune model specs
    prune_specs = subprocess.getoutput(f"nvtl {model_name} get-spec --id {experiment_id} --action prune --job_type experiment")
    prune_specs = json.loads(prune_specs)
    print(json.dumps(prune_specs, indent=4))

In [None]:
if model_name != "classification_pyt":
    # Customize prune model specs
    # Apply changes to specs dictionary if required here
    if model_name == "classification_tf2":
        prune_specs["prune"]["byom_model_path"] = ""
    print(json.dumps(prune_specs, indent=4))

In [None]:
if model_name != "classification_pyt":
    parent = job_map["train_" + model_name]
    job_id = subprocess.getoutput(f"nvtl {model_name} experiment-run-action --action prune --id {experiment_id} --parent_job_id {parent} --specs '{json.dumps(prune_specs)}'")
    job_map["prune_" + model_name] = job_id

In [None]:
# Check status of pruning job (the file won't exist until the backend Toolkit container is running -- can take several minutes)
if model_name != "classification_pyt":
    status = my_tail(model_name, experiment_id, job_id, "experiment", workdir)

#### Retrain <a class="anchor" id="head-20"></a>

In [None]:
if model_name != "classification_pyt":
    # Default retrain model specs
    retrain_specs = subprocess.getoutput(f"nvtl {model_name} get-spec --id {experiment_id} --action retrain --job_type experiment")
    retrain_specs = json.loads(retrain_specs)
    print(json.dumps(retrain_specs, indent=4))

In [None]:
if model_name != "classification_pyt":
    # Override any of the parameters listed in the previous cell as required
    # Example for multitask_classification (for each network the parameter key might be different)
    if model_name == "multitask_classification":
        retrain_specs["training_config"]["num_epochs"] = 10
        retrain_specs["gpus"] = 1
    # Example for classification_tf1
    elif model_name == "classification_tf1":
        retrain_specs["train_config"]["n_epochs"] = 80
        retrain_specs["gpus"] = 1
    # Example for classification_tf2
    elif model_name == "classification_tf2":
        retrain_specs["train"]["num_epochs"] = 80
        retrain_specs["gpus"] = 1
    print(json.dumps(retrain_specs, indent=4))

In [None]:
if model_name != "classification_pyt":
    parent = job_map["prune_" + model_name]
    job_id = subprocess.getoutput(f"nvtl {model_name} experiment-run-action --action retrain --id {experiment_id} --parent_job_id {parent} --specs '{json.dumps(retrain_specs)}'")
    job_map["retrain_" + model_name] = job_id

In [None]:
# Check status of evaluate after retrain job(the file won't exist until the backend Toolkit container is running -- can take several minutes)
if model_name != "classification_pyt":
    status = my_tail(model_name, experiment_id, job_id, "experiment", workdir)

#### Evaluate after retrain <a class="anchor" id="head-20"></a>

In [None]:
# Default evaluate model specs
if model_name != "classification_pyt":
    eval_retrain_specs = subprocess.getoutput(f"nvtl {model_name} get-spec --action evaluate --job_type experiment --id {experiment_id}")
    eval_retrain_specs = json.loads(eval_retrain_specs)
    print(json.dumps(eval_retrain_specs, indent=4))

In [None]:
# Customize evaluate model specs
# Change any spec if you wish
if model_name != "classification_pyt":
    print(json.dumps(eval_retrain_specs, indent=4))

In [None]:
if model_name != "classification_pyt":
    parent = job_map["retrain_" + model_name]
    job_id = subprocess.getoutput(f"nvtl {model_name} experiment-run-action --action evaluate --id {experiment_id} --parent_job_id {parent} --specs '{json.dumps(eval_retrain_specs)}'")
    job_map["eval2_" + model_name] = job_id

In [None]:
# Check status of retrain job (the file won't exist until the backend Toolkit container is running -- can take several minutes)
if model_name != "classification_pyt":
    status = my_tail(model_name, experiment_id, job_id, "experiment", workdir)

### Export <a class="anchor" id="head-22"></a>

#### Provide export specs <a class="anchor" id="head-22"></a>

In [None]:
# Default export model specs
export_specs = subprocess.getoutput(f"nvtl {model_name} get-spec --id {experiment_id} --action export --job_type experiment")
export_specs = json.loads(export_specs)
print(json.dumps(export_specs, indent=4))

In [None]:
# Customize export model specs
# Apply changes to the specs dictionary here if required
print(json.dumps(export_specs, indent=4))

#### Run export <a class="anchor" id="head-23"></a>

In [None]:
parent = job_map["train_" + model_name]
job_id = subprocess.getoutput(f"nvtl {model_name} experiment-run-action --action export --id {experiment_id} --parent_job_id {parent} --specs '{json.dumps(export_specs)}'")
job_map["export_" + model_name] = job_id
print(job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
status = my_tail(model_name, experiment_id, job_id, "experiment", workdir)

### TRT Engine generation using TAO-Deploy <a class="anchor" id="head-26"></a>

#### Provide trt engine generation specs <a class="anchor" id="head-26"></a>

In [None]:
# Default gen_trt_engine model specs
tao_deploy_specs = subprocess.getoutput(f"nvtl {model_name} get-spec --id {experiment_id} --action gen_trt_engine --job_type experiment")
tao_deploy_specs = json.loads(tao_deploy_specs)
print(json.dumps(tao_deploy_specs, indent=4))

In [None]:
# Customize gen_trt_engine model specs
if model_name == "classification_tf2":
    tao_deploy_specs["gen_trt_engine"]["tensorrt"]["data_type"] = "int8"
elif model_name == "classification_pyt":
    tao_deploy_specs["gen_trt_engine"]["tensorrt"]["data_type"] = "fp16"
else:
    tao_deploy_specs["data_type"] = "int8"
print(json.dumps(tao_deploy_specs, indent=4))

#### Run TRT Engine generation using TAO-Deploy <a class="anchor" id="head-27"></a>

In [None]:
parent = job_map["export_" + model_name]
job_id = subprocess.getoutput(f"nvtl {model_name} experiment-run-action --action gen_trt_engine --id {experiment_id} --parent_job_id {parent} --specs '{json.dumps(tao_deploy_specs)}'")
job_map["gen_trt_engine_" + model_name] = job_id
print(job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
status = my_tail(model_name, experiment_id, job_id, "experiment", workdir)

### TAO inference <a class="anchor" id="head-28"></a>

#### Provide TAO inference specs <a class="anchor" id="head-28"></a>

In [None]:
# Default inference model specs
tao_inference_specs = subprocess.getoutput(f"nvtl {model_name} get-spec --id {experiment_id} --action inference --job_type experiment")
tao_inference_specs = json.loads(tao_inference_specs)
print(json.dumps(tao_inference_specs, indent=4))

In [None]:
# Customize TAO inference specs
#Apply changes to the specs dictionary here if required
print(json.dumps(tao_inference_specs, indent=4))

#### Run TAO inference <a class="anchor" id="head-29"></a>

In [None]:
parent = job_map["train_" + model_name]
job_id = subprocess.getoutput(f"nvtl {model_name} experiment-run-action --action inference --id {experiment_id} --parent_job_id {parent} --specs '{json.dumps(tao_inference_specs)}'")
job_map["tlt_inference_" + model_name] = job_id
print(job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
status = my_tail(model_name, experiment_id, job_id, "experiment", workdir)

In [None]:
if download_jobs:
    temptar = subprocess.getoutput(f"nvtl {model_name} download-entire-job --id {experiment_id} --job {job_id} --job_type experiment --workdir {workdir}")
    tar_command = f'tar -xvf {temptar} -C {workdir}/'
    os.system(tar_command)
    os.remove(temptar)
    print(f"Results at {workdir}/{job_id}")
    inference_out_path = f"{workdir}/{job_id}"

In [None]:
if download_jobs:
    # Print Classification results
    if model_name == "classification_tf1":
        assert os.path.exists(f'{inference_out_path}/result.csv')
        !cat {inference_out_path}/result.csv
    elif "classification_" in model_name:
        assert os.path.exists(f'{inference_out_path}/inference/result.csv')
        !cat {inference_out_path}/inference/result.csv
    elif model_name == "multitask_classification":
        assert os.path.exists(f'{inference_out_path}/result.txt')
        !cat {inference_out_path}/result.txt

### TRT inference <a class="anchor" id="head-30"></a>

#### Provide TRT inference specs <a class="anchor" id="head-30"></a>

In [None]:
# Default inference model specs
trt_inference_specs = subprocess.getoutput(f"nvtl {model_name} get-spec --id {experiment_id} --action inference --job_type experiment")
trt_inference_specs = json.loads(trt_inference_specs)
print(json.dumps(trt_inference_specs, indent=4))

In [None]:
# Customize TAO inference specs
# Apply changes to the specs dictionary here if required
print(json.dumps(trt_inference_specs, indent=4))

#### Run TRT inference <a class="anchor" id="head-31"></a>

In [None]:
# Default inference model specs
parent = job_map["gen_trt_engine_" + model_name]
job_id = subprocess.getoutput(f"nvtl {model_name} experiment-run-action --action inference --id {experiment_id} --parent_job_id {parent} --specs '{json.dumps(trt_inference_specs)}'")
job_map["trt_inference_" + model_name] = job_id
print(job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
status = my_tail(model_name, experiment_id, job_id, "experiment", workdir)

In [None]:
if download_jobs:
    temptar = subprocess.getoutput(f"nvtl {model_name} download-entire-job --id {experiment_id} --job {job_id} --job_type experiment --workdir {workdir}")
    tar_command = f'tar -xvf {temptar} -C {workdir}/'
    os.system(tar_command)
    os.remove(temptar)
    print(f"Results at {workdir}/{job_id}")
    inference_out_path = f"{workdir}/{job_id}"
    assert glob.glob(f"{inference_out_path}/**/*result.csv", recursive=True)

In [None]:
# Print Classification results
if download_jobs:
    if model_name in ("classification_tf1", "multitask_classification"):
        !cat {inference_out_path}/result.csv
    elif "classification_" in model_name:
        !cat {inference_out_path}/inference/result.csv

### Delete model <a class="anchor" id="head-21"></a>

In [None]:
subprocess.getoutput(f"nvtl {model_name} experiment-delete --id {experiment_id}")

### Delete dataset <a class="anchor" id="head-21"></a>

#### Delete train dataset <a class="anchor" id="head-21"></a>

In [None]:
subprocess.getoutput(f"nvtl {model_name} dataset-delete --id {train_dataset_id}")

#### Delete val dataset <a class="anchor" id="head-21"></a>

In [None]:
subprocess.getoutput(f"nvtl {model_name} dataset-delete --id {eval_dataset_id}")

#### Delete test dataset <a class="anchor" id="head-21"></a>

In [None]:
subprocess.getoutput(f"nvtl {model_name} dataset-delete --id {test_dataset_id}")