### TAO remote client - Auto Labeling

Transfer learning is the process of transferring learned features from one application to another. It is a commonly used training technique where you use a model trained on one task and re-train to use it on a different task. Train Adapt Optimize (TAO) Toolkit  is a simple and easy-to-use Python based AI toolkit for taking purpose-built AI models and customizing them with users' own data.

![image](https://developer.nvidia.com/sites/default/files/akamai/TAO/tlt-tao-toolkit-bring-your-own-model-diagram.png)


### The workflow in a nutshell

- Creating a dataset
- Upload dataset to the service
- Getting a PTM from NGC
- Model Actions
    - Train (Normal/AutoML)
    - Evaluate
    - Inference on TAO

### Table of contents

1. [Install TAO remote client ](#head-1)
1. [Set the remote service base URL](#head-2)
1. [Access the shared volume](#head-3)
1. [Create the datasets](#head-4)
1. [List datasets](#head-5)
1. [Create a model experiment](#head-8)
1. [Find pretrained model](#head-9)
1. [Customize model metadata](#head-10)
1. [View hyperparameters that are enabled for AutoML by default](#head-11)
1. [Set AutoML related configurations](#head-12)
1. [Provide train specs](#head-13)
1. [Run train](#head-14)
1. [View checkpoint files](#head-15)
1. [Provide evaluate specs](#head-16)
1. [Run evaluate](#head-17)
1. [Provide TAO inference specs](#head-28)
1. [Run TAO inference](#head-29)
1. [Delete experiment](#head-32)
1. [Delete datasets](#head-33)
1. [Unmount shared volume](#head-34)
1. [Uninstall TAO Remote Client](#head-35)

### Requirements
Please find the server requirements [here](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_setup.html#)

In [None]:
import os
import glob
import subprocess
import getpass
import uuid
import json

In [None]:
namespace = 'default'

### FIXME

1. Assign the ip_address and port_number in FIXME 1 and FIXME 2 ([info](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_rest_api.html))
1. Assign the ngc_api_key variable in FIXME 3
1. (Optional) Enable AutoML if needed in FIXME 4
1. Choose between default and custom dataset in FIXME 5
1. Assign path of DATA_DIR in FIXME 6
1. Choose between Bayesian and Hyperband automl_algorithm in FIXME 7 (If automl was enabled in FIXME4)

In [None]:
model_name = "mal"

### Install TAO remote client <a class="anchor" id="head-1"></a>

In [None]:
# SKIP this step IF you have already installed the TAO-Client wheel.
! pip3 install nvidia-tao-client

In [None]:
# View the version of the TAO-Client
! tao-client --version

### Set the remote service base URL <a class="anchor" id="head-2"></a>

In [None]:
# Define the node_addr and port number
node_addr = "<ip_address>" # FIXME1 example: 10.137.149.22
node_port = "<port_number>" # FIXME2 example: 32334
# In host machine, node ip_address and port number can be obtained as follows,
# ip_address: hostname -i
# port_number: kubectl get service ingress-nginx-controller -o jsonpath='{.spec.ports[0].nodePort}'

ngc_api_key = "<ngc_api_key>" # FIXME3 example: (Add NGC API key)

In [None]:
automl_enabled = False # FIXME4 set to True if you want to run automl for the model chosen in the previous cell

In [None]:
%env BASE_URL=http://{node_addr}:{node_port}/{namespace}/api/v1

# Exchange NGC_API_KEY for JWT
identity = json.loads(subprocess.getoutput(f'tao-client login --ngc-api-key {ngc_api_key}'))

%env USER={identity['user_id']}
%env TOKEN={identity['token']}

### Access the shared volume <a class="anchor" id="head-3"></a>

In [None]:
# Get PVC ID
pvc_id = subprocess.getoutput(f'kubectl get pvc tao-toolkit-api-pvc -n {namespace} -o jsonpath="{{.spec.volumeName}}"')
print(pvc_id)

In [None]:
# Get NFS server info
provisioner = json.loads(subprocess.getoutput(f'helm get values nfs-subdir-external-provisioner -o json'))
nfs_server = provisioner['nfs']['server']
nfs_path = provisioner['nfs']['path']
print(nfs_server, nfs_path)

In [None]:
user = getpass.getuser()
home = os.path.expanduser('~')

! echo "Password for {user}"
password = getpass.getpass()

In [None]:
# Mount shared volume 
! mkdir -p ~/shared

command = "apt-get -y install nfs-common >> /dev/null"
! echo {password} | sudo -S -k {command}

command = f"mount -t nfs {nfs_server}:{nfs_path}/{namespace}-tao-toolkit-api-pvc-{pvc_id} ~/shared"
! echo {password} | sudo -S -k {command} && echo DONE

### Create the datasets <a class="anchor" id="head-4"></a>

We will be using the `COCO dataset`. `download_coco.sh` script from dataset prepare will be used to download and unzip the [coco2017 dataset](https://cocodataset.org/#download)

**If using custom dataset; it should follow this dataset structure**
```
DATA_DIR/train2017
├── annotations.json
├── images
    ├── image_name_1.jpg
    ├── image_name_2.jpg
    ├── ...

```
```
DATA_DIR/val2017
├── annotations.json
├── images
    ├── image_name_1.jpg
    ├── image_name_2.jpg
    ├── ...

```

In [None]:
dataset_to_be_used = "default" #FIXME5 example: default/custom; default for the dataset used in this tutorial notebook; custom for a different dataset
DATA_DIR = model_name # FIXME6
os.environ['DATA_DIR']= DATA_DIR
!mkdir -p $DATA_DIR

### Download dataset

In [None]:
if dataset_to_be_used == "default":
    !bash dataset_prepare/coco/download_coco.sh $DATA_DIR
    # Remove existing data
    !rm -rf $DATA_DIR/train2017/images
    !rm -rf $DATA_DIR/val2017/images
    # Rearrange data in the required format
    !mkdir -p $DATA_DIR/train2017/
    !mkdir -p $DATA_DIR/val2017/
    !mv $DATA_DIR/raw-data/train2017 $DATA_DIR/train2017/images
    !mv $DATA_DIR/raw-data/annotations/instances_train2017.json $DATA_DIR/train2017/annotations.json
    !mv $DATA_DIR/raw-data/val2017 $DATA_DIR/val2017/images
    !mv $DATA_DIR/raw-data/annotations/instances_val2017.json $DATA_DIR/val2017/annotations.json

### Verify the downloaded dataset

In [None]:
!if [ ! -d $DATA_DIR/train2017/images ]; then echo 'Train Images folder not found'; else echo 'Found Train images folder';fi
!if [ ! -f $DATA_DIR/train2017/annotations.json ]; then echo 'Train annotations file not found'; else echo 'Found Train annotations file';fi
!if [ ! -d $DATA_DIR/val2017/images ]; then echo 'Val Images folder not found'; else echo 'Found Val images folder';fi
!if [ ! -f $DATA_DIR/val2017/annotations.json ]; then echo 'Val annotations file not found'; else echo 'Found Val annotations file';fi

In [None]:
ds_type = "instance_segmentation"
ds_format = "coco"

In [None]:
train_dataset_id = subprocess.getoutput(f"tao-client {model_name} dataset-create --dataset_type {ds_type} --dataset_format {ds_format}")
print(train_dataset_id)

In [None]:
! rsync -ah --info=progress2 {DATA_DIR}/train2017/images ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
! rsync -ah --info=progress2 {DATA_DIR}/train2017/annotations.json ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/annotations.json
! echo DONE

In [None]:
eval_dataset_id = subprocess.getoutput(f"tao-client {model_name} dataset-create --dataset_type {ds_type} --dataset_format {ds_format}")
print(eval_dataset_id)

In [None]:
! rsync -ah --info=progress2 {DATA_DIR}/val2017/images ~/shared/users/{os.environ['USER']}/datasets/{eval_dataset_id}/
! rsync -ah --info=progress2 {DATA_DIR}/val2017/annotations.json ~/shared/users/{os.environ['USER']}/datasets/{eval_dataset_id}/annotations.json
! echo DONE

### List datasets <a class="anchor" id="head-5"></a>

In [None]:
pattern = os.path.join(home, 'shared', 'users', os.environ['USER'], 'datasets', '*', 'metadata.json')

datasets = []
for metadata_path in glob.glob(pattern):
    with open(metadata_path, 'r') as metadata_file:
        datasets.append(json.load(metadata_file))

print(json.dumps(datasets, indent=2))

In [None]:
#utility function to print log file for the upcoming cells
def my_tail(logs_dir, log_file):
    %env LOG_FILE={logs_dir}/{log_file}
    ! mkdir -p {logs_dir}
    ! [ ! -f "$LOG_FILE" ] && touch $LOG_FILE && chmod 666 $LOG_FILE
    ! tail -f -n +1 $LOG_FILE | while read LINE; do echo "$LINE"; [[ "$LINE" == "EOF" ]] && pkill -P $$ tail; done

### Create a model experiment <a class="anchor" id="head-8"></a>

In [None]:
network_arch = model_name.replace("-","_")
model_id = subprocess.getoutput(f"tao-client {model_name} model-create --network_arch {network_arch} --encryption_key tlt_encode ")
print(model_id)

### Find pretrained model <a class="anchor" id="head-9"></a>

In [None]:
# List all pretrained models for the chosen network architecture
pattern = os.path.join(home, 'shared', 'users', '*', 'models', '*', 'metadata.json')

for ptm_metadata_path in glob.glob(pattern):
  with open(ptm_metadata_path, 'r') as metadata_file:
    ptm_metadata = json.load(metadata_file)
    metadata_network_arch = ptm_metadata.get("network_arch")
    if metadata_network_arch == network_arch:
      if "encryption_key" not in ptm_metadata.keys():
        print(f'PTM Name: {ptm_metadata["name"]}; PTM version: {ptm_metadata["version"]}; NGC PATH: {ptm_metadata["ngc_path"]}; Additional info: {ptm_metadata["additional_id_info"]}')

In [None]:
# Assigning pretrained models to different yolo versions
# From the output of previous cell make the appropriate changes to this map if you want to change the default PTM backbone.
# Changing the default backbone here requires changing default spec/config during train/eval etc like for example
# If you are changing the ptm to resnet34, then you have to modify the config key num_layers if it exists to 34 manually
pretrained_map = {"mal" : "mask_auto_label:trainable_v1.0"}
no_ptm_models = set([])

In [None]:
if model_name not in no_ptm_models:
    pattern = os.path.join(home, 'shared', 'users', '*', 'models', '*', 'metadata.json')

    ptm_id = []
    for ptm_metadata_path in glob.glob(pattern):
      with open(ptm_metadata_path, 'r') as metadata_file:
        ptm_metadata = json.load(metadata_file)
        ngc_path = ptm_metadata.get("ngc_path")
        metadata_network_arch = ptm_metadata.get("network_arch")
        if metadata_network_arch == network_arch and ngc_path.endswith(pretrained_map[network_arch]):
          ptm_id = [ptm_metadata["id"]]
          break

    print(ptm_id)

### Customize model metadata <a class="anchor" id="head-10"></a>

In [None]:
metadata_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'metadata.json')

with open(metadata_path , "r") as metadata_file:
    metadata = json.load(metadata_file)

metadata["train_datasets"] = [train_dataset_id]
metadata["eval_dataset"] = eval_dataset_id
metadata["inference_dataset"] = eval_dataset_id
if model_name not in no_ptm_models:
    metadata["ptm"] = ptm_id

with open(metadata_path, "w") as metadata_file:
    json.dump(metadata, metadata_file, indent=2)

print(json.dumps(metadata, indent=2))

### View hyperparameters that are enabled for AutoML by default <a class="anchor" id="head-11"></a>

In [None]:
if automl_enabled:
    # View default automl specs enabled
    ! tao-client {model_name} model-automl-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/automl_defaults.json

### Set AutoML related configurations <a class="anchor" id="head-12"></a>
Refer to these hyper-links to see the parameters supported by each network and add more parameters if necessary in addition to the default automl enabled parameters: [Mask RCNN](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_action_specs.html#id39), [Unet](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_action_specs.html#id69)

In [None]:
if automl_enabled:
    # Choose automl algorithm between "Bayesian" and "HyperBand".
    automl_algorithm="Bayesian" # FIXME8 example: Bayesian/HyperBand

    metric = "kpi" #Don't change this, in future multiple metrics will be supported
    additional_automl_parameters = [] #Refer to parameter list mentioned in the above links and add any extra parameter in addition to the default enabled ones
    remove_default_automl_parameters = [] #Remove any hyperparameters that are enabled by default for AutoML

    metadata["automl_algorithm"] = automl_algorithm
    metadata["automl_enabled"] = automl_enabled
    metadata["metric"] = metric
    metadata["automl_add_hyperparameters"] = str(additional_automl_parameters)
    metadata["automl_remove_hyperparameters"] = str(remove_default_automl_parameters)

    with open(metadata_path, "w") as metadata_file:
        json.dump(metadata, metadata_file, indent=2)

    print(json.dumps(metadata, indent=2))

### Provide train specs <a class="anchor" id="head-13"></a>

In [None]:
# Default train model specs
! tao-client {model_name} model-train-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/train.json

In [None]:
# Customize train model specs
specs_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'specs', 'train.json')

with open(specs_path , "r") as specs_file:
    specs = json.load(specs_file)

# Apply changes for any of the parameters listed in the previous cell as required
specs["gpu_ids"] = [0]

with open(specs_path, "w") as specs_file:
    json.dump(specs, specs_file, indent=2)

print(json.dumps(specs, indent=2))

### Run train <a class="anchor" id="head-14"></a>

In [None]:
train_job_id = subprocess.getoutput(f"tao-client {model_name} model-train --id " + model_id)
print(train_job_id)

In [None]:
# Monitor job status
if automl_enabled:    
    # Set poll_automl_stats to True if just want to see what's the time left, how many epochs are remaining etc.
    # Set poll_automl_stats to False if you want to skip stats and see the training logs instead. Training logs viewing are supported only for Bayesian

    # For automl: Training times for different models benchmarked on 1 GPU V100 machine can be found here: https://docs.nvidia.com/tao/tao-toolkit/text/automl/automl.html#results-of-automl-experiments
    
    poll_automl_stats = True
    if poll_automl_stats:
        import time
        from IPython.display import clear_output
        stats_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, train_job_id, "automl_metadata.json")
        controller_json_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, train_job_id, "controller.json")
        while True:
            time.sleep(15)
            clear_output(wait=True)
            if os.path.exists(stats_path):
                try:
                    with open(stats_path , "r") as stats_file:
                        stats_dict = json.load(stats_file)
                    print(json.dumps(stats_dict, indent=2))
                    if float(stats_dict["Number of epochs yet to start"]) == 0.0:
                        break
                except (json.JSONDecodeError):
                    print("Stats computed are being written to file. Stats will be visible on screen in a few seconds")
    else:
        # Print the log file - supported only for bayesian (the file won't exist until the backend Toolkit container is running -- can take several minutes)
        if automl_algorithm == "Bayesian":
            logs_dir = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id)
            max_recommendations = metadata.get("automl_max_recommendations",20)
            for experiment_num in range(max_recommendations):
                log_file = f"{train_job_id}/experiment_{experiment_num}/log.txt"
                while True:
                    if os.path.exists(os.path.join(logs_dir, log_file)):
                        break
                print(f"\n\nViewing experiment {experiment_num}\n\n")
                my_tail(logs_dir, log_file)
    
else:
    # Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
    logs_dir = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'logs')
    log_file = f"{train_job_id}.txt"

    my_tail(logs_dir, log_file)

In [None]:
## To Stop an AutoML JOB
#    1. Stop the 'Monitor job status' cell (the cell right before this cell) manually
#    2. Uncomment the snippet in the next cell and run the cell

In [None]:
# if automl_enabled:
#     canceled_job_id = subprocess.getoutput(f"tao-client {model_name} model-job-cancel --id {model_id} --job {train_job_id}")
#     print(canceled_job_id)

In [None]:
## Resume AutoML

In [None]:
# Uncomment the below snippet if you want to resume an already stopped AutoML job and then run the 'Monitor job status' cell above (4th cell above from this cell)
# if automl_enabled:
#     resumed_job_id = subprocess.getoutput(f"tao-client {model_name} model-job-resume --id {model_id} --job {train_job_id}")
#     print(resumed_job_id)

### Viewing checkpoint files <a class="anchor" id="head-15"></a>

In [None]:
# View the checkpoints generated for the training job and for automl jobs, in addition view: best performing model's config and the results of all automl experiments

job_dir = f"{home}/shared/users/{os.environ['USER']}/models/{model_id}/{train_job_id}"
model_path = job_dir

if automl_enabled:
    !python3 -m pip install pandas==1.5.1
    import pandas as pd
    import glob
    model_path =  f"{job_dir}/best_model"

from IPython.display import clear_output

while True:
    clear_output(wait=True)
    if os.path.exists(model_path) and len(os.listdir(model_path)) > 0:
        #List the binary model file
        print("\nCheckpoints for the training experiment")
        if os.path.exists(model_path+"/weights") and len(os.listdir(model_path+"/weights")) > 0:
            print(f"Folder: {model_path}/weights")
            print("Files:", os.listdir(model_path+"/weights"))
        else:
            print(f"Folder: {model_path}")
            print("Files:", os.listdir(model_path))

        if automl_enabled:
            if os.path.exists(f"{model_path}/controller.json") and len(glob.glob(os.path.join(model_path,"*.kitti"))) > 0:
                experiment_artifacts = json.load(open(f"{model_path}/controller.json","r"))
                data_frame = pd.DataFrame(experiment_artifacts)
                # Print experiment id/number and the corresponding result
                print("\nResults of all experiments")
                with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.max_colwidth', None):
                    print(data_frame[["id","result"]])

                print("\nConfig/Spec file for the best performing experiment (recommendation_id.kitti with the maximum result value in the dataframe)")
                # List the recommendation config file of the best performing checkpoint(recommendation_id.kitti with the maximum result value in the dataframe)
                !ls {model_path}/*.kitti 
                break
        else:
            break

### Provide evaluate specs <a class="anchor" id="head-16"></a>

In [None]:
# Default evaluate model specs
! tao-client {model_name} model-evaluate-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/evaluate.json

In [None]:
# Customize evaluate model specs
specs_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'specs', 'evaluate.json')

with open(specs_path , "r") as specs_file:
    specs = json.load(specs_file)

# Change any spec if you wish

with open(specs_path, "w") as specs_file:
    json.dump(specs, specs_file, indent=2)

print(json.dumps(specs, indent=2))

### Run evaluate <a class="anchor" id="head-17"></a>

In [None]:
eval_job_id = subprocess.getoutput(f"tao-client {model_name} model-evaluate --id {model_id} --job {train_job_id}")
print(eval_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
log_file = f"{eval_job_id}.txt"
logs_dir = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'logs')
my_tail(logs_dir, log_file)

### Provide TAO inference specs <a class="anchor" id="head-28"></a>

In [None]:
# Default inference model specs
! tao-client {model_name} model-inference-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/inference.json

In [None]:
# Customize TAO inference specs
specs_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'specs', 'inference.json')

with open(specs_path , "r") as specs_file:
    specs = json.load(specs_file)

#Apply changes to the specs dictionary here if required

with open(specs_path, "w") as specs_file:
    json.dump(specs, specs_file, indent=2)

print(json.dumps(specs, indent=2))

### Run TAO inference <a class="anchor" id="head-29"></a>

In [None]:
tlt_inference_job_id = subprocess.getoutput(f"tao-client {model_name} model-inference --id {model_id} --job {train_job_id}")
print(tlt_inference_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
log_file = f"{tlt_inference_job_id}.txt"
my_tail(logs_dir, log_file)

In [None]:
job_dir = f"{home}/shared/users/{os.environ['USER']}/models/{model_id}/{tlt_inference_job_id}"
!ls $job_dir/inference.json

### Delete experiment <a class="anchor" id="head-32"></a>

In [None]:
! rm -rf ~/shared/users/{os.environ['USER']}/models/{model_id}
! echo DONE

### Delete datasets <a class="anchor" id="head-33"></a>

In [None]:
! rm -rf ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}
! rm -rf ~/shared/users/{os.environ['USER']}/datasets/{eval_dataset_id}
! rm -rf ~/shared/users/{os.environ['USER']}/datasets/{infer_dataset_id}
! echo DONE

### Unmount shared volume <a class="anchor" id="head-34"></a>

In [None]:
command = "umount ~/shared"
! echo {password} | sudo -S -k {command} && echo DONE

### Uninstall TAO Remote Client <a class="anchor" id="head-35"></a>

In [None]:
! pip3 uninstall -y nvidia-tao-client