## TAO remote client - Data-Services
### The workflow in a nutshell
TAO Data Services include 4 key pipelines:
1. Offline data augmentation using DALI
2. Auto labeling using TAO Mask Auto-labeler (MAL)
3. Annotation conversion
4. Groundtruth analytics

## Learning Objectives

In this notebook, you will learn how to leverage the simplicity and convenience of TAO to:

* Convert KITTI dataset to COCO format
* Run auto-labeling to generate pseudo masks for KITTI bounding boxes
* Apply data augmentation to the KITTI dataset with bounding boxe refinement
* Run data analytics to collect useful statistics on the original and augmented KITTI dataset

### Table of contents

1. [Create a cloud workspace](#head-2)
1. [Convert KITTI data to COCO format](#head-1)
1. [Generate pseudo-masks with the auto-labeler](#head-2)
1. [Apply data augmentation](#head-3)
1. [Perform data analytics](#head-4)
1. [Perform data validation](#head-5)


### Requirements
Please find the server requirements [here](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_setup.html#)

### Install TAO remote client

In [None]:
# # SKIP this step IF you have already installed the TAO-Client wheel.
! pip3 install nvidia-tao-client

In [None]:
# # View the version of the TAO-Client
! tao-client --version

In [None]:
import os
import glob
import subprocess
import json
import ast
import time
from IPython.display import clear_output

In [None]:
namespace = 'default'

### FIXME's <a class="anchor" id="head-2"></a>

1. Assign a workdir in FIXME 1
1. Assign the ip_address and port_number in FIXME 2 ([info](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_rest_api.html))
1. Assign the ngc_key variable in FIXME 3
1. Assign the ngc_org_name variable in FIXME 4
1. Set cloud storage details in FIXME 5
1. Assign path of kitti dataset relative to the bucket in FIXME 6
1. Database backup/restore archive filename in FIXME 10

In [None]:
workdir = "workdir_data_services" # FIXME1
# Creating workdir
if not os.path.isdir(workdir):
    os.makedirs(workdir)

#### Set API service's host information

In [None]:
host_url = "http://<ip_address>:<port_number>" # FIXME2 example: https://10.137.149.22:32334
# In host machine, node ip_address and port number can be obtained as follows,
# ip_address: hostname -i
# port_number: kubectl get service tao-api-ingress-nginx-controller -o jsonpath='{.spec.ports[0].nodePort}'

#### Set NGC Personal key for authentication and NGC org to access API services

In [None]:
ngc_key = "<ngc_key>" # FIXME3 example: (Add NGC Personal key)

In [None]:
ngc_org_name = "ea-tlt" # FIXME4 your NGC ORG

### Login <a class="anchor" id="head-3"></a>

In [None]:
%env BASE_URL={host_url}/{namespace}/api/v1

# Exchange NGC_API_KEY for JWT
identity = json.loads(subprocess.getoutput(f"tao-client login --ngc-key {ngc_key} --ngc-org-name {ngc_org_name} --enable-telemetry"))

%env USER={identity['user_id']}
%env TOKEN={identity['token']}

### Get NVCF gpu details <a class="anchor" id="head-2"></a>

 One of the keys of the response json are to be used as platform_id when you run each job

In [None]:
# # Valid only for NVCF backend during TAO-API helm deployment currently
# response = json.loads(subprocess.getoutput(f'tao get-gpu-types'))
# print((json.dumps(response, indent=4)))

### Create cloud workspace
This workspace will be the place where your datasets reside and your results of TAO API jobs will be pushed to.

If you want to have different workspaces for dataset and experiment, duplocate the workspace creation part and adjust the metadata accordingly.

In [None]:
#FIXME5 Dataset Cloud bucket details to download dataset for experiments (Can be read only)
workspace_name = "AWS workspace info"  # A Representative name for this cloud info
cloud_type = "aws"  # If it's AWS, HuggingFace or Azure

cloud_metadata = {}
cloud_metadata["cloud_region"] = "us-west-1"  # Bucket region
cloud_metadata["cloud_bucket_name"] = ""  # Bucket name
# Access and Secret for AWS
cloud_metadata["access_key"] = ""
cloud_metadata["secret_key"] = ""

In [None]:
workspace_id = subprocess.getoutput(f"tao-client annotations workspace-create --name '{workspace_name}' --cloud_type {cloud_type} --cloud_details '{json.dumps(cloud_metadata)}'")
print(workspace_id)

In [None]:
# #Optional: Restore database with a mongodump file saved in workspace dump/archive/{backup_filename}
# backup_file_name = "mongodump.tar.gz" # FIXME 7
# response = subprocess.getoutput(f"tao-client {model_name} workspace-restore --workspace {workspace_id} --backup_file_name {backup_file_name}")
# print(response)

### Function to parse logs <a class="anchor" id="head-1.1"></a>

In [None]:
def my_tail(model_name_cli, id, job_id, job_type, workdir):
	status = None
	while True:
		time.sleep(10)
		clear_output(wait=True)
		response = subprocess.getoutput(f"tao-client {model_name_cli} get-action-status --job_type {job_type} --id {id} --job {job_id}")
		response = json.loads(response)
		if response and "status" in response.keys() and response.get("status") in ("Done", "Error", "Canceled", "Paused"):
			print(json.dumps(response, indent=4))
			status = response.get("status")
			break

		logs = subprocess.getoutput(f"tao-client {model_name_cli} get-job-logs --id {id} --job {job_id} --job_type {job_type} --workdir {workdir}")
		if not logs:
			continue
		log_content_lines = logs.split("\n")        
		for line in log_content_lines:
			print(line.strip())
			if line.strip() == "Error EOF":
				status = "Error"
				break
			elif line.strip() == "Done EOF":
				status = "Done"
				break
		if status is not None:
			break
	return status

## 1. Convert KITTI data to COCO format <a class="anchor" id="head-1"></a>
We would first convert the dataset from KITTI to COCO formats.

### Define the task and action

### Create dataset
We support both KITTI and COCO data formats

KITTI dataset follow the directory structure displayed below:
```
$DATA_DIR/dataset
├── images
│   ├── image_name_1.jpg
│   ├── image_name_2.jpg
|   ├── ...
└── labels
    ├── image_name_1.txt
    ├── image_name_2.txt
    ├── ...
```

And COCO dataset follow the directory structure displayed below:
```
$DATA_DIR/dataset
├── images
│   ├── image_name_1.jpg
│   ├── image_name_2.jpg
|   ├── ...
└── annotations.json
```
For this notebook, we will be using the KITTI object detection dataset for this example. To find more details, please visit [here](http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=2d).

### Create a kitti Dataset

In [None]:
# FIXME6 : Set path relative to cloud bucket
kitti_dataset_path =  "/data/tao_od_synthetic_subset_train_convert_cleaned/"

In [None]:
# Create dataset
model_name = "annotations"
kitti_dataset_id = subprocess.getoutput(f"tao-client {model_name} dataset-create --dataset_type object_detection --dataset_format kitti --workspace {workspace_id} --cloud_file_path {kitti_dataset_path} --use_for '{json.dumps(['testing'])}'")
print(kitti_dataset_id)

In [None]:
# Check progress
while True:
    clear_output(wait=True)
    response = subprocess.getoutput(f"tao-client {model_name} get-metadata --id {kitti_dataset_id} --job_type dataset")
    response = json.loads(response)
    print(json.dumps(response, sort_keys=True, indent=4))
    if response.get("status") == "invalid_pull":
        raise ValueError("Dataset pull failed")
    if response.get("status") == "pull_complete":
        break
    time.sleep(5)

### Dataset format conversion action 


#### Get specs


In [None]:
# Default model specs
annotation_conversion_specs = subprocess.getoutput(f"tao-client {model_name} get-spec --action annotation_format_convert --job_type dataset --id {kitti_dataset_id}")
annotation_conversion_specs = json.loads(annotation_conversion_specs)
print(json.dumps(annotation_conversion_specs, indent=4))

In [None]:
# Set specs
annotation_conversion_specs["data"]["input_format"] = "KITTI"
annotation_conversion_specs["data"]["output_format"] = "COCO"
print(json.dumps(annotation_conversion_specs, indent=4))

#### Run action 


In [None]:
# Add --platform_id uuid for NVCF backend, where the uuid is a key from output of tao-client gpu-types
# Run action
coco_dataset_id = kitti_dataset_id
convert_job_id = subprocess.getoutput(f"tao-client {model_name} dataset-run-action --id {kitti_dataset_id} --action annotation_format_convert --specs '{json.dumps(annotation_conversion_specs)}'")
print(convert_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
status = my_tail(model_name, kitti_dataset_id, convert_job_id, "dataset", workdir)

In [None]:
# After the action is completed the format of dataset will be converted to coco from kitti
print(subprocess.getoutput(f"tao-client {model_name} get-metadata --id {kitti_dataset_id} --job_type dataset"))

## 2. Generate pseudo-masks with the auto-labeler <a class="anchor" id="head-2"></a>
Here we will use a pretrained MAL model to generate pseudo-masks for the converted KITTI data. 

### Define the task and action

### Create a coco Dataset - If you already have data in coco detection format(without masks) and skipped step 1

In [None]:
# # Create dataset
# model_name = "annotations"
# coco_dataset_id = subprocess.getoutput(f"tao-client {model_name} dataset-create --dataset_type object_detection --dataset_format coco --workspace {workspace_id} --cloud_file_path {coco_dataset_path} --use_for '{json.dumps(['testing'])}'")
# print(coco_dataset_id)

In [None]:
# # Check progress
# while True:
#     clear_output(wait=True)
#     response = subprocess.getoutput(f"tao-client {model_name} get-metadata --id {coco_dataset_id} --job_type dataset")
#     response = json.loads(response)
#     print(json.dumps(response, sort_keys=True, indent=4))
#     if response.get("status") == "invalid_pull":
#         raise ValueError("Dataset pull failed")
#     if response.get("status") == "pull_complete":
#         break
#     time.sleep(5)

### Assign PTM

In [None]:
# List all pretrained models for the chosen network architecture
model_name = "auto_label"
filter_params = {"network_arch": "mal"}
message = subprocess.getoutput(f"tao-client {model_name} list-base-experiments --filter_params '{json.dumps(filter_params)}'")
message = ast.literal_eval(message)
for rsp in message:
    rsp_keys = rsp.keys()
    if "encryption_key" not in rsp.keys():
        assert "name" in rsp_keys and "version" in rsp_keys and "ngc_path" in rsp_keys
        print(f'PTM Name: {rsp["name"]}; PTM version: {rsp["version"]}; NGC PATH: {rsp["ngc_path"]}')

In [None]:
pretrained_map = {"auto_label" : "mask_auto_label:trainable_v1.0"}

In [None]:
filter_params = {"network_arch": "mal"}
message = subprocess.getoutput(f"tao-client {model_name} list-base-experiments --filter_params '{json.dumps(filter_params)}'")
message = ast.literal_eval(message)
ptm = []
for rsp in message:
    rsp_keys = rsp.keys()
    assert "ngc_path" in rsp_keys
    if rsp["ngc_path"].endswith(pretrained_map[model_name]):
        assert "id" in rsp_keys
        ptm_id = rsp["id"]
        ptm = [ptm_id]
        print("Metadata for model with requested NGC Path")
        print(rsp)
        break
print(ptm)

In [None]:
ptm_information = {"base_experiment":ptm}
patched_model = subprocess.getoutput(f"tao-client {model_name} patch-artifact-metadata --id {coco_dataset_id} --job_type dataset --update_info '{json.dumps(ptm_information)}' ")
print(patched_model)

### Auto labeling action

#### Get specs

In [None]:
# Default model specs
auto_label_generate_specs = subprocess.getoutput(f"tao-client {model_name} get-spec --action auto_label --job_type dataset --id {coco_dataset_id}")
auto_label_generate_specs = json.loads(auto_label_generate_specs)
print(json.dumps(auto_label_generate_specs, indent=4))

In [None]:
# Set specs
auto_label_generate_specs["gpu_ids"] = [0]
print(json.dumps(auto_label_generate_specs, indent=4))

### Run action

In [None]:
# Add --platform_id uuid for NVCF backend, where the uuid is a key from output of tao-client gpu-types
# Run action
coco_mask_dataset_id = kitti_dataset_id
parent = convert_job_id
auto_labeling_job_id = subprocess.getoutput(f"tao-client {model_name} dataset-run-action --id {coco_dataset_id} --parent_job_id {parent} --action auto_label --specs '{json.dumps(auto_label_generate_specs)}'")
print(auto_labeling_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
status = my_tail(model_name, coco_dataset_id, auto_labeling_job_id, "dataset", workdir)

## 3. Apply data augmentation <a class="anchor" id="head-3"></a>
In this section, we run offline augmentation with the original dataset. During the augmentation process, we can use the pseudo-masks generated from the last step to refine the distorted or rotated bounding boxes.

### Define the task and action

### Create a coco mask Dataset - If you already have data in coco segmentation format and skipped step 1 and 2

In [None]:
# # Create dataset
# model_name = "annotations"
# coco_mask_dataset_id = subprocess.getoutput(f"tao-client {model_name} dataset-create --dataset_type object_detection --dataset_format coco  --workspace {workspace_id} --cloud_file_path {coco_mask_dataset_path} --use_for '{json.dumps(['testing'])}'")
# print(coco_mask_dataset_id)

In [None]:
# # Check progress
# while True:
#     clear_output(wait=True)
#     response = subprocess.getoutput(f"tao-client {model_name} get-metadata --id {coco_mask_dataset_id} --job_type dataset")
#     response = json.loads(response)
#     print(json.dumps(response, sort_keys=True, indent=4))
#     if response.get("status") == "invalid_pull":
#         raise ValueError("Dataset pull failed")
#     if response.get("status") == "pull_complete":
#         break
#     time.sleep(5)

### Run data augmentation action


#### Get specs


In [None]:
# Default model specs
augmentation_generate_specs = subprocess.getoutput(f"tao-client {model_name} get-spec --action augment --job_type dataset --id {coco_mask_dataset_id}")
augmentation_generate_specs = json.loads(augmentation_generate_specs)
print(json.dumps(augmentation_generate_specs, indent=4))

In [None]:
# Change any spec key if required
print(json.dumps(augmentation_generate_specs, indent=4))

#### Run action


In [None]:
# Add --platform_id uuid for NVCF backend, where the uuid is a key from output of tao-client gpu-types
# Run action
parent = auto_labeling_job_id
coco_mask_augmented_dataset_id = subprocess.getoutput(f"tao-client {model_name} dataset-run-action --id {coco_mask_dataset_id} --parent_job_id {parent} --action augment --specs '{json.dumps(augmentation_generate_specs)}'")
print(coco_mask_augmented_dataset_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
status = my_tail(model_name, coco_mask_dataset_id, coco_mask_augmented_dataset_id, "dataset", workdir)

In [None]:
# After the augment action you'll get a new dataset
print(subprocess.getoutput(f"tao-client {model_name} get-metadata --id {coco_mask_augmented_dataset_id} --job_type dataset"))

## 4. Perform data analytics  <a class="anchor" id="head-4"></a>
Next, we perform analytics with the KITTI dataset.

### Run Data analytics annotation analytics action


#### Get specs


In [None]:
# Default model specs
model_name = "analytics"
analytics_analyze_specs = subprocess.getoutput(f"tao-client {model_name} get-spec --action analyze --job_type dataset --id {coco_dataset_id}")
analytics_analyze_specs = json.loads(analytics_analyze_specs)
print(json.dumps(analytics_analyze_specs, indent=4))

In [None]:
# Set specs
analytics_analyze_specs["data"]["input_format"] = "COCO"
print(json.dumps(analytics_analyze_specs, indent=4))

#### Run action


In [None]:
# Add --platform_id uuid for NVCF backend, where the uuid is a key from output of tao-client gpu-types
# Run action
parent = convert_job_id
analyze_job_id = subprocess.getoutput(f"tao-client {model_name} dataset-run-action --id {coco_dataset_id} --action analyze --parent_job_id {parent} --specs '{json.dumps(analytics_analyze_specs)}'")
print(analyze_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
status = my_tail(model_name, coco_dataset_id, analyze_job_id, "dataset", workdir)

## 5. Perform data validation  <a class="anchor" id="head-5"></a>
Next, we perform validate the annotations and images.

### Run Data annotation validation action

#### Get specs


In [None]:
# Default model specs
model_name = "analytics"
validate_annotations_specs = subprocess.getoutput(f"tao-client {model_name} get-spec --action validate_annotations --job_type dataset --id {coco_dataset_id}")
validate_annotations_specs = json.loads(validate_annotations_specs)
print(json.dumps(validate_annotations_specs, indent=4))

In [None]:
# Set specs
validate_annotations_specs["data"]["input_format"] = "COCO"
print(json.dumps(validate_annotations_specs, indent=4))

#### Run action


In [None]:
# Add --platform_id uuid for NVCF backend, where the uuid is a key from output of tao-client gpu-types
# Run action
parent = convert_job_id
validate_annotations_job_id = subprocess.getoutput(f"tao-client {model_name} dataset-run-action --id {coco_dataset_id} --action validate_annotations --parent_job_id {parent} --specs '{json.dumps(validate_annotations_specs)}'")
print(validate_annotations_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
status = my_tail(model_name, coco_dataset_id, validate_annotations_job_id, "dataset", workdir)

### Run Data image validation action - removes corrupted images and creates a new dataset

#### Get specs


In [None]:
# Default model specs
model_name = "image"
validate_images_specs = subprocess.getoutput(f"tao-client {model_name} get-spec --action validate_images --job_type dataset --id {kitti_dataset_id}")
validate_images_specs = json.loads(validate_images_specs)
print(json.dumps(validate_images_specs, indent=4))

In [None]:
# Make changes to the specs if necessary

#### Run action


In [None]:
# Add --platform_id uuid for NVCF backend, where the uuid is a key from output of tao-client gpu-types
# Run action
validate_images_job_id = subprocess.getoutput(f"tao-client {model_name} dataset-run-action --id {kitti_dataset_id} --action validate_images --specs '{json.dumps(validate_images_specs)}'")
print(validate_images_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
status = my_tail(model_name, kitti_dataset_id, validate_images_job_id, "dataset", workdir)

In [None]:
# # Optional: Backup database with a mongodump file saved in workspace dump/archive/{backup_filename}
# backup_file_name = "mongodump.tar.gz" # FIXME 7
# subprocess.getoutput(f"tao-client {model_name} workspace-backup --workspace {workspace_id} --backup_file_name {backup_file_name}")

### Delete dataset <a class="anchor" id="head-21"></a>

#### Delete original kitti dataset <a class="anchor" id="head-21"></a>

In [None]:
subprocess.getoutput(f"tao-client {model_name} dataset-delete --id {kitti_dataset_id}")

#### Delete coco augment dataset <a class="anchor" id="head-21"></a>

In [None]:
subprocess.getoutput(f"tao-client {model_name} dataset-delete --id {coco_mask_augmented_dataset_id}")