## TAO remote client - Data-Services
### The workflow in a nutshell
TAO Data Services include 4 key pipelines:
1. Offline data augmentation using DALI
2. Auto labeling using TAO Mask Auto-labeler (MAL)
3. Annotation conversion
4. Groundtruth analytics

## Learning Objectives

In this notebook, you will learn how to leverage the simplicity and convenience of TAO to:

* Convert KITTI dataset to COCO format
* Run auto-labeling to generate pseudo masks for KITTI bounding boxes
* Apply data augmentation to the KITTI dataset with bounding boxe refinement
* Run data analytics to collect useful statistics on the original and augmented KITTI dataset

### Table of contents

1. [Convert KITTI data to COCO format](#head-1)
2. [Generate pseudo-masks with the auto-labeler](#head-2)
3. [Apply data augmentation](#head-3)
4. [Perform data analytics](#head-4)


### Requirements
Please find the server requirements [here](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_setup.html#)

In [None]:
import os
import glob
import subprocess
import json
import ast
import time
from IPython.display import clear_output

In [None]:
namespace = 'default'

### Install TAO remote client

In [None]:
# # SKIP this step IF you have already installed the TAO-Client wheel.
! pip3 install nvidia-transfer-learning-client

In [None]:
# # View the version of the TAO-Client
! nvtl --version

### Set the remote service base URL and Token

### FIXME

1. Assign the ip_address and port_number in FIXME 1 and FIXME 2 ([info](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_rest_api.html))
2. Assign the ngc_api_key variable in FIXME 3
3. Assign path of DATA_DIR in FIXME 4

In [None]:
# Define the node_addr and port number
workdir = "workdir_data_services" # FIXME1
host_url = "http://<ip_address>:<port_number>" # FIXME2 example: https://10.137.149.22:32334
# In host machine, node IP address and port number can be obtained as follows,
# node_addr: hostname -I
# node_port: kubectl get service ingress-nginx-controller -o jsonpath='{.spec.ports[0].nodePort}'
ngc_api_key = "<ngc_api_key>"  # FIXME 3 example: (Add NGC API key)
data_dir = "<DATA_DIR>" # FIXME4

In [None]:
%env BASE_URL={host_url}/{namespace}/api/v1

# Exchange NGC_API_KEY for JWT
identity = json.loads(subprocess.getoutput(f'nvtl login --ngc-api-key {ngc_api_key}'))

%env USER={identity['user_id']}
%env TOKEN={identity['token']}

In [None]:
# Creating workdir
workdir = os.path.abspath(workdir)
if not os.path.isdir(workdir):
    os.makedirs(workdir)

### Function to parse logs <a class="anchor" id="head-1.1"></a>

In [None]:
def my_tail(model_name_cli, id, job_id, job_type, workdir):
	status = None
	while True:
		time.sleep(10)
		clear_output(wait=True)
		log_file_path = subprocess.getoutput(f"nvtl {model_name_cli} get-log-file --id {id} --job {job_id} --job_type {job_type} --workdir {workdir}")
		if not os.path.exists(log_file_path):
			continue
		with open(log_file_path, 'rb') as log_file:
			log_contents = log_file.read()
		log_content_lines = log_contents.decode("utf-8").split("\n")        
		for line in log_content_lines:
			print(line.strip())
			if line.strip() == "Error EOF":
				status = "Error"
				break
			elif line.strip() == "Done EOF":
				status = "Done"
				break
		if status is not None:
			break
	return status

### Function to split tar files <a class="anchor" id="head-1.1"></a>

In [None]:
import os
import tarfile

def split_tar_file(input_tar_path, output_dir, max_split_size=0.2*1024*1024*1024):
	os.makedirs(output_dir, exist_ok=True)
	
	with tarfile.open(input_tar_path, 'r') as original_tar:
		members = original_tar.getmembers()
		current_split_size = 0
		current_split_number = 0
		current_split_name = os.path.join(output_dir, f'smaller_file_{current_split_number}.tar')
		
		with tarfile.open(current_split_name, 'w') as split_tar:
			for member in members:
				if current_split_size + member.size <= max_split_size:
					split_tar.addfile(member, original_tar.extractfile(member))
					current_split_size += member.size
				else:
					split_tar.close()
					current_split_number += 1
					current_split_name = os.path.join(output_dir, f'smaller_file_{current_split_number}.tar')
					current_split_size = 0
					split_tar = tarfile.open(current_split_name, 'w')  # Open a new split tar archive
					split_tar.addfile(member, original_tar.extractfile(member))
					current_split_size += member.size

## 1. Convert KITTI data to COCO format <a class="anchor" id="head-1"></a>
We would first convert the dataset from KITTI to COCO formats.

### Define the task and action

In [None]:
model_name = "annotations"
action = "convert"

### Create dataset
We support both KITTI and COCO data formats

KITTI dataset follow the directory structure displayed below:
```
$DATA_DIR/dataset
├── images
│   ├── image_name_1.jpg
│   ├── image_name_2.jpg
|   ├── ...
└── labels
    ├── image_name_1.txt
    ├── image_name_2.txt
    ├── ...
```

And COCO dataset follow the directory structure displayed below:
```
$DATA_DIR/dataset
├── images
│   ├── image_name_1.jpg
│   ├── image_name_2.jpg
|   ├── ...
└── annotations.json
```
For this notebook, we will be using the KITTI object detection dataset for this example. To find more details, please visit [here](http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=2d).

In [None]:
# Dataset Links
images_url = "https://s3.eu-central-1.amazonaws.com/avg-kitti/data_object_image_2.zip"
labels_url = "https://s3.eu-central-1.amazonaws.com/avg-kitti/data_object_label_2.zip"

In [None]:
# Download the dataset
!wget -O images.zip {images_url}
!wget -O labels.zip {labels_url}

In [None]:
!unzip -q images.zip -d {data_dir}/
!unzip -q labels.zip -d {data_dir}/
!mkdir -p {data_dir}/images {data_dir}/labels
!mv {data_dir}/training/image_2/000* {data_dir}/images/
!mv {data_dir}/training/label_2/000* {data_dir}/labels/
!cd {data_dir} && tar -cf kitti_dataset.tar images labels
!rm -rf images.zip labels.zip {data_dir}/training/ {data_dir}/training/ {data_dir}/testing/

In [None]:
dataset_path = f"{data_dir}/kitti_dataset.tar"

In [None]:
# Create dataset
kitti_dataset_id = subprocess.getoutput(f"nvtl {model_name} dataset-create --dataset_type object_detection --dataset_format kitti")
print(kitti_dataset_id)

In [None]:
output_dir = os.path.join(os.path.dirname(os.path.abspath(dataset_path)), model_name, "kitti_to_coco")
split_tar_file(dataset_path, output_dir)
for idx, tar_dataset_path in enumerate(os.listdir(output_dir)):
    print(f"Uploading {idx+1}/{len(os.listdir(output_dir))} tar split")
    upload_dataset_message = subprocess.getoutput(f"nvtl {model_name} dataset-upload --id {kitti_dataset_id} --path {os.path.join(output_dir,tar_dataset_path)}")
    print(upload_dataset_message)

### List the created datasets

In [None]:
message = subprocess.getoutput(f"nvtl {model_name} list-datasets")
message = ast.literal_eval(message)
for rsp in message:
    rsp_keys = rsp.keys()
    assert "id" in rsp_keys
    assert "type" in rsp_keys
    assert "format" in rsp_keys
    assert "name" in rsp_keys
    print(rsp["id"],"\t",rsp["type"],"\t",rsp["format"],"\t\t",rsp["name"])

### Create an experiment

In [None]:
# Create an experiment
annotation_conversion_experiment_id = subprocess.getoutput(f"nvtl {model_name} experiment-create --network_arch {model_name} --encryption_key nvidia_tlt")
print(annotation_conversion_experiment_id)

### Assign the dataset

In [None]:
# Assign dataset
docker_env_vars = {} # Update any variables to be included while triggering Docker run-time like MLOPs variables 
dataset_information = {"inference_dataset":kitti_dataset_id,"docker_env_vars": docker_env_vars}
patched_model = subprocess.getoutput(f"nvtl {model_name} patch-artifact-metadata --id {annotation_conversion_experiment_id} --job_type experiment --update_info '{json.dumps(dataset_information)}' ")
print(patched_model)

### Set action specs

In [None]:
# Default model specs
annotation_conversion_specs = subprocess.getoutput(f"nvtl {model_name} get-spec --action {action} --job_type experiment --id {annotation_conversion_experiment_id}")
annotation_conversion_specs = json.loads(annotation_conversion_specs)
print(json.dumps(annotation_conversion_specs, indent=4))

In [None]:
# Set specs
annotation_conversion_specs["data"]["input_format"] = "KITTI"
annotation_conversion_specs["data"]["output_format"] = "COCO"
print(json.dumps(annotation_conversion_specs, indent=4))

### Execute the data format conversion action

In [None]:
# Run action
convert_job_id = subprocess.getoutput(f"nvtl {model_name} experiment-run-action --id {annotation_conversion_experiment_id} --action {action} --specs '{json.dumps(annotation_conversion_specs)}'")
print(convert_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
status = my_tail(model_name, annotation_conversion_experiment_id, convert_job_id, "experiment", workdir)

### Download the COCO annotations

In [None]:
file_list = subprocess.getoutput(f"nvtl {model_name} list-job-files --id {annotation_conversion_experiment_id} --job {convert_job_id} --job_type experiment --retrieve_logs True --retrieve_specs False")
print(file_list)

In [None]:
temptar = subprocess.getoutput(f"nvtl {model_name} download-entire-job --id {annotation_conversion_experiment_id} --job {convert_job_id} --job_type experiment --workdir {workdir}")
tar_command = f'tar -xvf {temptar} -C {workdir}/'
os.system(tar_command)
os.remove(temptar)
print(f"Results at {workdir}/{convert_job_id}")
convert_out_path = f"{workdir}/{convert_job_id}"

In [None]:
# Copy annotations to the dataset
!cp {convert_out_path}/{kitti_dataset_id}.json {data_dir}/annotations.json

## 2. Generate pseudo-masks with the auto-labeler <a class="anchor" id="head-2"></a>
Here we will use a pretrained MAL model to generate pseudo-masks for the converted KITTI data. 

### Define the task and action

In [None]:
model_name = "auto_label"
action = "generate"

### Create the dataset
We would be formatting the original dataset to include the COCO annotations generated.

In [None]:
# Reformatting the dataset
# Untar to destination
tar_command = f'mkdir -p {workdir}/{convert_job_id}_coco/ && tar -xf {dataset_path} -C {workdir}/{convert_job_id}_coco/'
os.system(tar_command)

# Copy the annotations
copy_command = f'cp {convert_out_path}/{kitti_dataset_id}.json {workdir}/{convert_job_id}_coco/annotations.json'
os.system(copy_command)

# Tar the dataset
tar_command = f'cd {workdir} && tar -cf {convert_job_id}_coco.tar {convert_job_id}_coco'
os.system(tar_command)
coco_data_path = f'{workdir}/{convert_job_id}_coco.tar'

In [None]:
# Create Dataset
coco_dataset_id = subprocess.getoutput(f"nvtl {model_name} dataset-create --dataset_type object_detection --dataset_format coco")
print(coco_dataset_id)

In [None]:
output_dir = os.path.join(os.path.dirname(os.path.abspath(coco_data_path)), model_name, "coco_pseudo")
split_tar_file(coco_data_path, output_dir)
for idx, tar_dataset_path in enumerate(os.listdir(output_dir)):
    print(f"Uploading {idx+1}/{len(os.listdir(output_dir))} tar split")
    upload_dataset_message = subprocess.getoutput(f"nvtl {model_name} dataset-upload --id {coco_dataset_id} --path {os.path.join(output_dir,tar_dataset_path)}")
    print(upload_dataset_message)

### List the datasets

In [None]:
message = subprocess.getoutput(f"nvtl {model_name} list-datasets")
message = ast.literal_eval(message)
for rsp in message:
    rsp_keys = rsp.keys()
    assert "id" in rsp_keys
    assert "type" in rsp_keys
    assert "format" in rsp_keys
    assert "name" in rsp_keys
    print(rsp["id"],"\t",rsp["type"],"\t",rsp["format"],"\t\t",rsp["name"])

### Create an experiment

In [None]:
# Create an experiment
network_arch = model_name
pseudo_mask_experiment_id = subprocess.getoutput(f"nvtl {model_name} experiment-create --network_arch {network_arch} --encryption_key mvidia_tlt")
print(pseudo_mask_experiment_id)

### Find the PTM

In [None]:
# List all pretrained models for the chosen network architecture
filter_params = {"network_arch": network_arch}
message = subprocess.getoutput(f"nvtl {model_name} list-experiments --filter_params '{json.dumps(filter_params)}'")
message = ast.literal_eval(message)
for rsp in message:
    rsp_keys = rsp.keys()
    if "encryption_key" not in rsp.keys():
        assert "name" in rsp_keys and "version" in rsp_keys and "ngc_path" in rsp_keys and "additional_id_info" in rsp_keys
        print(f'PTM Name: {rsp["name"]}; PTM version: {rsp["version"]}; NGC PATH: {rsp["ngc_path"]}; Additional info: {rsp["additional_id_info"]}')

In [None]:
pretrained_map = {"auto_label" : "mask_auto_label:trainable_v1.0"}

In [None]:
filter_params = {"network_arch": network_arch}
message = subprocess.getoutput(f"nvtl {model_name} list-experiments --filter_params '{json.dumps(filter_params)}'")
message = ast.literal_eval(message)
ptm = []
for rsp in message:
    rsp_keys = rsp.keys()
    assert "ngc_path" in rsp_keys
    if rsp["ngc_path"].endswith(pretrained_map[network_arch]):
        assert "id" in rsp_keys
        ptm_id = rsp["id"]
        ptm = [ptm_id]
        print("Metadata for model with requested NGC Path")
        print(rsp)
        break
print(ptm)

### Assign the PTM

In [None]:
ptm_information = {"base_experiment":ptm}
patched_model = subprocess.getoutput(f"nvtl {model_name} patch-artifact-metadata --id {pseudo_mask_experiment_id} --job_type experiment --update_info '{json.dumps(ptm_information)}' ")
print(patched_model)

### Assign the dataset

In [None]:
# Assign dataset
docker_env_vars = {} # Update any variables to be included while triggering Docker run-time like MLOPs variables 
dataset_information = {"inference_dataset":coco_dataset_id,"docker_env_vars": docker_env_vars}
patched_model = subprocess.getoutput(f"nvtl {model_name} patch-artifact-metadata --id {pseudo_mask_experiment_id} --job_type experiment --update_info '{json.dumps(dataset_information)}' ")
print(patched_model)

### Set action specs

In [None]:
# Default model specs
auto_label_generate_specs = subprocess.getoutput(f"nvtl {model_name} get-spec --action {action} --job_type experiment --id {pseudo_mask_experiment_id}")
auto_label_generate_specs = json.loads(auto_label_generate_specs)
print(json.dumps(auto_label_generate_specs, indent=4))

In [None]:
# Set specs
auto_label_generate_specs["gpu_ids"] = [0]
print(json.dumps(auto_label_generate_specs, indent=4))

### Run action

In [None]:
# Run action
label_job_id = subprocess.getoutput(f"nvtl {model_name} experiment-run-action --id {pseudo_mask_experiment_id} --action {action} --specs '{json.dumps(auto_label_generate_specs)}'")
print(label_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
status = my_tail(model_name, pseudo_mask_experiment_id, label_job_id, "experiment", workdir)

### Download the label masks

In [None]:
file_list = subprocess.getoutput(f"nvtl {model_name} list-job-files --id {pseudo_mask_experiment_id} --job {label_job_id} --job_type experiment --retrieve_logs True --retrieve_specs False")
print(file_list)

In [None]:
temptar = subprocess.getoutput(f"nvtl {model_name} download-entire-job --id {pseudo_mask_experiment_id} --job {label_job_id} --job_type experiment --workdir {workdir}")
tar_command = f'tar -xvf {temptar} -C {workdir}/'
os.system(tar_command)
os.remove(temptar)
print(f"Results at {workdir}/{label_job_id}")
model_downloaded_path = f"{workdir}/{label_job_id}"

In [None]:
# Copy annotations to the dataset
!cp {model_downloaded_path}/label.json {data_dir}/label.json

## 3. Apply data augmentation <a class="anchor" id="head-3"></a>
In this section, we run offline augmentation with the original dataset. During the augmentation process, we can use the pseudo-masks generated from the last step to refine the distorted or rotated bounding boxes.

### Define the task and action

In [None]:
model_name = "augmentation"
action = "generate"

### Create the dataset
We would be formatting the dataset to include the generated mask information.

In [None]:
# Format the dataset
copy_command = f'cp {workdir}/{label_job_id}/label.json {workdir}/{convert_job_id}_coco'
os.system(copy_command)

# Tar the dataset
tar_command = f'cd {workdir} && tar -cvf {label_job_id}_coco.tar {convert_job_id}_coco'
os.system(tar_command)
coco_auto_label_data_path = f'{workdir}/{label_job_id}_coco.tar'

In [None]:
# Create dataset
coco_aug_dataset_id = subprocess.getoutput(f"nvtl {model_name} dataset-create --dataset_type object_detection --dataset_format coco")
print(coco_aug_dataset_id)

In [None]:
output_dir = os.path.join(os.path.dirname(os.path.abspath(coco_auto_label_data_path)), model_name, "coco_auto_label")
split_tar_file(coco_auto_label_data_path, output_dir)
for idx, tar_dataset_path in enumerate(os.listdir(output_dir)):
    print(f"Uploading {idx+1}/{len(os.listdir(output_dir))} tar split")
    upload_dataset_message = subprocess.getoutput(f"nvtl {model_name} dataset-upload --id {coco_aug_dataset_id} --path {os.path.join(output_dir,tar_dataset_path)}")
    print(upload_dataset_message)

### List the datasets

In [None]:
message = subprocess.getoutput(f"nvtl {model_name} list-datasets")
message = ast.literal_eval(message)
for rsp in message:
    rsp_keys = rsp.keys()
    assert "id" in rsp_keys
    assert "type" in rsp_keys
    assert "format" in rsp_keys
    assert "name" in rsp_keys
    print(rsp["id"],"\t",rsp["type"],"\t",rsp["format"],"\t\t",rsp["name"])

### Create an experiment

In [None]:
# Create an experiment
data_aug_experiment_id = subprocess.getoutput(f"nvtl {model_name} experiment-create --network_arch {model_name} --encryption_key nvidia_tlt")
print(data_aug_experiment_id)

### Assign the dataset

In [None]:
# Assign dataset
docker_env_vars = {} # Update any variables to be included while triggering Docker run-time like MLOPs variables 
dataset_information = {"inference_dataset":coco_aug_dataset_id,"docker_env_vars": docker_env_vars}
patched_model = subprocess.getoutput(f"nvtl {model_name} patch-artifact-metadata --id {data_aug_experiment_id} --job_type experiment --update_info '{json.dumps(dataset_information)}' ")
print(patched_model)

### Set action specs

In [None]:
# Default model specs
augmentation_generate_specs = subprocess.getoutput(f"nvtl {model_name} get-spec --action {action} --job_type experiment --id {data_aug_experiment_id}")
augmentation_generate_specs = json.loads(augmentation_generate_specs)
print(json.dumps(augmentation_generate_specs, indent=4))

In [None]:
# Change any spec key if required
print(json.dumps(augmentation_generate_specs, indent=4))

### Execute the data augmentation action

In [None]:
# Run action
augment_job_id = subprocess.getoutput(f"nvtl {model_name} experiment-run-action --id {data_aug_experiment_id} --action {action} --specs '{json.dumps(augmentation_generate_specs)}'")
print(augment_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
status = my_tail(model_name, data_aug_experiment_id, augment_job_id, "experiment", workdir)

## 4. Perform data analytics  <a class="anchor" id="head-4"></a>
Next, we perform analytics with the KITTI dataset.

### Assigning the task and action

In [None]:
model_name = "analytics"
action = "analyze"

### Create an experiment

In [None]:
# Create an experiment
data_analytics_experiment_id = subprocess.getoutput(f"nvtl {model_name} experiment-create --network_arch {model_name} --encryption_key nvidia_tlt")
print(data_analytics_experiment_id)

### Assign the dataset

In [None]:
# Assign dataset
docker_env_vars = {} # Update any variables to be included while triggering Docker run-time like MLOPs variables 
dataset_information = {"inference_dataset":kitti_dataset_id,"docker_env_vars": docker_env_vars}
patched_model = subprocess.getoutput(f"nvtl {model_name} patch-artifact-metadata --id {data_analytics_experiment_id} --job_type experiment --update_info '{json.dumps(dataset_information)}' ")
print(patched_model)

### Set action specs

In [None]:
# Default model specs
analytics_analyze_specs = subprocess.getoutput(f"nvtl {model_name} get-spec --action {action} --job_type experiment --id {data_analytics_experiment_id}")
analytics_analyze_specs = json.loads(analytics_analyze_specs)
print(json.dumps(analytics_analyze_specs, indent=4))

In [None]:
# Change any spec key if required
print(json.dumps(analytics_analyze_specs, indent=4))

### Execute the data analytics action

In [None]:
# Run action
analyze_job_id = subprocess.getoutput(f"nvtl {model_name} experiment-run-action --id {data_analytics_experiment_id} --action {action} --specs '{json.dumps(analytics_analyze_specs)}'")
print(analyze_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
status = my_tail(model_name, data_analytics_experiment_id, analyze_job_id, "experiment", workdir)

### Delete model <a class="anchor" id="head-21"></a>

In [None]:
subprocess.getoutput(f"nvtl {model_name} experiment-delete --id {annotation_conversion_experiment_id}")

In [None]:
subprocess.getoutput(f"nvtl {model_name} experiment-delete --id {pseudo_mask_experiment_id}")

In [None]:
subprocess.getoutput(f"nvtl {model_name} experiment-delete --id {data_aug_experiment_id}")

In [None]:
subprocess.getoutput(f"nvtl {model_name} experiment-delete --id {data_analytics_experiment_id}")

### Delete dataset <a class="anchor" id="head-21"></a>

#### Delete kitti dataset <a class="anchor" id="head-21"></a>

In [None]:
subprocess.getoutput(f"nvtl {model_name} dataset-delete --id {kitti_dataset_id}")

#### Delete coco dataset <a class="anchor" id="head-21"></a>

In [None]:
subprocess.getoutput(f"nvtl {model_name} dataset-delete --id {coco_dataset_id}")

#### Delete coco aug dataset <a class="anchor" id="head-21"></a>

In [None]:
subprocess.getoutput(f"nvtl {model_name} dataset-delete --id {coco_aug_dataset_id}")