## TAO remote client - Data-Services
### The workflow in a nutshell
TAO Data Services include 4 key pipelines:
1. Offline data augmentation using DALI
2. Auto labeling using TAO Mask Auto-labeler (MAL)
3. Annotation conversion
4. Groundtruth analytics

## Learning Objectives

In this notebook, you will learn how to leverage the simplicity and convenience of TAO to:

* Convert KITTI dataset to COCO format
* Run auto-labeling to generate pseudo masks for KITTI bounding boxes
* Apply data augmentation to the KITTI dataset with bounding boxe refinement
* Run data analytics to collect useful statistics on the original and augmented KITTI dataset

### Table of contents

1. [Convert KITTI data to COCO format](#head-1)
2. [Generate pseudo-masks with the auto-labeler](#head-2)
3. [Apply data augmentation](#head-3)
4. [Perform data analytics](#head-4)


### Requirements
Please find the server requirements [here](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_setup.html#)

In [None]:
import os
import glob
import subprocess
import getpass
import json

In [None]:
namespace = 'default'

### Install TAO remote client

In [None]:
# # SKIP this step IF you have already installed the TAO-Client wheel.
! pip3 install nvidia-tao-client

In [None]:
# # View the version of the TAO-Client
! tao-client --version

### Set the remote service base URL and Token

### FIXME

1. Assign the ip_address and port_number in FIXME 1 and FIXME 2 ([info](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_rest_api.html))
2. Assign the ngc_api_key variable in FIXME 3
3. Assign path of DATA_DIR in FIXME 4

In [None]:
# Define the node_addr and port number
node_addr = "<ip_address>"  # FIXME 1 example: 10.137.149.22
node_port = "<port_number>"  # FIXME 2 example: 32334
# In host machine, node IP address and port number can be obtained as follows,
# node_addr: hostname -I
# node_port: kubectl get service ingress-nginx-controller -o jsonpath='{.spec.ports[0].nodePort}'
ngc_api_key = "<ngc_api_key>"  # FIXME 3 example: (Add NGC API key)
data_dir = "<DATA_DIR>" # FIXME4

In [None]:
%env BASE_URL=http://{node_addr}:{node_port}/api/v1

# Exchange NGC_API_KEY for JWT
identity = json.loads(subprocess.getoutput(f'tao-client login --ngc-api-key {ngc_api_key}'))

%env USER={identity['user_id']}
%env TOKEN={identity['token']}

### Access the shared volume

In [None]:
# Get PVC ID
pvc_id = subprocess.getoutput(f'kubectl get pvc tao-toolkit-api-pvc -n {namespace} -o jsonpath="{{.spec.volumeName}}"')
print(pvc_id)

In [None]:
# Get NFS server info
provisioner = json.loads(subprocess.getoutput(f'helm get values nfs-subdir-external-provisioner -o json'))
nfs_server = provisioner['nfs']['server']
nfs_path = provisioner['nfs']['path']
print(nfs_server, nfs_path)

In [None]:
user = getpass.getuser()
home = os.path.expanduser('~')

! echo "Password for {user}"
password = getpass.getpass()

In [None]:
# Mount shared volume 
! mkdir -p ~/shared

command = "apt-get -y install nfs-common >> /dev/null"
! echo {password} | sudo -S -k {command}

command = f"mount -t nfs {nfs_server}:{nfs_path}/{namespace}-tao-toolkit-api-pvc-{pvc_id} ~/shared"
! echo {password} | sudo -S -k {command} && echo DONE

## 1. Convert KITTI data to COCO format <a class="anchor" id="head-1"></a>
We would first convert the dataset from KITTI to COCO formats.

### Define the task and action

In [None]:
model_name = "annotations"
action = "convert"

### Create dataset
We support both KITTI and COCO data formats

KITTI dataset follow the directory structure displayed below:
```
$DATA_DIR/dataset
├── images
│   ├── image_name_1.jpg
│   ├── image_name_2.jpg
|   ├── ...
└── labels
    ├── image_name_1.txt
    ├── image_name_2.txt
    ├── ...
```

And COCO dataset follow the directory structure displayed below:
```
$DATA_DIR/dataset
├── images
│   ├── image_name_1.jpg
│   ├── image_name_2.jpg
|   ├── ...
└── annotations.json
```
For this notebook, we will be using the KITTI object detection dataset for this example. To find more details, please visit [here](http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=2d).

In [None]:
# Dataset Links
images_url = "https://s3.eu-central-1.amazonaws.com/avg-kitti/data_object_image_2.zip"
labels_url = "https://s3.eu-central-1.amazonaws.com/avg-kitti/data_object_label_2.zip"

In [None]:
# Download the dataset
!wget -O images.zip {images_url}
!wget -O labels.zip {labels_url}

In [None]:
!unzip -q images.zip -d {data_dir}/
!unzip -q labels.zip -d {data_dir}/
!mkdir -p {data_dir}/images {data_dir}/labels
!mv {data_dir}/training/image_2/000* {data_dir}/images/
!mv {data_dir}/training/label_2/000* {data_dir}/labels/
!cd {data_dir} && tar -cf kitti_dataset.tar images labels
!rm -rf images.zip labels.zip {data_dir}/training/ {data_dir}/training/ {data_dir}/testing/

In [None]:
# Create dataset
dataset_id = subprocess.getoutput(f"tao-client {model_name} dataset-create --dataset_type object_detection --dataset_format raw")
print(dataset_id)

In [None]:
!rsync -ah --info=progress2 {data_dir}/images ~/shared/users/{identity['user_id']}/datasets/{dataset_id}/
!rsync -ah --info=progress2 {data_dir}/labels ~/shared/users/{identity['user_id']}/datasets/{dataset_id}/

### List the created datasets

In [None]:
pattern = os.path.join(home, 'shared', 'users', os.environ['USER'], 'datasets', '*', 'metadata.json')

datasets = []
for metadata_path in glob.glob(pattern):
    with open(metadata_path, 'r') as metadata_file:
        datasets.append(json.load(metadata_file))

print(json.dumps(datasets, indent=2))

### Create a model experiment

In [None]:
# Create model
model_id = subprocess.getoutput(f"tao-client {model_name} model-create --network_arch {model_name}")
print(model_id)

### Assign the dataset

In [None]:
# Assign datast
metadata_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'metadata.json')

with open(metadata_path , "r") as metadata_file:
    metadata = json.load(metadata_file)

metadata["inference_dataset"] = dataset_id

with open(metadata_path, "w") as metadata_file:
    json.dump(metadata, metadata_file, indent=2)

print(json.dumps(metadata, indent=2))

### Set action specs

In [None]:
# Default model specs
! tao-client {model_name} model-action-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/{action}.json

In [None]:
# Set specs
specs_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'specs', f'{action}.json')

with open(specs_path , "r") as specs_file:
    specs = json.load(specs_file)

# Updating specs
specs["data"]["input_format"] = "KITTI"
specs["data"]["output_format"] = "COCO"

with open(specs_path, "w") as specs_file:
    json.dump(specs, specs_file, indent=2)

print(json.dumps(specs, indent=2))

### Execute the data format conversion action

In [None]:
# Run action
convert_job_id = subprocess.getoutput(f"tao-client {model_name} execute-action --id {model_id}")
print(convert_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
logs_dir = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'logs')
log_file = f"{convert_job_id}.txt"

def my_tail(logs_dir, log_file):
    %env LOG_FILE={logs_dir}/{log_file}
    ! mkdir -p {logs_dir}
    ! [ ! -f "$LOG_FILE" ] && touch $LOG_FILE && chmod 666 $LOG_FILE
    ! tail -f -n +1 $LOG_FILE | while read LINE; do echo "$LINE"; [[ "$LINE" == "EOF" ]] && pkill -P $$ tail; done
    

my_tail(logs_dir, log_file)

### Download the COCO annotations

In [None]:
# Copy annotations to the dataset
!rsync -ah --info=progress2 ~/shared/users/{identity['user_id']}/models/{model_id}/{convert_job_id}/{dataset_id}.json {data_dir}/annotations.json

### Delete the experiment

In [None]:
! rm -rf ~/shared/users/{os.environ['USER']}/models/{model_id}
! echo DONE

### Delete the dataset

In [None]:
! rm -rf ~/shared/users/{os.environ['USER']}/datasets/{dataset_id}
! echo DONE

## 2. Generate pseudo-masks with the auto-labeler <a class="anchor" id="head-2"></a>
Here we will use a pretrained MAL model to generate pseudo-masks for the converted KITTI data. 

### Define the task and action

In [None]:
model_name = "auto-label"
action = "generate"

### Create the dataset
We would be formatting the original dataset to include the COCO annotations generated.

In [None]:
# Create Dataset
dataset_id = subprocess.getoutput(f"tao-client {model_name} dataset-create --dataset_type object_detection --dataset_format raw")
print(dataset_id)

In [None]:
!rsync -ah --info=progress2 {data_dir}/images ~/shared/users/{identity['user_id']}/datasets/{dataset_id}/
!rsync -ah --info=progress2 {data_dir}/annotations.json ~/shared/users/{identity['user_id']}/datasets/{dataset_id}/

### List the datasets

In [None]:
pattern = os.path.join(home, 'shared', 'users', os.environ['USER'], 'datasets', '*', 'metadata.json')

datasets = []
for metadata_path in glob.glob(pattern):
    with open(metadata_path, 'r') as metadata_file:
        datasets.append(json.load(metadata_file))

print(json.dumps(datasets, indent=2))

### Create a model experiment

In [None]:
# Create model
network_arch = model_name.replace("-","_")
model_id = subprocess.getoutput(f"tao-client {model_name} model-create --network_arch {network_arch}")
print(model_id)

### Find the PTM

In [None]:
# List all pretrained models for the chosen network architecture
pattern = os.path.join(home, 'shared', 'users', '*', 'models', '*', 'metadata.json')

for ptm_metadata_path in glob.glob(pattern):
  with open(ptm_metadata_path, 'r') as metadata_file:
    ptm_metadata = json.load(metadata_file)
    metadata_network_arch = ptm_metadata.get("network_arch")
    if metadata_network_arch == network_arch:
        print(f'PTM Name: {ptm_metadata["name"]}; PTM version: {ptm_metadata["version"]}; NGC PATH: {ptm_metadata["ngc_path"]}; Additional info: {ptm_metadata["additional_id_info"]}')

In [None]:
pretrained_map = {"auto_label" : "mask_auto_label:trainable_v1.0"}

In [None]:
# Get pretrained model
ptm = []
for ptm_metadata_path in glob.glob(pattern):
    with open(ptm_metadata_path, 'r') as metadata_file:
        ptm_metadata = json.load(metadata_file)
        ngc_path = ptm_metadata.get("ngc_path")
        metadata_network_arch = ptm_metadata.get("network_arch")
        if metadata_network_arch == network_arch and ngc_path.endswith(pretrained_map[network_arch]):
            ptm = [ptm_metadata["id"]]
            break

print(ptm)

### Assign the PTM

In [None]:
# Assign PTM
metadata_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'metadata.json')

with open(metadata_path , "r") as metadata_file:
    metadata = json.load(metadata_file)

metadata["ptm"] = ptm

with open(metadata_path, "w") as metadata_file:
    json.dump(metadata, metadata_file, indent=2)

print(json.dumps(metadata, indent=2))

### Assign the dataset

In [None]:
# Assign datast
metadata_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'metadata.json')

with open(metadata_path , "r") as metadata_file:
    metadata = json.load(metadata_file)

metadata["inference_dataset"] = dataset_id

with open(metadata_path, "w") as metadata_file:
    json.dump(metadata, metadata_file, indent=2)

print(json.dumps(metadata, indent=2))

### Set action specs

In [None]:
# Default train model specs
! tao-client {model_name} model-action-defaults --id {model_id} | tee ~/shared/users/{identity['user_id']}/models/{model_id}/specs/{action}.json

In [None]:
# Set specs
specs_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'specs', f'{action}.json')

with open(specs_path , "r") as specs_file:
    specs = json.load(specs_file)

# Override any of the parameters listed in the previous cell as required
specs["gpu_ids"] = [0]
    
with open(specs_path, "w") as specs_file:
    json.dump(specs, specs_file, indent=2)

print(json.dumps(specs, indent=2))

### Run action

In [None]:
# Run action
label_job_id = subprocess.getoutput(f"tao-client {model_name} execute-action --id {model_id}")
print(label_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
logs_dir = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'logs')
log_file = f"{label_job_id}.txt"

my_tail(logs_dir, log_file)

### Download the label masks

In [None]:
# Copy annotations to the dataset
!rsync -ah --info=progress2 ~/shared/users/{identity['user_id']}/models/{model_id}/{label_job_id}/label.json {data_dir}/label.json

### Delete the experiment

In [None]:
! rm -rf ~/shared/users/{os.environ['USER']}/models/{model_id}
! echo DONE

### Delete the dataset

In [None]:
! rm -rf ~/shared/users/{os.environ['USER']}/datasets/{dataset_id}
! echo DONE

## 3. Apply data augmentation <a class="anchor" id="head-3"></a>
In this section, we run offline augmentation with the original dataset. During the augmentation process, we can use the pseudo-masks generated from the last step to refine the distorted or rotated bounding boxes.

### Define the task and action

In [None]:
model_name = "augmentation"
action = "generate"

### Create the dataset
We would be formatting the dataset to include the generated mask information.

In [None]:
# Create dataset
dataset_id = subprocess.getoutput(f"tao-client {model_name} dataset-create --dataset_type object_detection --dataset_format raw")
print(dataset_id)

In [None]:
!rsync -ah --info=progress2 {data_dir}/images ~/shared/users/{identity['user_id']}/datasets/{dataset_id}/
!rsync -ah --info=progress2 {data_dir}/labels ~/shared/users/{identity['user_id']}/datasets/{dataset_id}/
!rsync -ah --info=progress2 {data_dir}/label.json ~/shared/users/{identity['user_id']}/datasets/{dataset_id}/

### List the datasets

In [None]:
pattern = os.path.join(home, 'shared', 'users', os.environ['USER'], 'datasets', '*', 'metadata.json')

datasets = []
for metadata_path in glob.glob(pattern):
    with open(metadata_path, 'r') as metadata_file:
        datasets.append(json.load(metadata_file))

print(json.dumps(datasets, indent=2))

### Create a model experiment

In [None]:
# Create model
model_id = subprocess.getoutput(f"tao-client {model_name} model-create --network_arch {model_name}")
print(model_id)

### Assign the dataset

In [None]:
# Assign datast
metadata_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'metadata.json')

with open(metadata_path , "r") as metadata_file:
    metadata = json.load(metadata_file)

metadata["inference_dataset"] = dataset_id

with open(metadata_path, "w") as metadata_file:
    json.dump(metadata, metadata_file, indent=2)

print(json.dumps(metadata, indent=2))

### Set action specs

In [None]:
# Default model specs
! tao-client {model_name} model-action-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/{action}.json

In [None]:
# Set specs
specs_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'specs', f'{action}.json')

with open(specs_path , "r") as specs_file:
    specs = json.load(specs_file)

with open(specs_path, "w") as specs_file:
    json.dump(specs, specs_file, indent=2)

print(json.dumps(specs, indent=2))

### Execute the data augmentation action

In [None]:
# Run action
augment_job_id = subprocess.getoutput(f"tao-client {model_name} execute-action --id {model_id}")
print(augment_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
logs_dir = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'logs')
log_file = f"{augment_job_id}.txt"

my_tail(logs_dir, log_file)

### Delete the experiment

In [None]:
! rm -rf ~/shared/users/{os.environ['USER']}/models/{model_id}
! echo DONE

### Delete the dataset

In [None]:
! rm -rf ~/shared/users/{os.environ['USER']}/datasets/{dataset_id}
! echo DONE

## 4. Perform data analytics  <a class="anchor" id="head-4"></a>
Next, we perform analytics with the KITTI dataset.

### Assigning the task and action

In [None]:
model_name = "analytics"
action = "analyze"

### Create the dataset

In [None]:
# Create dataset
dataset_id = subprocess.getoutput(f"tao-client {model_name} dataset-create --dataset_type object_detection --dataset_format raw")
print(dataset_id)

In [None]:
!rsync -ah --info=progress2 {data_dir}/images ~/shared/users/{identity['user_id']}/datasets/{dataset_id}/
!rsync -ah --info=progress2 {data_dir}/labels ~/shared/users/{identity['user_id']}/datasets/{dataset_id}/

### List the datasets

In [None]:
pattern = os.path.join(home, 'shared', 'users', os.environ['USER'], 'datasets', '*', 'metadata.json')

datasets = []
for metadata_path in glob.glob(pattern):
    with open(metadata_path, 'r') as metadata_file:
        datasets.append(json.load(metadata_file))

print(json.dumps(datasets, indent=2))

### Create a model experiment

In [None]:
# Create model
model_id = subprocess.getoutput(f"tao-client {model_name} model-create --network_arch {model_name}")
print(model_id)

### Assign the dataset

In [None]:
# Assign datast
metadata_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'metadata.json')

with open(metadata_path , "r") as metadata_file:
    metadata = json.load(metadata_file)

metadata["inference_dataset"] = dataset_id

with open(metadata_path, "w") as metadata_file:
    json.dump(metadata, metadata_file, indent=2)

print(json.dumps(metadata, indent=2))

### Set action specs

In [None]:
# Default model specs
! tao-client {model_name} model-action-defaults --id {model_id} --action {action} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/{action}.json

In [None]:
# Set specs
specs_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'specs', f'{action}.json')

with open(specs_path , "r") as specs_file:
    specs = json.load(specs_file)

with open(specs_path, "w") as specs_file:
    json.dump(specs, specs_file, indent=2)

print(json.dumps(specs, indent=2))

### Execute the data analytics action

In [None]:
# Run action
analyze_job_id = subprocess.getoutput(f"tao-client {model_name} execute-action --id {model_id} --action {action}")
print(analyze_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
logs_dir = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'logs')
log_file = f"{analyze_job_id}.txt"

my_tail(logs_dir, log_file)

### Delete the experiment

In [None]:
! rm -rf ~/shared/users/{os.environ['USER']}/models/{model_id}
! echo DONE

### Delete the dataset

In [None]:
! rm -rf ~/shared/users/{os.environ['USER']}/datasets/{dataset_id}
! echo DONE