## Notebook to demonstrate Data-Services workflow
### The workflow in a nutshell
TAO Data Services include 4 key pipelines:
1. Offline data augmentation using DALI
2. Auto labeling using TAO Mask Auto-labeler (MAL)
3. Annotation conversion
4. Groundtruth analytics

## Learning Objectives

In this notebook, you will learn how to leverage the simplicity and convenience of TAO to:

* Convert KITTI dataset to COCO format
* Run auto-labeling to generate pseudo masks for KITTI bounding boxes
* Apply data augmentation to the KITTI dataset with bounding boxe refinement
* Run data analytics to collect useful statistics on the original and augmented KITTI dataset

### Table of contents

1. [Create a cloud workspace](#head-2)
1. [Convert KITTI data to COCO format](#head-1)
1. [Generate pseudo-masks with the auto-labeler](#head-2)
1. [Apply data augmentation](#head-3)
1. [Perform data analytics](#head-4)
1. [Perform data validation](#head-5)

### Requirements
Please find the server requirements [here](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_setup.html#)

In [None]:
# imports
import json
import os
import requests
import time
from IPython.display import clear_output

### FIXME's <a class="anchor" id="head-1"></a>

1. Assign the ip_address and port_number in FIXME 1 ([info](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_rest_api.html))
1. Assign the ngc_key variable in FIXME 2
1. Assign the ngc_org_name variable in FIXME 3
1. Set cloud storage details in FIXME 4
1. Assign path of kitti dataset relative to the bucket in FIXME 5

#### Set API service's host information

In [None]:
host_url = "http://<ip_address>:<port_number>" # FIXME1 example: https://10.137.149.22:32334
# In host machine, node ip_address and port number can be obtained as follows,
# ip_address: hostname -i
# port_number: kubectl get service tao-api-ingress-nginx-controller -o jsonpath='{.spec.ports[0].nodePort}'

#### Set NGC Personal key for authentication and NGC org to access API services

In [None]:
ngc_key = "<ngc_key>" # FIXME2 example: (Add NGC Personal key)

In [None]:
ngc_org_name = "ea-tlt" # FIXME3 your NGC ORG

### Login <a class="anchor" id="head-2"></a>

In [None]:
# Validate NGC_PERSONAL_KEY
data = json.dumps({"ngc_org_name": ngc_org_name,
                   "ngc_key": ngc_key,
                   "enable_telemetry": True})
response = requests.post(f"{host_url}/api/v1/login", data=data)
assert response.status_code in (200, 201)
assert "token" in response.json().keys()
token = response.json()["token"]
print("JWT",token)

# Set base URL
base_url = f"{host_url}/api/v1/orgs/{ngc_org_name}"
print("API Calls will be forwarded to",base_url)

headers = {"Authorization": f"Bearer {token}"}

### Get NVCF gpu details <a class="anchor" id="head-2"></a>

 One of the keys of the response json are to be used as platform_id when you run each job

In [None]:
# # Valid only for NVCF backend during TAO-API helm deployment currently
# endpoint = f"{base_url}:gpu_types"
# response = requests.get(endpoint, headers=headers)

# assert response.ok
# print(response)
# print((json.dumps(response.json(), indent=4)))

## 1. Convert KITTI data to COCO format <a class="anchor" id="head-1"></a>
We would first convert the dataset from KITTI to COCO formats.

### Create the dataset
We support both KITTI and COCO data formats

KITTI dataset follow the directory structure displayed below:
```
$DATA_DIR/dataset
├── images
│   ├── image_name_1.jpg
│   ├── image_name_2.jpg
|   ├── ...
└── labels
    ├── image_name_1.txt
    ├── image_name_2.txt
    ├── ...
```

And COCO dataset follow the directory structure displayed below:
```
$DATA_DIR/dataset
├── images
│   ├── image_name_1.jpg
│   ├── image_name_2.jpg
|   ├── ...
└── annotations.json
```
For this notebook, we will be using the kitti object detection dataset for this example. To find more details, please visit [here](http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=2d).

In [None]:
job_map = {}

#### Set cloud details 

In [None]:
#FIXME4 Dataset Cloud bucket details to download dataset for experiments (Can be read only)
cloud_metadata = {}
cloud_metadata["name"] = "AWS workspace info"  # A Representative name for this cloud info
cloud_metadata["cloud_type"] = "aws"  # If it's AWS, HuggingFace or Azure
cloud_metadata["cloud_specific_details"] = {}
cloud_metadata["cloud_specific_details"]["cloud_region"] = "us-west-1"  # Bucket region
cloud_metadata["cloud_specific_details"]["cloud_bucket_name"] = ""  # Bucket name
# Access and Secret for AWS
cloud_metadata["cloud_specific_details"]["access_key"] = ""
cloud_metadata["cloud_specific_details"]["secret_key"] = ""

In [None]:
# Create cloud workspace
data = json.dumps(cloud_metadata)

endpoint = f"{base_url}/workspaces"

response = requests.post(endpoint,data=data,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))

assert "id" in response.json().keys()
workspace_id = response.json()["id"]

### Create a kitti Dataset

In [None]:
# FIXME5 : Set path relative to cloud bucket
kitti_dataset_path =  "/data/tao_od_synthetic_subset_train_convert_cleaned/"

In [None]:
# Create Dataset
dataset_metadata = {"type": "object_detection",
                    "format": "kitti",
                    "workspace": workspace_id,
                    "cloud_file_path": kitti_dataset_path,
                    "use_for": ["testing"]
                   }
data = json.dumps(dataset_metadata)

endpoint = f"{base_url}/datasets"

response = requests.post(endpoint, data=data, headers=headers)
assert response.status_code in (200, 201)
assert "id" in response.json().keys()

print(response)
print(json.dumps(response.json(), indent=4))
kitti_dataset_id = response.json()["id"]

In [None]:
# Check progress
endpoint = f"{base_url}/datasets/{kitti_dataset_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    assert response.status_code in (200, 201)

    print(response)
    print(json.dumps(response.json(), indent=4))
    if response.json().get("status") == "invalid_pull":
        raise ValueError("Dataset pull failed")
    if response.json().get("status") == "pull_complete":
        break
    time.sleep(5)

### Dataset format conversion action 


#### Get specs


In [None]:
# Get default spec schema
endpoint = f"{base_url}/datasets/{kitti_dataset_id}/specs/annotation_format_convert/schema"
 
response = requests.get(endpoint, headers=headers)

print(response)
annotations_conversion_specs = response.json()["default"]
print(json.dumps(annotations_conversion_specs, sort_keys=True, indent=4))

In [None]:
# Updating spec file
annotations_conversion_specs["data"]["input_format"] = "KITTI"
annotations_conversion_specs["data"]["output_format"] = "COCO"
print(json.dumps(annotations_conversion_specs, sort_keys=True, indent=4))

#### Run action 


In [None]:
# Run action
parent = None
data = json.dumps({"parent_job_id":parent, "action":"annotation_format_convert", "specs":annotations_conversion_specs,
                  #  "platform_id": "9af1aa90-8ea5-5a11-98d9-3879cd0da92c",  # Pick a platform_from output of {base_url}:gpu_types depending on GPU_type and instance_type
                   })

endpoint = f"{base_url}/datasets/{kitti_dataset_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)

assert response.status_code in (200, 201)
assert response.json()

print(response)
print(json.dumps(response.json(), indent=4))

job_map["convert"] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
coco_dataset_id = kitti_dataset_id
convert_job_id = job_map["convert"]
endpoint = f"{base_url}/datasets/{kitti_dataset_id}/jobs/{convert_job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    
    if "error_desc" in response.json().keys() and response.json()["error_desc"] in ("Job trying to retrieve not found", "No AutoML run found"):
        print("Job is being created")
        time.sleep(5)
        continue        
    assert response.status_code in (200, 201)
    print(response)
    print(json.dumps(response.json(), sort_keys=True, indent=4))
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

In [None]:
# After the action is completed the format of dataset will be converted to coco from kitti
endpoint = f"{base_url}/datasets/{coco_dataset_id}"

response = requests.get(endpoint, headers=headers)

print(response)
print(json.dumps(response.json(), sort_keys=True, indent=4))

## 2. Generate pseudo-masks with the auto-labeler <a class="anchor" id="head-2"></a>
Here we will use a pretrained MAL model to generate pseudo-masks for the converted KITTI data. 

### Create a coco Dataset - If you already have data in coco detection format(without masks) and skipped step 1

In [None]:
# # Create Dataset
# dataset_metadata = {"type": "object_detection",
#                     "format": "coco",
#                     "workspace": workspace_id,
#                     "cloud_file_path": coco_dataset_path
#                     "use_for": ["testing"]
#                     }
# data = json.dumps(dataset_metadata)

# endpoint = f"{base_url}/datasets"

# response = requests.post(endpoint, data=data, headers=headers)
# assert response.status_code in (200, 201)
# assert "id" in response.json().keys()

# print(response)
# print(json.dumps(response.json(), indent=4))
# coco_dataset_id = response.json()["id"]

In [None]:
# # Check progress
# endpoint = f"{base_url}/datasets/{coco_dataset_id}"

# while True:
#     clear_output(wait=True)
#     response = requests.get(endpoint, headers=headers)
#     assert response.status_code in (200, 201)

#     print(response)
#     print(json.dumps(response.json(), indent=4))
#     if response.json().get("status") == "invalid_pull":
#         raise ValueError("Dataset pull failed")
#     if response.json().get("status") == "pull_complete":
#         break
#     time.sleep(5)

### Assign PTM

In [None]:
# List models
endpoint = f"{base_url}/experiments:base"

response = requests.get(endpoint, headers=headers)

print(response)
print("model id\t\t\t     network architecture")
for rsp in response.json()["experiments"]:
    if rsp["name"] == "Mask Auto Label":
        print(f'PTM Name: {rsp["name"]}; PTM version: {rsp["version"]}; NGC PATH: {rsp["ngc_path"]}')

In [None]:
pretrained_map = {"auto_label" : "mask_auto_label:trainable_v1.0"}

In [None]:
# Get pretrained model
model_list = f"{base_url}/experiments:base"
response = requests.get(model_list, headers=headers)

response_json = response.json()["experiments"]

# Search for ptm with given ngc path
ptm = []
for rsp in response_json:
    if rsp["ngc_path"].endswith(pretrained_map["auto_label"]):
        ptm_id = rsp["id"]
        ptm = [ptm_id]
        print("Metadata for model with requested NGC Path")
        print(rsp)
        break

In [None]:
# Assign PTM
dataset_information = {"base_experiment": ptm}

data = json.dumps(dataset_information)

endpoint = f"{base_url}/datasets/{coco_dataset_id}"

response = requests.patch(endpoint, data=data, headers=headers)

print(response)
print(json.dumps(response.json(), indent=4))

### Auto labeling action

#### Get specs

In [None]:
# Get default spec schema
endpoint = f"{base_url}/datasets/{coco_dataset_id}/specs/auto_label/schema"

while True:
    response = requests.get(endpoint, headers=headers)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break

print(response)
auto_label_generate_specs = response.json()["default"]
print(json.dumps(auto_label_generate_specs, sort_keys=True, indent=4))

In [None]:
# Override any of the parameters listed in the previous cell as required
auto_label_generate_specs["gpu_ids"] = [0]
print(json.dumps(auto_label_generate_specs, sort_keys=True, indent=4))

#### Run action

In [None]:
# Run action
parent = convert_job_id

data = json.dumps({"parent_job_id": parent, "action":"auto_label", "specs":auto_label_generate_specs,
                  #  "platform_id": "9af1aa90-8ea5-5a11-98d9-3879cd0da92c",  # Pick a platform_from output of {base_url}:gpu_types depending on GPU_type and instance_type
                   })

endpoint = f"{base_url}/datasets/{coco_dataset_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)

print(response)
print(json.dumps(response.json(), indent=4))

job_map["auto_labeling"] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
coco_mask_dataset_id = kitti_dataset_id
auto_labeling_job_id = job_map["auto_labeling"]
endpoint = f"{base_url}/datasets/{coco_dataset_id}/jobs/{auto_labeling_job_id}"

while True: 
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    
    if "error_desc" in response.json().keys() and response.json()["error_desc"] in ("Job trying to retrieve not found", "No AutoML run found"):
        print("Job is being created")
        time.sleep(5)
        continue
    assert response.status_code in (200, 201)
    print(response)
    print(json.dumps(response.json(), sort_keys=True, indent=4))
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

## 3. Apply data augmentation <a class="anchor" id="head-3"></a>
In this section, we run offline augmentation with the original dataset. During the augmentation process, we can use the pseudo-masks generated from the last step to refine the distorted or rotated bounding boxes

### Create a coco mask Dataset - If you already have data in coco segmentation format and skipped step 1 and 2

In [None]:
# # Create Dataset
# dataset_metadata = {"type": "object_detection",
#                     "format": "coco",
#                     "workspace": workspace_id,
#                     "cloud_file_path": coco_mask_dataset_path
#                     "use_for": ["testing"]
#                     }
# data = json.dumps(dataset_metadata)

# endpoint = f"{base_url}/datasets"

# response = requests.post(endpoint, data=data, headers=headers)
# assert response.status_code in (200, 201)
# assert "id" in response.json().keys()

# print(response)
# print(json.dumps(response.json(), indent=4))
# coco_mask_dataset_id = response.json()["id"]

In [None]:
# # Check progress
# endpoint = f"{base_url}/datasets/{coco_mask_dataset_id}"

# while True:
#     clear_output(wait=True)
#     response = requests.get(endpoint, headers=headers)
#     assert response.status_code in (200, 201)

#     print(response)
#     print(json.dumps(response.json(), indent=4))
#     if response.json().get("status") == "invalid_pull":
#         raise ValueError("Dataset pull failed")
#     if response.json().get("status") == "pull_complete":
#         break
#     time.sleep(5)

### Run data augmentation action


#### Get specs


In [None]:
# Get default spec schema
endpoint = f"{base_url}/datasets/{coco_mask_dataset_id}/specs/augment/schema"

response = requests.get(endpoint, headers=headers)

print(response)
augmentation_generate_specs = response.json()["default"]
print(json.dumps(augmentation_generate_specs, sort_keys=True, indent=4))

In [None]:
# Make changes to the specs if necessary
print(json.dumps(augmentation_generate_specs, sort_keys=True, indent=4))

#### Run action


In [None]:
# Run action
parent = auto_labeling_job_id

data = json.dumps({"parent_job_id":parent, "action":"augment", "specs":augmentation_generate_specs,
                  #  "platform_id": "9af1aa90-8ea5-5a11-98d9-3879cd0da92c",  # Pick a platform_from output of {base_url}:gpu_types depending on GPU_type and instance_type
                   })

endpoint = f"{base_url}/datasets/{coco_mask_dataset_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)

print(response)
print(json.dumps(response.json(), indent=4))

job_map["augmentation"] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
coco_mask_augmented_dataset_id = job_map["augmentation"]
endpoint = f"{base_url}/datasets/{coco_mask_dataset_id}/jobs/{coco_mask_augmented_dataset_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    
    if "error_desc" in response.json().keys() and response.json()["error_desc"] in ("Job trying to retrieve not found", "No AutoML run found"):
        print("Job is being created")
        time.sleep(5)
        continue
    assert response.status_code in (200, 201)
    print(response)
    print(json.dumps(response.json(), sort_keys=True, indent=4))
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

In [None]:
# After the augment action you'll get a new dataset
endpoint = f"{base_url}/datasets/{coco_mask_augmented_dataset_id}"

response = requests.get(endpoint, headers=headers)

print(response)
print(json.dumps(response.json(), sort_keys=True, indent=4))

## 4. Perform data analytics  <a class="anchor" id="head-4"></a>
Next, we perform analytics with the KITTI dataset.

### Run Data analytics annotation analytics action


#### Get specs


In [None]:
# Get default spec schema
endpoint = f"{base_url}/datasets/{coco_dataset_id}/specs/analyze/schema"
 
response = requests.get(endpoint, headers=headers)

print(response)
analytics_analyze_specs = response.json()["default"]
print(json.dumps(analytics_analyze_specs, sort_keys=True, indent=4))

In [None]:
# Make changes to the specs if necessary
analytics_analyze_specs["data"]["input_format"] = "COCO"
print(json.dumps(analytics_analyze_specs, sort_keys=True, indent=4))

#### Run action


In [None]:
# Run action
parent = convert_job_id

data = json.dumps({"parent_job_id":parent, "action":"analyze", "specs":analytics_analyze_specs,
                  #  "platform_id": "9af1aa90-8ea5-5a11-98d9-3879cd0da92c",  # Pick a platform_from output of {base_url}:gpu_types depending on GPU_type and instance_type
                   })

endpoint = f"{base_url}/datasets/{kitti_dataset_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)

print(response)
print(json.dumps(response.json(), indent=4))

job_map["analytics"] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map["analytics"]
endpoint = f"{base_url}/datasets/{kitti_dataset_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    
    if "error_desc" in response.json().keys() and response.json()["error_desc"] in ("Job trying to retrieve not found", "No AutoML run found"):
        print("Job is being created")
        time.sleep(5)
        continue
    assert response.status_code in (200, 201)
    print(response)
    print(json.dumps(response.json(), sort_keys=True, indent=4))
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

## 5. Perform data validation  <a class="anchor" id="head-5"></a>
Next, we perform validate the annotations and images.

### Run Data annotation validation action


#### Get specs


In [None]:
# Get default spec schema
endpoint = f"{base_url}/datasets/{coco_dataset_id}/specs/validate_annotations/schema"
 
response = requests.get(endpoint, headers=headers)

print(response)
validate_annotation_specs = response.json()["default"]
print(json.dumps(validate_annotation_specs, sort_keys=True, indent=4))

In [None]:
# Make changes to the specs if necessary
validate_annotation_specs["data"]["input_format"] = "COCO"
print(json.dumps(validate_annotation_specs, sort_keys=True, indent=4))

#### Run action


In [None]:
# Run action
parent = convert_job_id

data = json.dumps({"parent_job_id":parent, "action":"validate_annotations", "specs":validate_annotation_specs,
                  #  "platform_id": "9af1aa90-8ea5-5a11-98d9-3879cd0da92c",  # Pick a platform_from output of {base_url}:gpu_types depending on GPU_type and instance_type
                   })

endpoint = f"{base_url}/datasets/{kitti_dataset_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)

print(response)
print(json.dumps(response.json(), indent=4))

job_map["validate_annotations"] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map["validate_annotations"]
endpoint = f"{base_url}/datasets/{kitti_dataset_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    
    if "error_desc" in response.json().keys() and response.json()["error_desc"] in ("Job trying to retrieve not found", "No AutoML run found"):
        print("Job is being created")
        time.sleep(5)
        continue
    assert response.status_code in (200, 201)
    print(response)
    print(json.dumps(response.json(), sort_keys=True, indent=4))
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

### Run Data image validation action - removes corrupted images and creates a new dataset

#### Get specs


In [None]:
# Get default spec schema
endpoint = f"{base_url}/datasets/{kitti_dataset_id}/specs/validate_images/schema"
 
response = requests.get(endpoint, headers=headers)

print(response)
validate_images_specs = response.json()["default"]
print(json.dumps(validate_images_specs, sort_keys=True, indent=4))

In [None]:
# Make changes to the specs if necessary

#### Run action


In [None]:
# Run action
parent = None

data = json.dumps({"parent_job_id":parent, "action":"validate_images", "specs":validate_images_specs,
                  #  "platform_id": "9af1aa90-8ea5-5a11-98d9-3879cd0da92c",  # Pick a platform_from output of {base_url}:gpu_types depending on GPU_type and instance_type
                   })

endpoint = f"{base_url}/datasets/{kitti_dataset_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)

print(response)
print(json.dumps(response.json(), indent=4))

job_map["validate_images"] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map["validate_images"]
endpoint = f"{base_url}/datasets/{kitti_dataset_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    
    if "error_desc" in response.json().keys() and response.json()["error_desc"] in ("Job trying to retrieve not found", "No AutoML run found"):
        print("Job is being created")
        time.sleep(5)
        continue
    assert response.status_code in (200, 201)
    print(response)
    print(json.dumps(response.json(), sort_keys=True, indent=4))
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

### Delete dataset <a class="anchor" id="head-21"></a>

#### Delete original kitti dataset <a class="anchor" id="head-21"></a>

In [None]:
endpoint = f"{base_url}/datasets/{kitti_dataset_id}"

response = requests.delete(endpoint,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))

#### Delete coco augment dataset <a class="anchor" id="head-21"></a>

In [None]:
endpoint = f"{base_url}/datasets/{coco_mask_augmented_dataset_id}"

response = requests.delete(endpoint,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))