# TAO Data services

TAO Data Services include 4 key pipelines:
1. offline data augmentation using DALI
2. auto labeling using TAO Mask Auto-labeler (MAL)
3. annotation conversion service
4. data analytics service

The offline data augmentation service enables users to enrich their data with GPU-accelerated spatial/color/kernel augmentation routines with more control than random augmentation often used in online augmentation during training.

Annotating an image-based dataset can be quite tedious and time-consuming. Labeling an instance mask by drawing a good polygon around an object can take 10 times longer than a bounding box. The Auto-Label service is designed to automatically generate instance segmentation masks given the groundtruth bounding boxes, which will greatly reduce the labeling effort.

Annotation conversion service provides an easy way of converting annotation groundtruth between COCO and KITTI formats, which are widely used in TAO models.

The Data Analytics service analyzes object-detection annotation files and image files, calculates insights,
and generate a graph and summary based on various metrics like object size, object counts, etc.


<img align="center" src="https://github.com/vpraveen-nv/model_card_images/blob/main/cv/notebook/common/mal_sample.jpg?raw=true" width="960">

## Learning Objectives

In this notebook, you will learn how to leverage the simplicity and convenience of TAO to:

* Convert KITTI dataset to COCO format
* Run auto-labeling to generate pseudo masks for KITTI bounding boxes
* Apply data augmentation to the KITTI dataset with bounding boxe refinement
* Run data analytics to collect useful statistics on the original and augmented KITTI dataset

## Table of Contents

This notebook shows an example usecase of MAL using Train Adapt Optimize (TAO) Toolkit.

0. [Set up env variables and map drives](#head-0)
1. [Installing the TAO launcher](#head-1)
2. [Prepare dataset](#head-2)
3. [Convert KITTI dataset to COCO format](#head-3)
4. [Generate pseudo-masks with the auto-labeler](#head-4)
5. [Apply data augmentation](#head-5)
6. [Run data analytics](#head-6)

## 0. Set up env variables and map drives <a class="anchor" id="head-0"></a>

When using the purpose-built pretrained models from NGC, please make sure to set the `$KEY` environment variable to the key as mentioned in the model overview. Failing to do so, can lead to errors when trying to load them as pretrained models.

The following notebook requires the user to set an env variable called the `$LOCAL_PROJECT_DIR` as the path to the users workspace. Please note that the dataset to run this notebook is expected to reside in the `$LOCAL_PROJECT_DIR/data`, while the TAO experiment generated collaterals will be output to `$LOCAL_PROJECT_DIR/mal/`. More information on how to set up the dataset and the supported steps in the TAO workflow are provided in the subsequent cells.

The TAO launcher uses docker containers under the hood, and **for our data and results directory to be visible to the docker, they need to be mapped**. The launcher can be configured using the config file `~/.tao_mounts.json`. Apart from the mounts, you can also configure additional options like the Environment Variables and amount of Shared Memory available to the TAO launcher. <br>

`IMPORTANT NOTE:` The code below creates a sample `~/.tao_mounts.json`  file. Here, we can map directories in which we save the data, specs, results and cache. You should configure it for your specific case so these directories are correctly visible to the docker container.


In [None]:
import os

# Please define this local project directory that needs to be mapped to the TAO docker session.
%env LOCAL_PROJECT_DIR=/path/to/local/tao-experiments
os.environ["HOST_RESULTS_DIR"] = os.path.join(os.getenv("LOCAL_PROJECT_DIR", os.getcwd()), "data_services")

# Set this path if you don't run the notebook from the samples directory.
# %env NOTEBOOK_ROOT=~/tao-samples/data_services

# The sample spec files are present in the same path as the downloaded samples.
os.environ["HOST_SPECS_DIR"] = os.path.join(os.getenv("NOTEBOOK_ROOT", os.getcwd()), "specs")
os.environ["HOST_DATA_DIR"] = os.path.join(os.getenv("NOTEBOOK_ROOT", os.getcwd()), "data")

In [None]:
!mkdir -p $HOST_RESULTS_DIR

In [None]:
# Mapping up the local directories to the TAO docker.
import json
import os
mounts_file = os.path.expanduser("~/.tao_mounts.json")
tlt_configs = {
   "Mounts":[
         # Mapping the Local project directory
        {
            "source": os.environ["LOCAL_PROJECT_DIR"],
            "destination": "/workspace/tao-experiments"
        },
       {
           "source": os.environ["HOST_DATA_DIR"],
           "destination": "/data"
       },
       {
           "source": os.environ["HOST_SPECS_DIR"],
           "destination": "/specs"
       },
       {
           "source": os.environ["HOST_RESULTS_DIR"],
           "destination": "/results"
       }
   ],
   "DockerOptions": {
        "shm_size": "16G",
        "ulimits": {
            "memlock": -1,
            "stack": 67108864
         },
        "network": "host"
   }
}
# Writing the mounts file.
with open(mounts_file, "w") as mfile:
    json.dump(tlt_configs, mfile, indent=4)

In [None]:
!cat ~/.tao_mounts.json

## 1. Installing the TAO launcher <a class="anchor" id="head-1"></a>
The TAO launcher is a python package distributed as a python wheel listed in the `nvidia-pyindex` python index. You may install the launcher by executing the following cell.

Please note that TAO Toolkit recommends users to run the TAO launcher in a virtual env with python 3.6.9. You may follow the instruction in this [page](https://virtualenvwrapper.readthedocs.io/en/latest/install.html) to set up a python virtual env using the `virtualenv` and `virtualenvwrapper` packages. Once you have setup virtualenvwrapper, please set the version of python to be used in the virtual env by using the `VIRTUALENVWRAPPER_PYTHON` variable. You may do so by running

```sh
export VIRTUALENVWRAPPER_PYTHON=/path/to/bin/python3.x
```
where x >= 6 and <= 8

We recommend performing this step first and then launching the notebook from the virtual environment. In addition to installing TAO python package, please make sure of the following software requirements:
* python >=3.7, <=3.10.x
* docker-ce > 19.03.5
* docker-API 1.40
* nvidia-container-toolkit > 1.3.0-1
* nvidia-container-runtime > 3.4.0-1
* nvidia-docker2 > 2.5.0-1
* nvidia-driver > 455+

Once you have installed the pre-requisites, please log in to the docker registry nvcr.io by following the command below

```sh
docker login nvcr.io
```

You will be triggered to enter a username and password. The username is `$oauthtoken` and the password is the API key generated from `ngc.nvidia.com`. Please follow the instructions in the [NGC setup guide](https://docs.nvidia.com/ngc/ngc-overview/index.html#generating-api-key) to generate your own API key.

Please note that TAO Toolkit recommends users to run the TAO launcher in a virtual env with python >=3.6.9. You may follow the instruction in this [page](https://virtualenvwrapper.readthedocs.io/en/latest/install.html) to set up a python virtual env using the virtualenv and virtualenvwrapper packages.

In [None]:
# SKIP this step IF you have already installed the TAO launcher.
!pip3 install nvidia-pyindex
!pip3 install nvidia-tao

In [None]:
# View the versions of the TAO launcher
!tao info

## 2. Prepare dataset <a class="anchor" id="head-2"></a>

### 2.1 Download KITTI dataset

We will be using the kitti object detection dataset for this example. To find more details, please visit http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=2d. Please download both, the left color images of the object dataset from [here](http://www.cvlibs.net/download.php?file=data_object_image_2.zip) and, the training labels for the object dataset from [here](http://www.cvlibs.net/download.php?file=data_object_label_2.zip), and place the zip files in `$HOST_DATA_DIR`

Once unzipped, the dataset should have the following structure
* training images in `$HOST_DATA_DIR/training/image_2`
* training labels in `$HOST_DATA_DIR/training/label_2`
* testing images in `$HOST_DATA_DIR/testing/image_2`

You may use this notebook with your own dataset as well. To use this example with your own dataset, please follow the same directory structure as mentioned below.

### 2.2. Using the sample KITTI dataset

In the rest of this notebook, we will a sample KITTI dataset which consists of 10 image/label pairs randomly picked from the original KITTI dataset

In [None]:
!cp $HOST_DATA_DIR/training/image_2/00001*.png $HOST_DATA_DIR/images
!ls -l $HOST_DATA_DIR/images

In [None]:
!cp $HOST_DATA_DIR/training/label_2/00001*.txt $HOST_DATA_DIR/labels
!ls -l $HOST_DATA_DIR/labels/

In [None]:
# NOTE: The following paths are set from the perspective of the TAO Docker.

# The data is saved here
%env DATA_DIR = /data
%env SPECS_DIR = /specs
%env RESULTS_DIR = /results

## 3. Convert KITTI data to COCO format <a class="anchor" id="head-3"></a>



Most TAO object detection models take KITTI or COCO annotation as input. Here we demonstrate how to easily convert KITTI data to COCO [format](https://cocodataset.org/#format-data).  The converted COCO json file will be used to generate pseudo-masks in the next section.

COCO format:
```
    annotation{
    "id": int, 
    "image_id": int, 
    "category_id": int, 
    "segmentation": RLE or [polygon], 
    "area": float, 
    "bbox": [x,y,width,height], 
    "iscrowd": 0 or 1,
    }

    image{
    "id": int,
    "width": int,
    "height": int,
    "file_name": str,
    "license": int,
    "flickr_url": str,
    "coco_url": str,
    "date_captured": datetime,
    }

    categories[{
    "id": int, 
    "name": str, 
    "supercategory": str,
    }]
```

In [None]:
!cat $HOST_SPECS_DIR/convert.yaml

In [None]:
# Convert KITTI to COCO
!tao dataset annotations convert -e $SPECS_DIR/convert.yaml

In [None]:
# The converted json file is saved at:
!ls -l $HOST_RESULTS_DIR

## 4. Generate pseudo-masks with the auto-labeler <a class="anchor" id="head-4"></a>
Here we will use a pretrained MAL model to generate pseudo-masks for the converted KITTI data

In [None]:
!cat $HOST_SPECS_DIR/autolabel.yaml

### 4.1 Download auto-labeler pretrained model from NGC

In [None]:
# Installing NGC CLI on the local machine.
## Download and install
%env CLI=ngccli_cat_linux.zip
!mkdir -p $LOCAL_PROJECT_DIR/ngccli

# Remove any previously existing CLI installations
!rm -rf $LOCAL_PROJECT_DIR/ngccli/*
!wget "https://ngc.nvidia.com/downloads/$CLI" -P $LOCAL_PROJECT_DIR/ngccli
!unzip -u "$LOCAL_PROJECT_DIR/ngccli/$CLI" -d $LOCAL_PROJECT_DIR/ngccli/
!rm -f $LOCAL_PROJECT_DIR/ngccli/*.zip 
os.environ["PATH"]="{}/ngccli/ngc-cli:{}".format(os.getenv("LOCAL_PROJECT_DIR", ""), os.getenv("PATH", ""))

In [None]:
# List available models
!ngc registry model list nvidia/tao/mask_auto_label:*

In [None]:
# Download the model
!ngc registry model download-version nvidia/tao/mask_auto_label:trainable_v1.0 --dest $HOST_RESULTS_DIR

In [None]:
print("Check that model is downloaded into dir.")
!ls -l $HOST_RESULTS_DIR/mask_auto_label_vtrainable_v1.0

### 4.2 Generate pseudo-labels

In [None]:
print("For multi-GPU, change `gpus` in autolabel.yaml based on your machine.")
!tao dataset auto_label generate -e $SPECS_DIR/autolabel.yaml

In [None]:
print("Check the pseudo label:")
!ls -l $HOST_RESULTS_DIR

### Let's visualize the pseudo-masks

In [None]:
# install deps
!pip3 install Cython==0.29.36
!pip3 install numpy
!pip3 install matplotlib
!pip3 install pillow
!pip3 install pycocotools

In [None]:
import os
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from PIL import Image
from pycocotools.coco import COCO
%matplotlib inline

image_dir = os.path.join(os.environ["HOST_DATA_DIR"], 'images')
json_path = os.path.join(os.environ["HOST_RESULTS_DIR"], 'data_mal.json')
coco_mal = COCO(annotation_file=json_path)

for i in coco_mal.getImgIds()[-2:-1]:
    img_info = coco_mal.loadImgs(i)[0]
    img_file_name = img_info["file_name"]
    print(img_file_name)
    ann_ids = coco_mal.getAnnIds(imgIds=[i], iscrowd=None)
    anns = coco_mal.loadAnns(ann_ids)
    # raw image
    im = Image.open(os.path.join(image_dir, img_file_name))
    # plots
    fig = plt.figure(figsize = (10,10))
    ax1 = fig.add_subplot(211)
    ax1.imshow(np.asarray(im))
    ax2 = fig.add_subplot(212)
    ax2.imshow(np.asarray(im), aspect='auto')
    coco_mal.showAnns(anns, draw_bbox=True)

## 5. Apply data augmentation <a class="anchor" id="head-5"></a>

In this section, we run offline augmentation with the KITTI data. During the augmentation process, we can use the pseudo-masks generated from the last step to refine the distorted or rotated bounding boxes

In [None]:
!cat $HOST_SPECS_DIR/augment.yaml

### 5.a Augmentation without bounding box refinement

In [None]:
print("For multi-GPU, change `num_gpus` in augment.yaml based on your machine.")
!tao dataset augmentation generate -e $SPECS_DIR/augment.yaml data.output_dataset=$RESULTS_DIR/augmented

**Augmented KITTI data are saved in `result/augmented` dir**

In [None]:
!ls -l $HOST_RESULTS_DIR/augmented/

### 5.b Augmentation with bounding box refinement

In [None]:
print("For multi-GPU, change `num_gpus` in augment.yaml based on your machine.")
!tao dataset augmentation generate -e $SPECS_DIR/augment.yaml \
    data.output_dataset=$RESULTS_DIR/refined \
    spatial_aug.rotation.refine_box.enabled=True \
    spatial_aug.rotation.refine_box.gt_cache=$RESULTS_DIR/data_mal.json

**Augmented images and refined labels are saved in `result/refined` dir**

In [None]:
!ls -l $HOST_RESULTS_DIR/refined/

### 5.c Let's visualize the augmented and refined data

In [None]:
from PIL import Image, ImageDraw

def draw_bbox(image_path, label_path):
    img = Image.open(image_path)
    draw = ImageDraw.Draw(img)
    f = open(label_path, 'r')
    for line in f:
        po = list(map(lambda x:float(x), line.split(' ')[4:8]))
        draw.rectangle(po, outline="Yellow")
    return img
image_dir = os.path.join(os.environ["HOST_DATA_DIR"], 'images')
label_dir = os.path.join(os.environ["HOST_DATA_DIR"], 'labels')
# original image
image_path = os.path.join(image_dir, '000015.png')
label_path = os.path.join(label_dir, '000015.txt')
# plot 
im = draw_bbox(image_path, label_path)
fig = plt.figure(figsize = (10,10))
ax1 = fig.add_subplot(311)
ax1.imshow(np.asarray(im))
# augmented image
image_path = os.path.join(os.environ["HOST_RESULTS_DIR"], 'augmented/images/000015.png')
label_path = os.path.join(os.environ["HOST_RESULTS_DIR"], 'augmented/labels/000015.txt')
# plot 
im = draw_bbox(image_path, label_path)
fig = plt.figure(figsize = (10,10))
ax2 = fig.add_subplot(312)
ax2.imshow(np.asarray(im))
# original image
image_path = os.path.join(os.environ["HOST_RESULTS_DIR"], 'refined/images/000015.png')
label_path = os.path.join(os.environ["HOST_RESULTS_DIR"], 'refined/labels/000015.txt')
# plot 
im = draw_bbox(image_path, label_path)
fig = plt.figure(figsize = (10,10))
ax3 = fig.add_subplot(313)
ax3.imshow(np.asarray(im))

## 6. Run data analytics <a class="anchor" id="head-6"></a>

In this section, we run data analytics with the KITTI data. This service supports the following tasks:

- analyze - This task analyzes the input files and generate graphs for calculated statistics. It can also generate the images with bounding boxes. The graphs can be generated locally or on wandb depend on the user input.

- validate - This task validates the input files by calculating the invalid coordinates, imbalance data and suggests whether data needs to be revised.

### 6.a Provide data analyze specification 
We provide specification file to configure the data analyze parameters including:

- data:
    - input_format: "KITTI"
    - output_dir: "/results/analytics"
    - image_dir: "/data/images"
    - ann_path: "/data/labels"
- workers: 2
- image:
    - generate_image_with_bounding_box: False
    - sample_size: 100
- graph:
    - generate_summary_and_graph: True
    - height: 15
    - width: 15
    - show_all: False
- wandb:
    - visualize: False
    - project: "tao data analytics"
 
  
- Image section configures if we want to generate images with bounding boxes.
- Graph section configures the height and width of generated graphs. graph.show_all parameter decides if we want to visualize all the annotation data on generated graphs.
- Wandb section configures if we want to generate the graph on wandb. By default all the graphs and images will be generated locally inside data.output_dir folder.

Please refer to the TAO documentation about Data Analytics to get all the parameters that are configurable.

In [None]:
!cat $HOST_SPECS_DIR/analytics.yaml

### 6.b Analyze data with Local Data visualization
In below cell the annotation files will be analyzed and the insight graphs will be generated locally.

In [None]:
!tao dataset analytics analyze -e $SPECS_DIR/analytics.yaml

In [None]:
print('Generated graphs:')
print('---------------------')
!ls -l $HOST_RESULTS_DIR/analytics/

Below cell generates graphs as well as the images with bounding boxes locally by setting image.generate_image_with_bounding_box=True.

In [None]:
!tao dataset analytics analyze -e $SPECS_DIR/analytics.yaml image.generate_image_with_bounding_box=True

In [None]:
print('Generated images with bounding boxes:')
print('---------------------')
!ls -l $HOST_RESULTS_DIR/analytics/image_with_bounding_boxes

Lets visualize an image.

In [None]:
from PIL import Image, ImageDraw
import matplotlib.pyplot as plt
import numpy as np
img = Image.open(os.path.join(os.environ["HOST_RESULTS_DIR"], "analytics/image_with_bounding_boxes/000015.png"))

# plot 
fig = plt.figure(figsize = (10,10))
ax1 = fig.add_subplot(311)
ax1.imshow(np.asarray(img))

### 6.c Analyze data with Wandb visualization
Here we are using wandb to visualize the data and uncover insights. Please refer the wandb integration documentation to setup wandb __[TAO Toolkit WandB Integration](https://tlt.gitlab-master-pages.nvidia.com/tlt-docs/text/mlops/wandb.html)__

Steps to generate wandb login key.
1. Create and login into wandb account.
2. Go to user setting and copy the API Key.

In [None]:
# Please define wandb api key to enable wandb login.
%env WANDB_API_KEY=WANDB_LOGIN_API_KEY

In [None]:
# set docker env variable.
import json
import os
mounts_file = os.path.expanduser("~/.tao_mounts.json")
tlt_configs["Envs"]= [
        {
            "variable": "WANDB_API_KEY",
            "value": os.environ["WANDB_API_KEY"]
        }
    ]
# Writing the mounts file.
with open(mounts_file, "w") as mfile:
    json.dump(tlt_configs, mfile, indent=4)

In [None]:
!cat ~/.tao_mounts.json

In below cell the annotation files will be analyzed and the insight graphs will be generated on wandb by setting wandb.visualize=True. Generated graphs can be find in wandb under the project name specified by wandb.project parameter.  

In [None]:
!tao dataset analytics analyze -e $SPECS_DIR/analytics.yaml wandb.visualize=True

Below cell generates graphs as well as the annotated images with bounding boxes in wandb by setting image.generate_image_with_bounding_box=True and wandb.visualize=True.

In [None]:
!tao dataset analytics analyze -e $SPECS_DIR/analytics.yaml wandb.visualize=True image.generate_image_with_bounding_box=True

### 6.d Provide data validate specification 
We provide specification file to configure the data validate parameters including:
- apply_correction: True
- data:
    - output_dir: "/results/analytics"
    - input_format: "KITTI"
    - image_dir: "/data/images"
    - ann_path: "/data/labels"
- workers: 2

apply_correction parameter decides if we want to correct the invalid bounding box coordinates in the annotation files. The corrected files will be saved under the data.output_dir.


In [None]:
!cat $HOST_SPECS_DIR/validate.yaml

### 6.e Data Validation
Data validation tasks calculate the number of inverted and out of bound bounding boxes coordinates and suggests if user should proceed with training with given data or apply corrections. The validation summary prints the count of object tags also to review if the data is imbalanced. Invalid coordinates can be corrected by providing apply_correction = True. The corrected files will be saved in data.output_dir. Below are the correction conditions for bounding box coordinates.

- Set negative coordinates to 0.

- Swap the inverted coordinates.

- If xmax is greater than image_width, then set xmax = image_width.

- If ymax is greater than image_height, then set ymax = image_height.

In below cell , the resultant number of invalid coordinates are 0 because there are no inverted or out of bound coordinates. 

In [None]:
!tao dataset analytics validate -e $SPECS_DIR/validate.yaml

This notebook has come to an end.