### TAO remote client - Purpose built models

Transfer learning is the process of transferring learned features from one application to another. It is a commonly used training technique where you use a model trained on one task and re-train to use it on a different task. Train Adapt Optimize (TAO) Toolkit  is a simple and easy-to-use Python based AI toolkit for taking purpose-built AI models and customizing them with users' own data.

![image](https://developer.nvidia.com/sites/default/files/akamai/TAO/tlt-tao-toolkit-bring-your-own-model-diagram.png)


### The workflow in a nutshell

- Creating a dataset
- Upload dataset to the service
- Running dataset convert (for specific models)
- Getting a PTM from NGC
- Model Actions
    - Train (Normal/AutoML)
    - Evaluate
    - Prune, retrain (for specific models)
    - Export
    - TAO-Deploy (for specific models)
    - Inference on TAO
    - Inference on TRT (for specific models)

### Table of contents

1. [Install TAO remote client ](#head-1)
1. [Set the remote service base URL](#head-2)
1. [Access the shared volume](#head-3)
1. [Create the datasets](#head-4)
1. [List datasets](#head-5)
1. [Provide and customize dataset convert specs for train dataset](#head-6)
1. [Run dataset convert for train dataset](#head-7)
1. [Provide and customize dataset convert specs for val dataset](#head-8)
1. [Run dataset convert for val dataset](#head-9)
1. [Create a model experiment](#head-10)
1. [Find pretrained model](#head-11)
1. [Customize model metadata](#head-12)
1. [View hyperparameters that are enabled for AutoML by default](#head-13)
1. [Set AutoML related configurations](#head-14)
1. [Provide train specs](#head-15)
1. [Run train](#head-16)
1. [View checkpoint files](#head-17)
1. [Provide evaluate specs](#head-18)
1. [Run evaluate](#head-19)
1. [Provide prune specs](#head-20) (for specific models)
1. [Run prune](#head-21) (for specific models)
1. [Provide retrain specs](#head-22) (for specific models)
1. [Run retrain](#head-23) (for specific models)
1. [Run evaluate on retrain](#head-24) (for specific models)
1. [Provide export specs](#head-25)
1. [Run export](#head-26)
1. [Provide trt engine generation specs](#head-27) (for specific models)
1. [Run TRT Engine generation using TAO-Deploy](#head-28) (for specific models)
1. [Provide TAO inference specs](#head-29)
1. [Run TAO inference](#head-30)
1. [Provide TRT inference specs](#head-31) (for specific models)
1. [Run TRT inference](#head-32) (for specific models)
1. [Delete experiment](#head-33)
1. [Delete datasets](#head-34)
1. [Unmount shared volume](#head-35)
1. [Uninstall TAO Remote Client](#head-36)

### Requirements
Please find the server requirements [here](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_setup.html#)

In [None]:
import os
import glob
import subprocess
import getpass
import uuid
import json

In [None]:
namespace = 'default'

### FIXME

1. Assign a model_name in FIXME 1

    1.1 Assign model type for action_recognition/pose_classification in FIXME 1.1

    1.2 Assign platform for action_recognition in FIXME 1.2
    
    1.3 Assign model input type for action_recognition in FIXME 1.3
1. Assign the ip_address and port_number in FIXME 2 and FIXME 3 ([info](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_rest_api.html))
1. Assign the ngc_api_key variable in FIXME 4
1. (Optional) Enable AutoML if needed in FIXME 5
1. Choose between default and custom dataset in FIXME 6
1. Assign path of DATA_DIR in FIXME 7
1. Choose between Bayesian and Hyperband automl_algorithm in FIXME 8 (If automl was enabled in FIXME5)

In [None]:
# Define model_name workspaces and other variables
# Available models (#FIXME 1):
# 1. action-recognition - https://docs.nvidia.com/tao/tao-toolkit/text/action_recognition_net.html
# 2. bpnet - https://docs.nvidia.com/tao/tao-toolkit/text/bodypose_estimation/bodyposenet.html
# 3. fpenet - https://docs.nvidia.com/tao/tao-toolkit/text/facial_landmarks_estimation/facial_landmarks_estimation.html
# 4. lprnet - https://docs.nvidia.com/tao/tao-toolkit/text/character_recognition/index.html
# 5. ml-recog - https://docs.nvidia.com/tao/tao-toolkit/text/ml_recog/index.html
# 6. ocdnet - https://docs.nvidia.com/tao/tao-toolkit/text/ocdnet/index.html
# 7. ocrnet - https://docs.nvidia.com/tao/tao-toolkit/text/ocrnet/index.html
# 8. optical-inspection - https://docs.nvidia.com/tao/tao-toolkit/text/optical_inspection/index.html
# 9. pose-classification - https://docs.nvidia.com/tao/tao-toolkit/text/pose_classification/index.html
# 10. pointpillars - https://docs.nvidia.com/tao/tao-toolkit/text/point_cloud/pointpillars.html
# 11. re-identification - https://docs.nvidia.com/tao/tao-toolkit/text/re_identification/index.html

model_name = "action-recognition" # FIXME1 (Add the model name from the above mentioned list)

In [None]:
if model_name in ("action-recognition","fpenet","lprnet","pose-classification"):
    # FIXME1.1 - model_type - string
        # action-recognition: rgb/of/joint;
        # fpenet: 10/80 (value represents the number of keypoints)
        # lprnet: us/ch (us for United States, ch for China)
        # pose-classification: kinetics/nvidia
    model_type = "rgb" # FIXME1.1 action_recognition: rgb/of/joint; fpenet: 10/80 (refers to number of keypoints); pose_classification: kinetics/nvidia

    if model_name == "action-recognition":
        if model_type not in ("rgb","of","joint"):
            raise Exception("Choose one of rgb/of/joint for action recognition model_type")
    elif model_name == "fpenet":
        if model_type not in ("10","80"):
            raise Exception("Choose one of 10/80 for FPENET model_type")
    elif model_name == "lprnet":
        if model_type not in ("us","ch"):
            raise Exception("Choose one of us/ch for LPRNET model_type")
    elif model_name == "pose-classification":
        if model_type not in ("kinetics","nvidia"):
            raise Exception("Choose one of kinetics/nvidia for pose classification model_type")

    if model_name == "action-recognition":
        platform = "a100" # FIXME1.2 a100/xavier - valid only for model_type that is not rgb
        model_input_type = "3d" # FIXME1.3 3d/2d

### Install TAO remote client <a class="anchor" id="head-1"></a>

In [None]:
# SKIP this step IF you have already installed the TAO-Client wheel.
! pip3 install nvidia-tao-client

In [None]:
# View the version of the TAO-Client
! tao-client --version

### Set the remote service base URL and Token <a class="anchor" id="head-2"></a>

In [None]:
# Define the node_addr and port number
node_addr = "<ip_address>" # FIXME2 example: 10.137.149.22
node_port = "<port_number>" # FIXME3 example: 32334
# In host machine, node ip_address and port number can be obtained as follows,
# ip_address: hostname -i
# port_number: kubectl get service ingress-nginx-controller -o jsonpath='{.spec.ports[0].nodePort}'

ngc_api_key = "<ngc_api_key>" # FIXME4 example: (Add NGC API key)

In [None]:
automl_enabled = False # FIXME5 set to True if you want to run automl for the model chosen in the previous cell

In [None]:
%env BASE_URL=http://{node_addr}:{node_port}/{namespace}/api/v1

# Exchange NGC_API_KEY for JWT
identity = json.loads(subprocess.getoutput(f'tao-client login --ngc-api-key {ngc_api_key}'))

%env USER={identity['user_id']}
%env TOKEN={identity['token']}

### Access the shared volume <a class="anchor" id="head-3"></a>

In [None]:
# Get PVC ID
pvc_id = subprocess.getoutput(f'kubectl get pvc tao-toolkit-api-pvc -n {namespace} -o jsonpath="{{.spec.volumeName}}"')
print(pvc_id)

In [None]:
# Get NFS server info
provisioner = json.loads(subprocess.getoutput(f'helm get values nfs-subdir-external-provisioner -o json'))
nfs_server = provisioner['nfs']['server']
nfs_path = provisioner['nfs']['path']
print(nfs_server, nfs_path)

In [None]:
user = getpass.getuser()
home = os.path.expanduser('~')

! echo "Password for {user}"
password = getpass.getpass()

In [None]:
# Mount shared volume 
! mkdir -p ~/shared

command = "apt-get -y install nfs-common >> /dev/null"
! echo {password} | sudo -S -k {command}

command = f"mount -t nfs {nfs_server}:{nfs_path}/{namespace}-tao-toolkit-api-pvc-{pvc_id} ~/shared"
! echo {password} | sudo -S -k {command} && echo DONE

### Create the datasets <a class="anchor" id="head-4"></a>

**Action Recognition:** We will be using the HMDB51 [dataset](https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/) for the tutorial. (We choose catch/smile for this tutorial):

**BPNET:** We will be using the `COCO dataset` for Instance segmentation - MaskRCNN. `download_coco.sh` script from dataset prepare will be used to download and unzip the coco2017 dataset from [here](https://cocodataset.org/#download)

**FPENET:** We will be using `AFW dataset`. Download it from [here](https://ibug.doc.ic.ac.uk/download/annotations/afw.zip/) and place it in $DATA_DIR.

**LPRNET**: We will be using the `OpenALPR benchmark dataset` for the tutorial. The following script will download the dataset automatically and convert it to the format used by TAO.  

**MLRecogNet** We will be using the `Retail Product Checkout Dataset` for the tutorial. Downdload the datsaet from [here](https://www.kaggle.com/datasets/diyer22/retail-product-checkout-dataset) and place it under $DATA_DIR/metric_learning_recognition

**OCDNET**: We will be using the ICDAR2015 dataset for the ocdnet tutorial. Please access the dataset [here](https://rrc.cvc.uab.es/?ch=4&com=tasks) to register and download the data from Task 4.1: Text Localization. Unzip the files to DATA_DIR

**OCRNET**: We will be using the ICDAR15 word recognition dataset for the tutorial. To find more details please visit [here](
https://rrc.cvc.uab.es/?ch=4&com=tasks). Please download the ICDAR15 word recognition train dataset and test_dataset [here](https://rrc.cvc.uab.es/?ch=4&com=downloads) to DATA_DIR.

**Pointpillars:** We will be using the `kitti object detection dataset` for this example. To find more details, please visit [here](http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=2d)

**Pose Classification:** We will be using the Kinetics dataset from [Deepmind](https://deepmind.com/research/open-source/kinetics) or NVIDIA created dataset. For kinetics based dataset set model_type as `kinetics` and for nvidia based dataset set model_type as `nvidia`

**Re-Identification:** We will be using the [Market-1501](https://zheng-lab.cecs.anu.edu.au/Project/project_reid.html) dataset. Download the dataset [here](https://drive.google.com/file/d/1TwkgQcIa_EgRjVMPSbyEKtcfljqURrzi/view?usp=sharing) and extract it.

In [None]:
dataset_to_be_used = "default" #FIXME6 #default/custom; default for the dataset used in this tutorial notebook; custom for a different dataset
DATA_DIR = os.path.abspath(model_name) # FIXME7 (set absolute path of the data_directory)
os.environ['DATA_DIR']= DATA_DIR
!mkdir -p $DATA_DIR

In [None]:
if dataset_to_be_used == "default":
    if model_name == "action-recognition":
        !sudo apt-get update -y && sudo apt-get install unrar-free -y
        !wget -P $DATA_DIR http://serre-lab.clps.brown.edu/wp-content/uploads/2013/10/hmdb51_org.rar
        !mkdir -p $DATA_DIR/videos && unrar x -o+ $DATA_DIR/hmdb51_org.rar $DATA_DIR/videos
        !mkdir -p $DATA_DIR/raw_data
        !unrar x -o+ $DATA_DIR/videos/catch.rar $DATA_DIR/raw_data
        !unrar x -o+ $DATA_DIR/videos/smile.rar $DATA_DIR/raw_data
    elif model_name == "bpnet":
        !bash dataset_prepare/coco/download_coco.sh $DATA_DIR
        # Remove existing data
        !rm -rf $DATA_DIR/train2017/images
        !rm -rf $DATA_DIR/val2017/images
        # Rearrange data in the required format
        !mv $DATA_DIR/raw-data/* $DATA_DIR/
        !cp dataset_prepare/bpnet/* $DATA_DIR/
    elif model_name == "fpenet":
        !if [ ! -f $DATA_DIR/afw.zip ]; then echo 'afw zip file not found, please download.'; else echo 'Found afw zip file.';fi
        !mkdir $DATA_DIR/data
        !unzip -uq $DATA_DIR/afw.zip -d $DATA_DIR/data/afw
        !cp dataset_prepare/fpenet/data.json $DATA_DIR/
    elif model_name == "lprnet":
        !python3 -m pip install --upgrade pip
        !python3 -m pip install "opencv-python>=3.4.0.12,<=4.5.5.64"
        !bash dataset_prepare/lprnet/download_and_prepare_data.sh $DATA_DIR
    elif model_name == "ml-recog":
        !if [ ! -f $DATA_DIR/metric_learning_recognition/retail-product-checkout-dataset.zip ]; then echo 'retail-product-checkout-dataset.zip file not found, please download.'; else echo 'Found retail product dataset zip file.';fi
        !unzip -uq $DATA_DIR/metric_learning_recognition/retail-product-checkout-dataset.zip -d $DATA_DIR/metric_learning_recognition
    elif model_name == "ocdnet":
        !if [ ! -d $DATA_DIR/train/img ]; then echo 'Train image folder not found, please download.'; else echo 'Found Train image folder.';fi
        !if [ ! -d $DATA_DIR/train/gt ]; then echo 'Train ground truth folder not found, please download.'; else echo 'Found Train ground truth folder.';fi
        !if [ ! -d $DATA_DIR/test/img ]; then echo 'Val image folder not found, please download.'; else echo 'Found Val image folder.';fi
        !if [ ! -d $DATA_DIR/test/gt ]; then echo 'Val ground truth folder not found, please download.'; else echo 'Found Val ground truth folder.';fi
    elif model_name == "ocrnet":
        !mkdir -p $DATA_DIR/train && rm -rf $DATA_DIR/train/*
        !mkdir -p $DATA_DIR/test && rm -rf $DATA_DIR/test/*
        !if [ ! -f $DATA_DIR/ch4_test_word_images_gt.zip ]; then echo 'Test Image zip file not found, please download.'; else echo 'Found Test Image zip file.';fi
        !if [ ! -f $DATA_DIR/Challenge4_Test_Task3_GT.txt ]; then echo 'Test Label file not found, please download.'; else echo 'Found Test Labels file.';fi
        !if [ ! -f $DATA_DIR/ch4_training_word_images_gt.zip ]; then echo 'Train zip file not found, please download.'; else echo 'Found Train zip file.';fi
        !unzip -u $DATA_DIR/ch4_test_word_images_gt.zip -d $DATA_DIR/test
        !cp $DATA_DIR/Challenge4_Test_Task3_GT.txt -d $DATA_DIR/test
        !unzip -u $DATA_DIR/ch4_training_word_images_gt.zip -d $DATA_DIR/train    
    elif model_name == "optical-inspection":
        !if [ ! -d $DATA_DIR/train/images ]; then echo 'Train image folder not found'; else echo 'Found train image folder';fi
        !if [ ! -f $DATA_DIR/train/dataset.csv ]; then echo 'Train label file not found'; else echo 'Found train label file';fi
        !if [ ! -d $DATA_DIR/val/images ]; then echo 'Val image folder not found'; else echo 'Found val image folder';fi
        !if [ ! -f $DATA_DIR/val/dataset.csv ]; then echo 'Val label file not found'; else echo 'Found val label file';fi
        !if [ ! -d $DATA_DIR/test/images ]; then echo 'Test image folder not found'; else echo 'Found test image folder';fi
        !if [ ! -f $DATA_DIR/test/dataset.csv ]; then echo 'Test label file not found'; else echo 'Found test label file';fi
    elif model_name == "pointpillars":
        !if [ ! -f $DATA_DIR/data_object_image_2.zip ]; then echo 'Image zip file not found, please download.'; else echo 'Found Image zip file.';fi
        !if [ ! -f $DATA_DIR/data_object_label_2.zip ]; then echo 'Label zip file not found, please download.'; else echo 'Found Labels zip file.';fi
        !if [ ! -f $DATA_DIR/data_object_velodyne.zip ]; then echo 'Velodyne zip file not found, please download.'; else echo 'Found Velodyne zip file.';fi
        !if [ ! -f $DATA_DIR/data_object_calib.zip ]; then echo 'Calib zip file not found, please download.'; else echo 'Found Calib zip file.';fi  # unpack 
        !unzip -u $DATA_DIR/data_object_image_2.zip -d $DATA_DIR
        !unzip -u $DATA_DIR/data_object_label_2.zip -d $DATA_DIR
        !unzip -u $DATA_DIR/data_object_velodyne.zip -d $DATA_DIR
        !unzip -u $DATA_DIR/data_object_calib.zip -d $DATA_DIR
    elif model_name == "pose-classification":
        !pip3 install -U gdown
        if model_type == "kinetics":
            !gdown https://drive.google.com/uc?id=1dmzCRQsFXJ18BlXj1G9sbDnsclXIdDdR -O $DATA_DIR/st-gcn-processed-data.zip
            !unzip $DATA_DIR/st-gcn-processed-data.zip -d $DATA_DIR
            !mv $DATA_DIR/data/Kinetics/kinetics-skeleton $DATA_DIR/kinetics
            !rm -r $DATA_DIR/data
            !rm $DATA_DIR/st-gcn-processed-data.zip
        elif model_type == "nvidia":
            !gdown https://drive.google.com/uc?id=1GhSt53-7MlFfauEZ2YkuzOaZVNIGo_c- -O $DATA_DIR/data_3dbp_nvidia.zip
            !mkdir -p $DATA_DIR/nvidia
            !unzip $DATA_DIR/data_3dbp_nvidia.zip -d $DATA_DIR/nvidia
            !rm $DATA_DIR/data_3dbp_nvidia.zip
    elif model_name == "re-identification":
        !pip3 install -U gdown
        !gdown https://drive.google.com/uc?id=0B8-rUzbwVRk0c054eEozWG9COHM -O $DATA_DIR/market1501.zip
        !unzip -u $DATA_DIR/market1501.zip -d $DATA_DIR
        !rm -rf $DATA_DIR/market1501
        !mv $DATA_DIR/Market-1501-v15.09.15 $DATA_DIR/market1501
        !rm $DATA_DIR/market1501.zip

In [None]:
if model_name == "lprnet":
    ds_type = "character_recognition"
    ds_format = "lprnet"
else:
    ds_type = model_name.replace("-","_")
    ds_format = "default"

In [None]:
train_dataset_id = subprocess.getoutput(f"tao-client {model_name} dataset-create --dataset_type {ds_type} --dataset_format {ds_format}")
print(train_dataset_id)

In [None]:
if dataset_to_be_used == "default":
    USER_EXPERIMENT_DIR = os.path.join("/shared/users",identity['user_id'],"datasets",train_dataset_id)
    if model_name == "action-recognition":
        !python3 -m pip install opencv-python numpy
        # For rgb action recognition
        !if [ -d tao_toolkit_recipes ]; then rm -rf tao_toolkit_recipes; fi
        !git clone https://github.com/NVIDIA-AI-IOT/tao_toolkit_recipes
        !cd tao_toolkit_recipes/tao_action_recognition/data_generation/ && bash ./preprocess_HMDB_RGB.sh $DATA_DIR/raw_data $DATA_DIR/processed_data 

        # For optical flow, comment the above 3 lines and uncomment the below (Note: for generating optical flow, a Turing or Ampere above GPU is needed.)
        #!echo <passwd> | sudo -S apt install -y libfreeimage-dev
        #!cp dataset_prepare/action_recognition/AppOFCuda tao_toolkit_recipes/tao_action_recognition/data_generation/
        #!cd tao_toolkit_recipes/tao_action_recognition/data_generation/ && bash ./preprocess_HMDB.sh $DATA_DIR/raw_data $DATA_DIR/processed_data

        # download the split files and unrar
        !wget -P $DATA_DIR http://serre-lab.clps.brown.edu/wp-content/uploads/2013/10/test_train_splits.rar
        !mkdir -p $DATA_DIR/splits && unrar x -o+ $DATA_DIR/test_train_splits.rar $DATA_DIR/splits
        # run split_HMDB to generate training split
        !if [ -d $DATA_DIR/train ]; then rm -rf $DATA_DIR/train $DATA_DIR/test; fi
        !cd tao_toolkit_recipes/tao_action_recognition/data_generation/ && python3 ./split_dataset.py $DATA_DIR/processed_data $DATA_DIR/splits/testTrainMulti_7030_splits $DATA_DIR/train  $DATA_DIR/test

    elif model_name == "fpenet":
        !pip3 install numpy opencv-python
        if model_type == "80":
            output_json_path = os.path.join(os.environ['DATA_DIR'], 'data/afw/afw.json')
        elif model_type == "10":
            output_json_path = os.path.join(os.environ['DATA_DIR'], 'data/afw_10/afw_10.json')
        !python3 dataset_prepare/fpenet/data_utils.py --afw_data_path $DATA_DIR/data/afw --output_json_path $output_json_path --afw_image_save_path $DATA_DIR/data/afw --num_key_points $model_type --container_root_path $USER_EXPERIMENT_DIR

    elif model_name == "lprnet":
        character_file_link = "https://api.ngc.nvidia.com/v2/models/nvidia/tao/lprnet/versions/trainable_v1.0/files/{}_lp_characters.txt".format(model_type)
        !wget -q -O $DATA_DIR/train/characters.txt $character_file_link
        !cp $DATA_DIR/train/characters.txt $DATA_DIR/val/characters.txt

    elif model_name == "ocrnet":
        !python3 -m pip install tqdm
        orig_train_gt_file=os.path.join(os.getenv("DATA_DIR"), "train", "gt.txt")
        processed_train_gt_file=os.path.join(os.getenv("DATA_DIR"), "train", "gt_new.txt")
        orig_test_gt_file=os.path.join(os.getenv("DATA_DIR"), "test", "Challenge4_Test_Task3_GT.txt")
        processed_test_gt_file=os.path.join(os.getenv("DATA_DIR"), "test", "gt_new.txt")
        !python3 dataset_prepare/ocrnet/preprocess_label.py $orig_train_gt_file $processed_train_gt_file
        !python3 dataset_prepare/ocrnet/preprocess_label.py $orig_test_gt_file $processed_test_gt_file
    
    elif model_name == "pointpillars":
        !python3 -m pip install scikit-image numpy
        !mkdir -p $DATA_DIR/train/lidar $DATA_DIR/train/label $DATA_DIR/val/lidar $DATA_DIR/val/label

        !python3 dataset_prepare/pointpillars/gen_lidar_points.py -p $DATA_DIR/training/velodyne \
                                               -c $DATA_DIR/training/calib    \
                                               -i $DATA_DIR/training/image_2  \
                                               -o $DATA_DIR/train/lidar  # Convert labels from Camera coordinate system to LIDAR coordinate system, etc
        !python3 dataset_prepare/pointpillars/gen_lidar_labels.py -l $DATA_DIR/training/label_2 \
                                               -c $DATA_DIR/training/calib \
                                               -o $DATA_DIR/train/label  # Drop DontCare class
        !python3 dataset_prepare/pointpillars/drop_class.py $DATA_DIR/train/label DontCare  # train/val split
        # Change the val set id's if you need a different set of validation images
        !python3 dataset_prepare/pointpillars/kitti_split.py dataset_prepare/pointpillars/val.txt \
                                          $DATA_DIR/train/lidar \
                                          $DATA_DIR/train/label \
                                          $DATA_DIR/val/lidar \
                                          $DATA_DIR/val/label

    elif model_name == "pose-classification" and model_type == "kinetics":
        !pip3 install numpy
        # select actions
        !python3 dataset_prepare/pose_classification/select_subset_actions.py

    elif model_name == "re-identification":
        #100 is the number of samples to be present in the subset data - you can choose any number <= total samples in the dataset
        !python3 dataset_prepare/re_identification/obtain_subset_data.py 100
    
    elif model_name == "ml-recog":
        !sudo apt-get install gcc -y
        !python3 -m pip install opencv-python numpy pycocotools tqdm
        !python3 dataset_prepare/metric_learning_recognition/process_retail_product_checkout_dataset.py

In [None]:
if model_name == "action-recognition":
    ! rsync -ah --info=progress2 $DATA_DIR/train ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
    ! rsync -ah --info=progress2 $DATA_DIR/test ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
elif model_name == "bpnet":
    ! rsync -ah --info=progress2 $DATA_DIR/train2017 ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
    ! rsync -ah --info=progress2 $DATA_DIR/val2017 ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
    ! rsync -ah --info=progress2 $DATA_DIR/annotations ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
    ! rsync -ah --info=progress2 $DATA_DIR/bpnet_18joints.json ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
    ! rsync -ah --info=progress2 $DATA_DIR/coco_spec.json ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
    !chmod -R 777 ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/coco_spec.json
    ! rsync -ah --info=progress2 $DATA_DIR/infer_spec.yaml ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
elif model_name == "fpenet":
    ! rsync -ah --info=progress2 $DATA_DIR/data ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
    !chmod -R 777 ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/data
    ! rsync -ah --info=progress2 $DATA_DIR/data.json ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
    !chmod -R 777 ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/data.json
elif model_name == "lprnet":
    ! rsync -ah --info=progress2 {DATA_DIR}/train/image ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
    ! rsync -ah --info=progress2 {DATA_DIR}/train/label ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
    ! rsync -ah --info=progress2 {DATA_DIR}/train/characters.txt ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
elif model_name == "ml-recog":
    !mkdir ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/metric_learning_recognition/
    ! rsync -ah --info=progress2 {DATA_DIR}/metric_learning_recognition/retail-product-checkout-dataset_classification_demo ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/metric_learning_recognition/
elif model_name == "ocdnet":
    ! rsync -ah --info=progress2 {DATA_DIR}/train ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
    !chmod -R 777 ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/train
elif model_name == "ocrnet":
    ! rsync -ah --info=progress2 {DATA_DIR}/train ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
    !chmod -R 777 ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/train
    ! rsync -ah --info=progress2 {DATA_DIR}/character_list ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
elif model_name == "optical-inspection":
    ! rsync -ah --info=progress2 {DATA_DIR}/train/* ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
elif model_name == "pointpillars":
    ! rsync -ah --info=progress2 $DATA_DIR/train ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
    ! rsync -ah --info=progress2 $DATA_DIR/val ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
elif model_name == "pose-classification":
    ! rsync -ah --info=progress2 $DATA_DIR/{model_type} ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
elif model_name == "re-identification":
    ! rsync -ah --info=progress2 $DATA_DIR/market1501/sample_train ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
    ! rsync -ah --info=progress2 $DATA_DIR/market1501/sample_test ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
    ! rsync -ah --info=progress2 $DATA_DIR/market1501/sample_query ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/

! echo DONE

In [None]:
if model_name in ("lprnet","ocdnet","ocrnet","optical-inspection"):
    eval_dataset_id = subprocess.getoutput(f"tao-client {model_name} dataset-create --dataset_type {ds_type} --dataset_format {ds_format}")
    print(eval_dataset_id)

In [None]:
if model_name == "lprnet":
    ! rsync -ah --info=progress2 {DATA_DIR}/val/image ~/shared/users/{os.environ['USER']}/datasets/{eval_dataset_id}/
    ! rsync -ah --info=progress2 {DATA_DIR}/val/label ~/shared/users/{os.environ['USER']}/datasets/{eval_dataset_id}/
    ! rsync -ah --info=progress2 {DATA_DIR}/val/characters.txt ~/shared/users/{os.environ['USER']}/datasets/{eval_dataset_id}/
elif model_name == "ocdnet":
    ! rsync -ah --info=progress2 {DATA_DIR}/test ~/shared/users/{os.environ['USER']}/datasets/{eval_dataset_id}/
    !chmod -R 777 ~/shared/users/{os.environ['USER']}/datasets/{eval_dataset_id}/test
elif model_name == "ocrnet":
    ! rsync -ah --info=progress2 {DATA_DIR}/test ~/shared/users/{os.environ['USER']}/datasets/{eval_dataset_id}/
    !chmod -R 777 ~/shared/users/{os.environ['USER']}/datasets/{eval_dataset_id}/test
    ! rsync -ah --info=progress2 {DATA_DIR}/character_list ~/shared/users/{os.environ['USER']}/datasets/{eval_dataset_id}/
elif model_name == "optical-inspection":
    ! rsync -ah --info=progress2 {DATA_DIR}/val/* ~/shared/users/{os.environ['USER']}/datasets/{eval_dataset_id}/
! echo DONE

In [None]:
if model_name in ("lprnet", "optical-inspection"):
    if model_name == "lprnet":
        ds_type = "character_recognition"
        ds_format = "raw"
    else:
        ds_type = model_name.replace("-","_")
        ds_format = "default"
    infer_dataset_id = subprocess.getoutput(f"tao-client {model_name} dataset-create --dataset_type {ds_type} --dataset_format {ds_format}")
    print(infer_dataset_id)

In [None]:
if model_name == "lprnet":
    ! rsync -ah --info=progress2 {DATA_DIR}/val/image ~/shared/users/{os.environ['USER']}/datasets/{infer_dataset_id}/
elif model_name == "optical-inspection":
    ! rsync -ah --info=progress2 {DATA_DIR}/test/* ~/shared/users/{os.environ['USER']}/datasets/{infer_dataset_id}/
! echo DONE

### List datasets <a class="anchor" id="head-5"></a>

In [None]:
pattern = os.path.join(home, 'shared', 'users', os.environ['USER'], 'datasets', '*', 'metadata.json')

datasets = []
for metadata_path in glob.glob(pattern):
    with open(metadata_path, 'r') as metadata_file:
        datasets.append(json.load(metadata_file))

print(json.dumps(datasets, indent=2))

### Provide and customize dataset convert specs for train dataset <a class="anchor" id="head-6"></a>

In [None]:
# Choose dataset convert action
if model_name in ("bpnet", "fpenet", "ocrnet", "pointpillars"):
    convert_action = "dataset_convert"

In [None]:
# Default train dataset specs
if model_name in ("bpnet", "fpenet", "ocrnet", "pointpillars"):
    ! tao-client {model_name} dataset-convert-defaults --id {train_dataset_id} --action {convert_action} | tee ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/specs/{convert_action}.json

In [None]:
# Customize train dataset specs
if model_name in ("bpnet", "fpenet", "ocrnet", "pointpillars"):
    specs_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'datasets', train_dataset_id, 'specs', f'{convert_action}.json')

    with open(specs_path , "r") as specs_file:
        specs = json.load(specs_file)

    if model_name == "bpnet":
        specs["mode"] = "train"
    elif model_name == "fpenet":
        specs["num_keypoints"] = int(model_type)
    
    with open(specs_path, "w") as specs_file:
        json.dump(specs, specs_file, indent=2)

    print(json.dumps(specs, indent=2))

### Run dataset convert for train dataset <a class="anchor" id="head-7"></a>

In [None]:
if model_name in ("bpnet", "fpenet", "ocrnet", "pointpillars"):
    train_convert_job_id = subprocess.getoutput(f"tao-client {model_name} dataset-convert --id {train_dataset_id}  --action {convert_action} ")
    print(train_convert_job_id)

In [None]:
def my_tail(logs_dir, log_file):
    %env LOG_FILE={logs_dir}/{log_file}
    ! mkdir -p {logs_dir}
    ! [ ! -f "$LOG_FILE" ] && touch $LOG_FILE && chmod 666 $LOG_FILE
    ! tail -f -n +1 $LOG_FILE | while read LINE; do echo "$LINE"; [[ "$LINE" == "EOF" ]] && pkill -P $$ tail; done
    
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
if model_name in ("bpnet", "fpenet", "ocrnet", "pointpillars"):
    logs_dir = os.path.join(home, 'shared', 'users', os.environ['USER'], 'datasets', train_dataset_id, 'logs')
    log_file = f"{train_convert_job_id}.txt"

    my_tail(logs_dir, log_file)

### Provide and customize dataset convert specs for val dataset <a class="anchor" id="head-8"></a>

In [None]:
# Default val dataset specs
if model_name == "bpnet":
    ! tao-client {model_name} dataset-convert-defaults --id {train_dataset_id} --action {convert_action} | tee ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/specs/{convert_action}.json
elif model_name == "ocrnet":
    ! tao-client {model_name} dataset-convert-defaults --id {eval_dataset_id} --action {convert_action} | tee ~/shared/users/{os.environ['USER']}/datasets/{eval_dataset_id}/specs/{convert_action}.json

In [None]:
# Customize val dataset specs
if model_name in ("bpnet", "ocrnet"):
    if model_name == "bpnet":
        specs_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'datasets', train_dataset_id, 'specs', f'{convert_action}.json')
    elif model_name == "ocrnet":
        specs_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'datasets', eval_dataset_id, 'specs', f'{convert_action}.json')

    with open(specs_path , "r") as specs_file:
        specs = json.load(specs_file)

    # Apply changes to the specs dictionary if necessary
    if model_name == "bpnet":
        specs["mode"] = "test"

    with open(specs_path, "w") as specs_file:
        json.dump(specs, specs_file, indent=2)

    print(json.dumps(specs, indent=2))

### Run dataset convert for val dataset <a class="anchor" id="head-9"></a>

In [None]:
if model_name in ("bpnet", "ocrnet"):
    if model_name == "bpnet":
        val_convert_job_id = subprocess.getoutput(f"tao-client {model_name} dataset-convert --id {train_dataset_id}  --action {convert_action} ")
    elif model_name == "ocrnet":
        val_convert_job_id = subprocess.getoutput(f"tao-client {model_name} dataset-convert --id {eval_dataset_id}  --action {convert_action} ")
    print(val_convert_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
if model_name in ("bpnet", "ocrnet"):
    if model_name == "bpnet":
        logs_dir = os.path.join(home, 'shared', 'users', os.environ['USER'], 'datasets', train_dataset_id, 'logs')
    elif model_name == "ocrnet":
        logs_dir = os.path.join(home, 'shared', 'users', os.environ['USER'], 'datasets', eval_dataset_id, 'logs')
    log_file = f"{val_convert_job_id}.txt"

    my_tail(logs_dir, log_file)

### Create a model experiment <a class="anchor" id="head-10"></a>

In [None]:
if model_name in ("action-recognition", "pose-classification", "ml-recog", "ocrnet", "ocdnet", "optical-inspection", "re-identification"):
    encode_key = "nvidia_tao"
elif model_name == "pointpillars":
    encode_key = "tlt_encode"
else:
    encode_key = "nvidia_tlt"

network_arch = model_name.replace("-","_")
model_id = subprocess.getoutput(f"tao-client {model_name} model-create --network_arch {network_arch} --encryption_key {encode_key} ")
print(model_id)

### Find pretrained model <a class="anchor" id="head-11"></a>

In [None]:
# List all pretrained models for the chosen network architecture
pattern = os.path.join(home, 'shared', 'users', '*', 'models', '*', 'metadata.json')

for ptm_metadata_path in glob.glob(pattern):
  with open(ptm_metadata_path, 'r') as metadata_file:
    ptm_metadata = json.load(metadata_file)
    metadata_network_arch = ptm_metadata.get("network_arch")
    if metadata_network_arch == network_arch:
      if "encryption_key" not in ptm_metadata.keys():
        print(f'PTM Name: {ptm_metadata["name"]}; PTM version: {ptm_metadata["version"]}; NGC PATH: {ptm_metadata["ngc_path"]}; Additional info: {ptm_metadata["additional_id_info"]}')

In [None]:
# Assigning pretrained models to different purpose built models versions
# From the output of previous cell make the appropriate changes to this map if you want to change the default PTM backbone.
# Changing the default backbone here requires changing default spec/config during train/eval etc like for example
# If you are changing the ptm to resnet34, then you have to modify the config key num_layers if it exists to 34 manually
pretrained_map = {"action_recognition":"actionrecognitionnet:trainable_v1.0",
                  "bpnet" : "bodyposenet:trainable_v1.0",
                  "fpenet" : "fpenet:trainable_v1.0",
                  "lprnet": "lprnet:trainable_v1.0",
                  "ml_recog": "retail_object_recognition:trainable_v1.0",
                  "ocdnet": "ocdnet:trainable_resnet18_v1.0",
                  "ocrnet": "ocrnet:trainable_v1.0",
                  "optical_inspection": "optical_inspection:trainable_v1.0",
                  "pointpillars":"pointpillarnet:trainable_v1.0",
                  "pose_classification":"poseclassificationnet:trainable_v1.0",
                  "re_identification":"reidentificationnet:trainable_v1.1"}

if model_name == "action-recognition":
    if model_type == "of":
        pretrained_map["action_recognition"] = "actionrecognitionnet:trainable_v2.0"
    elif model_type == "joint":
        pretrained_map["action_recognition"] = "actionrecognitionnet:trainable_v1.0,actionrecognitionnet:trainable_v2.0"

no_ptm_models = set([])

In [None]:
if network_arch not in no_ptm_models:
    pattern = os.path.join(home, 'shared', 'users', '*', 'models', '*', 'metadata.json')

    ptm_model_names = pretrained_map[network_arch].split(",")
    ptm = []

    for ptm_model_name in ptm_model_names:
        ptm_id = None
        for ptm_metadata_path in glob.glob(pattern):
          with open(ptm_metadata_path, 'r') as metadata_file:
            ptm_metadata = json.load(metadata_file)
            ngc_path = ptm_metadata.get("ngc_path")
            metadata_network_arch = ptm_metadata.get("network_arch")
            if metadata_network_arch == network_arch and ngc_path.endswith(ptm_model_name):
                additional_id_info = []
                if ptm_metadata.get("additional_id_info"):
                    additional_id_info = ptm_metadata["additional_id_info"].split(",")
                if (len(additional_id_info) == 0) or \
                    (model_name == "lprnet" and len(additional_id_info) == 1 and additional_id_info[0] == model_type) or \
                    (model_name == "action-recognition" and len(additional_id_info) == 1 and additional_id_info[0] == model_input_type) or \
                    (model_name == "action-recognition" and len(additional_id_info) == 2 and additional_id_info[0] == platform and additional_id_info[1] == model_input_type):
                    ptm_id = ptm_metadata["id"]
                    print("Metadata for model with requested NGC Path")
                    print(ptm_metadata)
                    break
        ptm.append(ptm_id)

### Customize model metadata <a class="anchor" id="head-12"></a>

In [None]:
metadata_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'metadata.json')

with open(metadata_path , "r") as metadata_file:
    metadata = json.load(metadata_file)

metadata["train_datasets"] = [train_dataset_id]
if model_name in ("bpnet","fpenet","lprnet","ml-recog","ocdnet","ocrnet","optical-inspection"):
    metadata["calibration_dataset"] = train_dataset_id
if model_name in ("lprnet", "ocdnet", "ocrnet", "optical-inspection"):
    metadata["eval_dataset"] = eval_dataset_id
if model_name in ("lprnet","optical-inspection"):
    metadata["inference_dataset"] = infer_dataset_id

if network_arch not in no_ptm_models:
    metadata["ptm"] = ptm

with open(metadata_path, "w") as metadata_file:
    json.dump(metadata, metadata_file, indent=2)

print(json.dumps(metadata, indent=2))

### View hyperparameters that are enabled for AutoML by default <a class="anchor" id="head-13"></a>

In [None]:
if automl_enabled:
    # View default automl specs enabled
    ! tao-client {model_name} model-automl-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/automl_defaults.json

### Set AutoML related configurations <a class="anchor" id="head-14"></a>
Refer to these hyper-links to see the parameters supported by each network and add more parameters if necessary in addition to the default automl enabled parameters:

[ActionRecognitionNet](https://github.com/NVIDIA/tao_front_end_services/tree/main/api/specs_utils/specs/action_recognition/action_recognition%20-%20train.csv), 
[BPNET](https://github.com/NVIDIA/tao_front_end_services/tree/main/api/specs_utils/specs/bpet/bpnet%20-%20train.csv), 
[FPENET](https://github.com/NVIDIA/tao_front_end_services/tree/main/api/specs_utils/specs/fpenet/fpenet%20-%20train.csv), 
[LPRNET](https://github.com/NVIDIA/tao_front_end_services/tree/main/api/specs_utils/specs/lprnet/lprnet%20-%20train.csv), 
[MetricLearningRecognition](https://github.com/NVIDIA/tao_front_end_services/tree/main/api/specs_utils/specs/ml_recog/ml_recog%20-%20train.csv), 
[OCDNET](https://github.com/NVIDIA/tao_front_end_services/tree/main/api/specs_utils/specs/ocdnet/ocdnet%20-%20train.csv), 
[OCRNET](https://github.com/NVIDIA/tao_front_end_services/tree/main/api/specs_utils/specs/ocrnet/ocrnet%20-%20train.csv), 
[OpticalInspection](https://github.com/NVIDIA/tao_front_end_services/tree/main/api/specs_utils/specs/optical_inspection/optical_inspection%20-%20train.csv), 
[Pointpillars](https://github.com/NVIDIA/tao_front_end_services/tree/main/api/specs_utils/specs/pointpillars/pointpillars%20-%20train.csv), 
[PoseClassificationNet](https://github.com/NVIDIA/tao_front_end_services/tree/main/api/specs_utils/specs/pose_classification/pose_classification%20-%20train.csv), 
[ReIdentificationNet](https://github.com/NVIDIA/tao_front_end_services/tree/main/api/specs_utils/specs/re_identification/re_identification%20-%20train.csv)

In [None]:
if automl_enabled:
    # Choose automl algorithm between "Bayesian" and "HyperBand".
    automl_algorithm="Bayesian" # FIXME8 example: Bayesian/HyperBand

    #Don't change this, in future multiple metrics will be supported
    metric = "kpi"

    additional_automl_parameters = [] #Refer to parameter list mentioned in the above links and add any extra parameter in addition to the default enabled ones
    remove_default_automl_parameters = [] #Remove any hyperparameters that are enabled by default for AutoML

    metadata["automl_algorithm"] = automl_algorithm
    metadata["automl_enabled"] = automl_enabled
    metadata["metric"] = metric
    metadata["epoch_multiplier"] = 1 # Will be considered for Hyperband only
    metadata["automl_add_hyperparameters"] = str(additional_automl_parameters)
    metadata["automl_remove_hyperparameters"] = str(remove_default_automl_parameters)

    with open(metadata_path, "w") as metadata_file:
        json.dump(metadata, metadata_file, indent=2)

    print(json.dumps(metadata, indent=2))

### Provide train specs <a class="anchor" id="head-15"></a>

In [None]:
# Default train model specs
! tao-client {model_name} model-train-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/train.json

In [None]:
# Customize train model specs
specs_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'specs', 'train.json')

with open(specs_path , "r") as specs_file:
    specs = json.load(specs_file)

# Apply changes for any of the parameters listed in the previous cell as required
if model_name == "action-recognition":
    specs["model"]["model_type"] = model_type
    specs["model"]["input_type"] = model_input_type
    specs["train"]["num_epochs"] = 20
    specs["train"]["gpu_ids"] = [0]
elif model_name == "bpnet":
    specs["num_epoch"] = 20
    specs["finetuning_config"]["checkpoint_path"] = None
    specs["gpus"] = 1
elif model_name == "fpenet":
    specs["dataloader"]["dataset_info"]["root_path"] = None
    specs["num_keypoints"] = int(model_type)
    specs["dataloader"]["num_keypoints"] = int(model_type)
    specs["gpus"] = 1
elif model_name == "lprnet":
    specs["training_config"]["num_epochs"] = 24
    specs["gpus"] = 1
elif model_name == "ml-recog":
    specs["train"]["num_epochs"] = 30
    specs["train"]["gpu_ids"] = [0]
elif model_name == "ocdnet":
    specs["train"]["num_epochs"] = 30
    specs["train"]["gpu_id"] = [0]
    specs["num_gpus"] = 1
elif model_name == "ocrnet":
    specs["train"]["num_epochs"] = 20
    specs["train"]["gpu_ids"] = [0]
elif model_name == "optical-inspection":
    specs["train"]["num_epochs"] = 30
    specs["train"]["gpu_ids"] = [0]
elif model_name == "pose-classification":
    specs["train"]["num_epochs"] = 50
    specs["train"]["gpu_ids"] = [0]
    if model_type == "nvidia":
        specs["dataset"]["num_classes"] = 6
        specs["model"]["graph_layout"] = "nvidia"
    elif model_type == "kinetics":
        specs["dataset"]["num_classes"] = 5
        specs["model"]["graph_layout"] = "openpose"
elif model_name == "pointpillars":
    specs["train"]["num_epochs"] = 80
    specs["gpus"] = 1
elif model_name == "re-identification":
    specs["train"]["num_epochs"] = 120
    specs["train"]["gpu_ids"] = [0]
    specs["dataset"]["num_classes"] = 100 #The number set in obtain_subset script
    specs["dataset"]["num_workers"] = 4 #Modify the num_workers according to your hardware setup
    specs["dataset"]["batch_size"] = 16 #Modify the batch_size according to your hardware setup

with open(specs_path, "w") as specs_file:
    json.dump(specs, specs_file, indent=2)

print(json.dumps(specs, indent=2))

### Run train <a class="anchor" id="head-16"></a>

In [None]:
train_job_id = subprocess.getoutput(f"tao-client {model_name} model-train --id " + model_id)
print(train_job_id)

In [None]:
# Monitor job status
if automl_enabled:    
    # Set poll_automl_stats to True if just want to see what's the time left, how many epochs are remaining etc.
    # Set poll_automl_stats to False if you want to skip stats and see the training logs instead. Training logs viewing are supported only for Bayesian

    # For automl: Training times for different models benchmarked on 1 GPU V100 machine can be found here: https://docs.nvidia.com/tao/tao-toolkit/text/automl/automl.html#results-of-automl-experiments
    
    poll_automl_stats = True
    if poll_automl_stats:
        import time
        from IPython.display import clear_output
        stats_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, train_job_id, "automl_metadata.json")
        controller_json_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, train_job_id, "controller.json")
        while True:
            time.sleep(15)
            clear_output(wait=True)
            if os.path.exists(stats_path):
                try:
                    with open(stats_path , "r") as stats_file:
                        stats_dict = json.load(stats_file)
                    print(json.dumps(stats_dict, indent=2))
                    if float(stats_dict.get("Number of epochs yet to start",-1)) == 0.0 or float(stats_dict.get("Number of iters yet to start",-1)) == 0.0:
                        break
                except (json.JSONDecodeError):
                    print("Stats computed are being written to file. Stats will be visible on screen in a few seconds")
    else:
        # Print the log file - supported only for bayesian (the file won't exist until the backend Toolkit container is running -- can take several minutes)
        if automl_algorithm == "Bayesian":
            logs_dir = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id)
            max_recommendations = metadata.get("automl_max_recommendations",20)
            for experiment_num in range(max_recommendations):
                log_file = f"{train_job_id}/experiment_{experiment_num}/log.txt"
                while True:
                    if os.path.exists(os.path.join(logs_dir, log_file)):
                        break
                print(f"\n\nViewing experiment {experiment_num}\n\n")
                my_tail(logs_dir, log_file)
    
else:
    # Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
    logs_dir = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'logs')
    log_file = f"{train_job_id}.txt"

    my_tail(logs_dir, log_file)

In [None]:
## To Stop an AutoML JOB
#    1. Stop the 'Monitor job status' cell (the cell right before this cell) manually
#    2. Uncomment the snippet in the next cell and run the cell

In [None]:
# if automl_enabled:
#     canceled_job_id = subprocess.getoutput(f"tao-client {model_name} model-job-cancel --id {model_id} --job {train_job_id}")
#     print(canceled_job_id)

In [None]:
## Resume AutoML

In [None]:
# Uncomment the below snippet if you want to resume an already stopped AutoML job and then run the 'Monitor job status' cell above (4th cell above from this cell)
# if automl_enabled:
#     resumed_job_id = subprocess.getoutput(f"tao-client {model_name} model-job-resume --id {model_id} --job {train_job_id}")
#     print(resumed_job_id)

### Viewing checkpoint files <a class="anchor" id="head-17"></a>

In [None]:
# View the checkpoints generated for the training job and for automl jobs, in addition view: best performing model's config and the results of all automl experiments

job_dir = f"{home}/shared/users/{os.environ['USER']}/models/{model_id}/{train_job_id}"
model_path = job_dir

if automl_enabled:
    !python3 -m pip install pandas==1.5.1
    import pandas as pd
    import glob
    model_path =  f"{job_dir}/best_model"

from IPython.display import clear_output

while True:
    clear_output(wait=True)
    if os.path.exists(model_path) and len(os.listdir(model_path)) > 0:
        #List the binary model file
        print("\nCheckpoints for the training experiment")
        if os.path.exists(model_path+"/train/weights") and len(os.listdir(model_path+"/train/weights")) > 0:
            print(f"Folder: {model_path}/train/weights")
            print("Files:", os.listdir(model_path+"/train/weights"))
        elif os.path.exists(model_path+"/weights") and len(os.listdir(model_path+"/weights")) > 0:
            print(f"Folder: {model_path}/weights")
            print("Files:", os.listdir(model_path+"/weights"))
        else:
            print(f"Folder: {model_path}")
            print("Files:", os.listdir(model_path))

        if automl_enabled:
            if os.path.exists(f"{model_path}/controller.json") and (len(glob.glob(os.path.join(model_path,"*.protobuf"))) > 0 or len(glob.glob(os.path.join(model_path,"*.yaml"))) > 0):
                experiment_artifacts = json.load(open(f"{model_path}/controller.json","r"))
                data_frame = pd.DataFrame(experiment_artifacts)
                # Print experiment id/number and the corresponding result
                print("\nResults of all experiments")
                with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.max_colwidth', None):
                    print(data_frame[["id","result"]])
                break
        else:
            break

### Provide evaluate specs <a class="anchor" id="head-18"></a>

In [None]:
# Default evaluate model specs
! tao-client {model_name} model-evaluate-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/evaluate.json

In [None]:
# Customize evaluate model specs
specs_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'specs', 'evaluate.json')

with open(specs_path , "r") as specs_file:
    specs = json.load(specs_file)

if model_name == "action-recognition":
    specs["model"]["model_type"] = model_type
    specs["model"]["input_type"] = model_input_type
elif model_name == "fpenet":
    specs["dataloader"]["dataset_info"]["root_path"] = None
    specs["num_keypoints"] = int(model_type)
    specs["dataloader"]["num_keypoints"] = int(model_type)
elif model_name == "pose-classification":
    if model_type == "nvidia":
        specs["dataset"]["num_classes"] = 6
        specs["model"]["graph_layout"] = "nvidia"
    elif model_type == "kinetics":
        specs["dataset"]["num_classes"] = 5
        specs["model"]["graph_layout"] = "openpose"
elif model_name == "re-identification":
    specs["dataset"]["num_classes"] = 100 #The number set in obtain_subset script

with open(specs_path, "w") as specs_file:
    json.dump(specs, specs_file, indent=2)

print(json.dumps(specs, indent=2))

### Run evaluate <a class="anchor" id="head-19"></a>

In [None]:
eval_job_id = subprocess.getoutput(f"tao-client {model_name} model-evaluate --id {model_id} --job {train_job_id}")
print(eval_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
log_file = f"{eval_job_id}.txt"
logs_dir = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'logs')
my_tail(logs_dir, log_file)

### Provide prune specs <a class="anchor" id="head-20"></a>

In [None]:
# Default prune model specs
if model_name in ("bpnet", "ocdnet", "ocrnet", "pointpillars"):
    ! tao-client {model_name} model-prune-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/prune.json

### Run prune <a class="anchor" id="head-21"></a>

In [None]:
if model_name in ("bpnet", "ocdnet", "ocrnet", "pointpillars"):
    prune_job_id = subprocess.getoutput(f"tao-client {model_name} model-prune --id {model_id} --job {train_job_id}")
    print(prune_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
if model_name in ("bpnet", "ocdnet", "ocrnet", "pointpillars"):
    log_file = f"{prune_job_id}.txt"
    my_tail(logs_dir, log_file)

### Provide retrain specs <a class="anchor" id="head-22"></a>

In [None]:
# Default retrain model specs
if model_name in ("bpnet", "ocdnet", "ocrnet", "pointpillars"):
    ! tao-client {model_name} model-retrain-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/retrain.json

In [None]:
# Customize retrain model specs
if model_name in ("bpnet", "ocdnet", "ocrnet", "pointpillars"):
    specs_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'specs', 'retrain.json')

    with open(specs_path , "r") as specs_file:
        specs = json.load(specs_file)

    if model_name == "bpnet":
        specs["num_epoch"] = 20
        specs["finetuning_config"]["checkpoint_path"] = None
        specs["gpus"] = 1
    elif model_name == "ocdnet":
        specs["train"]["num_epochs"] = 30
        specs["train"]["gpu_id"] = [0]
        specs["num_gpus"] = 1
    elif model_name == "ocrnet":
        specs["train"]["num_epochs"] = 20
        specs["train"]["gpu_ids"] = [0]
    elif model_name == "pointpillars":
        specs["train"]["num_epochs"] = 80
        specs["gpus"] = 1

    with open(specs_path, "w") as specs_file:
        json.dump(specs, specs_file, indent=2)

    print(json.dumps(specs, indent=2))

### Run retrain <a class="anchor" id="head-23"></a>

In [None]:
if model_name in ("bpnet", "ocdnet", "ocrnet", "pointpillars"):
    retrain_job_id = subprocess.getoutput(f"tao-client {model_name} model-retrain --id {model_id} --job {prune_job_id}")
    print(retrain_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
if model_name in ("bpnet", "ocdnet", "ocrnet", "pointpillars"):
    log_file = f"{retrain_job_id}.txt"
    my_tail(logs_dir, log_file)

### Run evaluate on retrained model <a class="anchor" id="head-24"></a>

In [None]:
if model_name in ("bpnet","pointpillars"):
    eval2_job_id = subprocess.getoutput(f"tao-client {model_name} model-evaluate --id {model_id} --job {retrain_job_id}")
    print(eval2_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
if model_name in ("bpnet","pointpillars"):
    log_file = f"{eval2_job_id}.txt"
    my_tail(logs_dir, log_file)

### Provide Export specs <a class="anchor" id="head-25"></a>

In [None]:
# Default export model specs
! tao-client {model_name} model-export-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/export.json

In [None]:
# Customize export model specs
specs_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'specs', 'export.json')

with open(specs_path , "r") as specs_file:
    specs = json.load(specs_file)

if model_name == "action-recognition":
    specs["model"]["model_type"] = model_type
    specs["model"]["input_type"] = model_input_type
elif model_name == "lprnet":
    specs["data_type"] = "fp32"
elif model_name == "bpnet":
    specs["data_type"] = "int8"
elif model_name == "pose-classification":
    if model_type == "nvidia":
        specs["dataset"]["num_classes"] = 6
        specs["model"]["graph_layout"] = "nvidia"
    elif model_type == "kinetics":
        specs["dataset"]["num_classes"] = 5
        specs["model"]["graph_layout"] = "openpose"
elif model_name == "re-identification":
    specs["dataset"]["num_classes"] = 100 #The number set in obtain_subset script

with open(specs_path, "w") as specs_file:
    json.dump(specs, specs_file, indent=2)

print(json.dumps(specs, indent=2))

### Run export <a class="anchor" id="head-26"></a>

In [None]:
export_job_id = subprocess.getoutput(f"tao-client {model_name} model-export --id {model_id} --job {train_job_id}")
print(export_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
log_file = f"{export_job_id}.txt"
my_tail(logs_dir, log_file)

### Provide trt engine generation specs <a class="anchor" id="head-27"></a>

In [None]:
# Default convert model specs
if model_name in ("lprnet", "ocdnet", "ocrnet", "ml-recog","optical-inspection"):
    ! tao-client {model_name} model-gen-trt-engine-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/gen_trt_engine.json
elif model_name in ("bpnet"):
    ! tao-client {model_name} model-trtexec-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/trtexec.json

In [None]:
# Customize convert model specs
if model_name in ("bpnet", "lprnet", "ocdnet", "ocrnet", "ml-recog","optical-inspection"):
    if model_name in ("bpnet"):
        specs_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'specs', 'trtexec.json')
    else:
        specs_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'specs', 'gen_trt_engine.json')

    with open(specs_path , "r") as specs_file:
        specs = json.load(specs_file)

    # Make changes to the specs dictionary if required here
    if model_name == "lprnet":
        specs["data_type"] = "fp32"
    elif model_name in ("ml-recog", "ocdnet"):
        specs["gen_trt_engine"]["tensorrt"]["data_type"] = "int8"
    elif model_name in ("ocrnet", "optical-inspection"):
        specs["gen_trt_engine"]["tensorrt"]["data_type"] = "fp16"    

    with open(specs_path, "w") as specs_file:
        json.dump(specs, specs_file, indent=2)

    print(json.dumps(specs, indent=2))

### Run TRT engine generation using TAO-Deploy <a class="anchor" id="head-28"></a>

In [None]:
if model_name in ("bpnet", "lprnet", "ocdnet", "ocrnet", "ml-recog","optical-inspection"):
    if model_name in ("bpnet"):
        gen_trt_engine_job_id = subprocess.getoutput(f"tao-client {model_name} model-trtexec --id {model_id} --job {export_job_id}")
    else:
        gen_trt_engine_job_id = subprocess.getoutput(f"tao-client {model_name} model-gen-trt-engine --id {model_id} --job {export_job_id}")
    print(gen_trt_engine_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
if model_name in ("bpnet", "lprnet", "ocdnet", "ocrnet", "ml-recog","optical-inspection"):
    log_file = f"{gen_trt_engine_job_id}.txt"
    my_tail(logs_dir, log_file)

### Provide TAO inference specs <a class="anchor" id="head-29"></a>

In [None]:
# Default inference model specs
! tao-client {model_name} model-inference-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/inference.json

In [None]:
# Customize TAO inference specs
specs_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'specs', 'inference.json')

with open(specs_path , "r") as specs_file:
    specs = json.load(specs_file)

#Apply changes to the specs dictionary here if required
if model_name == "action-recognition":
    specs["model"]["model_type"] = model_type
    specs["model"]["input_type"] = model_input_type
elif model_name == "fpenet":
    specs["num_keypoints"] = int(model_type)
    specs["dataloader"]["num_keypoints"] = int(model_type)
elif model_name == "pose-classification":
    if model_type == "nvidia":
        specs["dataset"]["num_classes"] = 6
        specs["model"]["graph_layout"] = "nvidia"
    elif model_type == "kinetics":
        specs["dataset"]["num_classes"] = 5
        specs["model"]["graph_layout"] = "openpose"
elif model_name == "re-identification":
    specs["dataset"]["num_classes"] = 100 #The number set in obtain_subset script

with open(specs_path, "w") as specs_file:
    json.dump(specs, specs_file, indent=2)

print(json.dumps(specs, indent=2))

### Run TAO inference <a class="anchor" id="head-30"></a>

In [None]:
tlt_inference_job_id = subprocess.getoutput(f"tao-client {model_name} model-inference --id {model_id} --job {train_job_id}")
print(tlt_inference_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
log_file = f"{tlt_inference_job_id}.txt"
my_tail(logs_dir, log_file)

In [None]:
# Inference output must be here
job_dir = f"{home}/shared/users/{os.environ['USER']}/models/{model_id}/{tlt_inference_job_id}"
if model_name in ("action-recognition","lprnet","ocrnet"):
    !cat {logs_dir}/{log_file}
elif model_name == "fpenet":
    !cat {job_dir}/result.txt
elif model_name == "ml-recog":
    !cat {job_dir}/inference/result.csv
elif model_name == "optical-inspection":
    !cat {job_dir}/inference/inference.csv
elif model_name == "pose-classification":
    !cat {job_dir}/results.txt
elif model_name in ("bpnet","pointpillars"):
    !python3 -m pip install matplotlib
    import glob
    import matplotlib.pyplot as plt
    import matplotlib.image as mpimg
    if model_name == "bpnet":
        sample_image = glob.glob(f"{job_dir}/images_annotated/*.png")[0]
    elif model_name == "pointpillars":
        sample_image = glob.glob(f"{job_dir}/infer/detected_boxes/*.png")[0]
    def display_photo(path):
        img = mpimg.imread(path)
        plt.figure(figsize = (int(img.shape[0]/100)*2,int(img.shape[1]/100)*2))
        plt.axis('off')
        imgplot = plt.imshow(img, aspect='auto')
        plt.show()
    display_photo(sample_image)
elif model_name == "re-identification":
    !cat {job_dir}/inference.json

### Provide TRT inference specs <a class="anchor" id="head-31"></a>

In [None]:
# Default inference model specs
if model_name in ("bpnet", "lprnet", "ocdnet", "ocrnet", "ml-recog","optical-inspection"):
    ! tao-client lprnet model-inference-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/inference.json

In [None]:
# Customize TRT inference specs
if model_name in ("bpnet", "lprnet", "ocdnet", "ocrnet", "ml-recog","optical-inspection"):
    specs_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'specs', 'inference.json')

    with open(specs_path , "r") as specs_file:
        specs = json.load(specs_file)

    # Change any spec if you wish

    with open(specs_path, "w") as specs_file:
        json.dump(specs, specs_file, indent=2)

    print(json.dumps(specs, indent=2))

### Run TRT inference <a class="anchor" id="head-32"></a>

In [None]:
if model_name in ("bpnet", "lprnet", "ocdnet", "ocrnet", "ml-recog","optical-inspection"):
    trt_inference_job_id = subprocess.getoutput(f"tao-client lprnet model-inference --id {model_id} --job {gen_trt_engine_job_id}")
    print(trt_inference_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
if model_name in ("bpnet", "lprnet", "ocdnet", "ocrnet", "ml-recog","optical-inspection"):
    log_file = f"{trt_inference_job_id}.txt"
    my_tail(logs_dir, log_file)

In [None]:
if model_name in ("bpnet", "lprnet", "ml-recog", "ocdnet", "ocrnet","optical-inspection"):
    job_dir = f"{home}/shared/users/{os.environ['USER']}/models/{model_id}/{trt_inference_job_id}"
    # You can view the predictions here or in the subdirectories
    !ls {job_dir}/

### Delete experiment <a class="anchor" id="head-33"></a>

In [None]:
! rm -rf ~/shared/users/{os.environ['USER']}/models/{model_id}
! echo DONE

### Delete datasets <a class="anchor" id="head-34"></a>

In [None]:
! rm -rf ~/shared/users/{os.environ['USER']}/datasets/*
! echo DONE

### Unmount shared volume <a class="anchor" id="head-35"></a>

In [None]:
command = "umount ~/shared"
! echo {password} | sudo -S -k {command} && echo DONE

### Uninstall TAO Remote Client <a class="anchor" id="head-36"></a>

In [None]:
! pip3 uninstall -y nvidia-tao-client