## Get the TensorRT tar file before running this Notebook

1. Visit https://developer.nvidia.com/tensorrt
2. Clicking `Download now` from step one directs you to https://developer.nvidia.com/nvidia-tensorrt-download where you have to Login/Join Now for Nvidia Developer Program Membership
3. Now, in the download page: Choose TensorRT 8 in available versions
4. Agree to Terms and Conditions
5. Click on TensorRT 8.6 GA to expand the available options
6. Click on 'TensorRT 8.6 GA for Linux x86_64 and CUDA 12.0 and 12.1 TAR Package' to dowload the TAR file
7. Upload the the tar file to your Google Drive

## Connect to GPU Instance

1. Change Runtime type to GPU by Runtime(Top Left tab)->Change Runtime Type->GPU(Hardware Accelerator)
1. Then click on Connect (Top Right)


## Mounting Google drive
Mount your Google drive storage to this Colab instance

In [2]:
import sys
if 'google.colab' in sys.modules:
    %env GOOGLE_COLAB=1
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
else:
    %env GOOGLE_COLAB=0
    print("Warning: Not a Colab Environment")

env: GOOGLE_COLAB=1
Mounted at /content/drive


# TAO Image Classification

Transfer learning is the process of transferring learned features from one application to another. It is a commonly used training technique where you use a model trained on one task and re-train to use it on a different task.

Train Adapt Optimize (TAO) Toolkit  is a simple and easy-to-use Python based AI toolkit for taking purpose-built AI models and customizing them with users' own data.

<img align="center" src="https://developer.nvidia.com/sites/default/files/akamai/TAO/tlt-tao-toolkit-bring-your-own-model-diagram.png" width="1080">

## Learning Objectives
In this notebook, you will learn how to leverage the simplicity and convenience of TAO to:

* Take a pretrained resnet18 model and finetune on a sample dataset converted from PascalVOC
* Prune the finetuned model
* Retrain the pruned model to recover lost accuracy
* Export the pruned model
* Run Inference on the trained model
* Export the pruned and retrained model to a .etlt file for deployment to DeepStream

### Table of Contents
This notebook shows an example use case for classification using the Train Adapt Optimize (TAO) Toolkit.

0. [Set up env variables](#head-0)
1. [Prepare dataset and pretrained model](#head-1)
    1. [Split the dataset into train/test/val](#head-1-1)
    2. [Download pre-trained model](#head-1-2)
2. [Setup GPU environment](#head-2) <br>
    2.1 [Setup Python environment](#head-2-1) <br>
3. [Provide training specification](#head-3)
4. [Run TAO training](#head-4)
5. [Evaluate trained models](#head-5)
6. [Prune trained models](#head-6)
7. [Retrain pruned models](#head-7)
8. [Testing the model](#head-8)
9. [Visualize inferences](#head-9)


#### Note
1. This notebook currently is by default set up to run training using 1 GPU. To use more GPU's please update the env variable `$NUM_GPUS` accordingly
2. This notebook uses VOC dataset by default, which should be around ~3 GB.
3. Using the default config/spec file provided in this notebook, each weight file size of classification created during training will be ~88 MB

## 0. Set up env variables and set FIXME parameters <a class="anchor" id="head-0"></a>

*Note: This notebook currently is by default set up to run training using 1 GPU. To use more GPU's please update the env variable `$NUM_GPUS` accordingly*

#### FIXME
1. NUM_GPUS - set this to <= number of GPU's availble on the instance
1. COLAB_NOTEBOOKS_PATH - for Google Colab environment, set this path where you want to clone the repo to; for local system environment, set this path to the already cloned repo
1. EXPERIMENT_DIR - set this path to a folder location where pretrained models, checkpoints and log files during different model actions will be saved
1. delete_existing_experiments - set to True to remove existing pretrained models, checkpoints and log files of a previous experiment
1. DATA_DIR - set this path to a folder location where you want to dataset to be present
1. delete_existing_data - set this to True to remove existing preprocessed and original data
1. trt_tar_path - set this path of the uploaded TensorRT tar.gz file after browser download
1. trt_untar_folder_path - set to path of the folder where the TensoRT tar.gz file has to be untarred into
1. trt_version - set this to the version of TRT you have downloaded

## 1. Prepare datasets and pre-trained model <a class="anchor" id="head-2"></a>

In [3]:
# Setting up env variables for cleaner command line commands.
import os

%env TAO_DOCKER_DISABLE=1

%env KEY=nvidia_tlt
#FIXME1
%env NUM_GPUS=1

#FIXME2
%env COLAB_NOTEBOOKS_PATH=/localhome/local-rarunachalam/colab_notebooks
if os.environ["GOOGLE_COLAB"] == "1":
    if not os.path.exists(os.path.join(os.environ["COLAB_NOTEBOOKS_PATH"])):

      !git clone https://github.com/NVIDIA-AI-IOT/nvidia-tao.git $COLAB_NOTEBOOKS_PATH
else:
    if not os.path.exists(os.environ["COLAB_NOTEBOOKS_PATH"]):
        raise Exception("Error, enter the path of the colab notebooks repo correctly")

#FIXME3
%env EXPERIMENT_DIR=/content/drive/MyDrive/TAO_Hiwi/results/classification
#FIXME4
delete_existing_experiments = True
#FIXME5
%env DATA_DIR=/content/drive/MyDrive/TAO_Hiwi/data_images
#FIXME6
delete_existing_data = False

if delete_existing_experiments:
    !sudo rm -rf $EXPERIMENT_DIR
if delete_existing_data:
    !sudo rm -rf $DATA_DIR

SPECS_DIR=f"{os.environ['COLAB_NOTEBOOKS_PATH']}/tensorflow/classification/specs"
%env SPECS_DIR={SPECS_DIR}
# Showing list of specification files.
!ls -rlt $SPECS_DIR

!sudo mkdir -p $DATA_DIR && sudo chmod -R 777 $DATA_DIR
!sudo mkdir -p $EXPERIMENT_DIR && sudo chmod -R 777 $EXPERIMENT_DIR

env: TAO_DOCKER_DISABLE=1
env: KEY=nvidia_tlt
env: NUM_GPUS=1
env: COLAB_NOTEBOOKS_PATH=/localhome/local-rarunachalam/colab_notebooks
Cloning into '/localhome/local-rarunachalam/colab_notebooks'...
remote: Enumerating objects: 2657, done.[K
remote: Counting objects: 100% (350/350), done.[K
remote: Compressing objects: 100% (193/193), done.[K
remote: Total 2657 (delta 241), reused 251 (delta 157), pack-reused 2307 (from 1)[K
Receiving objects: 100% (2657/2657), 4.05 MiB | 16.21 MiB/s, done.
Resolving deltas: 100% (1735/1735), done.
env: EXPERIMENT_DIR=/content/drive/MyDrive/TAO_Hiwi/results/classification
env: DATA_DIR=/content/drive/MyDrive/TAO_Hiwi/data_images
env: SPECS_DIR=/localhome/local-rarunachalam/colab_notebooks/tensorflow/classification/specs
total 8
-rw-r--r-- 1 root root 1175 Nov 30 15:58 classification_spec.cfg
-rw-r--r-- 1 root root 1046 Nov 30 15:58 classification_retrain_spec.cfg


We will be using the pascal VOC dataset for the tutorial. To find more details please visit
http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html#devkit. Please download the dataset present at http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar to $DATA_DIR.

In [4]:
# Check that file is present
import os
DATA_DIR = os.environ.get('DATA_DIR')
if not os.path.isfile(os.path.join(DATA_DIR , 'VOCtrainval_11-May-2012.tar')):
    print('tar file for dataset not found. Please download.')
else:
    print('Found dataset.')

Found dataset.


In [None]:
# unpack
!tar -xvf $DATA_DIR/VOCtrainval_11-May-2012.tar -C $DATA_DIR

In [6]:
# verify
!ls $DATA_DIR/VOCdevkit/VOC2012

Annotations  ImageSets	JPEGImages  SegmentationClass  SegmentationObject


### A. Split the dataset into train/val/test <a class="anchor" id="head-2-1"></a>

Pascal VOC Dataset is converted to our format (for classification) and then to train/val/test in the next two blocks.

In [None]:
from os.path import join as join_path
import os
import glob
import re
import shutil

DATA_DIR=os.environ.get('DATA_DIR')
source_dir = join_path(DATA_DIR, "VOCdevkit/VOC2012")
target_dir = join_path(DATA_DIR, "formatted")


suffix = '_trainval.txt'
classes_dir = join_path(source_dir, "ImageSets", "Main")
images_dir = join_path(source_dir, "JPEGImages")
classes_files = glob.glob(classes_dir+"/*"+suffix)
for file in classes_files:
    # get the filename and make output class folder
    classname = os.path.basename(file)
    if classname.endswith(suffix):
        classname = classname[:-len(suffix)]
        target_dir_path = join_path(target_dir, classname)
        if not os.path.exists(target_dir_path):
            os.makedirs(target_dir_path)
    else:
        continue
    print(classname)


    with open(file) as f:
        content = f.readlines()


    for line in content:
        tokens = re.split('\s+', line)
        if tokens[1] == '1':
            # copy this image into target dir_path
            target_file_path = join_path(target_dir_path, tokens[0] + '.jpg')
            src_file_path = join_path(images_dir, tokens[0] + '.jpg')
            shutil.copyfile(src_file_path, target_file_path)

In [None]:
import os
import glob
import shutil
from random import shuffle
from tqdm import tqdm

DATA_DIR=os.environ.get('DATA_DIR')
SOURCE_DIR=os.path.join(DATA_DIR, 'formatted')
TARGET_DIR=os.path.join(DATA_DIR,'split')
# list dir
print(os.walk(SOURCE_DIR))
dir_list = next(os.walk(SOURCE_DIR))[1]
# for each dir, create a new dir in split
for dir_i in tqdm(dir_list):
        newdir_train = os.path.join(TARGET_DIR, 'train', dir_i)
        newdir_val = os.path.join(TARGET_DIR, 'val', dir_i)
        newdir_test = os.path.join(TARGET_DIR, 'test', dir_i)

        if not os.path.exists(newdir_train):
                os.makedirs(newdir_train)
        if not os.path.exists(newdir_val):
                os.makedirs(newdir_val)
        if not os.path.exists(newdir_test):
                os.makedirs(newdir_test)

        img_list = glob.glob(os.path.join(SOURCE_DIR, dir_i, '*.jpg'))
        # shuffle data
        shuffle(img_list)

        for j in range(int(len(img_list)*0.7)):
                shutil.copy2(img_list[j], os.path.join(TARGET_DIR, 'train', dir_i))

        for j in range(int(len(img_list)*0.7), int(len(img_list)*0.8)):
                shutil.copy2(img_list[j], os.path.join(TARGET_DIR, 'val', dir_i))

        for j in range(int(len(img_list)*0.8), len(img_list)):
                shutil.copy2(img_list[j], os.path.join(TARGET_DIR, 'test', dir_i))

print('Done splitting dataset.')

In [7]:
!ls $DATA_DIR/split/test/cat

2008_000060.jpg  2008_005853.jpg  2009_002104.jpg  2010_001712.jpg  2010_004144.jpg
2008_000112.jpg  2008_005977.jpg  2009_002141.jpg  2010_001863.jpg  2010_004244.jpg
2008_000536.jpg  2008_006113.jpg  2009_002352.jpg  2010_001885.jpg  2010_004335.jpg
2008_000641.jpg  2008_006218.jpg  2009_002561.jpg  2010_001934.jpg  2010_004346.jpg
2008_000724.jpg  2008_006280.jpg  2009_002704.jpg  2010_001939.jpg  2010_004365.jpg
2008_000824.jpg  2008_006377.jpg  2009_002837.jpg  2010_002000.jpg  2010_004402.jpg
2008_001004.jpg  2008_006384.jpg  2009_002972.jpg  2010_002025.jpg  2010_004479.jpg
2008_001071.jpg  2008_006512.jpg  2009_003013.jpg  2010_002040.jpg  2010_004553.jpg
2008_001357.jpg  2008_006576.jpg  2009_003415.jpg  2010_002086.jpg  2010_004584.jpg
2008_001414.jpg  2008_006656.jpg  2009_003528.jpg  2010_002143.jpg  2010_004717.jpg
2008_001433.jpg  2008_006746.jpg  2009_003601.jpg  2010_002333.jpg  2010_004816.jpg
2008_001592.jpg  2008_006753.jpg  2009_003605.jpg  2010_002348.jpg  2010_004

### B. Download pretrained models <a class="anchor" id="head-2-2"></a>

 We will use NGC CLI to get the pre-trained models. For more details, go to ngc.nvidia.com and click the SETUP on the navigation bar.

In [8]:
# Installing NGC CLI on the local machine.
## Download and install
%env LOCAL_PROJECT_DIR=/ngc_content/
%env CLI=ngccli_cat_linux.zip
!sudo mkdir -p $LOCAL_PROJECT_DIR/ngccli && sudo chmod -R 777 $LOCAL_PROJECT_DIR

# Remove any previously existing CLI installations
!sudo rm -rf $LOCAL_PROJECT_DIR/ngccli/*
!wget --content-disposition 'https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/3.23.0/files/ngccli_linux.zip' -P $LOCAL_PROJECT_DIR/ngccli -O $LOCAL_PROJECT_DIR/ngccli/$CLI
!unzip -u -q "$LOCAL_PROJECT_DIR/ngccli/$CLI" -d $LOCAL_PROJECT_DIR/ngccli/
!rm $LOCAL_PROJECT_DIR/ngccli/*.zip
os.environ["PATH"]="{}/ngccli/ngc-cli:{}".format(os.getenv("LOCAL_PROJECT_DIR", ""), os.getenv("PATH", ""))
!cp /usr/lib/x86_64-linux-gnu/libstdc++.so.6 $LOCAL_PROJECT_DIR/ngccli/ngc-cli/libstdc++.so.6

env: LOCAL_PROJECT_DIR=/ngc_content/
env: CLI=ngccli_cat_linux.zip
--2024-11-30 16:06:52--  https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/3.23.0/files/ngccli_linux.zip
Resolving api.ngc.nvidia.com (api.ngc.nvidia.com)... 52.88.116.192, 44.239.44.161
Connecting to api.ngc.nvidia.com (api.ngc.nvidia.com)|52.88.116.192|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://xfiles.ngc.nvidia.com/org/nvidia/team/ngc-apps/recipes/ngc_cli/versions/3.23.0/files/ngccli_linux.zip?versionId=Bpzrduq29jxiO6V_pwHtxB_RuGz7cqzb&Expires=1733069213&Signature=xzmXwvkNiw4Qxdz8GL3Aec0wP2iJP0EojFab8LCN7TYeK6XVxyl0fwbR69yKVPWupf8DWvxQVSGR4GCSFYerdvwkl0yfoJ-W9wXk7VQpcX~lLzpYqHno155fxcNqcdRuJFbo4r4yA7DKdNrtN4TmDY4ms5h5mSJEKg~FoQgKifqyAofLQSf1F70OEOZULLIGDOJEbmcCraQEUqXAr-8ZbC5b101ec7TB1Ha2WOQbE0lDKkWa5BSTQxeSToHthAOnVo0HjGf8Oh8P7eHgkrvQ3R6X2Oop6jFFuaWzas~t8eJfMXl7DPfqEx41DoXW0t-ow1PRijYU4xG9Lf3tZGpASw__&Key-Pair-Id=KCX06E8E9L60W [following]
--2024-11-3

In [9]:
!ngc registry model list nvidia/tao/pretrained_classification:*

CLI_VERSION: Latest - 3.55.0 available (current: 3.23.0). Please update by using the command 'ngc version upgrade' 

+----------+----------+--------+-------+-------+----------+----------+----------+---------+
| Version  | Accuracy | Epochs | Batch | GPU   | Memory F | File     | Status   | Created |
|          |          |        | Size  | Model | ootprint | Size     |          | Date    |
+----------+----------+--------+-------+-------+----------+----------+----------+---------+
| vgg19    | 77.56    | 80     | 1     | V100  | 153.7    | 153.72   | UPLOAD_C | Aug 18, |
|          |          |        |       |       |          | MB       | OMPLETE  | 2021    |
| vgg16    | 77.17    | 80     | 1     | V100  | 113.2    | 113.16   | UPLOAD_C | Aug 18, |
|          |          |        |       |       |          | MB       | OMPLETE  | 2021    |
| squeezen | 65.13    | 80     | 1     | V100  | 6.5      | 6.46 MB  | UPLOAD_C | Aug 18, |
| et       |          |        |       |       |       

In [11]:
!mkdir -p $EXPERIMENT_DIR/pretrained_resnet18/

In [12]:
# Pull pretrained model from NGC
!ngc registry model download-version nvidia/tao/pretrained_classification:resnet18 --dest $EXPERIMENT_DIR/pretrained_resnet18

Getting files to download...
[?25l[32m⠋[0m [36m━━━━━━━[0m • [32m0.0/89…[0m • [36mRemaining:[0m [36m-:--:--[0m • [31m?[0m • [33mElapsed:[0m [33m0:00:…[0m • [34mTotal: 1 - Completed: 0 - Failed: 0[0m
[2K[1A[2K[32m⠙[0m [36m━━━━━━━[0m • [32m0.0/89…[0m • [36mRemaining:[0m [36m-:--:--[0m • [31m?[0m • [33mElapsed:[0m [33m0:00:…[0m • [34mTotal: 1 - Completed: 0 - Failed: 0[0m
[2K[1A[2K[32m⠹[0m [36m━━━━━━━[0m • [32m0.0/89…[0m • [36mRemaining:[0m [36m-:--:--[0m • [31m?[0m • [33mElapsed:[0m [33m0:00:…[0m • [34mTotal: 1 - Completed: 0 - Failed: 0[0m
[2K[1A[2K[32m⠼[0m [36m━━━━━━━[0m • [32m0.0/89…[0m • [36mRemaining:[0m [36m-:--:--[0m • [31m?[0m • [33mElapsed:[0m [33m0:00:…[0m • [34mTotal: 1 - Completed: 0 - Failed: 0[0m
[2K[1A[2K[32m⠴[0m [36m━━━━━━━[0m • [32m0.0/89…[0m • [36mRemaining:[0m [36m-:--:--[0m • [31m?[0m • [33mElapsed:[0m [33m0:00:…[0m • [34mTotal: 1 - Completed: 0 - Failed: 0[0m
[2K[1

In [13]:
print("Check that model is downloaded into dir.")
!ls -l $EXPERIMENT_DIR/pretrained_resnet18/pretrained_classification_vresnet18

Check that model is downloaded into dir.
total 91093
-rw------- 1 root root 93278448 Nov 30 16:07 resnet_18.hdf5


## 2. Setup GPU environment


### 2.1 Setup Python environment <a class="anchor" id="head-2-1"></a>
Setup the environment necessary to run the TAO Networks by running the bash script

In [20]:

###prev code
# FIXME 7: set this path of the uploaded TensorRT tar.gz file after browser download
# trt_tar_path="/content/drive/MyDrive/TAO_Hiwi/linux_tensor_rt/TensorRT-8.6.1.6.Linux.x86_64-gnu.cuda-12.0.tar.gz"

# import os
# if not os.path.exists(trt_tar_path):
#   raise Exception("TAR file not found in the provided path")

# # Ensure directory exists
# if not os.path.exists(trt_untar_folder_path):
#     print(f"Creating directory {trt_untar_folder_path}")
#     os.makedirs(trt_untar_folder_path, exist_ok=True)

# # Verify that you have write permissions
# if os.access(trt_untar_folder_path, os.W_OK):
#     print(f"Write access to {trt_untar_folder_path} confirmed.")
# else:
#     print(f"No write access to {trt_untar_folder_path}. Please check your permissions.")

# # FIXME 8: set to path of the folder where the TensoRT tar.gz file has to be untarred into
# #//Commented due to issues
# # %env trt_untar_folder_path=/content/drive/MyDrive/TAO_Hiwi/linux_tensor_rt/trt_untar
# trt_untar_folder_path = "/content/drive/MyDrive/TAO_Hiwi/linux_tensor_rt/trt_untar"


# # FIXME 9: set this to the version of TRT you have downloaded
# %env trt_version=8.6.1.6

# !sudo mkdir -p $trt_untar_folder_path && sudo chmod -R 777 $trt_untar_folder_path/

# import os

# untar = True
# for fname in os.listdir(os.environ.get("trt_untar_folder_path", None)):
#   if fname.startswith("TensorRT-"+os.environ.get("trt_version")) and not fname.endswith(".tar.gz"):
#     untar = False

# if untar:
#   !tar -xzf $trt_tar_path -C /content/trt_untar

# if os.environ.get("LD_LIBRARY_PATH","") == "":
#   os.environ["LD_LIBRARY_PATH"] = ""
# trt_lib_path = f':{os.environ.get("trt_untar_folder_path")}/TensorRT-{os.environ.get("trt_version")}/lib'
# os.environ["LD_LIBRARY_PATH"]+=trt_lib_path
######## PREVIOUS CODE####

import os

# Define the path for TensorRT tarball and the untar folder
trt_tar_path = "/content/drive/MyDrive/TAO_Hiwi/linux_tensor_rt/TensorRT-8.6.1.6.Linux.x86_64-gnu.cuda-12.0.tar.gz"
trt_untar_folder_path = "/content/drive/MyDrive/TAO_Hiwi/linux_tensor_rt/trt_untar"

# Check if the tarball file exists
if not os.path.exists(trt_tar_path):
    raise Exception(f"TAR file not found at {trt_tar_path}")

print(f"TensorRT tarball found at: {trt_tar_path}")

# Ensure the untar folder exists and is writable
if not os.path.exists(trt_untar_folder_path):
    print(f"Creating directory {trt_untar_folder_path}")
    os.makedirs(trt_untar_folder_path, exist_ok=True)

# Verify write permissions
if os.access(trt_untar_folder_path, os.W_OK):
    print(f"Write access to {trt_untar_folder_path} confirmed.")
else:
    raise PermissionError(f"No write access to {trt_untar_folder_path}. Please check permissions.")

# Check if TensorRT is already extracted, if not, extract it
untar = True
for fname in os.listdir(trt_untar_folder_path):
    if fname.startswith("TensorRT-" + "8.6.1.6") and not fname.endswith(".tar.gz"):
        untar = False

if untar:
    print(f"Extracting TensorRT to: {trt_untar_folder_path}")
    # Use the Python variable directly in the tar command
    !sudo tar -xzf {trt_tar_path} -C {trt_untar_folder_path}

# Update LD_LIBRARY_PATH to include TensorRT libraries
if os.environ.get("LD_LIBRARY_PATH", "") == "":
    os.environ["LD_LIBRARY_PATH"] = ""
trt_lib_path = f':{trt_untar_folder_path}/TensorRT-8.6.1.6/lib'
os.environ["LD_LIBRARY_PATH"] += trt_lib_path

print("TensorRT extraction complete and library path updated.")


TensorRT tarball found at: /content/drive/MyDrive/TAO_Hiwi/linux_tensor_rt/TensorRT-8.6.1.6.Linux.x86_64-gnu.cuda-12.0.tar.gz
Write access to /content/drive/MyDrive/TAO_Hiwi/linux_tensor_rt/trt_untar confirmed.
Extracting TensorRT to: /content/drive/MyDrive/TAO_Hiwi/linux_tensor_rt/trt_untar
TensorRT extraction complete and library path updated.


In [21]:
import os
if os.environ["GOOGLE_COLAB"] == "1":
    os.environ["bash_script"] = "setup_env.sh"
else:
    os.environ["bash_script"] = "setup_env_desktop.sh"

os.environ["NV_TAO_TF_TOP"] = "/tmp/tao_tensorflow1_backend/"

!sed -i "s|PATH_TO_TRT|$trt_untar_folder_path|g" $COLAB_NOTEBOOKS_PATH/tensorflow/$bash_script
!sed -i "s|TRT_VERSION|$trt_version|g" $COLAB_NOTEBOOKS_PATH/tensorflow/$bash_script
!sed -i "s|PATH_TO_COLAB_NOTEBOOKS|$COLAB_NOTEBOOKS_PATH|g" $COLAB_NOTEBOOKS_PATH/tensorflow/$bash_script

!sh $COLAB_NOTEBOOKS_PATH/tensorflow/$bash_script

--2024-11-30 16:19:54--  https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.199.39.144
Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.199.39.144|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4328 (4.2K) [application/x-deb]
Saving to: ‘cuda-keyring_1.0-1_all.deb’


2024-11-30 16:19:54 (267 MB/s) - ‘cuda-keyring_1.0-1_all.deb’ saved [4328/4328]

(Reading database ... 123630 files and directories currently installed.)
Preparing to unpack cuda-keyring_1.0-1_all.deb ...
Unpacking cuda-keyring (1.0-1) over (1.0-1) ...
Setting up cuda-keyring (1.0-1) ...
Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease [1,581 B]
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubunt

## 3. Provide training specification <a class="anchor" id="head-3"></a>
* Training dataset
* Validation dataset
* Pre-trained models
* Other training (hyper-)parameters such as batch size, number of epochs, learning rate etc.

In [None]:
!pip install tensorboard


    PyYAML (>=5.1.*)
            ~~~~~~^[0m[33m
[0m

In [None]:
!sed -i "s|TAO_DATA_PATH|$DATA_DIR/|g" $SPECS_DIR/classification_spec.cfg
!sed -i "s|EXPERIMENT_DIR_PATH|$EXPERIMENT_DIR/|g" $SPECS_DIR/classification_spec.cfg
!cat $SPECS_DIR/classification_spec.cfg

model_config {
  arch: "resnet",
  n_layers: 18
  # Setting these parameters to true to match the template downloaded from NGC.
  use_batch_norm: true
  all_projections: true
  freeze_blocks: 0
  freeze_blocks: 1
  input_image_size: "3,224,224"
}
train_config {
  train_dataset_path: "/content/drive/MyDrive/TAO_Hiwi/data_images//split/train"
  val_dataset_path: "/content/drive/MyDrive/TAO_Hiwi/data_images//split/val"
  pretrained_model_path: "/content/drive/MyDrive/TAO_Hiwi/results/classification//pretrained_resnet18/pretrained_classification_vresnet18/resnet_18.hdf5"
  optimizer {
    sgd {
    lr: 0.01
    decay: 0.0
    momentum: 0.9
    nesterov: False
  }
}
  batch_size_per_gpu: 64
  n_epochs: 10
  n_workers: 16
  preprocess_mode: "caffe"
  enable_random_crop: True
  enable_center_crop: True
  label_smoothing: 0.0
  mixup_alpha: 0.1
  # regularizer
  reg_config {
    type: "L2"
    scope: "Conv2D,Dense"
    weight_decay: 0.00005
  }

  # learning_rate
  lr_config {
    step {
     

## 4. Run TAO training <a class="anchor" id="head-4"></a>
* Provide the sample spec file and the output directory location for models

In [None]:
!tao model classification_tf1 train -e $SPECS_DIR/classification_spec.cfg -r $EXPERIMENT_DIR/output -k $KEY

Using TensorFlow backend.
2024-11-30 13:56:43.037783: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
Traceback (most recent call last):
  File "/usr/local/bin/classification_tf1", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/makenet/entrypoint/makenet.py", line 12, in main
    launch_job(nvidia_tao_tf1.cv.makenet.scripts, "classification", sys.argv[1:])
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/entrypoint/entrypoint.py", line 276, in launch_job
    modules = get_modules(package)
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/entrypoint/entrypoint.py", line 47, in get_modules
    module = importlib.import_module(module_name)
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", l

In [None]:
print("To run this training in data parallelism using multiple GPU's, please uncomment the line below and "
      "update the --gpus parameter to the number of GPU's you wish to use.")
# !tao model classification_tf1 train -e $SPECS_DIR/classification_spec.cfg \
#                       -r $EXPERIMENT_DIR/output \
#                       -k $KEY --gpus 2

In [None]:
print("""
      To run this training in model parallelism using multiple GPU's, please uncomment the line below and update the
      --gpus parameter to the number of GPU's you wish to use. Also add related parameters in training_config to
      enable model parallelism. E.g.,

             model_parallelism: 50
             model_parallelism: 50

""")

#!tao model classification_tf1 train -e $SPECS_DIR/classification_spec.cfg \
#                       -r $EXPERIMENT_DIR/output \
#                       -k $KEY --gpus 2 \
#                       -np 1

In [None]:
print("To resume from a checkpoint, use --init_epoch along with your checkpoint configured in the spec file.")
print("Please make sure that the model_path in the spec file is now updated to the '.tlt' file of the corresponding"
      "epoch you wish to resume from. You may choose from the files found under, '$EXPERIMENT_DIR/output/weights' folder.")
# !tao model classification_tf1 train -e $SPECS_DIR/classification_spec.cfg \
#                        -r $EXPERIMENT_DIR/output \
#                        -k $KEY --gpus 2 \
#                        --init_epoch N

## 5. Evaluate trained models <a class="anchor" id="head-5"></a>

In this step, we assume that the training is complete and the model from the final epoch (`resnet_010.tlt`) is available. If you would like to run evaluation on an earlier model, please edit the spec file at `$SPECS_DIR/classification_spec.cfg` to point to the intended model.

In [None]:
!tao model classification_tf1 evaluate -e $SPECS_DIR/classification_spec.cfg -k $KEY

## 6. Prune trained models <a class="anchor" id="head-6"></a>
* Specify pre-trained model
* Equalization criterion
* Threshold for pruning
* Exclude prediction layer that you don't want pruned (e.g. predictions)

Usually, you just need to adjust `-pth` (threshold) for accuracy and model size trade off. Higher `pth` gives you smaller model (and thus higher inference speed) but worse accuracy. The threshold to use is depend on the dataset. A pth value 0.68 is just a starting point. If the retrain accuracy is good, you can increase this value to get smaller models. Otherwise, lower this value to get better accuracy.

In [None]:
# Defining the checkpoint epoch number of the model to be used for the pruning.
# This should be lesser than the number of epochs training has been run for, in case training was interrupted earlier.
# By default, the default final model is at epoch 010.
%env EPOCH=010
!mkdir -p $EXPERIMENT_DIR/output/resnet_pruned
!tao model classification_tf1 prune -m $EXPERIMENT_DIR/output/weights/resnet_$EPOCH.hdf5 \
           -o $EXPERIMENT_DIR/output/resnet_pruned/resnet18_nopool_bn_pruned.hdf5 \
           -eq union \
           -pth 0.6 \
           -k $KEY \
           --results_dir $EXPERIMENT_DIR/logs

In [None]:
print('Pruned model:')
print('------------')
!ls -rlt $EXPERIMENT_DIR/output/resnet_pruned

## 7. Retrain pruned models <a class="anchor" id="head-7"></a>
* Model needs to be re-trained to bring back accuracy after pruning
* Specify re-training specification

In [None]:
!sed -i "s|TAO_DATA_PATH|$DATA_DIR/|g" $SPECS_DIR/classification_retrain_spec.cfg
!sed -i "s|EXPERIMENT_DIR_PATH|$EXPERIMENT_DIR/|g" $SPECS_DIR/classification_retrain_spec.cfg
!cat $SPECS_DIR/classification_retrain_spec.cfg

In [None]:
!tao model classification_tf1 train -e $SPECS_DIR/classification_retrain_spec.cfg \
                      -r $EXPERIMENT_DIR/output_retrain \
                      -k $KEY

## 8. Testing the model! <a class="anchor" id="head-8"></a>

In this step, we assume that the training is complete and the model from the final epoch (`resnet_010.tlt`) is available. If you would like to run evaluation on an earlier model, please edit the spec file at `$SPECS_DIR/classification_retrain_spec.cfg` to point to the intended model.

In [None]:
!tao model classification_tf1 evaluate -e $SPECS_DIR/classification_retrain_spec.cfg -k $KEY

## 9. Visualize Inferences <a class="anchor" id="head-9"></a>

To see the output results of our model on test images, we can use the `tlt-infer` tool. Note that using models trained for higher epochs will usually result in better results. We'll run inference with the directory mode. You can also use the single image mode.

In [None]:
# Defining the checkpoint epoch number to use for the subsequent steps.
# This should be lesser than the number of epochs training has been run for, in case training was interrupted earlier.
# By default, the default final model is at epoch 010.
%env EPOCH=010

In [None]:
!tao model classification_tf1 inference -e $SPECS_DIR/classification_retrain_spec.cfg \
                          -m $EXPERIMENT_DIR/output_retrain/weights/resnet_$EPOCH.hdf5 \
                          -k $KEY -b 32 -d $DATA_DIR/split/test/aeroplane \
                          -cm $EXPERIMENT_DIR/output_retrain/classmap.json \
                          --results_dir $EXPERIMENT_DIR/classification_inference

As explained in Getting Started Guide, this outputs a results.csv file in the same directory. We can use a simple python program to see the visualize the output of csv file.

In [None]:
import matplotlib.pyplot as plt
from PIL import Image
import os
import csv
from math import ceil

DATA_DIR = os.environ.get('DATA_DIR')
csv_path = os.path.join(os.getenv("EXPERIMENT_DIR","/"), 'classification_inference', 'result.csv')
results = []
with open(csv_path) as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    for row in csv_reader:
        results.append((row[0], row[1]))

w,h = 200,200
fig = plt.figure(figsize=(30,30))
columns = 5
rows = 1
for i in range(1, columns*rows + 1):
    ax = fig.add_subplot(rows, columns,i)
    print(results[i][0])
    img = Image.open(results[i][0])
    img = img.resize((w,h), Image.ANTIALIAS)
    plt.imshow(img)
    ax.set_title(results[i][1], fontsize=40)