# TensorRT-Pose Optimization

First, let's load the JSON file which describes the human pose task.  This is in COCO format, it is the category descriptor pulled from the annotations file.  We modify the COCO category slightly, to add a neck keypoint.  We will use this task description JSON to create a topology tensor, which is an intermediate data structure that describes the part linkages, as well as which channels in the part affinity field each linkage corresponds to.

In [1]:
import json
import trt_pose.coco

with open('human_pose.json', 'r') as f:
    human_pose = json.load(f)

topology = trt_pose.coco.coco_category_to_topology(human_pose)

Matplotlib created a temporary config/cache directory at /tmp/matplotlib-60xnchx0 because the default path (/home/ecasp/.cache/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib is building the font cache; this may take a moment.


In [2]:
def benchmark(model):
    duration = 0.
    for i in range(100):
        t0 = time.time()
        torch.cuda.current_stream().synchronize()
        y = model(data)
        torch.cuda.current_stream().synchronize()
        duration += time.time() - t0

    return 100.0 / duration

## Import pre-trained model (PyTorch)

Next, we'll load our model.  Each model takes at least two parameters, *cmap_channels* and *paf_channels* corresponding to the number of heatmap channels
and part affinity field channels.  The number of part affinity field channels is 2x the number of links, because each link has a channel corresponding to the
x and y direction of the vector field for each link.

In [3]:
import trt_pose.models

num_parts = len(human_pose['keypoints'])
num_links = len(human_pose['skeleton'])

model = trt_pose.models.resnet18_baseline_att(num_parts, 2 * num_links).cuda().eval()

Next, let's load the model weights.  You will need to download these according to the table in the README.

In [4]:
import torch

MODEL_WEIGHTS = 'resnet18_baseline_att_224x224_A_epoch_249.pth'

model.load_state_dict(torch.load(MODEL_WEIGHTS))

<All keys matched successfully>

In order to optimize with TensorRT using the python library *torch2trt* we'll also need to create some example data.  The dimensions
of this data should match the dimensions that the network was trained with.  Since we're using the resnet18 variant that was trained on
an input resolution of 224x224, we set the width and height to these dimensions.

In [5]:
WIDTH = 224
HEIGHT = 224

data = torch.zeros((1, 3, HEIGHT, WIDTH)).cuda()

In [6]:
import time

t0 = time.time()
torch.cuda.current_stream().synchronize()
for i in range(50):
    y = model(data)
torch.cuda.current_stream().synchronize()
t1 = time.time()

print(50.0 / (t1 - t0))

2.202997735908961


## Optimization

Next, we'll use [torch2trt](https://github.com/NVIDIA-AI-IOT/torch2trt) to optimize the model.  We'll enable fp16_mode to allow optimizations to use reduced half precision.

The optimized model may be saved so that we do not need to perform optimization again, we can just load the model.  Please note that TensorRT has device specific optimizations, so you can only use an optimized model on similar platforms.

In [7]:
import torch2trt
import time

### FP32 -- 1GB RAM

In [9]:
OPTIMIZED_MODEL = 'resnet18_baseline_att_224x224_trt_FP32_1GB.pth'

model_trt = torch2trt.torch2trt(model, [data], fp16_mode=False, max_workspace_size=1<<30)
torch.save(model_trt.state_dict(), OPTIMIZED_MODEL)

In [11]:
model_trt = torch2trt.TRTModule()
model_trt.load_state_dict(torch.load(OPTIMIZED_MODEL))

<All keys matched successfully>

In [12]:
benchmark(model_trt)

11.246368489326686

### FP32 -- 33MB RAM

In [8]:
OPTIMIZED_MODEL = 'resnet18_baseline_att_224x224_trt_FP32_33MB.pth'

model_trt = torch2trt.torch2trt(model, [data], fp16_mode=False, max_workspace_size=1<<25)
torch.save(model_trt.state_dict(), OPTIMIZED_MODEL)

In [9]:
model_trt = torch2trt.TRTModule()
model_trt.load_state_dict(torch.load(OPTIMIZED_MODEL))

<All keys matched successfully>

In [12]:
benchmark(model_trt)

11.557302527115603

### FP16 -- 1GB

In [8]:
OPTIMIZED_MODEL = 'resnet18_baseline_att_224x224_trt_FP16_1GB.pth'

model_trt = torch2trt.torch2trt(model, [data], fp16_mode=True, max_workspace_size=1<<30)
torch.save(model_trt.state_dict(), OPTIMIZED_MODEL)

In [9]:
model_trt = torch2trt.TRTModule()
model_trt.load_state_dict(torch.load(OPTIMIZED_MODEL))

<All keys matched successfully>

In [11]:
benchmark(model_trt)

14.916258781248613

### FP16 -- 33MB

In [12]:
OPTIMIZED_MODEL = 'resnet18_baseline_att_224x224_trt_FP16_33MB.pth'

model_trt = torch2trt.torch2trt(model, [data], fp16_mode=True, max_workspace_size=1<<25)
torch.save(model_trt.state_dict(), OPTIMIZED_MODEL)

In [13]:
model_trt = torch2trt.TRTModule()
model_trt.load_state_dict(torch.load(OPTIMIZED_MODEL))

<All keys matched successfully>

In [16]:
benchmark(model_trt)

14.929419114343837