# SWS3009 Lab 5B - Pose Estimation with YOLOv7

## 1. Introduction

In this part of Lab 5 we will look at how to do pose estimation with YOLOv7. This part of the lab is just an introduction on how to use the post estimation head in YOLOv7.

## 2. Building the Pose Estimator

We start off by importing our libraries:


In [1]:
import torch
from torchvision import transforms
from utils.datasets import letterbox
from utils.general import non_max_suppression_kpt
from utils.plots import output_to_keypoint, plot_skeleton_kpts
import matplotlib.pyplot as plt
import cv2
import numpy as np

import sys
sys.path.insert(0, './yolov7')

Now we begin by selecting a PyTorch device. Here we are using the CPU, but if you have a CUDA compatible CUDA device, you can instead use device=torch.device("cuda").

The load_model function simply calls torch.load to load up the yolov7-w6-pose.pt weights file. The model.float().eval() call sets the dropout parameters correctly to ensure consistent inference.

<b>Important:</b> You must have the yolov7-w6-pose.pt weights set, which has been included in your SWS3009Lab5.zip file.

In [2]:
device=torch.device("cpu")

def load_model():
    global device
    model = torch.load('yolov7-w6-pose.pt', map_location=device)['model']
    # Turn the model into a float model
    model.float().eval()
    
    return model

model = load_model()

  model = torch.load('yolov7-w6-pose.pt', map_location=device)['model']


Our run_inference function takes an image, resizes and pads it to a form suitable for YOLOv7, converts it to an image tensor, then calls the model. Note that we call the model from within "torch.no_grad" to prevent updating of the weights.

This function then produces a set of "keypoints" and the image itself. However the current set of keypoints will contain many duplicates and in draw_keypoints we will use non-maximal suppression to remove most of them. We will do this in the next function.

In [3]:
def run_inference(image):
    # Resize and pad image. First return value is the resized image
    # second is ratio, then dw and dh
    # Resize to [567, 960, 3]
    image = letterbox(image, 960, stride=64, auto = True)[0]
    
    # torch.Size([3, 567, 960]). Converts PIL image to tensor
    image = transforms.ToTensor()(image)
    # Adds an additional dimension of 1 at indicated position
    # Turns it into a batch
    image = image.unsqueeze(0)
    
    # no_grad disables update of weights
    image.to(device)
    with torch.no_grad():
        output, _ = model(image)
    return output, image

# To display images on the web
from IPython.display import Image

Our next step is to draw the keypoints. The keypoints are an $n \times 58$ matrix. I.e. there are $n$ rows of $58$ elements, where $n$ is the number of people detected. Each row of 58 numbers consists of:

1. 7 number numbers that represent the batch ID, class ID, x, y, width, height and confidence score of the object detected.
2. 17 "keypoints" consisting of x, y and confidence values (total is $17 \times 3 = 51$ values). A "keypoint" is a point on the "skeleton".  See this diagram for details:

![](https://i.stack.imgur.com/HG8dB.png)


You can use the keypoint values from output\[idx, 7:\] to access the keypoints. Here we just call plot_skeleton_kpts to draw the skeleton. The number of people detected can be found in "output\[0\]".


In [4]:
# Image produced from run_inference has many proposals
# We run non-maximal suppression to pick the btest

def draw_keypoints(output, image):
    # 0.25 confidence threshold, 0.65 IoU threshold
    # nc = number of classes
    output = non_max_suppression_kpt(output, 0.25, 0.65, 
                                    nc = model.yaml['nc'],
                                     nkpt = model.yaml['nkpt'],
                                     kpt_label = True)
    with torch.no_grad():
        output = output_to_keypoint(output)
    
    # Permute dimensions of tensor
    nimg = image[0].permute(1, 2, 0) * 255
    
    # tensor.cpu() returns copy of tensor in cpu memory
    nimg = nimg.cpu().numpy().astype(np.uint8)
    
    # Convert colorspace from standard RGB to 
    # CV2 BGR
    nimg = cv2.cvtColor(nimg, cv2.COLOR_RGB2BGR)
    
    # Plot the skeleton for each person detected. The number
    # of persons detected is in output.shape[0]. The keypoints
    # are from index 7 onwards for each person. Index 0 to 6
    # are the batch ID, class ID, x, y, width, height and
    # confidence score for the object identified.
    
    for idx in range(output.shape[0]):
        plot_skeleton_kpts(nimg, output[idx, 7:].T, 3)
    
    return nimg


The next function is to capture video from the camera or video file (post estimation is really not as useful in pictures) by calling cv2.VideoCapture. We then read the video frame, call run_inference and draw_keypoints to highlight the pose skeleton. 

Note that CV2 by default uses a BGR color space instead of RGB, hence we need to coll cvtColor to convert from BGR to RGB.

In [5]:
# Use a filename of 0 to capture the camera.
def pose_estimation_video(filename, outfilename = None):
    cap = cv2.VideoCapture(filename)
    
    # Filename, fourcc code, fps, frame dimensions. fourcc code
    # specifies the codec
    
    if outfilename is not None:
        # Video writer to capture to MP4q
        fourcc = cv2.VideoWriter_fourcc(*'mp4v')
        out = cv2.VideoWriter(outfilename, fourcc, 30.0, (int(cap.get(3)), 
                             int(cap.get(4))))
    else:
        out = None
    
    while cap.isOpened():
        (ret, frame) = cap.read()
        if ret == True:
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            output, frame = run_inference(frame)
            frame = draw_keypoints(output, frame)
            frame = cv2.resize(frame, (int(cap.get(3)), int(cap.get(4))))
            if out is not None:
                out.write(frame)
            cv2.imshow('Pose Estimation', frame)
        else:
            break
        
        if cv2.waitKey(15) &  0xFF == ord('q'):
            break
            
    cap.release()
    
    if out is not None:
        out.release()
    cv2.destroyAllWindows()

Now finally we call pose_estimation_video to estimate the pose of each person in the picture.

In [7]:
#pose_estimation_video("./ice-skating.mp4")
pose_estimation_video(0, outfilename="camera.mp4")

## 3. Conclusion

This very short lab shows you how to perform pose estimation with YOLO7.