# Litepose 
[Litepose][1] proposes an efficient way to perform multi-person pose estimation, providing a low computational cost, scale-invariant and reliable architecture. It follows a bottom-up approach, namely it uses just one network to do both keypoints estimation and grouping. In this implementation I have also relied on two other papers (also cited by Litepose), one is [HigherHRNet][2] which proposes the main architecture and the other is [Associative Embedding][3] which introduces a way to assign identity-free keypoints to the person they belong to.

Litepose modifies the HigherHRNet architecture going from a multi-branch to a single-branch one by gradual shrinking with the purpose of speeding up the inference, making it run on low computational power devices as well.

The architecture uses a MobileNet backbone with **Large Kernel Convolutions** that have shown great results empirically. The backbone output is passed to multiple deconvolutional (transpose convolutional) blocks implementing the main feature of the Litepose Paper which is **Fusion Deconv Head**. This allows obtaining scale-aware results by merging backbone intermediate features and refined features, in this way the network can exploit high resolution features, that help to catch close joints and small person, without involving a multi-branch architecture. 

Let $t$ be the number of convolutional blocks and $n$ be the number of the current deconvolutional block, the features fusion is implemented by summing the features of deconvolutional block in position $n$ with the features of backbone block in position $t-n-1$, refined by an additional convolutional layer. Eventually the merged features are passed to a final block for each deconvolutional layer that produces the output. The results of the network are provided in several scales, one for each deconvolutional layer. Hence the output is a $(n,j,s_i,s_i)$ tensor where $n$ is the number of scales (i.e. deconv blocks), $j$ is the number of joints that we want to detect, $s_i$ for $i\in{1,2,...,n}$ is the size of the current scale. The image clarifies the network structure.

![Network Architecture!](../assets/structure.png)

The output of the network is composed of a list of scaled versions of multi-channel results. Each result has $2*k$ channels, where $k$ is the number of joints for each person. The first $k$ channels have to be intended as a 2D probability distribution for each joint, since the model also works for multi-person detections, the distributions will have multiple peaks (hopefully one for each person). The last k channels, called **tags**, are used to assign identity free keypoints to each person, i.e. connect the keypoints together. The main idea behind tags is that two keypoints are connected to each other if their tags are the closest among every possible pair.


[1]:https://openaccess.thecvf.com/content/CVPR2022/papers/Wang_Lite_Pose_Efficient_Architecture_Design_for_2D_Human_Pose_Estimation_CVPR_2022_paper.pdf
[2]:https://arxiv.org/pdf/1908.10357.pdf
[3]:https://papers.nips.cc/paper/2017/file/8edd72158ccd2a879f79cb2538568fdc-Paper.pdf

This file has to be seen only as an entry that calls wrapper functions, the implementation of those functions can be found in the subdirectories of the repository.   
Every hyperparameter can be edited in `src/lp_config`.  
This project is highly scalable and customizable, you can adapt the model to your problem by modifying several parameters (number of joints, max persons, confidences...), in addition you can also change the network architecture in order to handle the performances-latency tradeoff.  
`lp_common_config.py` contains the general configurations about the dataset loading, training and inference. On the other hand `lp_model_config.py` contains the parameters that encode the model structure.

In [3]:
from lp_coco_utils.lp_getDataset import getDatasetProcessed
from lp_training.lp_trainer import train
from lp_model.lp_litepose import LitePose
from lp_inference.lp_inference import inference, assocEmbedding
from lp_utils.lp_image_processing import drawHeatmap, drawKeypoints, drawSkeleton
from lp_testing.lp_evaluate import evaluateModel

import lp_config.lp_common_config as cc
import torch
import cv2

# Dataset 
The model has been trained and validated on CrowdPose. This dataset contains high quality multi-person images and the annotations of the keypoints. As preprocessing the keypoints have been turned into different scales heatmaps, this is useful to define a loss function based on heatmaps mean squared error. In order to reduce overfitting several data augmentation transformations have been applied such as image rotation, scale, translation and flip.

# Training

During the training an optimizer Adam has been used as it provided better results compared to SGD.  

The applied loss is made up multiple components:
- Heatmaps Mean Squared error: the first $k$ channels are compared to the ground truth heatmaps generated during the dataset preprocessing. Hence the squared difference channel-wise, representing the estimation error, is summed up along the scale dimensions.
- Tag aggregate component: it aims to reduce the variance among the tags associated to the joints that belong to a single person. To define this component, it is useful to compute the joint ($K$) tag mean for each person ($n$) as $$ \overline{h_n} = \frac{1}{K} \sum_{k}^{} h_k(x_{nk}) $$
where $h_k(x)$ is the tag value at pixel location x and $x_{nk}$ is the ground truth joint position of person $n$. Finally the loss component is calculated as $$L_{aggr}=\frac{1}{NK}\sum_{n}^{}\sum_{k}^{}(\overline{h_n}-h_k(x_{nk}))^2$$
- Tag push component: its purpose is to maximize the difference between person tag means (to discriminate different people). This component is formalized as $$L_{push}=\frac{1}{N^2}\sum_{n}^{}\sum_{n'}^{}exp(- \frac{1}{2}(\overline{h_n} - \overline{h_{n'}})^2) $$

The total loss optimized by the model is $L = L_{MSE}+L_{aggr}+L_{push}$

Running the train again may take a while. I suggest to skip the cell below.

In [None]:
train(cc.config["batch_size"])

# Inference
Since the model returns $2*k$ channels, the output has to be processed to obtain interpretable results. First of all the different output scales are resized into a common scale through bilinear-interpolation in order to extract the keypoints in the right scale.  
Keypoint positions are obtained from the predicted heatmaps by selecting $n$ peaks for each joint, where $n$ is the maximum number of people that the image may contain. Then the obtained keypoints are filtered according to a confidence threshold.  
At this point the identity-free keypoints are grouped into each person by using tags. Since the connection sequence is known (i.e head with neck, neck with shoulders..) two nodes can be connected by exploiting this knowledge and checking the tags distance. So two nodes have an edge between them iff they belong to adjacent joints classes, their tag distance is the minimum between every other pair of nodes and their tag distance is less than an additional confidence threshold.  
The overall results are related to those confidence thresholds that can be tuned by considering domain dependent informations such as the image perspective and the noise.

Unfortunately OpenCV method `imshow()` has a well known bug with python notebooks, please press any key to close the image properly (not from x icon).

In [4]:
model = LitePose().to(cc.config["device"])
model.load_state_dict(torch.load("lp_trained_models/bigarch", map_location=cc.config["device"]))

<All keys matched successfully>

In [5]:
ds = getDatasetProcessed("validation")

data_loader = torch.utils.data.DataLoader(
    ds,
    batch_size=8
)

it = iter(data_loader)
row = next(it)
images = row[0].to(cc.config["device"])
gthm = row[1]

loading annotations into memory...
Done (t=0.11s)
creating index...
index created!


In [6]:
output, keypoints = inference(model, images)

In [8]:
embedding = assocEmbedding(keypoints)

The images below show the comparison between the heatmaps ground truth and the heatmaps predicted by the model

In [7]:
jointsHeatmap = output[1][2][:cc.config["num_joints"]]

img, finalHm, superimposed = drawHeatmap(images[2], jointsHeatmap)
img, gtfinalHm, gtsuperimposed = drawHeatmap(images[2], gthm[1][2])
cv2.imshow("Image", img)
cv2.imshow("Final heatmap", finalHm)
cv2.imshow("Superimposed", superimposed)

cv2.imshow("Ground Truth heatmap", gtfinalHm)
cv2.imshow("Ground Truth Superimposed", gtsuperimposed)
cv2.waitKey()
cv2.destroyAllWindows()

![Heatmaps](../assets/hms_report.png)

Finally both keypoints and skeletons can be visualized, this visualization is also useful to tune the confidence thresholds.

In [10]:
idx = 0
img = drawKeypoints(images[idx], keypoints[idx])
cv2.imshow("Image Keypoints", img)
cv2.waitKey()
cv2.destroyAllWindows()

![Keypoints](../assets/kps_report.png)

In [11]:
idx = 7
img = drawSkeleton(images[idx], embedding[idx])
cv2.imshow("Image Keypoints", img)
cv2.waitKey()
cv2.destroyAllWindows()

![Pose](../assets/pose_report.png)

According to LitePose original paper, Object Keypoint Similarity (OKS) is used as performance evaluation metric. It takes into account only the keypoints, disregarding the connection between them. Despite this, OKS is still a good metric as the keypoint prediction and tag estimation performances are related to each other, how the similar loss decrease has shown during the training. The metric (slightly modified) is defined as: $$ OKS = \frac{\sum_{i}^{}exp(-\frac{d_i^2}{2*k^2})*\delta(v_i>0)}{\sum_{i}^{}\delta(v_i>0)} $$
Where $d_i$ is the Euclidean distance between detected keypoints and their ground truth position, $k$ is a constant and $\delta(v_i>0)$ is a function that is 1 if the keypoint is valid, 0 otherwise

In [12]:
res = evaluateModel(model)
print(f"Object Keypoint Similarity (OKS) score: {res*100}%")

loading annotations into memory...
Done (t=0.08s)
creating index...
index created!


100%|██████████| 250/250 [08:47<00:00,  2.11s/it]

Object Keypoint Similarity (OKS) score: 56.11389455608193%





In [14]:
from thop import profile
row = next(it)
images = row[0].to(cc.config["device"])
macs, parameters = profile(model, inputs=(images,))

print(f"Model MACs: {macs}\nModel Parameters: {parameters}")

[INFO] Register count_convNd() for <class 'torch.nn.modules.conv.Conv2d'>.
[INFO] Register count_normalization() for <class 'torch.nn.modules.batchnorm.BatchNorm2d'>.
[INFO] Register zero_ops() for <class 'torch.nn.modules.activation.ReLU6'>.
[INFO] Register zero_ops() for <class 'torch.nn.modules.container.Sequential'>.
[INFO] Register count_convNd() for <class 'torch.nn.modules.conv.ConvTranspose2d'>.
Model MACs: 97467580416.0
Model Parameters: 25008536.0


The obtained results are summarized as following: 

| Network Parameters | MACs       | OKS   |
|--------------------|------------|-------|
| $25*10^6$          | $584*10^9$ | 56.0% |
| $4*10^6$           | $92*10^9$  | 43.7% |

The experiments show that the performances keep growing as the network size raises. This property allows to handle the tradeoff between the inference time and the network performances.  
The inference time with the first network (56% oks) has been tested by processing a video frame by frame through the network. The network managed to process that video with a mean of 20 FPS on an Nvidia GeForce RTX 3050Ti. 

Code taken by the [official paper repository](https://github.com/mit-han-lab/litepose):
- classes `CrowdPoseDataset` and `CrowdPoseKeypoints` are taken. They load the dataset and preprocess the joints turning them into heatmaps.
- I took the code contained in `lp_generators.py` and `lp_transforms.py`, since they were a `CrowdPoseKeypoints` dependencies