In [None]:
#@title Mounting drive
### 
!apt-get install -y -qq software-properties-common python-software-properties module-init-tools
!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
!apt-get update -qq 2>&1 > /dev/null
!apt-get -y install -qq google-drive-ocamlfuse fuse
from google.colab import auth
auth.authenticate_user()
from oauth2client.client import GoogleCredentials
creds = GoogleCredentials.get_application_default()
import getpass
!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
vcode = getpass.getpass()
!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}
!mkdir -p drive
!google-drive-ocamlfuse drive

# **Dance Recognition Project**
## Comparison of Pose Estimation Libraries

Recently, I was playing around with libraries for pose estimation. In this report, I would like to show their performance on samples from my dataset and compare them to each other. 

I came accross several libraries, which might be relevant for our purposes: *wrnchAI, OpenPose, AlphaPose, HRNet, DensePose, DarkPose, Megvii,DeeperCut, PoseNet... WrnchAI* seems to be similar in accuracy to *OpenPose*, but usually at higher speed. It is however a commercial product, so I suppose it is out of question. Therefore, I will be only dealing with the other libraries.

### Example video

I will be demonstrating the performance of some of the abovementioned tools on the following video (5 seconds of chacha from my dataset, no audio, 30 fps): 

In [None]:
#@title Importing Example Video
from IPython.display import HTML
from base64 import b64encode

mp4 = open('drive/TanecProjekt/chachacha_1_seg_4.mp4','rb').read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
HTML("""
<video width=600 controls>
      <source src="%s" type="video/mp4">
</video>
""" % data_url)

## Open Pose

The most wide-spread open-source library for pose estimation, developed at Carnegie Mellon. It has quite well documented github page: https://github.com/CMU-Perceptual-Computing-Lab/openpose . Installation is easy and it can be run from a binary, without using its API. Computations on my CPU were very slow (5s video took 50 minutes to process) so using a GPU is a must - unfortunately mine doesnt support CUDA nor OpenCL so I had to move to Google Collab for the GPUs.

In the following we test how OpenPose can recognize major body landmarks on our example video. It also has additional support for monitoring details of face and hands, but those are probably overkill for our purposes.

Here is how OpenPose performed on our example video:

In [None]:
#@title Result of OpenPose
mp4 = open('drive/TanecProjekt/chachacha_1_seg_4_Poses_Collab.mp4','rb').read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
HTML("""
<video width=600 controls>
      <source src="%s" type="video/mp4">
</video>
""" % data_url)

Here the result is presented once again, but this time just the skeletons, without the projection of the original video on the background:

In [None]:
#@title Result of OpenPose, no background 
mp4 = open('drive/TanecProjekt/chachacha_1_seg_4_Poses_noBgr.mp4','rb').read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
HTML("""
<video width=600 controls>
      <source src="%s" type="video/mp4">
</video>
""" % data_url)

### Comments on speed

OpenPose seems quite promising, using 1 GPU on Google Collab it took 16 seconds for OpenPose to process the example video and return the video without the background (with background it took 22 seconds). They are claimed to be working in real time and it seems plausible (If the production of demonstration videos was omitted, there should be some speedup. In the worstcase, resolution or frames per second of input video can be reduced as well...). Major advantage of OpenPose which is usually mentioned is its speed invariance to the number of detected people, this is advantage even for us if we want to analyze videos of multiple pairs dancing - but it might be disadvantegeous if we were interessted in just a single pair... (For that purpose, some faster framework could be found)

They also suggest some setup for maximal precision (at cost of slow speed), that could be used in training of our NN. For actual usage (for example on mobile device or a device without GPU), some lightweight version of OpenPose or some other framework should be used. Or it could be outsourced from the device to a server with GPUs...

This project ( https://github.com/ildoonet/tf-pose-estimation ) claims to reimplement OpenPose in a way to run on mobile device or a device without CPU in real-time. I did not test it yet, but last commit is 1 year ago, that seems little bit suspicious.

This project claims to have implemented lightweighted pytorch version of OpenPose: https://github.com/Daniil-Osokin/lightweight-human-pose-estimation.pytorch .

## AlphaPose

Another tool, this one is developed in China at Shanghai Jiao Tong University. Their github is: https://github.com/MVIG-SJTU/AlphaPose . It is probably not so wide-spread internationally as OpenPose, when googling some issues I found forum threads which were in chinese - which might make the usage of this library bit more difficult. However the professor behind this came from Standford and he seems he would like to make it international.

They claim it is "freely available for free non-commercial use, and may be redistributed under 'these' conditions". Where 'these' conditions means a yearly licence fee of 2000 USD. For commercial uses there is likely to be individual pricing, so I am not sure what is meant by this "redistribution" other than commercial. Somewhere they called themselves open-source, so I hope there should be no problem.

Here is how AlphaPose performed on our example video:

In [None]:
#@title Result of AlphaPose (v0.3.0)
mp4 = open('drive/TanecProjekt/AlphaPose_chachacha_1_seg_4.mp4','rb').read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
HTML("""
<video width=600 controls>
      <source src="%s" type="video/mp4">
</video>
""" % data_url)

Unfortunately, it doesnt provide the possibility of outputting a video of skeleton only. However as for the actually useful outputs, it is quite convenient since it can output the .json data in both its own format or in the same format as OpenPose.

###Comments on speed

It took 9 seconds to produce the above output according to their profiler in Google Colab with 1GPU. However, it seems that it might not count some processing of the outputted video which the OpenPose profiler probably counts, so it is hard to say whether it is fair to compare it to the 16s of OpenPose.

For their pyTorch implementation they claim to bee real-time at 20fps. Somewhere they mentioned a version capable of 23 fps, but that might be at cost of accuracy.

As for lightweight version, they are behind this project, which I have not tested: https://github.com/YuliangXiu/MobilePose-pytorch .

##HRNet

Another framework developed in China. Github: https://github.com/leoxiaobin/deep-high-resolution-net.pytorch , paper: https://jingdongwang2017.github.io/Projects/HRNet/PoseEstimation.html . It is under MIT licence.

This model and its derivations are currently placing at top of competitions on datasets such as COCO and MPII (Human pose dataset). One year old article about its achievements on Medium: https://medium.com/syncedreview/human-pose-estimation-model-hrnet-breaks-three-coco-records-cvpr-accepts-paper-74e57fabdeb6 .

Here is how this model (actually some particular demo https://github.com/lxy5513/hrnet) performed on our example video:


In [None]:
#@title Result of HRNet
mp4 = open('drive/TanecProjekt/HRNet_chachacha_1_seg_4.mp4','rb').read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
HTML("""
<video width=600 controls>
      <source src="%s" type="video/mp4">
</video>
""" % data_url)

###Comments on speed

The version I ran in Google Colab with 1 GPU worked at speed around 13fps. There is a lightweight version of this, which is claimed to ran at speed of 17fps on a CPU only machine (Intel i7) at a good accuracy, see paper (https://arxiv.org/pdf/1911.10346.pdf) and GitHub (https://github.com/zhang943/lpn-pytorch).

## Other frameworks

There is a plenty of other frameworks for pose estimation, the state of the art performances on benchmark datasets can be found here: https://paperswithcode.com/task/pose-estimation.

Short comments on some of them:

*   *DensePose* - does also 3D mapping, most likely overkill for our purposes (and probably slow), developed by Facebook
*   *DarkPose* - Currently at very top of the charts, paper: https://arxiv.org/abs/1910.06278, GitHub: https://github.com/ilovepose/DarkPose . It is very recent, the code is not yet public, but should be opensourced on GitHub under Apache Licence.

*   *Fast Human Pose Estimation* - From the authors of DarkPose. Under MIT Licence, GitHub: https://github.com/ilovepose/fast-human-pose-estimation.pytorch , paper: http://openaccess.thecvf.com/content_CVPR_2019/papers/Zhang_Fast_Human_Pose_Estimation_CVPR_2019_paper.pdf . There is also another unofficial PyTorch implementation: https://github.com/yuanyuanli85/Fast_Human_Pose_Estimation_Pytorch . The unofficial implementation is claimed to run at 43 fps at CPU only machine (Intel i7).

*   *Megvii - pose estimation* - Topping charts, developed by a chinese company Megvii. Paper: https://arxiv.org/abs/1901.00148, GitHub: https://github.com/megvii-detection/MSPN . I didnt find details on licence.



*   *Human pose estimation* - Leading MPII chart, paper: https://www.adrianbulat.com/downloads/FG20/fast_human_pose.pdf, no code made available yet, probably developed by Samsung
*   DeeperCut - Paper https://arxiv.org/pdf/1605.03170v3.pdf , TensorFLow implementation : https://github.com/eldar/pose-tensorflow , opensource. Used to be at top, but it is already 3 years old...
*   *PoseNet / PersonLab* - Developed by Google, implemented with TensorFlow. Lightweight versions for browser or mobile devices. PoseNet GitHub: https://github.com/tensorflow/tfjs-models/tree/master/posenet , webcam demo: https://storage.googleapis.com/tfjs-models/demos/posenet/camera.html , Medium: https://medium.com/tensorflow/real-time-human-pose-estimation-in-the-browser-with-tensorflow-js-7dd0bc881cd5 . PersonLab paper: https://arxiv.org/abs/1803.08225








## Conclusion

I tried to get some basic orientation in the tools for pose estimation which are out there. For those where it was the easiest to get some version running I tried them with an example dance video. 

From these videos I did not see a major difference in accuracy on my sample. OpenPose might be convenient for dancing, since it also tracks feet (there might be such an extension for the other libraries as well, it is not default though). Another advantage of OpenPose is the community around it, this would make it much simpler to work with. 

Because of this, I would suggest to start with OpenPose and if it turns out insufficient in accuracy or speed then I would look for other frameworks. Currently HRNet would be my second choice, but later - when made public - DarkPose deserves to be examined. If there was an app to be made from this project, we could consider using some lightweigted framework for the mobile devices, such as PoseNet or some lightweight version of the abovementioned robust frameworks.