Skip to content
Action recognition using OpenPose and TDA - developed during an Internship at Fujitsu Centre of Excellence in Paris
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
action_recognition Move translate chunks step to where it is needed Aug 31, 2018
Dockerfile Minore dockerfile update Aug 20, 2018 Improve documentation Aug 30, 2018
requirements.txt Add missing requirement for video recordin Aug 24, 2018 Move translate chunks step to where it is needed Aug 31, 2018

Action recognition based on OpenPose and TDA (using Gudhi and sklearn_tda)


This project was developed in the summer of 2018 during an internship at Fujitsu Centre of Excellence in Paris. Thanks for the all of the help throughout the project!


The ensemble classifier achieves an accuracy of 0.912 on custom data. However, there is still a need to capture more data to see how well it would generalise over different actors and scenes. The TDA classifier on its own achieves an accuracy of 0.823 on the same data.

From video to action detection

The pipeline can be run in its entirety using the following scripts (also in, and see --help for each script for options and parameters). The first step is to generate data and train a classifier:

python3.6 --video test-0.mp4 --out-directory output
python3.6 --videos test-0.mp4 --tracks output/test-0-tracks.npz
python3.6 --dataset dataset/dataset --point-clouds
python3.6 --dataset dataset/dataset --tda

The last step creates a trained classifier (in a .pkl file). This classifier can then be used to generate predictions of actions live by running the script:

python3.6 --classifier classifier.pkl --video test-0.mp4

The script will output identified actions, a video with the predictions overlayed on the original video, and a video per predicted action.

Most of what the scripts do is to wrap input and output to the different modules in action_recognition. The final script ( being the exception, as it also aggregates some of the predictions, and decides on which parts of each track should be used to predict actions. Below is a description of what each of these scripts do:

  • The script creates tracks of people identified in a video for dataset creation.
  • Creates a tracker.Tracker object with either detector.CaffeOpenpose (which is CMU's original implementation) or using detector.TFOpenpose (which is faster, but did not deliver the same level of accuracy for me). Also, requires an output directory to where it places the processed videos and tracks.
  • Produces two files: A video file with the identified keypoints overlayed on the original video. A file called {path_to_video}-tracks.npz, which contains two numpy arrays: tracks (i.e. the keypoints of each identified person in the video), and frames (i.e. the corresponding frame numbers for each identified person, primarily useful for later visualisation of the keypoints).
  • Each track is a [n_frames, n_keypoints, 3] numpy.ndarray which is predicted as being a single person through several frames, making the final outputted array of shape [n_tracks, n_frames, n_keypoints, 3], where the values are (x, y, confidence). The frames array, correspondingly, has the shape [n_tracks, n_frames, 1]. Note, however, that both a arrays will be ndarrays with dtype=object, since n_frames per track will differ.

  • Run the script to create a dataset from tracks. If you have not previously labelled the data, the script will prompt the user for labels. The labelling process will either give you the option to look through the videos and discard bad chunks (if there are timestamps for the videos with corresponding labels) or manually label the data by displaying each chunk and requiring input on which label to attach to which chunk.
  • The script outputs {name}-train.npz and {name}-test.npz files containing the corresponding chunks, frames, labels, and videos of the train and test sets. Note that the frames and videos are only used for visualisation of the data.
  • The labelling process only needs to be done once, after which a .json file is created per tracks file, which can be manually edited and will be parsed for labels subsequent times.
  • During the creation of the dataset, before dividing tracks into chunks, there are also a couple of post-processing steps:
    1. Merge tracks that are very close to each other at their ends, or throughout a longer period.
    2. Remove tracks that are too short, under 15 frames long.
    3. Fill-in missing keypoints (as OpenPose sometimes does not output every keypoint) by maintaining the same distance to a connected keypoint (e.g. wirst-elbow) as when the keypoint was last seen. This increases the accuracy of the classifier later on.
    4. Fill-in missing frames by interpolating positions of every keypoint. This is done to normalise the data in case OpenPose lost track of a person for a couple of frames. With this normalisation, every chunk of the same length will also have the same length in video-time.
  • If there are multiple datasets that you wish to combine, you can run the script which allows you to do exactly that.
  • A possible improvement here is to allow a user to label different chunks in the same video with different lengths.

  • If you wish, you can now run to get an idea about, for instance, how the point clouds look or how well the features of the feature engineering seems to separate the different classes.

  • The script trains a classifier on the data. It accepts a dataset as input (without the -test and -train suffix) and an option to run either --feature-engineering, --tda, or --ensemble. They will produce confusion matrices of the classifier on the test set, the classifier saved in .pkl format, and optionally a visualisation of the incorrect classifications. The --feature-engineering option trains a classifier on hand-selected features. The tda option runs a SlicedWasserstein Kernel on the Persistence diagrams of the generated point clouds from the data. The ensemble option combines the Sliced Wasserstein kernel with the feature engineering using a voting classifier.
  • The pipeline for the TDA calculation has 7 steps, remember the data is split up into chunks by
    1. Extract certain keypoints (the neck, ankles, and wrists have worked the best for me), which both speeds up the computation and increases accuracy.
    2. Smooth the path of each keypoint. This is mainly done since OpenPose sometimes produces jittery output, and this helps to remove that (and increases accuracy as a result).
    3. Normalise every chunk so that it is centered around (0, 0).
    4. Flatten the chunks from shape [n_frames, n_keypoints, 2] to [n_keypoints * n_frames, 3]. The third dimension corresponds to the index of the frame (i.e. ranging from 0 to n_frames), not actual time.
    5. Calculate persistence using Gudhi's AlphaComplex (with max_alpha_square set to 2).
    6. Calculate the SlicedWasserstein kernel from sklearn_tda.
    7. Train a scikit-learn SVC classifier.
  • The training can take a couple of minutes, naturally longer for the TDA calculations than for the pure feature engineering. The SlicedWasserstein kernel is the computation that takes the longest (but thankfully prints its progress), roughly 1.6 times longer than the next most time-consuming operation, which is the persistence calculation of the AlphaComplex which takes place just before.

  • takes a trained classifier and uses tracker.Tracker to yield identified tracks from the tracking of people in the video.
  • On each such track (every 20:th frame), it does post-processing (using analysis.PostProcessor) and then (arbitrarily) takes the latest 50, 30, 25, and 20 frames as chunks for which actions are predicted. The most likely action (highest probability/confidence from the classifier) from all chunks is selected as the action for the person.
  • If the confiedence for a classification falls below a user-specifed threshold, the prediction is discarded.
  • It also tries to predict if a person moves through e.g. a checkout-area without stopping by identifying if a person moves during several consecutive frames.


There are currently four dockerfiles, corresponding to three natural divisions of dependencies, and one with every dependency:

  • dockerfiles/Dockerfile-openpose-gpu: which is the GPU version of OpenPose, allows the openpose parts (specifically of this project to be run.
  • dockerfiles/Dockerfile-openpose-cpu: which is the CPU version of OpenPose.
  • dockerfiles/Dockerfile-tda: which contains Gudhi and sklearn_tda for the classification part (specifically of the project.
  • Dockerfile: which installs both openpose (assuming a GPU) as well as the TDA libraries (which allows to be run). This file can do with some cleanup using build stages.

After building the Dockerfiles, there is a script which runs the container and mounts the source directory as well as the expected locations of the data. It is provided more out of convenience than anything else and may need some modification depending on your configuration.

Recording videos

There is a helper script for producing timestamps for labels while recording videos. It is called and requires a video name, a path to the camera device and video size. It prompts the user in multiple steps: First, asks whether to record video or stop recording. Second, it prompts the user for a label for the timestamp. These steps repeat until the user quits the program. The produced timestamps are read by to help reduce labelling time.

Issues with the approach

  • A bit slow - OpenPose takes 0.5 s/frame, and the TDA classifier takes 3 s/person and prediction. This time complexity comes mainly from the kernel calculation from sklearn_tda, and the persistence calculation of the gudhi library. Both of these have parameters that can be tuned (see TDAClassifier._pre_validated_pipeline()), at the expense of accuracy.
  • We are restricted to 2D positions - a limitation from OpenPose, which makes classification harder.
  • OpenPose can be quite jittery, especially when using lower resolutions.
  • TDA does not have any way of recognising still vs lying. Since the actions don't have any movement, they don't form any shapes that TDA can recognise.
  • TDA also does not have a concept of direction, only the shape of the point cloud. Therefore, a vertical action can easily be confused with a horizontal one.
  • While the final confusion matrix/accuracy looks good, I am worried that the data/actions are too easy since the feature engineering works so well. The TDA kernel might generalise better?

Ideas that might improve the program

  • Main issue (for real usage) is that it is too slow. As mentioned above, there are some parameters that can be tuned to help speed up the classification, but it is still too slow for real usage. Under the assumption that every action takes the same amount of time, we can also base every live prediction on only one chunk. Not sure how to actually make the persistence/kernel calculations faster.
  • There is also a need for more data to test the approach on. As previously mentioned, the current data may be too easy, and I've seen improvements on test video by adding more data.
  • It might be possible to increase the dimensionality of the point clouds with some hand-crafted feature. I experimented with adding the speed of every keypoint to the point cloud - but did not get an increase in performance. Also, naturally, makes the persistence/kernel computations slower.
  • Could change the 3:rd dimension in the point clouds from the frame number (from start of chunk) to the actual change in time.
  • There are also three possible improvements to the tracker if necessary: To predict position of people in next frame by using their speed to help increase accuracy of tracks. To remove assignments of keypoints where the, e.g., arm suddenly is much much longer (which sometimes does happen). To remove predicted people who have too few keypoints, maybe only an arm or part of a face.
You can’t perform that action at this time.