# [1 &mdash; Hierarchical Recurrent Neural Network for Skeleton Based Action Recognition (2015)](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Du_Hierarchical_Recurrent_Neural_2015_CVPR_paper.pdf)
---

The represented model is indeed hierarchical: human mechanical model &mdash; skeleton is divided into five (5) parts (2 arms, 2 legs, 1 trunk/torso) according to structure and obtained parts represent the leaves of the hierarchical structure. Separate five parts are then fed into five bidirectional recurrent neural networks for further processing, where as we move towards higher layers the components are fused together until they form a complete single skeleton that passes through a single-layer perceptron, with softmax at the end.

## Main idea - hierarchical RNN structure

Previous work used only single layer RNN as a sequence classifier **without** fragmenting the skeleton into smaller elements. In this work, a tree-like hierarchical structure is present, together with fusion of segments in higher layers. Authors cosider motions of a human body as composite movements of discrete body parts and accordignly derive their hierarchical approach.

Previous work also considers movement of subjects in a predetermined time window, without grasping the general evolution of the movement throughout the sequence/video.

Hidden Markov Model (HMM) can describe the temporal evolution of actions, but demands the segmentation and alignment of the input sequences -> difficult task.

## Network architecture

![alt text](./images/hierarchical_BRNN.png)

#### 9 layers in total:
- 4 bidirectional layers (starting with 5 subnetworks for 5 body parts whose outputs are fused together and thus decreasing in numbers towards higher layers)
- 3 fusion layers (one fusion layer after every bidirectional layer for integration of body part movements)
- 1 fully connected layer
- 1 softmax at the exit

LSTM neurons are only used in the last (fourth) bidirectional recurrent layer, while the other layers all use the tanh activation function. LSTMs are much more expressive models than the tanh BRNNs and tend to overfit with limited training data, so they're used sparingly.

The objective function of the model is to minimize the maximum-likelihood loss.

### Interesting detail - dropout doesn't work here! Adding weight noise helps!

## Datasets

3 benchmark datasets:
- MSR Action3D Dataset
- Berkeley Multimodal Human Action Dataset (Berkeley MHAD)
- Motion Capture Dataset HDM05

# [2 &mdash; Differential Recurrent Neural Networks for Action Recognition (2015)](https://arxiv.org/abs/1504.06678)
---

## Main Idea &mdash; Not Every Bit of Video Data is Useful

LSTMs are great for gesture recognition, but in their basic ("vanilla") form they integrate over the entire sequence that's fed into them and learn everything depicted in that sequence, even parts that contain insufficient information about the gesture at hand. The authors propose a LSTM model that can learn to notice important dynamic characteristics ("salient dynamic patterns") of performed gestures and recognize them.

Suggested differential RNN (dRNN) utilizes Derivate of States (DoS) which is sensitive to the spatio-temporal structure of actions, thus enabling the network to model the dynamics of movements.

#### Important: truncated backpropagation is used in order to evade the exploding/vanishing gradient problem. Before the errors are propagated into the DoS nodes, they're truncated.

#### Past techniques: HOG3D, 3D-SIFT, actionlet ensemble

#### Important feat:
![alt text](./images/rnn_vs_lstm.png)

## Derivative of States

![alt text](./images/dRNN.png)

Internal states $s_t$ of the LSTM memory cells depend on the characteristics of the input signal - LSTMs remember the signal that passed through them. The Derivative of (internal) States expresses the magnitude of information change that occured between two moments in time, enabling us to assess the dynamics of the movement that occured (large, abrupt movement -> large DoS -> important to incorporate into the model (open the gate) and vice versa).

With DoS we have a parameter that tells us whether the incoming sequence is worth remembering or not and enables the network to "focus" on parts of the sequence that contain much of the important (movement) information.

Important to note: derivative of inputs *is not used* as a measurement of movement dynamics that controls the gate units &mdash; it would amplify input noise! DoS differentiates the states, which accumulated over time and have already filtered out raw details of the input signal.

In this paper, only two orders of DoS are considered (for assessing velocity and acceleration).

## Datasets

- KTH Dataset (*de facto* benchmark for evaluating action recognition algorithms?)
- MSR Action3D

## Details

The authors used HOG3D to extract features from KTH video sequences (56k feature vector per frame through PCA to get 450 dim. vector for dRNN input). So, this approach is hybrid &mdash; traditional feature engineering (HOG3D + PCA) + machine learning (modded LSTM).


# [3 &mdash; Co-occurrence Feature Learning for Skeleton based Action Recognition using Regularized Deep LSTM Networks (2016)](https://arxiv.org/abs/1603.07772)
---

Authors propose deep LSTM network for skeleton-based action recognition. The idea is that the co-occurrence of joints characterizes the actions the skeleton/human is making. Complete skeleton is taken as the input at each time-step and the authors propose a variant of a dropout approach in order to train the LSTM network effectively. This dropout simultaneously operates on the gates, cells and output responses.

### Interesting fact: SDK for Kinect V2 can directly generate accurate skeletons in real-time.

## Main Idea - two aspects of the motion recognition problem

Human body structure can be simplified and easily represented by a skeleton. Since we already have systems that can generate skeletons for moving bodies in real-time, we just have to figure out how to use these skeletons to identify the movement a person behind the skeleton is making. 

In this work, the authors have identified two aspects of the skeleton-based motion recognition problem:
1. intra-frame aspect - get (robust and discriminative) features from the skeleton contained in a single frame,
2. inter-frame aspect - model action dynamics by analysing temporal dependencies between frames.

As for the used framework, the authors have opted for the deep LSTM network with strong regularization.

### Interesting detail: actionlet ensemble model

It's a subset of joints that are actively used while performing an action. The catch is - how to stimulate the network and make it learn relevant features that distinguish the joint groups across movements.

## Network Architecture

![alt text](./images/deep_LSTM.png)

## Dropout treatment and co-occurrence

Dropout is very effective in deep CNNs, but it's of limited use in RNNs since it can lead to deletion of information that's been retained through time (the whole point of RNNs - networks that remember what happened).

Authors propose a "custom" dropout and use it only on the last (third) LSTM layer of their network. Since the dropout can negatively influence LSTMs performance by deleting retained information, the authors decided to allow the influence of dropout along the network's layers, but not along the time axis. In short, yes, dropout certain cells within incoming layers, but don't forget anything that's happened in time.

Also, the authors used special co-occurrence regularization in order to enforce the learning of certain interdependencies. This regularization method is specified not as a special (handcrafted) feature, but as a part of a learning mechanism. It is claimed that co-occurrence regularization helps learn the human parts automatically and the importance of fully connected architecture is pointed out.

![alt text](./images/deep_LSTM_dropout.png)

## Deep LSTM VS Hierarchical RNNs

Authors criticize the hierarchical RNN model as inferior (measured on SBU Kinect dataset), since the "vanilla" deep LSTM model achieves 5.6% better performance improvement from the start. They reason about this difference and attribute it to the fact that their network learned everything by itself, without imposed hierarchical structure that could inhibit the learning of certain correlations.

## Datasets

Deep LSTM model is validated on ... datasets:

- SBU kinect interaction dataset
- HDM05 dataset
- CMU dataset
- Berkeley MHAD



# [4 &mdash; Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition (2016)](https://arxiv.org/abs/1607.07043)
---

## Main Idea - learn from spatial  AND temporal data

The authors propose a spatio-temporal LSTM (ST-LSTM) that encodes spatial relations between joints within a single frame (spatial aspect) and exploits the temporal character of the movements that evolve through multiple frames (temporal aspect). It is important to note that every ST-LSTM unit corresponds to a single skeleton joint and the assumption is that joints are arranged in a chain-like sequence.

![alt text](./images/ST-LSTM.png)

## Human Skeleton - Tree-like Graph

![alt text](./images/skeleton_tree.png)

At each step of the ST-LSTM, a single joint from a single frame is fed to the network (hence, the number of steps should correspond to the number of joints in a frame), resulting in a much smaller number of model parameters (authors consider this to be a weight sharing regularization feature), leading to better generalization in cases with limited training examples.

The proposed architecture can also gain depth if needed.

![alt text](./images/deep_st_lstm.png)


## Trust Gate

Since the depth estimation of sensors is susceptible to noise and occlusion, the authors propose additional *trust* gate for the ST-LSTM that will analyze the reliability of the sensor input (inspired by the works from NLP which predict the next word based on the network's representation of previous words &mdash; this approach works because skeleton joints often move together in articulated (predictable) manner).

The trust gate concept is general and can be used in other applications to deal with unreliable data!

## Datasets

Propsed model was evaluated on four datasets:
- NTU RGB+D (as of this writing, the largest depth-based action dataset - 4 mil frames)
- SBU Interaction
- UT-Kinect
- Berkeley MHAD

## Implementation Details

Each video sequence is divided into T sub-sequences with the same length and one frame was randomly selected from each sub-sequence (done for randomness of data generation and improvement of generalization capability &mdash; better than uniformly sampled frames). $T = 20$ found as an optimal value.

## Comment of the Results

Obtained results suggest that it is really important how (in the semantic sense) is the data fed into the network -> 3.5% jump in accuracy on NTU RGB+D (diff between Joint Chain and Tree Traversal on Cross Subject, Table 1.) just because of the Tree Traversal of the skeleton!

Also, this network does not utilize the Svaitzky-Golay filter to handle the noise in the joint data!

# [5 &mdash; NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis (2016)](https://arxiv.org/abs/1604.02808)

## Main Idea - Large Dataset for RGB+D Human Action Recognition + Part-Aware LSTM

![alt text](./images/gesture_recognition_datasets.png)

![alt text](./images/p_lstm.png)