# [Action bank](http://www.cse.buffalo.edu/~jcorso/r/actionbank/) evaluation on the [UCF101](http://crcv.ucf.edu/data/UCF101.php) dataset

In [1]:
import sys
import os
import time
# Add code dir to PYTHONPATH          
sys.path.append("../../3rdparty/action_bank/base")
sys.path.append("../scripts")

In [2]:
%pylab inline
import time
import pandas
import numpy
import seaborn

from pprint import pprint
from sklearn.svm import SVC
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.grid_search import GridSearchCV
from sklearn.metrics.pairwise import chi2_kernel
from sklearn.cross_validation import cross_val_score, StratifiedKFold
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

seaborn.set_palette("deep", desat=0.6)
seaborn.set_context(rc={'figure.figsize': (15, 12)})

from load_data import read_cross_data, banked_suffix
from conf_plot import plot_confusion_mat

Populating the interactive namespace from numpy and matplotlib


In [3]:
%%bash
rm -rf ../data/ucf101_bank/
tar -xf ../../3rdparty/action_bank/ucf101_bank.tar.gz -C ../data/

In [2]:
%%bash
rm -rf ../data/ucf101-splits/
wget http://crcv.ucf.edu/data/UCF101/UCF101TrainTestSplits-RecognitionTask.zip -P ../data/
unzip ../data/UCF101TrainTestSplits-RecognitionTask.zip -d ../data/
mv ../data/ucfTrainTestlist/ ../data/ucf101-splits/
rm -f ../data/UCF101TrainTestSplits-RecognitionTask.zip

In [5]:
import shutil
x = %pwd
abs_path = x[:x.rfind("/")]
def split_ucf101_data(txt_file, train_test='test'):
    """Splits the UCF101 data for train and test

    Parameters
    ----------
    path

    Returns
    -------

    """
    labels = []
    contents = []

    # Read line by lines
    with open(txt_file) as f:
        for line in f.readlines():
            # Append file line without return key or
            temp = line.rsplit(" ", 1)[0]
            filename = temp.rsplit("\n", 1)[0].rsplit("\r", 1)[0] + banked_suffix
            
            out_dir = abs_path + "/data/ucf101_bank/"+train_test+"/"+filename.split("/")[0]
            if not os.path.exists(out_dir):
                os.makedirs(out_dir)
            
            if os.path.isfile(abs_path + "/data/ucf101_bank/"+filename):
                shutil.move(abs_path + "/data/ucf101_bank/"+filename,
                            abs_path + "/data/ucf101_bank/"+train_test+"/"+filename)
            else:
                print "file: {} -- doesnot exist, file not moved".format(filename)

In [6]:
# Move test files
print "Moving test-set files ..."
split_ucf101_data("../data/ucf101-splits/testlist01.txt", 'test')

# Move train files
print "Moving train-set files ..."
split_ucf101_data("../data/ucf101-splits/trainlist01.txt", 'train')

Moving test-set files ...
Moving train-set files ...


## Purpose
The purpose of this notebook is to quantify [Action bank](http://www.cse.buffalo.edu/~jcorso/r/actionbank/) features  for `action recognition` using the [UCF101](http://crcv.ucf.edu/data/UCF101.php) dataset. 

### Action Bank 
The [Action bank](http://www.cse.buffalo.edu/~jcorso/r/actionbank/) algorithim gerenates spatio-temporal features, which capture characteristic shape and motion from video data. A video is represented as a collection of many action detectors that each produce a correlation volume. Template-based action detectors are the primary element of action bank recognition. 

### Action recognition pipeline 
The standard framework for evaluating spatio-temporal features for action recognition, involves constructing a **BoW** histogram of spatio-temporal features before applying a classifier. The spatio-temporal features
features are first quantized into visual words and a video is then represented as the
frequency histogram over the visual words.
<img src="../images/vid_bow.png" alt="action_pipeline" style="width: 600px;"/>

[Wang2009](http://www.irisa.fr/vista/Papers/2009_bmvc_wang.pdf) constructed vocabularies using *k*-means clustering, and set the number of visual words $V$ to $4 000$. The resulting histograms of visual word occurrences are used as to represent the video sequence. Features are assigned to their closest vocabulary word using Euclidean distance. To increase precision, [Wang2009](http://www.irisa.fr/vista/Papers/2009_bmvc_wang.pdf) initialised *k*-means 8 times and keep the result with the lowest error. To classify the features, a non-linear support vector machine with a $\chi^2$-kernel is. . The resulting histograms of visual
word occurrences are used as video sequence representations

**Date: 06-02-2016**

## Experimental Evaluation
### Building [Action bank](http://www.cse.buffalo.edu/~jcorso/r/actionbank/) templates
The [action bank](http://www.cse.buffalo.edu/~jcorso/r/actionbank/) templates used for all experiments consists of 205 template actions, and six action templates (e.g. `clap4`, `violin6`, `soccer3`, `jog right4`, `polevault4`, `ski4`, `basketball2`, and `hula4`) extracted from the UCF51 dataset. The action templates have an average spatial
resolution of approximately $50\times 120$ pixels and a temporal length of $40 - 50$ frames; each template is cropped spatially to cover the extent of the human motion within it. 

### Evaluation
The [UCF101](http://crcv.ucf.edu/data/UCF101.php) dataset contain 3 train-test splits. For the purposes of this eperiment `split1`. Instead of the $\chi^2$-kernel a `LinearSVM` is used as in the paper. The evaluation consists of four experimental pipelines tests,

1. Features - `LinearSVM`
2. Features - Stanadrd normalisation - `LinearSVM`
3. Features - Stanadrd normalisation - KMeans **BoW** - `LinearSVM`
3. Features - Stanadrd normalisation - PCA - `LinearSVM`

An `accuracy` score is used as the perfomance measure.

## Experimental Results

### Examining the features
The [Action bank](http://www.cse.buffalo.edu/~jcorso/r/actionbank/) features on the [UCF101](http://crcv.ucf.edu/data/UCF101.php) dataset were extacted before hand using the code provided [here](http://www.cse.buffalo.edu/~jcorso/r/actionbank/). The [Action bank](http://www.cse.buffalo.edu/~jcorso/r/actionbank/) features lie in the range $(0, 255)$.

In [7]:
sys.stdout.write("\rLoading UCF101 action bank train data")
X, Y, folders = read_cross_data('../data/ucf101_bank/train/')

Loading UCF101 action bank train datavector length is 14965


In [8]:
sys.stdout.write("Max summary statistics for the UCF101 train data")
pandas.DataFrame(X).describe().max(axis=1)

Max summary statistics for the UCF101 train data

count    9537.00000
mean      224.98228
std        86.65980
min        75.00000
25%       218.00000
50%       238.00000
75%       246.00000
max       255.00000
dtype: float64

In [9]:
sys.stdout.write("\rThe UCF101 action bank test data has:\nSamples\t\t{0}"\
                 "\nFeatures\t{1}\nLabels\t\t{2}".format(X.shape[0], X.shape[1], folders))

The UCF101 action bank test data has:
Samples		9537
Features	14965
Labels		['ApplyEyeMakeup', 'ApplyLipstick', 'Archery', 'BabyCrawling', 'BalanceBeam', 'BandMarching', 'BaseballPitch', 'Basketball', 'BasketballDunk', 'BenchPress', 'Biking', 'Billiards', 'BlowDryHair', 'BlowingCandles', 'BodyWeightSquats', 'Bowling', 'BoxingPunchingBag', 'BoxingSpeedBag', 'BrushingTeeth', 'CleanAndJerk', 'CliffDiving', 'CricketBowling', 'CricketShot', 'CuttingInKitchen', 'Diving', 'Drumming', 'Fencing', 'FieldHockeyPenalty', 'FloorGymnastics', 'FrisbeeCatch', 'FrontCrawl', 'GolfSwing', 'Haircut', 'Hammering', 'HammerThrow', 'HandstandPushups', 'HeadMassage', 'HighJump', 'HorseRace', 'HorseRiding', 'HulaHoop', 'IceDancing', 'JavelinThrow', 'JugglingBalls', 'JumpingJack', 'JumpRope', 'Kayaking', 'Knitting', 'LongJump', 'Lunges', 'MilitaryParade', 'Mixing', 'MoppingFloor', 'Nunchucks', 'PizzaTossing', 'PlayingCello', 'PlayingDaf', 'PlayingDhol', 'PlayingFlute', 'PlayingGuitar', 'PlayingPiano', 'PlayingSi

In [10]:
sys.stdout.write("\rLoading UCF101 action bank test data")
X_test, Y_test, _ = read_cross_data('../data/ucf101_bank/test/')

Loading UCF101 action bank test datavector length is 14965


In [11]:
sys.stdout.write("\rThe UCF101 action bank test data has:\nSamples\t\t{0}"\
                 "\nFeatures\t{1}".format(X_test.shape[0], X_test.shape[1]))

The UCF101 action bank test data has:
Samples		3783
Features	14965

### Features - LinearSVM pipeline
The raw extracted fatures are passed to the `LinearSVM` without any preprocessing.

In [12]:
print "Training model ..."    

## Model train and test
pl = Pipeline(steps=[
        ('SVM', SVC(kernel='linear'))
    ])

print "pipeline:", [name for name, _ in pl.steps]
    
start = time.time()
pl.fit(X=X, y=Y)
print "Time taken {0}".format(time.strftime("%H:%M:%S", time.gmtime(time.time() - start)))
                     
cls_01 = pl
start = time.time()
pred_01= pl.predict(X=X_test)
print "Time taken {0}".format(time.strftime("%H:%M:%S", time.gmtime(time.time() - start))) 

Training model ...
pipeline: ['SVM']
Time taken 00:11:41
Time taken 00:07:29


In [13]:
print classification_report(y_pred=pred_01, y_true=Y_test)

             precision    recall  f1-score   support

        0.0       0.18      0.20      0.19        44
        1.0       0.25      0.38      0.30        32
        2.0       0.00      0.00      0.00        41
        3.0       0.06      0.11      0.08        35
        4.0       0.69      0.58      0.63        31
        5.0       0.24      0.28      0.26        43
        6.0       0.54      0.49      0.51        43
        7.0       0.25      0.34      0.29        35
        8.0       0.50      0.73      0.59        37
        9.0       0.65      0.69      0.67        48
       10.0       0.32      0.42      0.36        38
       11.0       0.95      0.88      0.91        40
       12.0       0.14      0.13      0.14        38
       13.0       0.20      0.39      0.27        33
       14.0       0.49      0.63      0.55        30
       15.0       0.57      0.58      0.57        43
       16.0       0.22      0.29      0.25        49
       17.0       0.46      0.35      0.40   

In [14]:
print "Average accuracy on the UCF101 dataset: {0:.4f}".format(accuracy_score(y_true=Y_test, y_pred=pred_01) * 100.)

Average accuracy on the UCF101 dataset: 38.6730


#### Observations
The accuracy of $38.67\%$ is below the baseline accuracy of $43.9\%$ achieved by [Soomro2012](http://crcv.ucf.edu/papers/cvpr2009_liu1.pdf), and it is well below the best $76.95\%$ `flow` model accuracy achieved by [Donahue2014](http://arxiv.org/pdf/1411.4389v3.pdf). [Donahue2014](http://arxiv.org/pdf/1411.4389v3.pdf) were able to increase their perfomance to $86.4\%$ by weighting their `RGB` and `flow` models. The difference in perfomace highlights the differene between hard-wired feature extraction and learned feature extraction.

### Features - Standard normalisation - LinearSVM pipeline
The features are first normalised with respect to the `mean` and `variance`.

In [15]:
print "Training model ..."       

## Model train and test
pl = Pipeline(steps=[
        ('StdScaler', StandardScaler(with_mean=True, with_std=True)),
        ('SVM', SVC(kernel='linear'))
    ])

print "pipeline:", [name for name, _ in pl.steps]
    
start = time.time()
pl.fit(X=X.astype('float32'), y=Y)
print "Time taken {0}".format(time.strftime("%H:%M:%S", time.gmtime(time.time() - start)))
                     
cls_02 = pl
start = time.time()
pred_02= pl.predict(X=X_test.astype('float32'))
print "Time taken {0}".format(time.strftime("%H:%M:%S", time.gmtime(time.time() - start)))

Training model ...
pipeline: ['StdScaler', 'SVM']
Time taken 00:12:01
Time taken 00:07:32


In [16]:
print classification_report(y_pred=pred_02, y_true=Y_test)

             precision    recall  f1-score   support

        0.0       0.15      0.20      0.17        44
        1.0       0.24      0.38      0.29        32
        2.0       0.04      0.02      0.03        41
        3.0       0.12      0.17      0.14        35
        4.0       0.59      0.55      0.57        31
        5.0       0.20      0.26      0.22        43
        6.0       0.56      0.47      0.51        43
        7.0       0.30      0.40      0.35        35
        8.0       0.57      0.76      0.65        37
        9.0       0.74      0.73      0.74        48
       10.0       0.33      0.39      0.36        38
       11.0       0.91      0.80      0.85        40
       12.0       0.17      0.18      0.18        38
       13.0       0.17      0.36      0.23        33
       14.0       0.55      0.70      0.62        30
       15.0       0.52      0.53      0.53        43
       16.0       0.19      0.20      0.20        49
       17.0       0.41      0.32      0.36   

In [17]:
print "Average accuracy on the UCF101 dataset: {0:.4f}".format(accuracy_score(y_true=Y_test, y_pred=pred_02) * 100.)

Average accuracy on the UCF101 dataset: 39.9154


#### Observations
The standardising the features increases the accuracy to $39.91\%$, which is still below the baseline accuracy of $43.9\%$ achieved by [Soomro2012](http://crcv.ucf.edu/papers/cvpr2009_liu1.pdf). Looking at the classification report `label` $44$ (`JugglingBalls`) and $11$ (`Biking`) have a very high `f1-score`. 

### Features - standard normalisation - KMeans BoW - LinearSVM pipeline
The features are first standardised normalised using the training set `mean`, then a `kmeans` **BoW** model with a dictionary of $4000$ is created before a `LinearSVM` is applied.

In [18]:
print "Training model ..."   

## Model train and test
pl = Pipeline(steps=[
        ('StdScaler', StandardScaler(with_mean=True)),
        ('KMeansCluster', KMeans(n_clusters=4000, n_init=8, n_jobs=8)),
        ('SVM', SVC(kernel='linear'))
    ])

print "pipeline:", [name for name, _ in pl.steps]
    
start = time.time()
pl.fit(X=X.astype('float32'), y=Y)
print "Time taken {0}".format(time.strftime("%H:%M:%S", time.gmtime(time.time() - start)))
                     
cls_03 = pl
start = time.time()
pred_03 = pl.predict(X=X_test.astype('float32'))
print "Time taken {0}".format(time.strftime("%H:%M:%S", time.gmtime(time.time() - start)))

Training model ...
pipeline: ['StdScaler', 'KMeansCluster', 'SVM']
Time taken 04:08:47
Time taken 00:02:05


In [19]:
print classification_report(y_pred=pred_03, y_true=Y_test)

             precision    recall  f1-score   support

        0.0       0.13      0.25      0.17        44
        1.0       0.11      0.19      0.14        32
        2.0       0.04      0.02      0.03        41
        3.0       0.08      0.14      0.10        35
        4.0       0.42      0.35      0.39        31
        5.0       0.16      0.26      0.19        43
        6.0       0.34      0.33      0.33        43
        7.0       0.11      0.11      0.11        35
        8.0       0.34      0.59      0.44        37
        9.0       0.59      0.67      0.63        48
       10.0       0.22      0.32      0.26        38
       11.0       0.78      0.70      0.74        40
       12.0       0.06      0.08      0.07        38
       13.0       0.21      0.39      0.27        33
       14.0       0.33      0.33      0.33        30
       15.0       0.38      0.35      0.36        43
       16.0       0.07      0.08      0.07        49
       17.0       0.37      0.35      0.36   

In [20]:
print "Average accuracy on the UCF101 dataset: {0:.4f}".format(accuracy_score(y_true=Y_test, y_pred=pred_03) * 100.)

Average accuracy on the UCF101 dataset: 28.9717


#### Observations
The constructing ac **BoW** models dramatically reduces the the accuracy to $28.97\%$. One possible reason for this might be that the `kmeans` algorithm's initialisation was poor and thus convergence was at a poor location.

### Features - standard normalisation - PCA - LinearSVM pipeline
The features are first standardised normalised using the training set `mean` and `variance`, then the features are `PCA` is applied to reduce the feature space from $14965$ to $4000$ before applying a `LinearSVM` classifier

In [21]:
print "Training model ..."     

## Model train and test
pl = Pipeline(steps=[
        ('StdScaler', StandardScaler(with_mean=True, with_std=True)),
        ('PCA', PCA(n_components=8000)),
        ('SVM', SVC(kernel='linear'))
    ])

print "pipeline:", [name for name, _ in pl.steps]
    
start = time.time()
pl.fit(X=X.astype('float32'), y=Y)
print "Time taken {0}".format(time.strftime("%H:%M:%S", time.gmtime(time.time() - start)))
                     
cls_04 = pl
start = time.time()
pred_04 = pl.predict(X=X_test.astype('float32'))
print "Time taken {0}".format(time.strftime("%H:%M:%S", time.gmtime(time.time() - start)))

Training model ...
pipeline: ['StdScaler', 'PCA', 'SVM']
Time taken 00:19:19
Time taken 00:04:11


In [22]:
print classification_report(y_pred=pred_04, y_true=Y_test)

             precision    recall  f1-score   support

        0.0       0.15      0.20      0.17        44
        1.0       0.24      0.38      0.29        32
        2.0       0.04      0.02      0.03        41
        3.0       0.12      0.17      0.14        35
        4.0       0.59      0.55      0.57        31
        5.0       0.20      0.26      0.22        43
        6.0       0.56      0.47      0.51        43
        7.0       0.30      0.40      0.35        35
        8.0       0.57      0.76      0.65        37
        9.0       0.74      0.73      0.74        48
       10.0       0.33      0.39      0.36        38
       11.0       0.91      0.80      0.85        40
       12.0       0.17      0.18      0.18        38
       13.0       0.17      0.36      0.23        33
       14.0       0.55      0.70      0.62        30
       15.0       0.52      0.53      0.53        43
       16.0       0.19      0.20      0.20        49
       17.0       0.41      0.32      0.36   

In [23]:
print "Average accuracy on the UCF101 dataset: {0:.4f}".format(accuracy_score(y_true=Y_test, y_pred=pred_04) * 100.)

Average accuracy on the UCF101 dataset: 39.9154


#### Observations
Adding `PCA` to the standard normalisation pipeline resulted in an `accuracy` score of $39.92\%$, which is similar to the one achieved with the standard normalisation pipeline.

## Conclution

The `notebook` examined [Action bank](http://www.cse.buffalo.edu/~jcorso/r/actionbank/) features for `action recognition` using the [UCF101](http://crcv.ucf.edu/data/UCF101.php) dataset using 4 different pipelines. `Action bank` were able to achieve the best `accuracy` score of $39.91\%$ using the `Features - Stanadrd normalisation - LinearSVM` pipeline.

The accuracy of $39.91\%$ is below the baseline accuracy of $43.9\%$ achieved by [Soomro2012](http://crcv.ucf.edu/papers/cvpr2009_liu1.pdf), and it is well below the best $76.95\%$ `flow` model accuracy achieved by [Donahue2014](http://arxiv.org/pdf/1411.4389v3.pdf). [Donahue2014](http://arxiv.org/pdf/1411.4389v3.pdf) were able to increase their perfomance to $86.4\%$ by weighting their `RGB` and `flow` models. The difference in perfomace highlights the differene between hard-wired feature extraction and learned feature extraction.

[Action bank](http://www.cse.buffalo.edu/~jcorso/r/actionbank/) features don't scale well to `action recognition` datasets such as the [UCF101](http://crcv.ucf.edu/data/UCF101.php), where the videos are recorded in unconstrained environments and some clips include with camera motion, varying light conditions, and occlusions. This effect can also be seen in [Action bank](http://www.cse.buffalo.edu/~jcorso/r/actionbank/)'s performence on the [HMDB51](http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/) dataset were an accuracy of $26.9\%$ was achieved.