#### Action Recognition @ UCF101  
**Due date: 11:59 pm on Dec. 11, 2018 (Tuesday)**

## Description
---
In this homework, you will be doing action recognition using Recurrent Neural Network (RNN), (Long-Short Term Memory) LSTM in particular. You will be given a dataset called UCF101, which consists of 101 different actions/classes and for each action, there will be 145 samples. We tagged each sample into either training or testing. Each sample is supposed to be a short video, but we sampled 25 frames from each videos to reduce the data amount. Consequently, a training sample is a tuple of 3D volume with one dimension encoding *temporal correlation* between frames and a label indicating what action it is.

To tackle this problem, we aim to build a neural network that can not only capture spatial information of each frame but also temporal information between frames. Fortunately, you don't have to do this on your own. RNN — a type of neural network designed to deal with time-series data — is right here for you to use. In particular, you will be using LSTM for this task.

Instead of training a end-to-end neural network from scratch whose computation is prohibitively expensive for CPUs. We divide this into two steps: feature extraction and modelling. Below are the things you need to implement for this homework:
- **{35 pts} Feature extraction**. Use the pretrained VGG network to extract features from each frame. Specifically, we recommend  to use the activations of the first fully connected layer `torchvision.models.vgg16` (4096 dim) as features of each video frame. This will result into a 4096x25 matrix for each video. 
    **hints**: 
    - use `scipy.io.savemat()` to save feature to '.mat' file and `scipy.io.loadmat()` load feature.
    - norm your images using `torchvision.transforms`
    ```
    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    prep = transforms.Compose([ transforms.ToTensor(), normalize ])
    prep(img)
    
    ```
    More detils of image preprocessing in PyTorch can be found at http://pytorch.org/tutorials/beginner/data_loading_tutorial.html
    
- **{35 pts} Modelling**. With the extracted features, build an LSTM network which takes a 4096x25 sample as input, and outputs the action label of that sample.
- **{20 pts} Evaluation**. After training your network, you need to evaluate your model with the testing data by computing the prediction accuracy. Moreover, you need to compare the result of your network with that of support vector machine (SVM) (stacking the 4096x25 feature matrix to a long vector and train a SVM).
- **{10 pts} Report**. Details regarding the report can be found in the submission section below.

Notice that the size of the raw images is 256x340, whereas VGG16 takes 224x224 images as inputs. To solve this problem, instead of resizing the images which unfavorably changes the spatial ratio, we take a better solution: Cropping five 224x224 images at the image center and four corners and compute the 4096-dim VGG16 features for each of them, and average these five 4096-dim feature to get final feature representation for the raw image.

In order to save you computational time, we did the feature extraction of most samples for you except for class 1. For class 1, we provide you with the raw images, and you need to write code to extract the feature of the samples in class 1. Instead of training over the whole dataset on CPUs which mays cost you serval days, **use the first 15** classes of the whole dataset. The same applies to those who have access to GPUs.


## Dataset
Download dataset at [UCF101](http://vision.cs.stonybrook.edu/~yangwang/public/UCF101_dimitris_course.zip). 

The dataset is consist of the following two parts: video images and extracted features.

### 1. Video Images  

UCF101 dataset contains 101 actions and 13,320 videos in total.  

+ `annos/actions.txt`  
  + lists all the actions (`ApplyEyeMakeup`, .., `YoYo`)   
  
+ `annots/videos_labels_subsets.txt`  
  + lists all the videos (`v_000001`, .., `v_013320`)  
  + labels (`1`, .., `101`)  
  + subsets (`1` for train, `2` for test)  

+ `images_class1/`  
  + contains videos belonging to class 1 (`ApplyEyeMakeup`)  
  + each video folder contains 25 frames  


### 2. Video Features

+ `extract_vgg16_relu6.py`  
  + used to extract video features  
     + Given an image (size: 256x340), we get 5 crops (size: 224x224) at the image center and four corners. The `vgg16-relu6` features are extracted for all 5 crops and subsequently averaged to form a single feature vector (size: 4096).  
     + Given a video, we process its 25 images seuqentially. In the end, each video is represented as a feature sequence (size: 4096 x 25).  
  + written in PyTorch; supports both CPU and GPU.  

+ `vgg16_relu6/`  
   + contains all the video features, EXCEPT those belonging to class 1 (`ApplyEyeMakeup`)  
   + you need to run script `extract_vgg16_relu6.py` to complete the feature extracting process   


## Some Tutorials
- Good materials for understanding RNN and LSTM
    - http://blog.echen.me
    - http://karpathy.github.io/2015/05/21/rnn-effectiveness/
    - http://colah.github.io/posts/2015-08-Understanding-LSTMs/
- Implementing RNN and LSTM with PyTorch
    - [LSTM with PyTorch](http://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html#sphx-glr-beginner-nlp-sequence-models-tutorial-py)
    - [RNN with PyTorch](http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html)

In [32]:
# write your codes here
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import cv2
import numpy as np
import scipy.io
import numpy as np
from random import shuffle
import copy
from sklearn.metrics import accuracy_score 
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score

In [4]:
##Function to load the data from text files
def loadTextData(path):
    data = []
    all_data = []
    feature = []
    label = []
    train_test = []
    feature_test = []
    feature_train = []
    label_test = []
    label_train = []
    with open(path ,'r') as file:
        lines = file.read()
        line = lines.split("\n")
        #print(line)
        #line = shuffle(line)
        #print(line)
        for i in line:
            string = i.split("\t")
            all_data.append(string)
    
    for i in range(0,2010):
        data.append(all_data[i])
    shuffle(data)
    #print(data)
    
    for i in range(0, 2010):
        feature.append(data[i][0])
        label.append(int(data[i][1])-1)
        train_test.append(data[i][2])
    #print(label)

    for i in range(len(train_test)):
        a = int(train_test[i])
        if a == 1:
            feature_train.append(feature[i])
            label_train.append(label[i])
        if a == 2:
            feature_test.append(feature[i])
            label_test.append(label[i])
    return feature_train, feature_test, label_train, label_test, data

path = 'UCF101_dimitris_course/UCF101_release/annos/videos_labels_subsets.txt'
feature_train1, feature_test1, label_train1, label_test1, data = loadTextData(path)
#print(label_train1)

In [18]:
##Function to get the data in minibatches
def getIndexPairs(features, step): 
    ##Index pairs for mini_batches:
    index_pairs = []
    indices = []
    length = len(features)+1
    for i in range(0,length,step):
        indices.append(i)
    for i in range(len(indices)-1):
        current_index = indices[i]
        next_index = indices[i+1]
        index_pairs.append([current_index, next_index])
    return index_pairs
step = 7    
index_pairs = getIndexPairs(feature_train1, step)
index_pairs_test = getIndexPairs(feature_test1, step)
##print(index_pairs_test)

In [20]:
##Feeding data in batches to the LSTM network along with corresponding labels as tensors of shape (7,1,102400)
def getData(index_pairs, feature_train1, label_train1, path, batch_num):
    dim_1 = 25
    dim_2 = 4096
    channel_num = 1
    mini_batch = []
    total_data = []
    total_label = []
    batch_label = []
    empty = []
    for i in range(len(index_pairs)):
        start = index_pairs[i][0]
        end = index_pairs[i][1]
        mini_batch = copy.deepcopy(empty)
        batch_label = copy.deepcopy(empty)
        for i in range(start, end):
            file_path = feature_train1[i] + '.mat'
            file = scipy.io.loadmat(path + feature_train1[i] + '.mat')
            mini_batch.append(file['Feature'])
            batch_label.append(label_train1[i])
            #batch_label = np.array(batch_label)
            #print(batch_label)
            #print(batch_label.shape)
            #batch_label = np.reshape(batch_label,(batch_num,1))
            #print(batch_label)
        mini_batch = np.array(mini_batch)
        mini_batch = torch.from_numpy(np.reshape(mini_batch, (batch_num,channel_num, dim_1*dim_2)))
        batch_label = np.array(batch_label)
        batch_label = np.reshape(batch_label,(batch_num))
        total_data.append(mini_batch)
        total_label.append(torch.LongTensor(batch_label))
    return total_data, total_label

batch_num = 7
path =  'UCF101_dimitris_course/UCF101_release/vgg16_relu6/'
input_data, label = getData(index_pairs, feature_train1, label_train1, path, batch_num)
input_data_test, label_test = getData(index_pairs_test, feature_test1, label_test1, path, batch_num)

torch.Size([7, 1, 102400])


In [7]:
##LSTM network defined using one hidden layer 
class LSTM(nn.Module):
    
    def __init__(self, hidden_dim, batch_size, label_size):
        super(LSTM, self).__init__()
        self.hidden_dim = hidden_dim
        self.batch_size= batch_size
        #self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(102400, hidden_dim)
        self.hidden2label = nn.Linear(hidden_dim, label_size)
        self.hidden = self.init_hidden()
    
    def init_hidden(self):
        return (torch.zeros(1, self.batch_size, self.hidden_dim),
                torch.zeros(1, self.batch_size, self.hidden_dim))
    
    def forward(self, input_features):
        inputs = input_features.view(self.batch_size,-1)
        lstm_out, self.hidden = self.lstm(inputs.view(1,self.batch_size,-1), self.hidden)
        outputs = self.hidden2label(lstm_out.view(self.batch_size,-1))
        output_labels = F.log_softmax(outputs, dim=-1)
        #output = self.softmax(outputs)
        return output_labels
        return outputs

In [10]:
##Various cases of LSTMs by varying the various hyperparameters mentioned below
Batch_size = 7
Hidden_dim = 256
Num_classes =15

model_1 = LSTM(Hidden_dim,Batch_size,Num_classes)
print(model_1)
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model_1.parameters(), lr=0.1)

#Training
initial_loss = 0
for epoch in range(10):
    loss_count = 0.0
    for i in range(len(input_data)):
        feats = input_data[i]
        labels = label[i]
        model_1.zero_grad()
        model_1.hidden= model_1.init_hidden()
        outputs = model_1(feats)
        loss = loss_function(outputs, labels)
        loss.backward()
        optimizer.step()
        loss_count += loss.item()
    print('Epoch : %d Loss: %.3f' %(epoch+1, loss_count/len(input_data)))

LSTM(
  (lstm): LSTM(102400, 256)
  (hidden2label): Linear(in_features=256, out_features=15, bias=True)
)
Epoch : 1 Loss: 0.875
Epoch : 2 Loss: 0.094
Epoch : 3 Loss: 0.026
Epoch : 4 Loss: 0.012
Epoch : 5 Loss: 0.008
Epoch : 6 Loss: 0.006
Epoch : 7 Loss: 0.005
Epoch : 8 Loss: 0.004
Epoch : 9 Loss: 0.004
Epoch : 10 Loss: 0.003


In [11]:
##Function for testing
def Testing(input_data_test, label_test, model):
    with torch.no_grad():
        correct = 0
        total = 0
        for i in range(len(input_data_test)):
            feats = input_data_test[i]
            labels = label_test[i]
            outputs = model(feats)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
            accuracy = (correct/total)*100
    return accuracy

accuracy = Testing(input_data_test, label_test, model_1)
print("Accuracy on the test dataset is:", accuracy)


Accuracy on the test dataset is: 92.7689594356261


In [12]:
Batch_size = 7
Hidden_dim = 512
Num_classes =15

model_2 = LSTM(Hidden_dim,Batch_size,Num_classes)
print(model_2)
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model_2.parameters(), lr=0.1)

#Training
initial_loss = 0
for epoch in range(10):
    loss_count = 0.0
    for i in range(len(input_data)):
        feats = input_data[i]
        labels = label[i]
        model_2.zero_grad()
        model_2.hidden= model_2.init_hidden()
        outputs = model_2(feats)
        loss = loss_function(outputs, labels)
        loss.backward()
        optimizer.step()
        loss_count += loss.item()
    print('Epoch : %d Loss: %.3f' %(epoch+1, loss_count/len(input_data)))

LSTM(
  (lstm): LSTM(102400, 512)
  (hidden2label): Linear(in_features=512, out_features=15, bias=True)
)
Epoch : 1 Loss: 0.634
Epoch : 2 Loss: 0.040
Epoch : 3 Loss: 0.010
Epoch : 4 Loss: 0.005
Epoch : 5 Loss: 0.004
Epoch : 6 Loss: 0.003
Epoch : 7 Loss: 0.002
Epoch : 8 Loss: 0.002
Epoch : 9 Loss: 0.002
Epoch : 10 Loss: 0.002


In [13]:
accuracy = Testing(input_data_test, label_test, model_2)
print("Accuracy on the test dataset is:", accuracy)

Accuracy on the test dataset is: 92.41622574955908


In [21]:
step = 14  
index_pairs = getIndexPairs(feature_train1, step)
index_pairs_test = getIndexPairs(feature_test1, step)
batch_num = 14
path =  'UCF101_dimitris_course/UCF101_release/vgg16_relu6/'
input_data, label = getData(index_pairs, feature_train1, label_train1, path, batch_num)
input_data_test, label_test = getData(index_pairs_test, feature_test1, label_test1, path, batch_num)
print(input_data[0].shape)

torch.Size([14, 1, 102400])


In [22]:
Batch_size = 14
Hidden_dim = 256
Num_classes =15

model_1 = LSTM(Hidden_dim,Batch_size,Num_classes)
print(model_1)
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model_1.parameters(), lr=0.1)

#Training
initial_loss = 0
for epoch in range(10):
    loss_count = 0.0
    for i in range(len(input_data)):
        feats = input_data[i]
        labels = label[i]
        model_1.zero_grad()
        model_1.hidden= model_1.init_hidden()
        outputs = model_1(feats)
        loss = loss_function(outputs, labels)
        loss.backward()
        optimizer.step()
        loss_count += loss.item()
    print('Epoch : %d Loss: %.3f' %(epoch+1, loss_count/len(input_data)))

LSTM(
  (lstm): LSTM(102400, 256)
  (hidden2label): Linear(in_features=256, out_features=15, bias=True)
)
Epoch : 1 Loss: 1.062
Epoch : 2 Loss: 0.140
Epoch : 3 Loss: 0.045
Epoch : 4 Loss: 0.020
Epoch : 5 Loss: 0.012
Epoch : 6 Loss: 0.009
Epoch : 7 Loss: 0.007
Epoch : 8 Loss: 0.006
Epoch : 9 Loss: 0.005
Epoch : 10 Loss: 0.004


In [23]:
accuracy = Testing(input_data_test, label_test, model_1)
print("Accuracy on the test dataset is:", accuracy)

Accuracy on the test dataset is: 92.32142857142858


In [24]:
##LSTM with two nn.LSTM layers
class LSTM_2(nn.Module):

    def __init__(self, hidden_dim_1, hidden_dim_2, batch_size, label_size):
        super(LSTM_2, self).__init__()
        self.hidden_dim_1 = hidden_dim_1
        self.hidden_dim_2 = hidden_dim_2
        self.batch_size = batch_size

        #self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.lstm_1 = nn.LSTM(102400, hidden_dim_1)
        self.lstm_2 = nn.LSTM(hidden_dim_1, hidden_dim_2)
        
        self.hidden2label = nn.Linear(hidden_dim_2, label_size)
        
        self.hidden1 = self.init_hidden1()
        self.hidden2 = self.init_hidden2()
    
    def init_hidden1(self):
        return (torch.zeros(1, 7, self.hidden_dim_1),
                torch.zeros(1, 7, self.hidden_dim_1))
    def init_hidden2(self):
        return(torch.zeros(1, 7, self.hidden_dim_2),
               torch.zeros(1, 7, self.hidden_dim_2))

    def forward(self, input_features):
        inputs = input_features.view(7,-1)
        lstm_out_1, self.hidden1 = self.lstm_1(inputs.view(1,7,-1), self.hidden1)
        lstm_out_2, self.hidden2 = self.lstm_2(lstm_out_1.view(1,7,-1), self.hidden2)
        outputs = self.hidden2label(lstm_out_2.view(7,-1))
        output_labels = F.log_softmax(outputs, dim=-1)
        return output_labels
        return outputs

In [26]:
step = 7
index_pairs = getIndexPairs(feature_train1, step)
index_pairs_test = getIndexPairs(feature_test1, step)
batch_num = 7
path =  'UCF101_dimitris_course/UCF101_release/vgg16_relu6/'
input_data, label = getData(index_pairs, feature_train1, label_train1, path, batch_num)
input_data_test, label_test = getData(index_pairs_test, feature_test1, label_test1, path, batch_num)
print(input_data[0].shape)

torch.Size([7, 1, 102400])


In [27]:
##Model with Hidden dimension size for one layer equal to 1024
HIDDEN_DIM_1 = 256
HIDDEN_DIM_2 = 512
model_3 = LSTM_2(HIDDEN_DIM_1, HIDDEN_DIM_2, 7, 15)
print(model_3)
loss_function = nn.NLLLoss()
optimizer=optim.SGD(model_3.parameters(), lr=0.1)
loss_list = []

initial_loss = 0
for epoch in range(10):
    loss_count = 0.0
    for i in range(len(input_data)):
        model_3.zero_grad()
        model_3.hidden1 = model_3.init_hidden1()
        model_3.hidden2 = model_3.init_hidden2()
        output_scores = model_3(input_data[i])
        loss = loss_function(output_scores, label[i])
        loss.backward()
        optimizer.step()
        loss_count += loss.item()
    print('Epoch : %d Loss: %.3f' %(epoch+1, loss_count/len(input_data)))

LSTM_2(
  (lstm_1): LSTM(102400, 256)
  (lstm_2): LSTM(256, 512)
  (hidden2label): Linear(in_features=512, out_features=15, bias=True)
)
Epoch : 1 Loss: 2.243
Epoch : 2 Loss: 0.655
Epoch : 3 Loss: 0.160
Epoch : 4 Loss: 0.058
Epoch : 5 Loss: 0.019
Epoch : 6 Loss: 0.009
Epoch : 7 Loss: 0.006
Epoch : 8 Loss: 0.004
Epoch : 9 Loss: 0.003
Epoch : 10 Loss: 0.003


In [28]:
accuracy = Testing(input_data_test, label_test, model_3)
print("Accuracy on the test dataset is:", accuracy)

Accuracy on the test dataset is: 61.552028218694886


In [30]:
###SVM
##get data for SVM:
def getDataSVM(path, feature_set):
    total_data_SVM=[]
    dim_1 = 25
    dim_2 = 4096
    for i in range(len(feature_set)):
        file_path = feature_set[i] + '.mat'
        file = scipy.io.loadmat(path + file_path)
        feature = file['Feature']
        feature = np.reshape(feature,(dim_1*dim_2))
        total_data_SVM.append(feature)
    return total_data_SVM

path =  'UCF101_dimitris_course/UCF101_release/vgg16_relu6/'
total_data_SVM_train = getDataSVM(path, feature_train1)
total_data_SVM_test = getDataSVM(path, feature_test1)
    
print(len(total_data_SVM_train))
print(len(total_data_SVM_test)) 

1442
568


In [31]:
clf =  OneVsRestClassifier(LinearSVC(random_state=None ,tol=1e-4, loss='squared_hinge', C=0.012))
clf.fit(total_data_SVM_train, label_train1)



OneVsRestClassifier(estimator=LinearSVC(C=0.012, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0),
          n_jobs=None)

In [33]:
labels_predicted = clf.predict(total_data_SVM_test)
accuracy = accuracy_score(label_test1, labels_predicted)
print("The predicted accuracy is {:.2f}%".format(accuracy*100))

The predicted accuracy is 95.42%


## Submission
---
**Runnable source code in ipynb file and a pdf report are required**.

The report should be of 3 to 4 pages describing what you have done and learned in this homework and report performance of your model. If you have tried multiple methods, please compare your results. If you are using any external code, please cite it in your report. Note that this homework is designed to help you explore and get familiar with the techniques. The final grading will be largely based on your prediction accuracy and the different methods you tried (different architectures and parameters).

Please indicate clearly in your report what model you have tried, what techniques you applied to improve the performance and report their accuracies. The report should be concise and include the highlights of your efforts.