# Action Recognition @ UCF101  
**Due date: 11:59 pm on Dec. 11, 2018 (Tuesday)**

## Description
---
In this homework, you will be doing action recognition using Recurrent Neural Network (RNN), (Long-Short Term Memory) LSTM in particular. You will be given a dataset called UCF101, which consists of 101 different actions/classes and for each action, there will be 145 samples. We tagged each sample into either training or testing. Each sample is supposed to be a short video, but we sampled 25 frames from each videos to reduce the data amount. Consequently, a training sample is a tuple of 3D volume with one dimension encoding *temporal correlation* between frames and a label indicating what action it is.

To tackle this problem, we aim to build a neural network that can not only capture spatial information of each frame but also temporal information between frames. Fortunately, you don't have to do this on your own. RNN — a type of neural network designed to deal with time-series data — is right here for you to use. In particular, you will be using LSTM for this task.

Instead of training a end-to-end neural network from scratch whose computation is prohibitively expensive for CPUs. We divide this into two steps: feature extraction and modelling. Below are the things you need to implement for this homework:
- **{35 pts} Feature extraction**. Use the pretrained VGG network to extract features from each frame. Specifically, we recommend  to use the activations of the first fully connected layer `torchvision.models.vgg16` (4096 dim) as features of each video frame. This will result into a 4096x25 matrix for each video. 
    **hints**: 
    - use `scipy.io.savemat()` to save feature to '.mat' file and `scipy.io.loadmat()` load feature.
    - norm your images using `torchvision.transforms`
    ```
    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    prep = transforms.Compose([ transforms.ToTensor(), normalize ])
    prep(img)
    
    ```
    More detils of image preprocessing in PyTorch can be found at http://pytorch.org/tutorials/beginner/data_loading_tutorial.html
    
- **{35 pts} Modelling**. With the extracted features, build an LSTM network which takes a 4096x25 sample as input, and outputs the action label of that sample.
- **{20 pts} Evaluation**. After training your network, you need to evaluate your model with the testing data by computing the prediction accuracy. Moreover, you need to compare the result of your network with that of support vector machine (SVM) (stacking the 4096x25 feature matrix to a long vector and train a SVM).
- **{10 pts} Report**. Details regarding the report can be found in the submission section below.

Notice that the size of the raw images is 256x340, whereas VGG16 takes 224x224 images as inputs. To solve this problem, instead of resizing the images which unfavorably changes the spatial ratio, we take a better solution: Cropping five 224x224 images at the image center and four corners and compute the 4096-dim VGG16 features for each of them, and average these five 4096-dim feature to get final feature representation for the raw image.

In order to save you computational time, we did the feature extraction of most samples for you except for class 1. For class 1, we provide you with the raw images, and you need to write code to extract the feature of the samples in class 1. Instead of training over the whole dataset on CPUs which mays cost you serval days, **use the first 15** classes of the whole dataset. The same applies to those who have access to GPUs.


## Dataset
Download dataset at [UCF101](http://vision.cs.stonybrook.edu/~yangwang/public/UCF101_dimitris_course.zip). 

The dataset is consist of the following two parts: video images and extracted features.

### 1. Video Images  

UCF101 dataset contains 101 actions and 13,320 videos in total.  

+ `annos/actions.txt`  
  + lists all the actions (`ApplyEyeMakeup`, .., `YoYo`)   
  
+ `annots/videos_labels_subsets.txt`  
  + lists all the videos (`v_000001`, .., `v_013320`)  
  + labels (`1`, .., `101`)  
  + subsets (`1` for train, `2` for test)  

+ `images_class1/`  
  + contains videos belonging to class 1 (`ApplyEyeMakeup`)  
  + each video folder contains 25 frames  


### 2. Video Features

+ `extract_vgg16_relu6.py`  
  + used to extract video features  
     + Given an image (size: 256x340), we get 5 crops (size: 224x224) at the image center and four corners. The `vgg16-relu6` features are extracted for all 5 crops and subsequently averaged to form a single feature vector (size: 4096).  
     + Given a video, we process its 25 images seuqentially. In the end, each video is represented as a feature sequence (size: 4096 x 25).  
  + written in PyTorch; supports both CPU and GPU.  

+ `vgg16_relu6/`  
   + contains all the video features, EXCEPT those belonging to class 1 (`ApplyEyeMakeup`)  
   + you need to run script `extract_vgg16_relu6.py` to complete the feature extracting process   


## Some Tutorials
- Good materials for understanding RNN and LSTM
    - http://blog.echen.me
    - http://karpathy.github.io/2015/05/21/rnn-effectiveness/
    - http://colah.github.io/posts/2015-08-Understanding-LSTMs/
- Implementing RNN and LSTM with PyTorch
    - [LSTM with PyTorch](http://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html#sphx-glr-beginner-nlp-sequence-models-tutorial-py)
    - [RNN with PyTorch](http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html)

In [1]:
# write your codes here
import torch
import torchvision
import torchvision.transforms as transforms
import torchvision.models as models
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam, RMSprop
from torch.utils import data
from torch.utils.data.dataset import Dataset
import torchvision.models.inception as inception

import scipy
from scipy import io
import numpy as np
import csv
import pandas as pd

In [2]:
path = "./UCF101_dimitris_course/UCF101_release/vgg16_relu6/"
train_data = []
test_data = []
train_label = []
test_label = []
with open("./UCF101_dimitris_course/UCF101_release/annos/videos_labels_subsets.txt", "r") as f:
    for line in f:
        line_list = line.rstrip().split("\t")
        if line_list[1] == "16":
            break
        if line_list[-1] == "1":
            
            train_data.append(scipy.io.loadmat(path+line_list[0]+".mat")["Feature"])
            train_label.append(int(line_list[1]))
                              
        else:
            test_data.append(scipy.io.loadmat(path+line_list[0]+".mat")["Feature"])
            test_label.append(int(line_list[1]))

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(train_data, train_label, test_size=0.10, random_state=42)


In [4]:
class MyDataset(Dataset):
    def __init__(self, X, y):
        self.data = X
        self.target = y

    def __getitem__(self, index):
        x = self.data[index]
        y = self.target[index]

        return x, y

    def __len__(self):
        return len(self.data)

In [5]:
train_dataset = MyDataset(X_train, y_train)
validation_dataset = MyDataset(X_val, y_val)
test_dataset = MyDataset(test_data, test_label)

In [6]:
train_dataset_loader = torch.utils.data.DataLoader(dataset = train_dataset, batch_size = 25, shuffle = True, drop_last=True)
val_dataset_loader = torch.utils.data.DataLoader(dataset = validation_dataset, batch_size = 25, shuffle = True, drop_last=True)
test_dataset_loader = torch.utils.data.DataLoader(dataset = test_dataset, batch_size = 25, shuffle = True, drop_last=True)

In [7]:
"""
#following model gives accuracy of 80.63 with 21 epochs
class LSTM_Classify(nn.Module):
    def __init__(self,feature_size, output_size):
        super(LSTM_Classify, self).__init__()
        
        self.recurrent_layer = nn.LSTM(feature_size, 25, 2)
        self.prediction_layer = nn.Linear(25, output_size)
        
    def forward(self, input, h_t_1=None, c_t_1=None):
        rnn_output, (hn, cn) = self.recurrent_layer(input)
        prediction_output = self.prediction_layer(rnn_output[:,-1])
        return prediction_output
"""
"""
#following model was run on GPU getting GPU memory full

class LSTM_Classify(nn.Module):
    def __init__(self,feature_size, output_size):
        super(LSTM_Classify, self).__init__()
        self.inception = nn.Sequential(
                                       nn.Conv2d(1, 15, stride = 1, kernel_size = 2),
                                       nn.LeakyReLU(),
                                       nn.BatchNorm2d(15),
                                       inception.InceptionB(15),
                                       nn.MaxPool2d(2, stride = 2),
                                       nn.Conv2d(495, 1, stride = 1, kernel_size = 2),
                                       nn.MaxPool2d(2, stride = 2),
                                       nn.LeakyReLU(),
                                       nn.BatchNorm2d(1))

        self.linear = nn.Linear(1022, 500)
        self.recurrent_layer = nn.LSTM(500, 25, 2)
        self.prediction_layer = nn.Linear(25, output_size)
    def forward(self, input, h_t_1=None, c_t_1=None):
        inception_output = self.inception(input.view((25, 1, 25, 4096)))
        N, C, H, W = inception_output.size()
        linear_input = inception_output.view(N, -1)
        linear_output = self.linear(linear_input)
        recurrent_input = linear_output.view(25, 1, 500)
        rnn_output, (hn, cn) = self.recurrent_layer(recurrent_input)
        prediction_output = self.prediction_layer(rnn_output[:,-1])
        #return inception_output
        return prediction_output
"""
class LSTM_Classify(nn.Module):
    def __init__(self,feature_size, output_size):
        super(LSTM_Classify, self).__init__()
        self.inception = nn.Sequential(
                                       nn.Conv2d(1, 15, stride = 1, kernel_size = 2),
                                       nn.MaxPool2d(2, stride = 2),
                                       nn.LeakyReLU(),
                                       nn.BatchNorm2d(15),
                                       nn.Conv2d(15, 1, stride = 1, kernel_size = 2),
                                       nn.MaxPool2d(2, stride = 2),
                                       nn.LeakyReLU(),
                                       nn.BatchNorm2d(1))
                                       #nn.Conv2d(1, 15, stride = 1, kernel_size = 2))
                                       
        self.linear = nn.Linear(5115, 500)
        self.recurrent_layer = nn.LSTM(500, 25, 2)
        self.prediction_layer = nn.Linear(25, output_size)
        
    def forward(self, input, h_t_1=None, c_t_1=None):
        inception_output = self.inception(input.view((25, 1, 25, 4096)))
        N, C, H, W = inception_output.size()
        linear_input = inception_output.view(N, -1)
        linear_output = self.linear(linear_input)
        recurrent_input = linear_output.view(25, 1, 500)
        rnn_output, (hn, cn) = self.recurrent_layer(recurrent_input)
        prediction_output = self.prediction_layer(rnn_output[:,-1])
        return prediction_output
        #return linear_input

In [8]:
#batch_size = 50
feature_size = 4096
output_size  = 16
model = LSTM_Classify(feature_size, output_size)


In [9]:
#optimizer = torch.optim.SGD(model.parameters(), lr = 1e-2)
#optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
optimizer = torch.optim.Adadelta(model.parameters(), lr = 1e-1)
loss_function = nn.CrossEntropyLoss()


In [10]:
# Wrote this to find precision point to stop at correct epoch.
def validate(model, loss_function, optimizer, dataloader,epoch):
    val_loss = 0.0
    val_acc = 0.0
    model.eval()
    for step_num, tup in enumerate(dataloader):

        x_var = Variable(tup[0])
        y_var = Variable(tup[1].long())
        output_scores = model(x_var)
        loss = loss_function(output_scores, y_var)
        
        val_loss += loss.cpu().data.item() #* tup[0].size(0)
        _, prediction = torch.max(output_scores.data, 1)

        val_acc += torch.sum(prediction == tup[1])
    val_acc_epoch = (float(val_acc)/(len(dataloader)*25))*100
    val_loss_epoch = (float(val_loss/(len(dataloader)*25)))
    print("Epoch {}, Validation Accuracy: {}, Validation Loss: {}".format(epoch+1, val_acc_epoch, val_loss_epoch))

In [11]:
def train(model, loss_function, optimizer, train_dataloader,val_dataloader, num_epochs):
    for epoch in range(num_epochs):
        print('Starting epoch number %d' % (epoch + 1))
        train_loss = 0.0
        train_acc = 0.0
        model.train()
        for step_num, tup in enumerate(train_dataloader):
         
            x_var = Variable(tup[0])
           
            y_var = Variable(tup[1].long())
            optimizer.zero_grad()
            output_scores = model(x_var)
            #print(output_scores.shape)
            loss = loss_function(output_scores, y_var)             
            loss.backward()
            optimizer.step()
            train_loss += loss.cpu().data.item() #* tup[0].size(0)
            _, prediction = torch.max(output_scores.data, 1)
            
            train_acc += torch.sum(prediction == tup[1])
        train_acc_epoch = (float(train_acc)/(len(train_dataloader)*25))*100
        train_loss_epoch = (float(train_loss/(len(train_dataloader)*25)))
        print("Epoch {}, Train Accuracy: {}, Train Loss: {}".format(epoch+1, train_acc_epoch, train_loss_epoch))
        if (step_num) % 1 == 0:
                validate(model, loss_function, optimizer, val_dataloader,epoch)

In [12]:
train(model, loss_function, optimizer, train_dataset_loader,val_dataset_loader, 47)

Starting epoch number 1
Epoch 1, Train Accuracy: 13.254901960784313, Train Loss: 0.10828352535472197
Epoch 1, Validation Accuracy: 22.400000000000002, Validation Loss: 0.10709198379516602
Starting epoch number 2
Epoch 2, Train Accuracy: 32.15686274509804, Train Loss: 0.10376270275489957
Epoch 2, Validation Accuracy: 34.4, Validation Loss: 0.10250077056884765
Starting epoch number 3
Epoch 3, Train Accuracy: 48.392156862745104, Train Loss: 0.09975511663100299
Epoch 3, Validation Accuracy: 54.400000000000006, Validation Loss: 0.09840564155578613
Starting epoch number 4
Epoch 4, Train Accuracy: 61.01960784313726, Train Loss: 0.09540253770117667
Epoch 4, Validation Accuracy: 64.8, Validation Loss: 0.09392206954956055
Starting epoch number 5
Epoch 5, Train Accuracy: 69.25490196078431, Train Loss: 0.0906360964681588
Epoch 5, Validation Accuracy: 68.8, Validation Loss: 0.08891266250610351
Starting epoch number 6
Epoch 6, Train Accuracy: 76.86274509803923, Train Loss: 0.0855659948610792
Epoch 6

In [13]:
def evaluate(model, testloader):
    model.eval()
    test_acc = 0.0
    for step_num, tup in enumerate(testloader):
        test_output = model(tup[0])
        _, prediction = torch.max(test_output.data, 1)
        test_acc += torch.sum(prediction == tup[1])
        
    test_acc = (float(test_acc) / (len(testloader)*25))*100
    print("Got Test Accuracy of : {}".format(test_acc))

In [14]:
evaluate(model, test_dataset_loader)

Got Test Accuracy of : 87.45454545454545


SVM IMPLEMENTATION

In [15]:
from sklearn.svm import LinearSVC
svm_train_data = [x.reshape(-1) for x in train_data ]
svm_test_data = [x.reshape(-1) for x in test_data]
svm = LinearSVC(C=10)
svm.fit(svm_train_data, train_label)

predictions = svm.predict(svm_test_data)
accuracy = sum(np.array(predictions) == test_label) / float(len(test_data))
print("The accuracy of Classifier is %.2f " % ((accuracy*100)))


The accuracy of Classifier is 90.14 


## Submission
---
**Runnable source code in ipynb file and a pdf report are required**.

The report should be of 3 to 4 pages describing what you have done and learned in this homework and report performance of your model. If you have tried multiple methods, please compare your results. If you are using any external code, please cite it in your report. Note that this homework is designed to help you explore and get familiar with the techniques. The final grading will be largely based on your prediction accuracy and the different methods you tried (different architectures and parameters).

Please indicate clearly in your report what model you have tried, what techniques you applied to improve the performance and report their accuracies. The report should be concise and include the highlights of your efforts.