# Workshop on Real time video analytics (action recognition)

Course: Real time audio visual sensing and sense making

Website: https://www.iss.nus.edu.sg/executive-education/course/detail/real-time-audio-visual-sensing-and-sense--making/artificial-intelligence

Contact: Tian Jing

Email: tianjing@nus.edu.sg

# Objective
In this workshop, we will perform the following three tasks

- Exercise 1: Perform action recognition using histogram of optical flow
- Exercise 2: Perform action recognition using C3D deep learning approach
- Exercise 3: Perform action recognition using C3D + classifier

# Submission guideline

Once you finish the workshop, rename your .ipynb file to be your name, and submit your .ipynb file into LumiNUS.

In [None]:
# Load library
import os
import cv2
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
import torch
import torchvision
import torch.nn as nn
import torch.nn.functional as F
from sklearn import svm
from sklearn.metrics import confusion_matrix

print("PyTorch version is", torch.__version__)
# Use GPU if available else revert to CPU
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print("Device being used:", device)


PyTorch version is 1.8.1+cu101
Device being used: cuda:0


In [None]:
# Grant access to google drive.
# Run this cell, then you’ll see a link, click on that link, allow access
# Copy the code that pops up, paste it in the box, hit Enter

from google.colab import drive
drive.mount('/content/gdrive')
# Change working directory to be current folder
import os
os.chdir('/content/gdrive/My Drive/RTAVS/action')
!ls

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
2_wk_action.zip		    model_c3d_v0815.h5
archive			    model_c3d_v0815.pt
data			    wk_action_keras_v202107_reference.ipynb
data_test_hof_feature.npz   wk_action_pytorch_v202107_reference.ipynb
data_train_hof_feature.npz


# Explore the dataset

- UCF11 Dataset: https://www.crcv.ucf.edu/data/UCF_YouTube_Action.php.

It contains 11 action categories: basketball shooting, biking/cycling, diving, golf swinging, horse back riding, soccer juggling, swinging, tennis swinging, trampoline jumping, volleyball spiking, and walking with a dog.


In [None]:
def load_groups(input_folder):
    '''
    Load the list of sub-folders into a python list with their
    corresponding label.
    '''
    groups         = []
    label_folders  = os.listdir(input_folder)
    index          = 0
    for label_folder in sorted(label_folders):
        label_folder_path = os.path.join(input_folder, label_folder)
        if os.path.isdir(label_folder_path):
            group_folders = os.listdir(label_folder_path)
            for group_folder in group_folders:
                if group_folder != 'Annotation':
                    groups.append([os.path.join(label_folder_path, group_folder), index])
            index += 1

    return groups

#Reference: https://github.com/microsoft/CNTK/blob/master/Examples/Video/DataSets/UCF11/split_ucf11.py
def ucf_split_data(groups, file_ext):
    '''
    Split the data at random for train, eval and test set.
    '''
    group_count = len(groups)
    indices = np.arange(group_count)

    np.random.seed(0) # Make it deterministic.
    np.random.shuffle(indices)

    # 80% training and 20% test.
    train_count = int(0.8 * group_count)
    test_count  = group_count - train_count

    train = []
    test  = []

    for i in range(train_count):
        group = groups[indices[i]]
        video_files = os.listdir(group[0])
        for video_file in video_files:
            video_file_path = os.path.join(group[0], video_file)
            if os.path.isfile(video_file_path):
                video_file_path = os.path.abspath(video_file_path)
                ext = os.path.splitext(video_file_path)[1]
                if (ext == file_ext):
                    train.append([video_file_path, group[1]])

    for i in range(train_count, train_count + test_count):
        group = groups[indices[i]]
        video_files = os.listdir(group[0])
        for video_file in video_files:
            video_file_path = os.path.join(group[0], video_file)
            if os.path.isfile(video_file_path):
                video_file_path = os.path.abspath(video_file_path)
                ext = os.path.splitext(video_file_path)[1]
                if (ext == file_ext):
                    test.append([video_file_path, group[1]])

    return train, test


In [None]:
# Prepare the dataset
ucf_groups = load_groups("data")

ucf_action_labels  = os.listdir("data")
print("action labels: ", ucf_action_labels)

ucf_train, ucf_test = ucf_split_data(ucf_groups, '.avi')
print("Total %d categories, Training data %d sequences, test data %d sequences" % (len(ucf_action_labels), len(ucf_train), len(ucf_test)))


action labels:  ['basketball', 'biking', 'diving', 'tennis_swing', 'volleyball_spiking', 'golf_swing', 'horse_riding', 'soccer_juggling', 'trampoline_jumping', 'swing', 'walking']
Total 11 categories, Training data 1295 sequences, test data 305 sequences


# Exercise 1: Action recognition using histogram of optical flow

- Reference: Histogram of optical flow,  https://github.com/colincsl/pyKinectTools/blob/master/pyKinectTools/algs/HistogramOfOpticalFlow.py

In [None]:
# Reference: https://github.com/colincsl/pyKinectTools/blob/master/pyKinectTools/algs/HistogramOfOpticalFlow.py
# Fix a few bugs
def hof(flow, orientations=9, pixels_per_cell=(8, 8), cells_per_block=(2, 2), normalise=False, motion_threshold=1.):

    """Extract Histogram of Optical Flow (HOF) for a given image.
    Key difference between this and HOG is that flow is MxNx2 instead of MxN
    Compute a Histogram of Optical Flow (HOF) by
        1. (optional) global image normalisation
        2. computing the dense optical flow
        3. computing flow histograms
        4. normalising across blocks
        5. flattening into a feature vector
    Parameters
    ----------
    Flow : (M, N) ndarray
        Input image (x and y flow images).
    orientations : int
        Number of orientation bins.
    pixels_per_cell : 2 tuple (int, int)
        Size (in pixels) of a cell.
    cells_per_block  : 2 tuple (int,int)
        Number of cells in each block.
    normalise : bool, optional
        Apply power law compression to normalise the image before
        processing.
    static_threshold : threshold for no motion
    Returns
    -------
    newarr : ndarray
        hof for the image as a 1D (flattened) array.
    hof_image : ndarray (if visualise=True)
        A visualisation of the hof image.
    References
    ----------
    * http://en.wikipedia.org/wiki/Histogram_of_oriented_gradients
    * Dalal, N and Triggs, B, Histograms of Oriented Gradients for
      Human Detection, IEEE Computer Society Conference on Computer
      Vision and Pattern Recognition 2005 San Diego, CA, USA
    """
    flow = np.atleast_2d(flow)

    """ 
    -1-
    The first stage applies an optional global image normalisation
    equalisation that is designed to reduce the influence of illumination
    effects. In practice we use gamma (power law) compression, either
    computing the square root or the log of each colour channel.
    Image texture strength is typically proportional to the local surface
    illumination so this compression helps to reduce the effects of local
    shadowing and illumination variations.
    """

    if flow.ndim < 3:
        raise ValueError("Requires dense flow in both directions")

    if normalise:
        flow = sqrt(flow)

    """ 
    -2-
    The second stage computes first order image gradients. These capture
    contour, silhouette and some texture information, while providing
    further resistance to illumination variations. The locally dominant
    colour channel is used, which provides colour invariance to a large
    extent. Variant methods may also include second order image derivatives,
    which act as primitive bar detectors - a useful feature for capturing,
    e.g. bar like structures in bicycles and limbs in humans.
    """

    if flow.dtype.kind == 'u':
        # convert uint image to float
        # to avoid problems with subtracting unsigned numbers in np.diff()
        flow = flow.astype('float')

    gx = np.zeros(flow.shape[:2])
    gy = np.zeros(flow.shape[:2])
    # gx[:, :-1] = np.diff(flow[:,:,1], n=1, axis=1)
    # gy[:-1, :] = np.diff(flow[:,:,0], n=1, axis=0)

    gx = flow[:,:,1]
    gy = flow[:,:,0]


    """ 
    -3-
    The third stage aims to produce an encoding that is sensitive to
    local image content while remaining resistant to small changes in
    pose or appearance. The adopted method pools gradient orientation
    information locally in the same way as the SIFT [Lowe 2004]
    feature. The image window is divided into small spatial regions,
    called "cells". For each cell we accumulate a local 1-D histogram
    of gradient or edge orientations over all the pixels in the
    cell. This combined cell-level 1-D histogram forms the basic
    "orientation histogram" representation. Each orientation histogram
    divides the gradient angle range into a fixed number of
    predetermined bins. The gradient magnitudes of the pixels in the
    cell are used to vote into the orientation histogram.
    """

    magnitude = sqrt(gx**2 + gy**2)
    orientation = arctan2(gy, gx) * (180 / pi) % 180

    sy, sx = flow.shape[:2]
    cx, cy = pixels_per_cell
    bx, by = cells_per_block

    n_cellsx = int(np.floor(sx // cx))  # number of cells in x
    n_cellsy = int(np.floor(sy // cy))  # number of cells in y

    # compute orientations integral images
    orientation_histogram = np.zeros((n_cellsy, n_cellsx, orientations))
    subsample = np.index_exp[int(cy / 2):cy * n_cellsy:cy, int(cx / 2):cx * n_cellsx:cx]
    # There are (orientations-1) bins for optical flow and 1 bin for no-motion
    for i in range(orientations-1):
        #create new integral image for this orientation
        # isolate orientations in this range

        # temp_ori = np.where(orientation < 180 / orientations * (i + 1), orientation, -1)
        # temp_ori = np.where(orientation >= 180 / orientations * i, temp_ori, -1)
        # fixed the bug in the original Github code
        temp_ori = np.where(orientation < 180 / (orientations-1) * (i + 1), orientation, -1)
        temp_ori = np.where(orientation >= 180 / (orientations-1) * i, temp_ori, -1)
        # select magnitudes for those orientations
        cond2 = (temp_ori > -1) * (magnitude > motion_threshold)
        temp_mag = np.where(cond2, magnitude, 0)

        temp_filt = uniform_filter(temp_mag, size=(cy, cx))
        orientation_histogram[:, :, i] = temp_filt[subsample]

    ''' Calculate the no-motion bin '''
    temp_mag = np.where(magnitude <= motion_threshold, magnitude, 0)

    temp_filt = uniform_filter(temp_mag, size=(cy, cx))
    orientation_histogram[:, :, -1] = temp_filt[subsample]

    """
    The fourth stage computes normalisation, which takes local groups of
    cells and contrast normalises their overall responses before passing
    to next stage. Normalisation introduces better invariance to illumination,
    shadowing, and edge contrast. It is performed by accumulating a measure
    of local histogram "energy" over local groups of cells that we call
    "blocks". The result is used to normalise each cell in the block.
    Typically each individual cell is shared between several blocks, but
    its normalisations are block dependent and thus different. The cell
    thus appears several times in the final output vector with different
    normalisations. This may seem redundant but it improves the performance.
    We refer to the normalised block descriptors as Histogram of Oriented
    Gradient (hog) descriptors.
    """

    n_blocksx = (n_cellsx - bx) + 1
    n_blocksy = (n_cellsy - by) + 1
    normalised_blocks = np.zeros((n_blocksy, n_blocksx,
                                  by, bx, orientations))

    for x in range(n_blocksx):
        for y in range(n_blocksy):
            block = orientation_histogram[y:y+by, x:x+bx, :]
            eps = 1e-5
            normalised_blocks[y, x, :] = block / sqrt(block.sum()**2 + eps)

    return normalised_blocks.ravel()

In [None]:
# Define the HoF feature extraction function
def extract_hof_feature(video_list):
    feature_hof = []
    label_list = []
    img_width = 128
    img_height = 64
    for idx, value in enumerate(video_list):
        # Display the progress
        if (idx % 100) == 0:
            print("process sequence %d/%d" % (idx, len(video_list)))
        filename = value[0]
        label = value[1]
        hof_feature_all = []

        cap = cv2.VideoCapture(filename)
        ret, frame = cap.read()
        if ret:
            gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
            gray = cv2.resize(gray, (img_width, img_height)) # Resize frames to reduce feature dimensions
        
            while True:
                previousGray = gray
                ret, frame = cap.read()

                if ret:
                    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
                    gray = cv2.resize(gray, (img_width, img_height))
                    flow = cv2.calcOpticalFlowFarneback(previousGray, gray, flow=None, pyr_scale=0.5, levels=5, winsize=11, iterations=10, poly_n=5, poly_sigma=1.1, flags=0)
                    hof_feature_one = hof(flow, orientations=9, pixels_per_cell=(8, 8),cells_per_block=(2, 2))
                    if (len(hof_feature_all) == 0):
                        hof_feature_all = hof_feature_one
                    else:
                        hof_feature_all = np.vstack((hof_feature_all, hof_feature_one))
                else:
                    break
    
        cap.release()
        if (len(hof_feature_all) != 0):
            hof_feature_mean = np.mean(hof_feature_all, axis=0)
            feature_hof.append(hof_feature_mean)
            label_list.append(label)      
        
    return np.array(feature_hof), np.array(label_list)


In [None]:
# It takes around half hour to prepare such feature dataset
# You can uncomment if you want to re-build feature dataset, otherwise, load them from the data files
# The sequence file basketball\v_shooting_24\v_shooting_24_01.avi is very short.

# print("Prepare training feature dataset")
# train_feature_hof, train_label_hof = extract_hof_feature(ucf_train)
# np.savez("data_train_hof_feature.npz", X=train_feature_hof, Y=train_label_hof)
# print(train_feature_hof.shape, train_label_hof.shape)

# print("Prepare test feature dataset")
# test_feature_hof, test_label_hof = extract_hof_feature(ucf_test)
# np.savez("data_test_hof_feature.npz", X=test_feature_hof, Y=test_label_hof)
# print(test_feature_hof.shape, test_label_hof.shape)

In [None]:
# Load HoF features from pre-prepared data file and perform SVM classification
with np.load("data_train_hof_feature.npz") as npzfile:
    x_train_hof = npzfile["X"]
    x_train_hof_label = npzfile["Y"]
    
with np.load("data_test_hof_feature.npz") as npzfile:
    x_test_hof = npzfile["X"]
    x_test_hof_label = npzfile["Y"]
    
print("Training data", x_train_hof.shape, x_train_hof_label.shape)
print("Test data", x_test_hof.shape, x_test_hof_label.shape)

Training data (1294, 3780) (1294,)
Test data (305, 3780) (305,)


In [None]:
hof_svm_model = svm.SVC(kernel = 'linear', C = 10).fit(x_train_hof, x_train_hof_label)

x_test_hof_pred = hof_svm_model.predict(x_test_hof)

print(confusion_matrix(x_test_hof_label, x_test_hof_pred))


[[ 0  1  2  0  0  0  0  0  0  0  2]
 [ 7 12  1  0  7  0  9  0  0  0  7]
 [ 2  0 10  0  0  0  0  0  0  0  2]
 [ 0  0  6 21  1  2  1  0  1  2  0]
 [ 5  2  3  0  9  0  0  0  0  0  1]
 [ 5  0  9  2  0  6  1  5  5  0  0]
 [ 0  2  5  0  0  3 10  0  3  1  5]
 [ 4  2  1 12  0  9  0 20  1  3  0]
 [ 5  0  0  0  0  0  1  0 14  2  1]
 [ 3  0  4  1  0  0  0  1  0  5  4]
 [ 2  5  2  2  7  0  1  0  0  0 15]]


# Exercise 2: Action recognition using C3D model

A modified C3D model is used in the workshop to reduce model training time for demonstration purpose.

- Reference: D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning Spatiotemporal Features with 3D Convolutional Networks“, ICCV 2015, https://arxiv.org/abs/1412.0767

In [None]:
class Videoto3D:

    def __init__(self, width, height, depth):
        self.width = width
        self.height = height
        self.depth = depth

    def get_data(self, filename, skip=True):
        cap = cv2.VideoCapture(filename)
        nframe = cap.get(cv2.CAP_PROP_FRAME_COUNT)
        bAppend = False
        if (nframe>=self.depth):
            if skip:
                frames = [x * nframe / self.depth for x in range(self.depth)]
            else:
                frames = [x for x in range(self.depth)]
        else:
            print("Insufficient %d frames in video %s, set bAppend as True" % (nframe, filename))
            bAppend = True
            frames = [x for x in range(int(nframe))] # nframe is a float

        framearray = []

        for i in range(len(frames)):#self.depth):
            cap.set(cv2.CAP_PROP_POS_FRAMES, frames[i])
            ret, frame = cap.read()
            frame = cv2.resize(frame, (self.height, self.width))
            framearray.append(frame)

        cap.release()
        
        if bAppend:
            while len(framearray) < self.depth:
                framearray.append(frame)
            print("Append more frames in the framearray to have %d frames" % len(framearray))
                
        return np.array(framearray)

def loaddata(video_list, vid3d, skip=True):
    X = []
    Y = []
    for idx, value in enumerate(video_list):
        # Display the progress
        if (idx % 100) == 0:
            print("process data %d/%d" % (idx, len(video_list)))
        filename = value[0]
        label = value[1]
        Y.append(label)
        X.append(vid3d.get_data(filename, skip=skip))

    return np.array(X), np.array(Y)


In [None]:
# Define parameter setting
class Args:
    nclass = len(ucf_action_labels) # 11 action categories
    depth = 10
    rows = 32
    cols = 32
    skip = True # Skip: randomly extract frames; otherwise, extract first few frames
    color_channel = 3

param_setting = Args()
img_rows = param_setting.rows
img_cols = param_setting.cols
frames = param_setting.depth
vid3d = Videoto3D(img_rows, img_cols, frames)
nb_classes = param_setting.nclass


In [None]:
# Prepare training data
x_train, y_train = loaddata(ucf_train, vid3d, param_setting.skip)

# Prepare test data
x_test, y_test = loaddata(ucf_test, vid3d, param_setting.skip)

print("Training data", x_train.shape, y_train.shape)
print("Test data", x_test.shape, y_test.shape)

process data 0/1295
process data 100/1295
process data 200/1295
process data 300/1295
process data 400/1295
process data 500/1295
process data 600/1295
process data 700/1295
process data 800/1295
process data 900/1295
process data 1000/1295
Insufficient 1 frames in video /content/gdrive/My Drive/RTAVS/action/data/basketball/v_shooting_24/v_shooting_24_01.avi, set bAppend as True
Append more frames in the framearray to have 10 frames
process data 1100/1295
process data 1200/1295
process data 0/305
process data 100/305
process data 200/305
process data 300/305
Training data (1295, 10, 32, 32, 3) (1295,)
Test data (305, 10, 32, 32, 3) (305,)


In [None]:

class myDataSet(torch.utils.data.Dataset):
    def __init__(self, data_X, data_Y, nb_classes):
        self.X = data_X.astype('float32')/255.0
        self.X = self.X.transpose(0, 4, 1, 2, 3) # take note the dimension used in model training
        self.Y =torch.from_numpy(data_Y)
        self.num_samples = self.Y.shape[0]

    def __getitem__(self, index):
        return self.X[index], self.Y[index]   

    def __len__(self):
        return self.num_samples


In [None]:
class C3D(nn.Module):
    # A simplified C3D model

    def __init__(self, num_classes):
        super(C3D, self).__init__()

        self.conv1 = nn.Conv3d(3, 32, kernel_size=(3, 3, 3), padding=(1, 1, 1))
        self.conv2 = nn.Conv3d(32, 32, kernel_size=(3, 3, 3), padding=(1, 1, 1))
        self.pool2 = nn.MaxPool3d(kernel_size=(3, 3, 3), stride=(2, 2, 2))

        self.conv3a = nn.Conv3d(32, 64, kernel_size=(3, 3, 3), padding=(1, 1, 1))
        self.conv3b = nn.Conv3d(64, 64, kernel_size=(3, 3, 3), padding=(1, 1, 1))
        self.pool3 = nn.MaxPool3d(kernel_size=(3, 3, 3), stride=(2, 2, 2))

        self.fc4 = nn.Linear(3136, 512)
        self.fc5 = nn.Linear(512, num_classes)

        self.dropout = nn.Dropout(p=0.2)
        self.relu = nn.ReLU()
        self.__init_weight()

    def forward(self, x):

        x = self.relu(self.conv1(x))
        x = self.relu(self.conv2(x))
        x = self.pool2(x)
        x = self.dropout(x)

        x = self.relu(self.conv3a(x))
        x = self.relu(self.conv3b(x))
        x = self.pool3(x)
        x = self.dropout(x)

        x = x.view(x.size()[0], -1)
        x = self.relu(self.fc4(x))
        x = self.dropout(x)
        logits = self.fc5(x)
        
        return x, logits # the 'x' will be used as features later, the logits will be used as model output

    def __init_weight(self):
        for m in self.modules():
            if isinstance(m, nn.Conv3d):
                torch.nn.init.kaiming_normal_(m.weight)
            elif isinstance(m, nn.BatchNorm3d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()


In [None]:

def train_model(model, batch_size, lr, num_epochs):
      
    criterion = nn.CrossEntropyLoss()  # standard crossentropy loss for classification
    optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9, weight_decay=5e-4)
    train_dataloader = torch.utils.data.DataLoader(myDataSet(x_train, y_train, nb_classes), shuffle=True, batch_size=batch_size)

    model.to(device)
    for epoch in range(num_epochs):
        running_loss = 0.0
        model.train()

        for idx, (inputs, labels) in enumerate(train_dataloader):
            if (epoch ==0) & (idx == 0):
                print(inputs.shape, labels.shape)        #  Check dimension of a batch of training data
                                        
            inputs = inputs.to(device)
            labels = labels.to(device, dtype=torch.int64)
            optimizer.zero_grad()
            _, outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()

        epoch_loss = running_loss / len(train_dataloader)

        if ((epoch % 20) == 0):
            print("Epoch: %s/%s, Loss: %.4f" % (epoch+1, num_epochs, epoch_loss))


In [None]:
num_epochs = 400  # Number of epochs for training
lr = 1e-4 # Learning rate
batch_size = 32

model = C3D(num_classes=nb_classes)
# train_model(model=model, batch_size=batch_size, lr=lr, num_epochs=num_epochs)
# torch.save(model.state_dict(), "model_c3d_v0815.pt")

# Load models
loaded_model = C3D(num_classes=nb_classes)
loaded_model.load_state_dict(torch.load("model_c3d_v0815.pt"))


<All keys matched successfully>

In [None]:
batch_size = 10
loaded_model.to(device)

test_dataloader  = torch.utils.data.DataLoader(myDataSet(x_test, y_test, nb_classes), batch_size=batch_size)
test_size = len(test_dataloader.dataset)
conf_mat = np.zeros([nb_classes, nb_classes])

loaded_model.eval()
with torch.no_grad():
    for idx, (inputs, labels) in enumerate(test_dataloader):

        inputs = inputs.to(device)
        labels = labels.to(device, dtype=torch.int64)
        _, outputs = loaded_model(inputs)

        probs = nn.Softmax(dim=1)(outputs)
        preds = torch.max(probs, 1)[1]

        for idx1 in range(preds.shape[0]):
            ii = labels[idx1].item()
            jj = preds[idx1].item()
            conf_mat[ii, jj] += 1.0

print(conf_mat)


[[ 2.  0.  0.  0.  1.  0.  0.  0.  0.  1.  1.]
 [ 2. 16.  0.  2.  6.  0.  5.  1.  2.  1.  8.]
 [ 0.  0. 14.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0. 33.  0.  0.  0.  0.  0.  0.  1.]
 [ 0.  1.  0.  0. 16.  0.  2.  1.  0.  0.  0.]
 [ 0.  2.  0. 20.  1.  5.  0.  1.  0.  2.  2.]
 [ 0.  3.  1.  1.  2.  0.  9.  0.  5.  5.  3.]
 [ 0.  0.  3.  9.  0.  7.  0. 24.  0.  5.  4.]
 [ 0.  0.  1.  0.  0.  0.  3.  6. 12.  0.  1.]
 [ 2.  0.  0.  0.  1.  0.  0.  0.  1. 14.  0.]
 [ 0. 11.  0.  2.  5.  0.  3.  1.  0.  0. 12.]]


$\color{red}{\text{Q1: Complete code to perform action recognition using C3D + classifier}}$

Tasks
- Extract the fully connected layer response `flatten_feature` from the C3D model `c3d_model` as features from the training data `x_train`
- Labels are provided in `ucf_train` and `ucf_test`
- Build a classification model, such as SVM
- Perform classification on `x_test`, display the confusion matrix


In [None]:
# Provide your solution to Q1 here
#
# 

$\color{red}{\text{Q2: Propose how to apply the model developed in this workshop on the live video streaming.}}$

The model developed in this workshop works on a short video clip. In practice, given a live video streaming, how to apply such model for action recognition?

In [None]:
# Provide your solution to Q2 here (no need programming)
#
#

**Once you finish the workshop, rename your .ipynb file to be your name, and submit your .ipynb file into LumiNUS.**

Have a nice day!