<a href="https://colab.research.google.com/github/DevavratSinghBisht/neural-networks/blob/main/VideoData(CNN)/Video_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Video Classification

* @author: Devavrat Singh Bisht
* Dataset: YouTube DataSet Annotated
* Click [here](https://www.crcv.ucf.edu/data/YouTube_DataSet_Annotated.zip) to download the whole YouTube DataSet.
* Click [here](https://www.crcv.ucf.edu/data/UCF_YouTube_Action.php) to visit the website where you can download this and many other similar datasets.
* Note: I have reduced the dataset to the 3 classes mentioned below, in order to reduce the dataset size and thus computation required in order to fit it. The original dataset contains 11 classes

In this session we will do video classification.
There are 3 classes/types of videos:
* Walking
* Horse Riding
* Bikinng

As a video is made up of frames, we will take multiple frames from a single video and make a convolutional network using Conv3D laeyers to predict the class of the video.

The video in our dataset are small and is about 10sec on an average. So taking 5 frames from the video seems good enough for our learning purpose, as we do not have access to high computation.

Also building a model that perfectly fits a video data needs a huge dataset and a lot of computation. Thus, understanding the concept is our main aim in this notbook, none the less we will also try to optimize the model a little bit.

## Importing Libraries

In [1]:
# you can ignore this
# connecting to drive
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [2]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv3D, Flatten, Dense, Dropout, BatchNormalization, Input, ConvLSTM2D

import os
import cv2
import random
import numpy as np

## Data Loading and Preprocessing

We will create a data generator or you can also call it as data loader, this class will take the video data directly from the hard drive and get exactly 5 frames from the video and since each frame will be resized to (32, 32, 3) irrespective of the shape of original frames in the video, we will get an output of shape (batch size, 5, 32, 32, 3).

One 4D array will represent 1 Video. And stacking all these 4D arrays to make a batch will result in a 5D array of whole dataset.
As we only have 3 classes our target variable will be of shape (batch size, 3) that is a 2D array as it usually is. The only change that happens here is in the independaent variable i.e. X.

In [3]:
class DataGenerator(tf.keras.utils.Sequence):
  'Generates data for Keras'
  def __init__(self, dataset_path, batch_size=32, dim=(5, 32, 32, 3), vid_per_class = 21*3):
    self.dataset_path = dataset_path
    self.dir_list = os.listdir(dataset_path)
    self.n_classes = len(self.dir_list)
    self.batch_size = batch_size
    self.dim = dim
    self.frame_per_vid, self.height, self.width, self.channels = self.dim
    self.dataset_len = 0
    self.vid_per_class = vid_per_class
    self.dataset_len = self.n_classes * self. vid_per_class
      

  def __len__(self):
    'Denotes the number of batches per epoch'
    return int(np.floor(self.dataset_len / self.batch_size))

  def __getitem__(self, index):
    'Generate one batch of data'
    # Generate data
    X, y = self.__data_generation()

    return X, y

  def __data_generation(self):
    'Generates data containing batch_size samples' # X : (n_samples, n_channels, *dim)
    # Initialization
    X = np.zeros((self.batch_size, *self.dim))
    y = np.zeros((self.batch_size, self.n_classes), dtype=int)

    # Generate data
    for i in range(self.batch_size):
      #print(i)

      frame_list = []

      #generates random number between and inclusive of the limiting values
      class_no = random.randint(0, self.n_classes-1)

      vid_dir_path = self.dataset_path + "//" + self.dir_list[class_no]
      vid_path = vid_dir_path + "//" + random.choice(os.listdir(vid_dir_path))

      cam = cv2.VideoCapture(vid_path)

      currentframe = 0
  
      while(True): 
      
        # reading from frame 
        ret,frame = cam.read() 
  
        if ret:
          frame = cv2.resize(frame, (self.height, self.width), interpolation = cv2.INTER_NEAREST)
          frame_list.append(frame)           
        else: 
          break
      
      multiplier = (len(frame_list)-1)//(self.frame_per_vid-1)

      for j in range(self.frame_per_vid):
        #print(j, multiplier, len(frame_list), frame_list[j*multiplier].shape)
        X[i, j, :, :, :] = frame_list[j*multiplier]

      y[i, class_no] = 1 

    X = X/255

    return X, y

## Model1
Here we will build a model using Conv3D layers, it is similar to Conv2D layers but it does convolution operation in three dimensions rather than the conventional two dimension of the images.

What we are goning to do is stack the frames of the movie one over the other making a 3D cube of arrays (note that the dimension of channel is ignored here), after forming the 3D cube we will apply some Conv3D layers followed by a flatten and some dense layers. 

The model may not be good enough as the data set it very small considering that we are working on videos. But it's sufficient for our learning purposes. The model overfits but I believe that give more data and computing resources it will result in higher accuracy.

### Model Building

In [4]:
model1 = Sequential([
                     Conv3D(4, kernel_size=(2, 8, 8), input_shape=(5, 32, 32, 3), activation='relu'),
                     Conv3D(16, kernel_size=(2, 8, 8), activation='relu'),
                     Conv3D(32, kernel_size=(1, 16, 16), activation='relu'),
                     Flatten(),
                     Dense(256, activation='relu'),
                     Dense(3, activation='softmax')
])

In [5]:
model1.compile(optimizer='adam', loss='categorical_crossentropy', metrics=["accuracy"])

In [6]:
model1.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv3d (Conv3D)              (None, 4, 25, 25, 4)      1540      
_________________________________________________________________
conv3d_1 (Conv3D)            (None, 3, 18, 18, 16)     8208      
_________________________________________________________________
conv3d_2 (Conv3D)            (None, 3, 3, 3, 32)       131104    
_________________________________________________________________
flatten (Flatten)            (None, 864)               0         
_________________________________________________________________
dense (Dense)                (None, 256)               221440    
_________________________________________________________________
dense_1 (Dense)              (None, 3)                 771       
Total params: 363,063
Trainable params: 363,063
Non-trainable params: 0
__________________________________________________

### Model Training

In [7]:
train_datagen = DataGenerator('/content/drive/MyDrive/Study/DL/Video Classification/Datset/Train')
val_datagen = DataGenerator('/content/drive/MyDrive/Study/DL/Video Classification/Datset/Val', batch_size=4, vid_per_class= 2*3)

In [8]:
model1.fit(train_datagen, validation_data=val_datagen, epochs=50)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7f31489dcda0>

## Model2
Here we will build a model using ConvLSTM2D layers, as you might have guessed correctly it is a combination of Conv2D layer and a LSTM layer. As video is a sequence of frames its better to use layers which are suitable for sequence data, but at ther same time it consists of frames thus we needed a layer which can treat frames as a sequence of data.

We will stack the frames of the movie one over the other making a 3D cube of arrays (note that the dimension of channel is ignored here), after forming the 3D cube we will apply some ConvLSTM2D layers followed by a flatten and some dense layers. 

A major advantage of ConvLSTM2D over Conv3D is that it treats the data as a sequence which is how the data should be treated originaly. Also we dont need to restrict the number of frames like earlier. We will implement the model for any number of frames, although the data feeded to the model will contain 5 frame per video as we have defiend earlier for our ease.

The model may not be good enough as the data set it very small considering that we are working on videos. But it's sufficient for our learning purposes. The model overfits and I believe that give more data and computing resources it will result in higher accuracy.

### Model Building

In [9]:
model2 = Sequential([
                     Input( shape=(None, 32, 32, 3) ),  # Variable-length sequence of 32x32x3 frames
                     ConvLSTM2D( filters=40, kernel_size=(3, 3), return_sequences=True ),
                     ConvLSTM2D( filters=40, kernel_size=(3, 3), return_sequences=True ),
                     ConvLSTM2D( filters=40, kernel_size=(3, 3), return_sequences=False ),
                     Flatten(),
                     Dense(256, activation='relu'),
                     Dense(3, activation='softmax'),
                     ])
model2.compile(loss="categorical_crossentropy", optimizer="adam", metrics=['accuracy'])

In [10]:
model2.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv_lst_m2d (ConvLSTM2D)    (None, None, 30, 30, 40)  62080     
_________________________________________________________________
conv_lst_m2d_1 (ConvLSTM2D)  (None, None, 28, 28, 40)  115360    
_________________________________________________________________
conv_lst_m2d_2 (ConvLSTM2D)  (None, 26, 26, 40)        115360    
_________________________________________________________________
flatten_1 (Flatten)          (None, 27040)             0         
_________________________________________________________________
dense_2 (Dense)              (None, 256)               6922496   
_________________________________________________________________
dense_3 (Dense)              (None, 3)                 771       
Total params: 7,216,067
Trainable params: 7,216,067
Non-trainable params: 0
____________________________________________

### Model Training

In [11]:
model2.fit(train_datagen, validation_data=val_datagen, epochs=50)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7f30f02e3080>