## This notebook demonstrates how to build a simple video classification training pipeline using `PyTorchVideo` models, datasets and transforms.


### We will be using a `3D ResNet` for the model and the `Kinetics` dataset and a standard video transform augmentation recipe.

***

## 1. Prepare the dataset.
Setup the PyTorchVideo Kinetics data loader using a `pytorch_lightning.LightningDataModule` This is a wrapper that defines the train, validation and test data partitions.

The list of videos found on the Kinetics website can be found [here.](https://deepmind.com/research/open-source/kinetics)

You will then need the official [download script](https://github.com/activitynet/ActivityNet/tree/master/Crawler/Kinetics) to download the videos.

Once the videos are downloaded, point the `pytorchvideo.data.Kinetics data path` argument to the folder of classes.

There are a few arguments that are more specific to PyTorch datasets:

- `video_sampler`: It defines the order to sample a video at each iteration.
- `clip_sampler`: Defines how to sample a clip from the chosen video at each iteration.
-`transform`: This provides a way to apply user-defined data preprocessing or augmentation before batch collating by the PyTorch data loader.

In [1]:
!pip install pytorch-lightning pytorchvideo torchvision torchrec torchaudio



In [3]:
import os
import pytorch_lightning
import pytorchvideo.data
import torch.utils.data

from pytorchvideo.transforms import (
    ApplyTransformToKey, 
    Normalize, 
    RandomShortSideScale,
    RemoveKey,
    ShortSideScale,
    UniformTemporalSubsample
)


import torchvision
from torchvision.transforms import (
    Compose, 
    Lambda, 
    RandomCrop,
    RandomHorizontalFlip
)

#2. Transforms.
`PyTorchVideo` datasets take a `transform` callable argument that defines custom processing, e.g. `augmentations, normalization` that is applied to each clip.

`pytorchvideo.data.Kinetics` clips have the following dictionary format:


```
{
  'video': <video_tensor>     #Shape: (C, T, H, W)
  'audio': <audio_tensor>     #Shape: (S)
  'label' : <action_label>    #Integer defining class annotation.
  'video_name': <video_path>  #Video file path stem.
  'video_index': <video_id>   #Index of video used by the sampler.
  'clip_index': <clip_id>     #Index of the clip sampled within the video.
}
```



In [15]:
class KineticsDataModule(pytorch_lightning.LightningDataModule):
    #Dataset configuration.
    _DATA_PATH = '~/.torch/datasets/'
    _CLIP_DURATION = 2 #The duration of sampled clip for each video.
    _BATCH_SIZE = 8
    _NUM_WORKERS = 7 #The number of parallel processes fetcing data.

    def train_dataloader(self):
        global _DATA_PATH
        '''
        Creates the Kinetics train partition from the list of video labels.
        Add transform that subsamples and normalizes the video before applying
        the scale, crop and flip augmentations.
        '''
        train_transform = Compose(
            [
                ApplyTransformToKey(
                    key = 'video',
                    transform = Compose(
                        [
                         UniformTemporalSubsample(8),
                         Lambda(lambda x: x / 255.0),
                         Normalize((0.45, 0.45, 0.45), (0.225, 0.225, 0.225)),
                         RandomShortSideScale(min_size = 256, max_size = 320),
                         RandomCrop(244),
                         RandomHorizontalFlip(p = 0.5)
                        ]
                    ),
                ),
            ]
        )

        train_dataset = torchvision.datasets.Kinetics(root = _DATA_PATH, frames_per_clip = 10, 
                                                  num_classes = '101', split = 'train',
                                                  step_between_clips = 1, transform = train_transform, 
                                                  download = False, num_download_workers = 4, num_workers = 4)

        train_dataset = pytorchvideo.data.Kinetics(
            clip_sampler = pytorchvideo.data.make_clip_sampler('random', self._CLIP_DURATION),
            decode_audio = False)

        # train_dataset = pytorchvideo.data.Kinetics(
        #     data_path = os.path.join(self._DATA_PATH, 'train.csv'),
        #     clip_sampler = pytorchvideo.data.make_clip_sampler('random', self._CLIP_DURATION),
        #     transform = train_transform
            # )

        return torch.utils.data.DataLoader(
                train_dataset, 
        batch_size = self._BATCH_SIZE, 
        num_workers = self._NUM_WORKERS
    )
    
    
    
    def val_dataloader(self):        
        '''Creates the Kinetics validation partition from the list of video labels.'''
        val_dataset = torchvision.datasets.Kinetics(root = './', frames_per_clip = 10, 
                                                      num_classes = '101', split = 'test',
                                                      step_between_clips = 1, 
                                                      download = False, num_download_workers = 4, num_workers = 4)

        val_dataset = pytorchvideo.data.Kinetics(
            clip_sampler = pytorchvideo.data.make_clip_sampler('uniform', self._CLIP_DURATION),
            decode_audio = False
        )
        # val_dataset = pytorchvideo.data.Kinetics(
        #     data_path = os.path.join(self._DATA_PATH, 'val'),
        #     clip_sampler = pytorchvideo.data.make_clip_sampler('uniform', self._CLIP_DURATION),
        #     decode_audio = False
        # )

        return torch.utils.data.DataLoader(
            val_dataset, 
            batch_size = self._BATCH_SIZE,
            num_workers = self._NUM_WORKERS
        )

#3. Model.
All PytorchVideo models and layers can be built with simple, reproducible factory functions.

We call this a `flat` model interface since the arguments do not require hierarchical configurations to be used.

In [16]:
import pytorchvideo.models.resnet

def make_kinetics_resnet():
    return pytorchvideo.models.resnet.create_resnet(
      input_channel = 3, #RGB input for kinetics.
      model_depth = 50, #ResNet 50
      model_num_class = 101, #Kinetics has 400 classes, so we need the final head to align
      norm = nn.BatchNorm3d,
      activation = nn.ReLU
    )

#4. Putting it all together,

We create a `pytorch_lightning.LightningModule` that defines the train and validation step code and the optimizer.

In [17]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class VideoClassificationLightningModule(pytorch_lightning.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = make_kinetics_resnet()

    def forward(self, x):
        return self.model(x)


    def training_step(self, batch, batch_idx):
        #The model expects a video tensor of shape (B, C, T, H, W), which is 
        #the format provided by the dataset.
        y_hat = self.model(batch['video'])

        #Compute cross-entropy loss, loss.backkwards will be called under the hood by
        #pytorchlightning after being returned from this method.
        loss = F.cross_entropy(y_hat, batch['label'])

        #Log the train loss to TensorBoard.
        self.log('train_loss', loss.item())

        return loss

    def validation_step(self, batch, batch_idx):
        y_hat = self.model(batch['video'])
        loss = F.cross_entropy(y_hat, batch['label'])
        self.log('val_loss', loss)

        return loss

    def configure_optimizers(self):
        '''
        Setup the Adam optimizer. This function also can return a learning_rate scheduler 
        which is usually for training video models.
        '''
        return torch.optim.Adam(self.parameters(), lr = 1e-1)

In [18]:
def train():
    classification_module = VideoClassificationLightningModule()
    data_module = KineticsDataModule()
    trainer = pytorch_lightning.Trainer()
    trainer.fit(classification_module, data_module)

train()

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
  rank_zero_warn(
Missing logger folder: /home/debonair/Documents/Jupyter/Computer Vision/Projects/Video-recognition-for-ASL/notebooks/lightning_logs

  | Name  | Type | Params
-------------------------------
0 | model | Net  | 31.9 M
-------------------------------
31.9 M    Trainable params
0         Non-trainable params
31.9 M    Total params
127.441   Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

ValueError: Unknown value '101' for argument num_classes. Valid values are {'400', '600', '700'}.