<a href="https://colab.research.google.com/github/ReemAlsharabi/KAUST-Academy/blob/main/summer-program/week5/CV/Day1/Day1_ComputerVision_part_3_unsolved.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 3: Video Classification (2.5 points)

There are many video datasets [available](https://datasetsearch.research.google.com/) for free online intended for research purposes, such as [YouTube-8M](https://research.google.com/youtube8m/) and [ActivityNet](http://activity-net.org/). We will be using the [YouTubeVideoGame](https://ai.googleblog.com/2013/11/released-data-set-features-extracted.html) dataset which contains multiview (multi-modal) hand-crafted video features (vision, text, and audio) extracted from 120K YouTube videos of people playing 30+ popular video games. The authors of this dataset hid the original YouTube video links. All what we have is just the extracted vision, text and audio features for each video (inputs) and the class label of the featured game (output). We are only interested in the vision features. You can get the dataset and learn more about it from [here](https://code.google.com/archive/p/multiview-video-features-data/wikis/InfoOnData.wiki).

The vision modality has 5 feature families, similar to the audio modality, while the text has only 3. Combined, they are 13 feature families. Each feature family is an N-dimensional feature vector collected using some [feature extraction method](https://deepai.org/machine-learning-glossary-and-terms/feature-extraction) (e.g.,
[HOG](https://en.wikipedia.org/wiki/Histogram_of_oriented_gradients) features). Every feature family was saved into a sperate compressed text file `*.txt.gz`. They also, split the training set from the validation from the testing set. Thus, we will create a dataset class for every feature family (called `YTVGFeatureFamily`) and create another dataset class combining feature families of the same split (train, validation, test) and modality (called `YouTubeVideoGame`).

Skim through the following code-cell very quickly. It defines `YTVGFeatureFamily`. The most relavent parts are `__getitem__` and `__len__`. The former method is to give you one sample given its index and the later to return the total number of samples in the dataset. It also has a flag `by_video_id` to use video ids instead of indices. Most of these features have high sparsity (most values are zeros). Therefore, they opted to store the features in the sparse [COO(rdinate) format](https://pytorch.org/docs/1.7.1/sparse.html). However, there is no need to fret as it can be easly converted to the regular dense format by calling `Tensor.to_dense()`.

In [1]:
import torch
import gzip
from pathlib import Path
from torch.utils.data import Dataset


class YTVGFeatureFamily(Dataset):
    """YouTube Video Game Dataset (YTVG).

    http://ai.googleblog.com/2013/11/released-data-set-features-extracted.html
    http://code.google.com/archive/p/multiview-video-features-data/

    YTVG dataset has 13 feature family divided into three modalities;
    vision (5 families), audio (5 families), and text (3 families).

    Args:
        file_path: path to the *.txt.gz feature family file.
        by_video_id: accessing items is by video_id instead of index.

    """

    modalities = {
        'vision': (
            'cuboids_histogram',
            'hs_hist_stream',
            'hist_motion_estimate',
            'misc',
            'hog_features',
        ),
        'audio': (
            'mfcc',
            'sai_intervalgrams',
            'sai_boxes',
            'volume_stream',
            'spectrogram_stream',
        ),
        'text': (
            'description_unigrams',
            'tag_unigrams',
            'game_lda_1000',
        ),
    }

    def __init__(self, file_path, by_video_id=False):
        self.file_path = Path(file_path)

        # whether to select items by video id or index
        self.by_video_id = bool(by_video_id)

        # placeholders for the data
        self.index = {}  # key: video_id, value: index
        self.video_id = []
        self.class_label = []
        self.indices = []
        self.values = []

        # read all the instances (videos) in the *.txt.gz file
        with gzip.open(self.file_path, 'rt') as txt_gz:
            instances = iter(txt_gz.read().split('#I'))
            next(instances)

        # parse each instace
        for index, instance in enumerate(instances):
            video_id, class_label, *features = instance.split()
            if features:
                indices, values = zip(*map(lambda x: x.split(':'), features))
            else:
                indices, values = [], []
            indices = torch.LongTensor(tuple(map(int, indices))) - 1
            values = torch.FloatTensor(tuple(map(float, values)))

            self.index[int(video_id)] = index
            self.video_id.append(int(video_id))
            self.class_label.append(int(class_label) - 1)
            self.indices.append(indices)
            self.values.append(values)

        # get the size of this feature (maximum among all instances)
        self.size = max(i.max().item() for i in self.indices if len(i)) + 1

    @property
    def name(self):
        return self.file_path.name.split('.')[0]

    @property
    def modality(self):
        return self.file_path.name.split('_')[0]

    def __getitem__(self, index):
        if self.by_video_id:
            index = self.index[index]
        i = self.indices[index].unsqueeze(0)
        v = self.values[index]
        feature = torch.sparse.FloatTensor(i, v, (self.size,))
        return {
            'index': index,
            'video_id': self.video_id[index],
            'class_label': self.class_label[index],
            'feature': feature,
        }

    def __len__(self):
        return len(self.video_id)

    def __contains__(self, video_id):
        return video_id in self.index

    def __repr__(self):
        return f'{type(self).__name__}({self.name}_{self.size})'

You cannot test the previous class before downloading and extracting its files. The following dataset class (`YouTubeVideoGame`) should be able to download these for you. If the automatic download failed, you can download them yourself using the links that you will get prompted. These are large files they could take around 2GB of RAM after they are loaded.

In [None]:
from torchvision.datasets.utils import extract_archive
from torchvision.datasets.utils import download_file_from_google_drive


class YouTubeVideoGame(Dataset):
    """YouTube Video Game Dataset (YTVG).

    http://ai.googleblog.com/2013/11/released-data-set-features-extracted.html
    http://code.google.com/archive/p/multiview-video-features-data/

    Args:
        data_dir: root directory for data files.
        modality: must be in {'vision', 'audio', 'text'}.
        split: must be in {'train', 'validation', 'test'}.

    """

    file_id = {
        # Google Drive file ids
        'dir_vision': '0B4ZwSjYLbUK3Qmp0YWR2ckVKc2c',
        'dir_audio1': '0B4ZwSjYLbUK3dFVTa0hvMGJxdXc',
        'dir_audio2': '0B4ZwSjYLbUK3NTUyU0pQUDFpc3c',
        'dir_text': '0B4ZwSjYLbUK3WnMwTW93a1Bmc28',
        'validation': '0B4ZwSjYLbUK3ZDlZRG9pazZ6eGc',
        'test': '0B4ZwSjYLbUK3SjJJX1UtSThBZnM',
    }
    md5 = {
        # MD5 values for each file
        'dir_vision': 'b8c5bc715405d526716008ee792589c0',
        'dir_audio1': '5f10d11c2c601ff80e775c1a2ef361d3',
        'dir_audio2': '7d3ca965c9f430451799b3d68ac51498',
        'dir_text': '1e21c10b38df59b8c488f99607ff814e',
        'validation': 'e975214da0ff36702b9e649c0ee32035',
        'test': 'ae11b3769eb46c62cd72ddc8252694bf',
    }

    def __init__(self, data_dir=None, modality='vision', split='train'):
        self.data_dir = data_dir
        self.modality = modality
        self.split = split

        # select the correct data files
        files = []
        if self.split == 'train':
            if self.modality == 'vision':
                files.append('dir_vision')
            elif self.modality == 'audio':
                files.append('dir_audio1')
                files.append('dir_audio2')
            elif self.modality == 'text':
                files.append('dir_text')
        else:
            files.append(self.split)

        # download the files and load feature families
        self.features = []
        for tar in files:
            self.download_and_extract(tar)
            for name in YTVGFeatureFamily.modalities[self.modality]:
                gz = self.data_dir / f'{tar}/{self.modality}_{name}.txt.gz'
                print(f'loading {gz} ...')
                feature = YTVGFeatureFamily(gz, by_video_id=True)
                self.features.append(feature)

    @classmethod
    def default_dir(cls):
        """Get the default dataset files directory."""
        return Path(torch.hub.get_dir()) / f'datasets/{cls.__name__}'

    @property
    def data_dir(self):
        """Get dataset files directory."""
        return self._data_dir

    @data_dir.setter
    def data_dir(self, path):
        if path is None:
            path = self.default_dir()
        self._data_dir = Path(path)

    def download_and_extract(self, file_name):
        """Download and extract a dataset file."""
        directory = self.data_dir / file_name
        path = self.data_dir / (file_name + '.tar')
        if not directory.exists():
            file_id = self.file_id[file_name]
            print(f'did not find {str(directory)}')
            print(f'downloading https://drive.google.com/file/d/{file_id}')
            print(f'if it is taking more than expected, download it yourself')
            print(f'you should then rename it and place it here {path}')
            download_file_from_google_drive(
                file_id=file_id,
                root=self.data_dir,
                filename=file_name + '.tar',
                md5=self.md5[file_name],
            )
            extract_archive(str(path))

    @property
    def size(self):
        return sum(f.size for f in self.features)

    def __getitem__(self, index):
        video_id = self.features[0].video_id[index]
        output = self.features[0][video_id]
        output['feature'] = []
        for feature in self.features:
            if video_id in feature:
                vector = feature[video_id]['feature']
            else:
                size = torch.Size([feature.size])
                vector = torch.sparse.FloatTensor(size)
            output['feature'].append(vector)
        output['feature'] = torch.cat(output['feature'])
        return output

    def __len__(self):
        return len(self.features[0])

    def __repr__(self):
        return f'{type(self).__name__}({self.modality}_{self.split})'


# this may take a while to download and a while to load
test_set = YouTubeVideoGame(modality='vision', split='test')
val_set = YouTubeVideoGame(modality='vision', split='validation')
train_set = YouTubeVideoGame(modality='vision', split='train')

The dataset has 31 classes $[0, 30]$ (30 games + 1 unspecified). The unspecified class (background class) includes all videos of games that are not in the previous 30 classes. Unfortunately, the `__getitem__()` function returns a `dict`, while our implementation of `gradient_descent()` expects a `tuple` of two values; in input dense feature vector and an output class label. `YouTubeVideoGame` doesn't even offer a way to add transforms for each item. So, we will need to create a wrapper class for `YouTubeVideoGame` (called `YTVG` for short).

In [None]:
# TODO: vvvvvvvvvvv
# compelete the wrapper class
# add small Gaussian noise to features if self.augment is True
# comment the noise out if it hurts the generalizaiton
class YTVG(Dataset):
    def __init__(self, dataset, augment=False):
        self.dataset = dataset
        self.augment = augment

    def __getitem__(self, index):
        output = self.dataset[index]
        class_label = ...
        features = ...
        if self.augment:
            # TODO:
            pass
        return features, class_label

    def __len__(self):
        return len(self.dataset)

    def __repr__(self):
        return repr(self.dataset)
# ^^^^^^^^^^^^^^^^^

# wrap the datasets
if isinstance(train_set, YouTubeVideoGame):
    train_set = YTVG(train_set, augment=True)
if isinstance(val_set, YouTubeVideoGame):
    val_set = YTVG(val_set)
if isinstance(test_set, YouTubeVideoGame):
    test_set = YTVG(test_set)

print(train_set, len(train_set))
print(val_set, len(val_set))
print(test_set, len(test_set))

In [None]:
# TODO: vvvvvvvvvvv
# create the data loaders

# ^^^^^^^^^^^^^^^^^

In [None]:
# TODO: vvvvvvvvvvv
# train a deep model on the vision modality of YouTubeVideoGame dataset
# you must plot training and validation loss and accuracy per epoch
# and report the final testing accuracy as we did in the previous parts
# there is no target accuracy required feel free to experiments ;)
torch.manual_seed(0)

# ^^^^^^^^^^^^^^^^^

## On Vision Transformers

If there's a word that you heard too many times, it's transformers. In this project we are interested in exploring Vision Transformers (ViTs). In 2022, ViTs emerged as a competitive alternative to convolutional neural networks (CNNs). Transformer architecture was first introduced in natural language processing (NLP) and later was extended to image classification (and other vision tasks), check [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/pdf/2010.11929v2.pdf). Without going back and forth we directly jump into the various components of a ViT architecture.

1. Split an image into patches (fixed sizes).

2. Flatten the image patches.

3. Create lower-dimensional linear embeddings from these flattened image patches.

4. Include positional embeddings.

5. Feed the sequence as an input to a state-of-the-art transformer encoder.

6. Pre-train the ViT model with image labels, which is then fully supervised on a big dataset.

7. Fine-tune the downstream dataset for image classification

![Vision Transformers](https://viso.ai/wp-content/uploads/2021/09/vision-transformer-vit.png)

Please read the following posts to get a better understanding of ViTs before proceeding to the below TODOs!

* https://lilianweng.github.io/posts/2020-04-07-the-transformer-family/

* https://jalammar.github.io/illustrated-transformer/


In each assignmnet we will have a ViT related task. Today's task is achieving steps 1 and 2, that is we need to first patchify the images then flatten them so we could later feed into an embedding layer.

In [None]:
# patchify the image

def convert_image_to_patches(image, patch_size):
    """
    As discussed above, Vision Transformers Require Patchification of the input image.
    This function takes an image tensor and returns a tensor of patches. (# Batches, # Patches, # Channels, Patch Size, Patch Size)

    images - PyTorch tensor containing the images of a batch and has the shape [B, C, H, W]
    patch_size - Number of pixels per dimension of the patches (integer)

    expected output: PyTorch tensor of shape [B, N, C, patch_size, patch_size] where N is the number of patches obtained from the image after patchification.
    """

    # TODO: fill in the blanks to patchify the image

    B, C, H, W = x.shape

    x = x.reshape()
    x = x.permute()
    x = x.flatten()

    return x



In [None]:
# TODO: Visualize the patches in an image grid using torchvision.utils.make_grid or matplotlib subplots or a tool of your choice.

In [None]:
#TODO: Now we need to flatten the patches into a vector. A simple way to do that is use .flatten() function of PyTorch. We need to pass Batch x Patches x ??.
# ?? represents the obtained vector dimension after flattening.