---
---


## Small Intro To Multi-Label Video Classification and the main problem

Multi-Label classification is the task of assigning multiple labels to each data instance, in this case, videos. A given input may belong to more than one label depending on the data, as opposed to the single-label assignment in traditional classification. Now, video classification stands as one of the most crucial challenges in computer vision. Video content comprises temporally related images, and this temporal dimension introduces a new layer of complexity to the image classification problem. There are various ways to achieve this, for instance, by utilizing the [**Pytorch Lightning Flash**](https://lightning-flash.readthedocs.io/en/latest/quickstart.html) API. You can refer to our previous blog where we created a Video Classification pipeline using one of the larger [X3D](https://pytorch.org/hub/facebookresearch_pytorchvideo_x3d/) models. However, a current drawback of Lightning Flash is the absence of tutorials on how to train a Multi-Label classification video model anywhere on the internet. In general, not many APIs have worked on this either. So, how do we achieve this? It involves a deep dive into the behind-the-scenes code of Flash, customization, and then bringing it all together!

The repository for this blog can be found [**here**](https://github.com/RafayF1/MultiLabelVidFlash.git)

---
---


## The Root of the problem

Before we embark on our journey to explore the code and address the problem, we need to understand precisely what is hindering us from performing **Multi-Label Video classification** in Lightning Flash. As mentioned earlier, the Flash documentation lacks a tutorial for this specific task. Interestingly, there is a tutorial for [Multi-Label Image Classification](https://lightning-flash.readthedocs.io/en/latest/reference/image_classification_multi_label.html), which raises the question: why can't we similarly create a _Multi-label video classification model_? Let's compare some code to find out.

For image classification, Flash employs its [**ImageClassifier**](https://lightning-flash.readthedocs.io/en/latest/api/generated/flash.image.classification.model.ImageClassifier.html#flash.image.classification.model.ImageClassifier) class, which supports multi-label classification through the use of the _multi-label_ argument. However, when examining the arguments in the [source code of the **VideoClassifier**](https://lightning-flash.readthedocs.io/en/latest/api/generated/flash.video.classification.model.VideoClassifier.html#flash.video.classification.model.VideoClassifier) used for video classification, you'll notice the absence of a _multi-label_ argument. This is the primary issue, suggesting that Flash does not inherently support Multi-Label Video Classification. So, can we simply resolve the problem by adding an argument? Well, not quite. After exploring the behind-the-scenes code, we'll discover that we need to write some custom code. Nevertheless, we have identified the root of the problem, enabling us to take a step-by-step approach to achieve our goal.


## The Steps to do

0. All the desired **imports** and **helper functions**

1. **Data Pre-processing**: This might differ for people depending on the dataset they work on the common goal is to make a multi-label dataset that the VideoClassificationData will use to create the DataModule.

2. **Creating our Custom Transform**: This will allow us to use the **x3d_m** model while also integrating the multi-label dataset into the DataModule.

3. **Creating a Custom Classifier class**: The real magic will happen here. We will also add a loss function suitable for multi-label approach.

4. **Define our DataModule**

5. **Training**


## 0. Imports and Helper Functions

All of the libraries, classes and functions we need for data pre-processing. **imports.py** has all the necessary imports needed.


In [1]:
from imports import *


  "Accessing the model URLs via the internal dictionary of the module is deprecated since 0.13 and may "


In [2]:
from flash.core.classification import ClassificationAdapterTask
from types import FunctionType
from typing import Any, Dict, Iterable, List, Optional, Union

import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DistributedSampler
from torchmetrics import Accuracy

import flash
from flash.core.classification import ClassificationTask
from flash.core.data.io.input import DataKeys
from flash.core.registry import FlashRegistry
from flash.core.utilities.compatibility import accelerator_connector
from flash.core.utilities.imports import _PYTORCHVIDEO_AVAILABLE
from flash.core.utilities.providers import _PYTORCHVIDEO
from flash.core.utilities.types import (
    LOSS_FN_TYPE,
    LR_SCHEDULER_TYPE,
    METRICS_TYPE,
    OPTIMIZER_TYPE,
)


**Helper Functions**

These functions will help us in pre-processing data, all of these are in **utils.py**


In [3]:
from utils import *


## 1. Data Pre-Processing

One of the most integral aspects of machine learning is data pre-processing. It plays a key role in allowing us to comprehend the data, understand it, its functionality, and manipulate or process it to suit our needs. Now, since this tutorial primarily aims to demonstrate how we can achieve **Multi-Label Video Classification** in Flash, I won't be using a large dataset, nor a perfectly suitable one for that matter. Instead, we will use the same [**Kinetics**](https://paperswithcode.com/dataset/kinetics-400-1) dataset featured in the [Video Classification Tutorial](https://lightning-flash.readthedocs.io/en/latest/reference/video_classification.html) on the documentation website, but with a slight twist. We'll convert this data into Multi-Label by adding some labels of our own. Let's break it down.


Firstly, we will download the data using **download_data** magic function provided by Lightning Flash.


In [7]:
download_data("https://pl-flash-data.s3.amazonaws.com/kinetics.zip", "./data")

Once the data is downloaded, a folder by the name "**data**" will be created in your home directory. Here’s an outline of the folder structure:


In [None]:
video_dataset
├── train
│   ├── archery
│   │   ├── -1q7jA3DXQM_000005_000015.mp4
│   │   ├── -5NN5hdIwTc_000036_000046.mp4
│   │   ...
│   ├── bowling
│   │   ├── -5ExwuF5IUI_000030_000040.mp4
│   │   ├── -7sTNNI1Bcg_000075_000085.mp4
│   ... ...
└── val
    ├── archery
    │   ├── 0S-P4lr_c7s_000022_000032.mp4
    │   ├── 2x1lIrgKxYo_000589_000599.mp4
    │   ...
    ├── bowling
    │   ├── 1W7HNDBA4pA_000002_000012.mp4
    │   ├── 4JxH3S5JwMs_000003_000013.mp4
    ... ...

It appears quite straightforward, easily readable, and accessible. However, as we need to manipulate the data, we must obtain the path names of the video files stored in each folder. To retrieve all of these paths, we will employ one of our helper functions, **get_files**, which will store all of the paths of the video files in a single list variable. Let's access the train files.


In [4]:
train_data_path = Path("./data/kinetics/train/")
train_vids = get_files(train_data_path, extensions=[".mp4"])


**train_vids** is a list that contains the paths of every .mp4 (i.e video) files in the train folder. How about a look at one of these paths?


In [5]:
train_vids[1]

PosixPath('data/kinetics/train/bowling/-N4vEATi9Mk_000003_000013.mp4')

Using **Path** from the **pathlib** library in Python, it allows us to break down file path names and manipulate them. For instance, this is how you get the name of the file.


In [6]:
train_vids[1].name

'-N4vEATi9Mk_000003_000013.mp4'

Just like showed in the folder structure above, this is the name of a video file. We can also get just the name of the folder (or label in our case) this file belongs to.


In [7]:
train_vids[1].parent.name

'bowling'

Let's take a look at the labels in the data. Currently, the dataset has 5 labels: **archery**, **bowling**, **flying_kite**, **high_jump**, **marching**. As we know this is a single label classification data. To make it multi-label, we are going to add two new labels: _indoor_ and _outdoor_. The former states which of these actions are usually done indoor while the latter states which of these actions are performed outdoor. We will consider **archery** and **bowling** as _indoor_ actions while **flying_kite**, **high_jump**, **marching** will be categorized as _outdoor_ actions. This is the theoritical part of it, now how de we go on to achieve this?

We usually _one hot_ encode our labels for multi-class classification problems. In one hot encoding, we represent the categorical variables as binary vectors. We first map categorical values to integer values. Then, each integer value is represented as a binary vector where all values are zero except the index of the integer, which is marked with a 1. However, we know that for multi-label classification problems, we can have any number of classes associated with it. We'll assume that the labels are mutually exclusive, and thus, instead of one hot encoding, we'll try **multi-label binarization**. Here the label (which can have multiple classes) is transformed into a binary vector such that all values are zero except the indexes associated with each class in that label, which is marked with a 1.

Let's have a look at our 7 labels. Imagine them as a list like this:


In [12]:
# | echo: False

labels = [
    "archery",
    "bowling",
    "flying_kite",
    "high_jump",
    "marching",
    "indoor",
    "outdoor",
]
labels


['archery',
 'bowling',
 'flying_kite',
 'high_jump',
 'marching',
 'indoor',
 'outdoor']

Looking at the list above, we can say that "archery" is at the list[0] position. "bowling" at list[1] and so on. Using this logic, we will try **multi-label binarization**. Now, we have got all of the paths of the train video files. We can also get the label name of the video file through **parent.name** Combining all of it together:


In [8]:
l = []
for v in train_vids:
    n = v.parent.name
    if n == "archery":
        lab = [1, 0, 0, 0, 0, 1, 0]
    elif n == "bowling":
        lab = [0, 1, 0, 0, 0, 1, 0]
    elif n == "flying_kite":
        lab = [0, 0, 1, 0, 0, 0, 1]
    elif n == "high_jump":
        lab = [0, 0, 0, 1, 0, 0, 1]
    elif n == "marching":
        lab = [0, 0, 0, 0, 1, 0, 1]
    l.append(lab)

l = np.array(l)
l = torch.from_numpy(l)


Here is how our list of tensors to use for multi-label training looks like (viewing the first 5 elements):


In [9]:
l[:5]

tensor([[0, 1, 0, 0, 0, 1, 0],
        [0, 1, 0, 0, 0, 1, 0],
        [0, 1, 0, 0, 0, 1, 0],
        [0, 1, 0, 0, 0, 1, 0],
        [0, 1, 0, 0, 0, 1, 0]])

_Voila!_ This is exactly what we needed. _1_ states that the video belongs to these labels while _0_ states the opposite. Since this is multi-label classification, a video can belong to more than one class, which is precisely what we wanted. With that accomplished, let's create a DataFrame to view our processed data in all of its glory.


In [10]:
train_vids = [str(vid).replace(str(train_data_path) + "/", "") for vid in train_vids]


To create the DataFrame, we will slice our 2-D tensor numpy array accordingly with the sequence of label names, for example, as stated before, "archery" is the 0th element which means l[:,0] would extract column 0 (the first column) from all of the rows.


In [11]:
train_df = pd.DataFrame(
    {
        "video": train_vids,
        "archery": l[:, 0],
        "bowling": l[:, 1],
        "flying_kite": l[:, 2],
        "high_jump": l[:, 3],
        "marching": l[:, 4],
        "indoor": l[:, 5],
        "outdoor": l[:, 6],
    }
)


In [13]:
train_df = train_df.sample(frac=1)

In [14]:
train_df.head()

Unnamed: 0,video,archery,bowling,flying_kite,high_jump,marching,indoor,outdoor
18,high_jump/-ZEThexrAe0_000002_000012.mp4,0,0,0,1,0,0,1
16,high_jump/-v6Dj_-drts_000003_000013.mp4,0,0,0,1,0,0,1
38,marching/-534IANO-AM_000120_000130.mp4,0,0,0,0,1,0,1
33,marching/-4c4r9YeS6s_000098_000108.mp4,0,0,0,0,1,0,1
25,flying_kite/-cMsP8DzCls_000019_000029.mp4,0,0,1,0,0,0,1


Removing the **.head()** will allow you to view the whole DataFrame but even this shows how our processed data looks like now. We will save the dataframe into a csv file in the train folder of our dataset which will be used in our DataModule:


In [43]:
train_df.to_csv("./data/kinetics/train/train.csv", index=False)

These were all integral steps to take to view our data, how it is structured, manipulating the path names of the data; which finally allowed us to **Multi-Class Binarize** the data to convert it to multi-label. Having said that, clean, organized and abstract code is a vital part of writing code. It also allows us to save time and this is a good case study for that. If you look at the dataset, we also have a **val** folder so we also need to create a csv file for that to be used in our DataModule. Instead of going through the whole process again, what if we create a function which will only take the desired **data path** and create the csv file in accordance with that? In the **utility.py** file, there is a fucntion named **createMultiLabelDf**, this is how it looks like:

These were all integral steps to take to view our data, understand its structure, and manipulate the file path names, ultimately enabling us to **Multi-Class Binarize** the data and convert it to a multi-label format. That being said, clean, organized, and abstract code is a vital aspect of coding. It not only saves time but also enables us to write efficient code. When examining the dataset, you'll notice the existence of a **val** folder, which also requires the creation of a CSV file for use in our DataModule. Rather than going through the entire process again, what if we create a function that takes the desired _data path_ as input and generates the CSV file accordingly? In the **utility.py** file, there is a function named **createMultiLabelDf**, and this is what it looks like:


In [None]:
def createMultiLabelDf(data_path):
    data_path = Path(data_path)
    vids = get_files(data_path, extensions=[".mp4"])
    l = multiBinary(vids)
    vids = [str(vid).replace(str(data_path) + "/", "") for vid in vids]

    df = pd.DataFrame(
        {
            "video": vids,
            "archery": l[:, 0],
            "bowling": l[:, 1],
            "flying_kite": l[:, 2],
            "high_jump": l[:, 3],
            "marching": l[:, 4],
            "indoor": l[:, 5],
            "outdoor": l[:, 6],
        }
    )

    df.to_csv(str(data_path) + "/" + str(data_path.name) + ".csv", index=False)


This function contains everything we did above in chronological order. Now see the magic happen, we will firstly, create two variables that will have the string path of the train and val folders saved. Next, we will just pass these as arguments to the function twice.


In [15]:
train_data_path = "./data/kinetics/train/"
val_data_path = "./data/kinetics/val/"

In [16]:
train_csv = createMultiLabelDf(train_data_path)
val_csv = createMultiLabelDf(val_data_path)

## 2. Creating our Custom Transform

The reason why we are creating our own **Custom Transform** is because we want to use the **x3d_m** model which gives far more accurate and better results than the smaller **x3d_xs** model. I would suggest giving this [tutorial](https://medium.com/@dreamai/video-classification-using-pytorch-lightning-flash-and-the-x3d-family-of-models-ec6361969073) a read for an in-depth concept. Another huge reason, in the multi-label aspect of things is this: if we explore the code behind-the-scenes, this is what the currently staticly written transform looks like:


In [None]:
class VideoClassificationInputTransform(InputTransform):
    image_size: int = 244
    temporal_sub_sample: int = 8
    mean: Tensor = torch.tensor([0.45, 0.45, 0.45])
    std: Tensor = torch.tensor([0.225, 0.225, 0.225])
    data_format: str = "BCTHW"
    same_on_frame: bool = False

    def per_sample_transform(self) -> Callable:
        per_sample_transform = [CenterCrop(self.image_size)]

        return ApplyToKeys(
            DataKeys.INPUT,
            Compose(
                [UniformTemporalSubsample(self.temporal_sub_sample), normalize]
                + per_sample_transform
            ),
        )

    def train_per_sample_transform(self) -> Callable:
        per_sample_transform = [RandomCrop(self.image_size, pad_if_needed=True)]

        return ApplyToKeys(
            DataKeys.INPUT,
            Compose(
                [UniformTemporalSubsample(self.temporal_sub_sample), normalize]
                + per_sample_transform
            ),
        )

    def per_batch_transform_on_device(self) -> Callable:
        return ApplyToKeys(
            DataKeys.INPUT,
            K.VideoSequential(
                K.Normalize(self.mean, self.std),
                data_format=self.data_format,
                same_on_frame=self.same_on_frame,
            ),
        )


We already know to use the **x3d_m** model we only need to change the **temporal_sub_sample** value. However, we already know that our new processed data contains a tensor numpy array for every video which indicates which class or _TARGET_ it belongs to. This is why another vital adjustment we need to do is to allow our DataModule to convert the input targets to tensors. Otherwise, it will generate errors. For this. we will use the **[ApplyToKeys](https://lightning-flash.readthedocs.io/en/latest/api/generated/flash.core.data.io.input_transform.InputTransform.html)** transform that Lightning Flash provides.


In [None]:
def normalize(x: Tensor) -> Tensor:
    return x / 255.0

In [8]:
class TransformDataModule(InputTransform):
    image_size: int = 256
    temporal_sub_sample: int = 16
    mean: Tensor = torch.tensor([0.45, 0.45, 0.45])
    std: Tensor = torch.tensor([0.225, 0.225, 0.225])
    data_format: str = "BCTHW"
    same_on_frame: bool = False

    def per_sample_transform(self) -> Callable:
        per_sample_transform = [CenterCrop(self.image_size)]

        return Compose(
            [
                ApplyToKeys(
                    DataKeys.INPUT,
                    Compose(
                        [UniformTemporalSubsample(self.temporal_sub_sample), normalize]
                        + per_sample_transform
                    ),
                ),
                ApplyToKeys(DataKeys.TARGET, torch.as_tensor),
            ]
        )

    def train_per_sample_transform(self) -> Callable:
        per_sample_transform = [RandomCrop(self.image_size, pad_if_needed=True)]

        return Compose(
            [
                ApplyToKeys(
                    DataKeys.INPUT,
                    Compose(
                        [UniformTemporalSubsample(self.temporal_sub_sample), normalize]
                        + per_sample_transform
                    ),
                ),
                ApplyToKeys(DataKeys.TARGET, torch.as_tensor),
            ]
        )

    def per_batch_transform_on_device(self) -> Callable:
        return ApplyToKeys(
            DataKeys.INPUT,
            K.VideoSequential(
                K.Normalize(self.mean, self.std),
                data_format=self.data_format,
                same_on_frame=self.same_on_frame,
            ),
        )


## 3. Creating a Custom Classifier Class

As Pytorch Lightning Flash directly does not allow us to perform _Multi-Label Video Classification_ we will need to write some custom code inherting its concepts and the classes of Flash. The first step is to determine which loss function to use. Since, we are not dealing with Single-Label classification, we must use a different one. Torch provides a function [**binary_cross_entropy_with_logits**](https://pytorch.org/docs/stable/generated/torch.nn.functional.binary_cross_entropy_with_logits.html) that measures Binary Cross Entropy between target and input logits. This fits perfeclty in our case.


In [None]:
def binary_cross_entropy_with_logits(x: Tensor, y: Tensor) -> Tensor:
    """Calls BCE with logits and cast the target one_hot (y) encoding to floating point precision."""
    return F.binary_cross_entropy_with_logits(x, y.float())

This is some necessary code.


In [9]:
_VIDEO_CLASSIFIER_BACKBONES = FlashRegistry("backbones")

if _PYTORCHVIDEO_AVAILABLE:
    from pytorchvideo.models import hub

    for fn_name in dir(hub):
        if "__" not in fn_name:
            fn = getattr(hub, fn_name)
            if isinstance(fn, FunctionType):
                _VIDEO_CLASSIFIER_BACKBONES(fn=fn, providers=_PYTORCHVIDEO)

Now, we write our custom classifer class that will be used to load our desired model for training.


In [10]:
class VC(ClassificationTask):
    backbones: FlashRegistry = _VIDEO_CLASSIFIER_BACKBONES

    required_extras = "video"

    def __init__(
        self,
        num_classes: Optional[int] = None,
        multi_label: bool = False,
        labels: Optional[List[str]] = None,
        backbone: Union[str, nn.Module] = "x3d_xs",
        backbone_kwargs: Optional[Dict] = None,
        pretrained: bool = True,
        loss_fn: LOSS_FN_TYPE = binary_cross_entropy_with_logits,
        optimizer: OPTIMIZER_TYPE = "Adam",
        lr_scheduler: LR_SCHEDULER_TYPE = None,
        metrics: METRICS_TYPE = Accuracy(),
        learning_rate: Optional[float] = None,
        head: Optional[Union[FunctionType, nn.Module]] = None,
    ):
        self.save_hyperparameters()

        if labels is not None and num_classes is None:
            num_classes = len(labels)

        super().__init__(
            model=None,
            loss_fn=loss_fn,
            optimizer=optimizer,
            lr_scheduler=lr_scheduler,
            metrics=metrics,
            learning_rate=learning_rate,
            num_classes=num_classes,
            labels=labels,
            multi_label=multi_label,
        )

        if not backbone_kwargs:
            backbone_kwargs = {}

        backbone_kwargs["pretrained"] = (
            True if (flash._IS_TESTING and torch.cuda.is_available()) else pretrained
        )
        backbone_kwargs["head_activation"] = None

        if isinstance(backbone, nn.Module):
            self.backbone = backbone
        elif isinstance(backbone, str):
            self.backbone = self.backbones.get(backbone)(**backbone_kwargs)
            num_features = self.backbone.blocks[-1].proj.out_features
        else:
            raise ValueError(
                f"backbone should be either a string or a nn.Module. Found: {backbone}"
            )

        self.head = head or nn.Sequential(
            nn.Flatten(),
            nn.Linear(num_features, num_classes),
        )

    def on_train_start(self) -> None:
        if accelerator_connector(self.trainer).is_distributed:
            encoded_dataset = self.trainer.train_dataloader.loaders.dataset.data
            encoded_dataset._video_sampler = DistributedSampler(
                encoded_dataset._labeled_videos
            )
        super().on_train_start()

    def on_train_epoch_start(self) -> None:
        if accelerator_connector(self.trainer).is_distributed:
            encoded_dataset = self.trainer.train_dataloader.loaders.dataset.data
            encoded_dataset._video_sampler.set_epoch(self.trainer.current_epoch)
        super().on_train_epoch_start()

    def step(self, batch: Any, batch_idx: int, metrics) -> Any:
        return super().step(
            (batch[DataKeys.INPUT], batch[DataKeys.TARGET]), batch_idx, metrics
        )

    def forward(self, x: Any) -> Any:
        x = self.backbone(x)
        if self.head is not None:
            x = self.head(x)
        return x

    def predict_step(self, batch: Any, batch_idx: int, dataloader_idx: int = 0) -> Any:
        predictions = self(batch[DataKeys.INPUT])
        batch[DataKeys.PREDS] = predictions
        return batch

    def modules_to_freeze(
        self,
    ) -> Union[nn.Module, Iterable[Union[nn.Module, Iterable]]]:
        """Return the module attributes of the model to be frozen."""
        return list(self.backbone.children())


## 4. Define the DataModule

We will use the _.from_csv_ method while also writing the input variable: **videos** along with the list of **labels**.


In [11]:
# | output : False

datamodule = VideoClassificationData.from_csv(
    "video",
    ["archery", "bowling", "flying_kite", "high_jump", "marching", "indoor", "outdoor"],
    train_file="./data/kinetics/train/train.csv",
    train_videos_root="./data/kinetics/train/",
    val_file="./data/kinetics/val/val.csv",
    val_videos_root="./data/kinetics/val/",
    transform_kwargs=dict(image_size=(244, 244)),
    clip_sampler="uniform",
    clip_duration=1,
    decode_audio=False,
    batch_size=8,
    num_workers=2,
    transform=TransformDataModule(),
    persistent_workers=True,
)


  "`pytorch_lightning.utilities.apply_func.apply_to_collection` has been deprecated in v1.8.0 and will be"


Let's check if our datamodule is exactly what we need starting off with the labels


In [12]:
datamodule.labels

['archery',
 'bowling',
 'flying_kite',
 'high_jump',
 'marching',
 'indoor',
 'outdoor']

Looks good, it shows all 7 labels. The datamodule also has .multi_label attribute which is a boolean value.


In [14]:
datamodule.multi_label

True

Great ! Exactly what we wanted.


## 5. Training

We have almost reached the summit! Before we start to train our model, we need to create it first. Let's define the evaluation metrics our model will follow.


In [15]:
# metrics = (F1Score(num_labels=datamodule.num_classes, task="multilabel", top_k=1))
metrics = MultilabelAccuracy(num_labels, threshold=0.5, average=None)


Another crucial aspect is to know which activation function to use. For a multi-class classification problem, we use Softmax activation function. This is because we want to maximize the probability of a single class, and softmax ensures that the sum of the probabilities is one. However, we use Sigmoid activation function for the output layer in the multi-label classification setting. What sigmoid does is that it allows you to have a high probability for all your classes or some of them, or none of them.


In [21]:
head = nn.Sequential(
    nn.Flatten(start_dim=1, end_dim=-1),
    nn.Linear(in_features=400, out_features=7, bias=True),
    nn.Sigmoid(),
)


Now, let's combine all of it together to create a model for training by using our custom **VC** class with a **x3d_m** backbone.


In [22]:
# | output: False

model = VC(
    backbone="x3d_m",
    labels=datamodule.labels,
    metrics=metrics,
    loss_fn=binary_cross_entropy_with_logits,
    head=head,
    multi_label=datamodule.multi_label,
    pretrained=True,
)


  f"Attribute {k!r} is an instance of `nn.Module` and is already saved during checkpointing."
  f"Attribute {k!r} is an instance of `nn.Module` and is already saved during checkpointing."
Using 'x3d_m' provided by Facebook Research/PyTorchVideo (https://github.com/facebookresearch/pytorchvideo).


Some necessary checks to see if our model meets our requirements.


In [23]:
model.labels

['archery',
 'bowling',
 'flying_kite',
 'high_jump',
 'marching',
 'indoor',
 'outdoor']

In [24]:
model.multi_label


True

In [25]:
model.head

Sequential(
  (0): Flatten(start_dim=1, end_dim=-1)
  (1): Linear(in_features=400, out_features=7, bias=True)
  (2): Sigmoid()
)

Great! Time for the training process to commence.


In [26]:
# | output: False


trainer = flash.Trainer(
    max_epochs=2,
    accelerator="gpu",
    devices=1,
)


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
  "Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning`"


In [27]:
# | output: False

trainer.finetune(model, datamodule=datamodule, strategy="freeze")

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name          | Type       | Params
---------------------------------------------
0 | train_metrics | ModuleDict | 0     
1 | val_metrics   | ModuleDict | 0     
2 | test_metrics  | ModuleDict | 0     
3 | backbone      | Net        | 3.8 M 
4 | head          | Sequential | 2.8 K 
---------------------------------------------
34.2 K    Trainable params
3.8 M     Non-trainable params
3.8 M     Total params
15.188    Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]



Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

`Trainer.fit` stopped: `max_epochs=2` reached.
