# Short Intro-detection from tv-series

## Problem definition


We will represent our video as a sequence of 1 fps images and then applying binary classification to each frame: $$\lbrace 0, 0, 0 , 1, 1, 1, 1, ...., 0, 0, 0 \rbrace$$
where $1$ represents as intro frame, and $0$ - indicates the main video

Such approach allows for model to remain independent from video duration

## Model Architecture

### Input


The model processes video content as sliding windows of 60 consecutive
frames, sampled at a rate of 1 FPS. Each frame is resized to 224×224 pixels
and normalized using standard ImageNet statistics. The resulting input tensor
has the shape (B, T, C, H, W), where B is the batch size, T = 60 is the temporal
window length, C = 3 is the number of color channels, and H, W = 224 are the
spatial dimensions.


### Feature Extractor


Each frame is passed through the CLIP image encoder, producing a **512-dimensional embedding**:
$$f_t = CLIP(I_t)  \ \forall t \in \left[1, 60\right]$$
Where $I_t$ represents as frme at time step $t$.

The output feature sequence:
$$\left( B, T, D\right)$$
Where B - batch-size, $T = 60$, $D = 512$

###Positional encoding

To preserve temporal structure, we incorporate positional encodings into CLIP embeddings.
 The final embedding matrix is obtained as:
 $$E = \left[f_1+ P_1, ..., f_{60}+P_{60} \right]$$
Where $P_t$ represents the learnable positional encoding at timestep $t$. This ensures
that the model learns relative temporal dependencies within the input sequence.


### Multihead Attention for Temporal Context

To capture long-range dependencies between frames, we employ a **multihead
attention mechanism**. The attention module consists of **16 heads and 16
transformer layers**, allowing the model to:
* Learn contextual dependencies between frames.
* Recognize patterns in intros and credits that span multiple frames.
* Differentiate between fast and slow transitions, improving robustness
across different editing styles.

Each attention head computes a weighted sum of input embeddings:

Attention(Q, K, V) = softmax($\frac{QK^T}{\sqrt{d_k}}$) V

where Q, K, V are the query, key, and value matrices derived from input embeddings.

### Frame-wise Classification

The final classification layer consists of 60 independent linear classifiers,
where each classifier processes a single frame in the sequence and predicts
whether it belongs to an intro/credit or main content. The output is represented as:
$$ \hat y_t = σ(W_t E_t + b_t) \ ∀ t \in \left[ 1, 60 \right]$$
where $W_t$ and $b_t$ are the parameters of the classifier at timestep t, and σ denotes
the sigmoid activation function

The predictions from all 60 classifiers are concatenated to form the final
sequence output:
$$ \hat Y = \left[ \hat y_1, ..., \hat y_{60}\right]$$
which is then used for sequence labeling

## Model implementation

In [None]:
import cv2
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from torch.utils.data import DataLoader, Dataset
from torchvision import transforms
import open_clip
from tqdm import tqdm

class IntroDetector(nn.Module):
    def __init__(self,clip_name = "ViT-B-32", window_size = 60,transformer_layers = 16,n_heads = 16,dropout = 0.1,unfreeze_clip= False):
        super().__init__()
        # CLIP
        self.clip, _, self.clip_preprocess = open_clip.create_model_and_transforms(clip_name, pretrained="laion2b_s34b_b79k")
        self.clip.eval()
        if not unfreeze_clip:
            for p in self.clip.parameters():
                p.requires_grad_(False)
        self.embed_dim = self.clip.visual.output_dim  # 512 for ViT‑B/32
        #PE
        self.pos_embed = nn.Parameter(torch.randn(1, window_size, self.embed_dim) * 0.02)
        # Transformer
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=self.embed_dim,
            nhead=n_heads,
            dim_feedforward=self.embed_dim * 4,
            dropout=dropout,
            batch_first=True,
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=transformer_layers)
        #Classification head
        self.head = nn.Linear(self.embed_dim, 1)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, T, C, H, W = x.shape
        x = x.flatten(0, 1)  # (B*T,3,224,224)
        with torch.set_grad_enabled(self.training and any(p.requires_grad for p in self.clip.parameters())):
            feats = self.clip.encode_image(x)  # (B*T, D)
        feats = feats.view(B, T, -1)
        feats = feats + self.pos_embed[:, :T]
        feats = self.transformer(feats)
        logits = self.head(feats).squeeze(-1)  # (B,T)
        return logits

## Evaluation metrics


To assess model performance, we report accuracy, precision, recall, and F1-
score — the most commonly used metrics in binary classification tasks. Accuracy reflects the overall proportion of correctly labeled frames. Precision
indicates how many of the predicted intro/credit frames are actually correct,
while recall measures how many of the true intro/credit frames were successfully identified. The F1-score summarizes both by computing their harmonic
mean.

In [None]:
@torch.no_grad()
def evaluate(model, loader, device):
    model.eval()
    y_true, y_pred = [], []
    for frames, labels in loader:
        frames, labels = frames.to(device), labels.to(device)
        logits = model(frames)
        probs = torch.sigmoid(logits)
        preds = (probs > 0.5).float()
        y_true.append(labels.cpu().numpy())
        y_pred.append(preds.cpu().numpy())
    y_true = np.concatenate(y_true).ravel()
    y_pred = np.concatenate(y_pred).ravel()
    precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average="binary", zero_division=0)
    acc = accuracy_score(y_true, y_pred)
    return {"acc": acc, "prec": precision, "rec": recall, "f1": f1}