<a href="https://colab.research.google.com/github/Jaseelkt007/Generative_AI_basics/blob/main/Ctrl_V_paper_explanation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ctrl-V paper Explanation:
**Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion (Dec 2024)**

https://arxiv.org/abs/2406.05630

Ctrl-V uses two-part approach :
1. It first generates trajectories for object bounding boxes over time (BBox Generator using a diffusion model).
2. Then, it generates the actual video (Box2Video) conditioned on these bounding box trajectories using a specialized video diffusion model.
    *   In the **Box2Video** step, Ctrl-V directly adapts the ControlNet framework for video generation:
        *   The ControlNet takes “bounding box frames” as control signals and injects them into the video diffusion backbone (Stable Video Diffusion, SVD).
        *   The SVD’s original weights are frozen, and only the ControlNet-adapter module is trained—mirroring the original image ControlNet design but extended for video synthesis and spatiotemporal control.



# BBox Generator

## 1. Purpose

The BBox Generator predicts bounding boxes for objects across all video frames, using an SVD (likely Stable Video Diffusion or space-time video diffusion) backbone.

## 2. Inputs to the Model

There are four main inputs:

### $\hat{b}^t$
- The encoded "video" of bounding boxes, but with t levels of noise added
- In training, this is what the model tries to denoise

### $b^{(0)}$
- The encoded initial bounding box/frame(s)
- Represents the object position(s) at the start of the video

### $b^{(N-1)}$
- The encoded final bounding box/frame
- The object position(s) at the end of the video

### $z^{(0)}$
- The encoded initial video frame itself (not the bounding box, but the frame's latent or pixel-space encoding)

## 3. Training Objective

### Goal
The model learns to predict the noise that was added to $\hat{b}^t$ (the noisy boxes) using the UNet in an EDM (Elucidated Diffusion Model) noise schedule.

### Process
It recovers the original bounding boxes $b$ from the noisy version $\hat{b}^t$ by:
- Predicting the noise component (using UNet outputs)
- "Eliminating" this noise using certain scaling functions (common in diffusion models)

*Note: Model diagram abstracts away this denoising detail for simplicity.*

## 4. Input Vector Formation

All four inputs are transformed and concatenated into the expected format for the UNet adapter inside the SVD backbone.

### Key Step – Constructing $z_{pad}^{(0)}$

- $z^{(0)}$ has a shape $1 \times C' \times H' \times W'$ (one frame, with feature channels and spatial dimensions)
- It is replicated to get shape $N \times C' \times H' \times W'$ (for N frames)

**Crucially:**
The very first (0-th) and last (N-1-th) elements in the first (frame) dimension are replaced:
- The first becomes $b^{(0)}$ (initial bounding box)
- The last becomes $b^{(N-1)}$ (final bounding box)

This process creates a padded tensor:

$$z_{pad}^{(0)} = \text{concat}(b^{(0)}, z^{(0)}, ..., z^{(0)}, b^{(N-1)})$$

**Shape:** $N \times C' \times H' \times W'$

## 5. Final Model Input

The noisy bounding box encoding $\hat{b}^t$ is concatenated along with $z_{pad}^{(0)}$.

This combined tensor is fed to the UNet adapter.

## 6. Additional Conditioning

The model also uses extra info for conditioning at each UNet layer/block:

### $c^{(0)}$ - CLIP-encoded embedding of the initial frame
- CLIP features (semantic info) of the starting frame

### $t$ - Noise-level embedding
- Indicates the noise intensity (typical in diffusion models)

These embeddings are integrated into every sub-block of the UNet using a self-attention mechanism (so every part of the network is aware of them throughout the process).

## Summary Table

| Component | Role |
|-----------|------|
| $\hat{b}^t$ | Noisy video of bounding box encodings (to be denoised by model) |
| $b^{(0)}$ | Encoded initial bounding box/frame |
| $b^{(N-1)}$ | Encoded final bounding box/frame |
| $z^{(0)}$ | Encoded initial video frame |
| $z_{pad}^{(0)}$ | N-frame tensor, start and end replaced by $b^{(0)}$, $b^{(N-1)}$, rest are $z^{(0)}$ |
| UNet Adapter Input | Concatenation of $\hat{b}^t$ and $z_{pad}^{(0)}$ |
| Conditioning Inputs | CLIP embedding of initial frame ($c^{(0)}$), noise level embedding ($t$) |
| Integration | Conditionings are fed into every UNet sub-block via self-attention |

## In Short

The BBox Generator uses noisy and clean cues from bounding boxes and video frames.

It "pads" a stack of latent encodings with the initial and final bounding box states.

It concatenates this with the noisy box encoding for UNet-based denoising.

Semantic and noise-level information are injected via self-attention throughout the network.

# Representing Bounding Box in Pixel space

## 1.Ctrl-V Design Choice: Pixel-Space Rendering

**Ctrl-V** renders **bounding boxes as actual images/frames in pixel space**, not just as numbers (coordinates).

## 2. Why is this important?

**How** bounding box information is encoded and injected as a control signal for the video generator **matters a lot** for generation performance and flexibility.

### Prior approach example (Boximator):
- Bounding boxes are converted into a **vector format** (by using the Fourier transform on raw box coordinates, plus ID and metadata)
- This is a compact, vectorized (not image) representation used as input

## 3. Ctrl-V's Contrasting Approach

Instead of vectorizing, **Ctrl-V renders bounding boxes into images (frames)**, intentionally preserving spatial and meta information visually.

### What does this mean?
- The bounding box for each object is literally drawn into a "canvas" (image) as it would appear in the original frame
- This is done for every frame in the sequence

## 4. Encoding Extra Information Visually

The following information is encoded visually in the rendered bounding boxes:

### Track ID
- Uniquely identifies each object throughout the video

### Object Type
- Tells what kind of object it is (e.g., car, pedestrian)

### Orientation
- Indicates which direction the object is facing

### How are these encoded?

| Visual Element | Encodes |
|----------------|---------|
| **Border color** | Object's ID |
| **Fill color** | Type of object |
| **Markings** | Orientation (could be an arrow, angled line, etc.) |

## 5. Advantages of This Approach

### Minimal Loss of Meta-information
- By rendering everything visually in the frame, you don't "throw away" any of the information that coordinatization or vectorization might lose (e.g., relative positions, shapes, types)

### Pixel-Level Guidance using ControlNet
- Since the box is rendered into a pixel-level frame, it is possible to use **ControlNet** (a diffusion model guidance method) to impose *precise* control over what parts of the image or video the diffusion process should focus on

# Box2Video

## Purpose of Box2Video

**Goal:** To generate *high-fidelity videos* that are **controlled** by bounding box frames.
- Bounding boxes may come from the BBox Generator

## Architecture at a Glance

### Components:
- **SVD Backbone:** Handles video diffusion/generation (SVD = Stable Video Diffusion or similar)
- **Adapted ControlNet Module:** Processes the bounding box control signal and injects it into the video diffusion process

## ControlNet

- A network for **controlling image (now video) generation** using explicit signals (e.g., bounding boxes in image/pixel space)
- **Modification:** Ctrl-V adapts ControlNet so it integrates with a video diffusion model instead of just images

## Training Efficiency

### Single-stage training
- Ctrl-V's Box2Video is trained end-to-end (single stage), *without* extra losses/criteria or staged pretraining

### Prior work comparison:
- Previous methods (Boximator, TrackDiffusion) **required extra multi-stage training** and extra losses
- Box2Video is architected for simpler, more efficient training

## Inputs and Pre-processing

### Two main video-related inputs to SVD:
- $z^{(0)}$: Encoded initial video frame
- $z^t$: Encoded entire video (with noise level t added)

### Preparation:
- $z^{(0)}$ is **padded (repeated along the time/frame dimension)** so shapes match for concatenation
- These are concatenated to form input to the SVD's UNet adapter (video diffusion entry)

## ControlNet Processing Path

### Parallel Inputs:
- The **same input** (concatenated [$z^{(0)}$, $z^t$]) is also sent into ControlNet via its own UNet adapter layers
- **Bounding box frames $b$:** (Rendered in pixel space as previously discussed) are also encoded and sent into ControlNet through special adapter ("ControlNet adapter layers")

### Merging:
- The two ControlNet input streams (**video** and **bounding box frames**) are *added together*
- This combined information is processed by ControlNet, allowing the network to use **both structure (boxes) and context (video init/content)** to guide generation

## Output Flow & Control Signal Injection

- The processed signal from ControlNet goes through a **zero-convolution** (which usually means a learnable 1×1 convolution initialized to zero, so it starts with no contribution and gradually learns to influence)

### Residual Path:
- This "control" signal is injected (residually) into the **SVD UNet decoder layers**, influencing each stage of the video generation process according to the control input

## Training Details

| Component | Training Status | Description |
|-----------|----------------|-------------|
| **SVD weights ($\theta$)** | **Frozen** | The main video diffusion model doesn't learn further |
| **ControlNet weights ($\xi$)** | **Trained** | Only these weights are updated during training |

This allows the model to efficiently *adapt* or *guide* the base diffusion model using bounding boxes, without retraining the whole video generator from scratch.