TimeSformer assumes a fixed number of frames in its layers even though it interpolates temporal embeddings based on the input #38027

kamila-chay · 2025-05-08T17:02:12Z

System Info

transformers version: 4.52.0.dev0
Platform: Linux-6.11.0-24-generic-x86_64-with-glibc2.39
Python version: 3.12.3
Huggingface_hub version: 0.30.2
Safetensors version: 0.5.3
Accelerate version: 1.6.0
Accelerate config: not found
DeepSpeed version: not installed
PyTorch version (GPU?): 2.5.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: no
Using GPU in script?: yes
GPU type: NVIDIA GeForce RTX 4070 Laptop GPU

Who can help?

@amyeroberts @qubvel

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I created a short Jupyter notebook: Collab

Expected behavior

Timesformer should infer the number of frames dynamically instead of relying on the config, while continuing to infer the image size from config values.

While it's reasonable for the spatial dimensions (like image_size) to be fixed and defined in the config — since they don't typically vary within a single video — temporal length often does, especially in setups using staggered windows to process long videos at high temporal density.

In these cases, it's more practical if the model dynamically infers the number of frames from the input shape, rather than requiring every chunk to match the fixed num_frames set in the config. This would simplify usage and reduce unnecessary edge-case handling downstream.

[EDIT] This issue also arises when using pretrained temporal embeddings that need to be interpolated to accommodate longer sequences. While the interpolation itself proceeds without error, the attention modules can either fail outright or—more subtly—rely on incorrect attention patterns. Notably, this isn't limited to staggered window configurations; it can occur more broadly whenever temporal dimensions are extended beyond the pretrained length.

`
def forward(self, hidden_states: torch.Tensor, output_attentions: bool = False):

    num_frames = self.config.num_frames 
    num_patch_width = self.config.image_size // self.config.patch_size
    batch_size = hidden_states.shape[0]
    num_spatial_tokens = (hidden_states.size(1) - 1) // num_frames
    num_patch_height = num_spatial_tokens // num_patch_width`

I believe this would make the model more flexible and robust in practice.

Let me know if this direction seems reasonable — I’m happy to open a PR with a proposed fix (also open to creating a custom flag in the config in order to not break BC).

The text was updated successfully, but these errors were encountered:

Rocketknight1 · 2025-05-09T14:10:46Z

cc @fcakyon as the original PR author as well!

kamila-chay added the bug label May 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TimeSformer assumes a fixed number of frames in its layers even though it interpolates temporal embeddings based on the input #38027

TimeSformer assumes a fixed number of frames in its layers even though it interpolates temporal embeddings based on the input #38027

kamila-chay commented May 8, 2025 •

edited

Loading

Rocketknight1 commented May 9, 2025

TimeSformer assumes a fixed number of frames in its layers even though it interpolates temporal embeddings based on the input #38027

TimeSformer assumes a fixed number of frames in its layers even though it interpolates temporal embeddings based on the input #38027

Comments

kamila-chay commented May 8, 2025 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Rocketknight1 commented May 9, 2025

kamila-chay commented May 8, 2025 •

edited

Loading