Skip to content

TimeSformer assumes a fixed number of frames in its layers even though it interpolates temporal embeddings based on the input #38027

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 4 tasks
kamila-chay opened this issue May 8, 2025 · 1 comment
Labels

Comments

@kamila-chay
Copy link

kamila-chay commented May 8, 2025

System Info

  • transformers version: 4.52.0.dev0
  • Platform: Linux-6.11.0-24-generic-x86_64-with-glibc2.39
  • Python version: 3.12.3
  • Huggingface_hub version: 0.30.2
  • Safetensors version: 0.5.3
  • Accelerate version: 1.6.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (GPU?): 2.5.1+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: no
  • Using GPU in script?: yes
  • GPU type: NVIDIA GeForce RTX 4070 Laptop GPU

Who can help?

@amyeroberts @qubvel

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I created a short Jupyter notebook: Collab

Expected behavior

Timesformer should infer the number of frames dynamically instead of relying on the config, while continuing to infer the image size from config values.

While it's reasonable for the spatial dimensions (like image_size) to be fixed and defined in the config — since they don't typically vary within a single video — temporal length often does, especially in setups using staggered windows to process long videos at high temporal density.

In these cases, it's more practical if the model dynamically infers the number of frames from the input shape, rather than requiring every chunk to match the fixed num_frames set in the config. This would simplify usage and reduce unnecessary edge-case handling downstream.

[EDIT] This issue also arises when using pretrained temporal embeddings that need to be interpolated to accommodate longer sequences. While the interpolation itself proceeds without error, the attention modules can either fail outright or—more subtly—rely on incorrect attention patterns. Notably, this isn't limited to staggered window configurations; it can occur more broadly whenever temporal dimensions are extended beyond the pretrained length.

`
def forward(self, hidden_states: torch.Tensor, output_attentions: bool = False):

    num_frames = self.config.num_frames 
    num_patch_width = self.config.image_size // self.config.patch_size
    batch_size = hidden_states.shape[0]
    num_spatial_tokens = (hidden_states.size(1) - 1) // num_frames
    num_patch_height = num_spatial_tokens // num_patch_width`

I believe this would make the model more flexible and robust in practice.

Let me know if this direction seems reasonable — I’m happy to open a PR with a proposed fix (also open to creating a custom flag in the config in order to not break BC).

@kamila-chay kamila-chay added the bug label May 8, 2025
@Rocketknight1
Copy link
Member

cc @fcakyon as the original PR author as well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants