You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Timesformer should infer the number of frames dynamically instead of relying on the config, while continuing to infer the image size from config values.
While it's reasonable for the spatial dimensions (like image_size) to be fixed and defined in the config — since they don't typically vary within a single video — temporal length often does, especially in setups using staggered windows to process long videos at high temporal density.
In these cases, it's more practical if the model dynamically infers the number of frames from the input shape, rather than requiring every chunk to match the fixed num_frames set in the config. This would simplify usage and reduce unnecessary edge-case handling downstream.
[EDIT] This issue also arises when using pretrained temporal embeddings that need to be interpolated to accommodate longer sequences. While the interpolation itself proceeds without error, the attention modules can either fail outright or—more subtly—rely on incorrect attention patterns. Notably, this isn't limited to staggered window configurations; it can occur more broadly whenever temporal dimensions are extended beyond the pretrained length.
I believe this would make the model more flexible and robust in practice.
Let me know if this direction seems reasonable — I’m happy to open a PR with a proposed fix (also open to creating a custom flag in the config in order to not break BC).
The text was updated successfully, but these errors were encountered:
System Info
transformers
version: 4.52.0.dev0Who can help?
@amyeroberts @qubvel
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I created a short Jupyter notebook: Collab
Expected behavior
Timesformer should infer the number of frames dynamically instead of relying on the config, while continuing to infer the image size from config values.
While it's reasonable for the spatial dimensions (like image_size) to be fixed and defined in the config — since they don't typically vary within a single video — temporal length often does, especially in setups using staggered windows to process long videos at high temporal density.
In these cases, it's more practical if the model dynamically infers the number of frames from the input shape, rather than requiring every chunk to match the fixed num_frames set in the config. This would simplify usage and reduce unnecessary edge-case handling downstream.
[EDIT] This issue also arises when using pretrained temporal embeddings that need to be interpolated to accommodate longer sequences. While the interpolation itself proceeds without error, the attention modules can either fail outright or—more subtly—rely on incorrect attention patterns. Notably, this isn't limited to staggered window configurations; it can occur more broadly whenever temporal dimensions are extended beyond the pretrained length.
`
def forward(self, hidden_states: torch.Tensor, output_attentions: bool = False):
I believe this would make the model more flexible and robust in practice.
Let me know if this direction seems reasonable — I’m happy to open a PR with a proposed fix (also open to creating a custom flag in the config in order to not break BC).
The text was updated successfully, but these errors were encountered: