Skip to content

Conversation

@CodersAcademy006
Copy link
Contributor

This PR adds a 3D patch embedding module for video inputs, enabling
Diffusion Transformer (DiT)–style video models in Megatron-LM.

The module converts video tensors [B, C, T, H, W] into a sequence of
tokens [B, N, D] using a single Conv3D projection, consistent with ViT
and DiT architectures.

  • Introduces VideoPatchEmbed under megatron/core/vision
  • Uses Conv3D for efficient linear projection over spatiotemporal patches
  • Produces Transformer-ready token sequences
  • Includes unit test validating shape correctness
  • No impact on existing models or training paths

This PR is intentionally minimal and self-contained.

Fixes #2796 - Part - 3 DiT reference wiring

@CodersAcademy006 CodersAcademy006 requested review from a team as code owners January 5, 2026 09:40
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Video Generation Model Support (Wan, DiT architectures)

1 participant