# Multimodal Transformers (Text–Image–Audio–Video)

This notebook introduces multimodal learning with transformers: dual encoders (CLIP/ALIGN), captioning hybrids (CoCa), and universal models (PaLM-E, GPT-4V).

## Learning objectives
- Understand cross-modal alignment and contrastive objectives
- Learn fusion strategies (early/late/cross-attention)
- Run a minimal contrastive similarity demo with synthetic features

## References (Papers)
- Radford et al., 2021 — "Learning Transferable Visual Models From Natural Language Supervision (CLIP)" (arXiv:2103.00020)
- Jia et al., 2021 — "ALIGN" (arXiv:2102.05918)
- Yu et al., 2022 — "CoCa: Contrastive Captioners are Image-Text Foundation Models" (arXiv:2205.01917)
- Driess et al., 2023 — "PaLM-E" (arXiv:2303.03378)
- OpenAI, 2023 — "GPT-4V(ision)" (system card and technical report)


In [None]:
# Contrastive similarity + rearrange demo
import torch
from einops import rearrange

B, N, C = 4, 8, 32
# Simulate per-patch image embeddings and per-token text embeddings
img_tokens = torch.randn(B, N, C)
text_tokens = torch.randn(B, N, C)

# Pool to single vector per sample
img_emb = img_tokens.mean(dim=1)  # (B, C)
text_emb = text_tokens.mean(dim=1)  # (B, C)

# Contrastive logits
logits = text_emb @ img_emb.T
print('Contrastive logits:', logits.shape)

# Demonstrate requested pattern to restore shapes
flat = rearrange(img_tokens, 'B N C -> (B N) C')
restored = rearrange(flat, '(B N) C -> B N C', B=B, N=N)
print('Flat:', flat.shape, 'Restored:', restored.shape)
