How long of an embedding vector is common for a vision transformer?
Vision Transformers commonly use embedding vectors of size:

768 (e.g. ViT-Base)
1024 (e.g. ViT-Large)
2048 (e.g. ViT-Huge)

Andrei-Cristian Rad explains in his article on Medium that the ViT architecture uses a trainable embedding tensor of shape (p²*c, d), which learns to linearly project each flat patch to dimension d. This dimension d is constant in the architecture and is used in most of the components.

So anywhere from 768 to 2048 dimensions is common and has been explored in research.
In general, larger embedding sizes allow the model to capture more fine-grained relationships and representations, but also increase parameter count and risk of overfitting. So it's a tradeoff, and the exact size used will depend on the specific use case and data available.