Let me preface this discussion by saying the following:

all-reduce that I referenced in the VLLM codebase is a feature inherit in tensor parallelism. As models grow into the tens or hundreds of billions of parameters, Cannot clone model across GPUS, so Megatron-LM addresses this by sharding the model’s layers themselves across GPUs.

Frameworks do it differently. VLLM uses specialized parallel linear layer classes to partition weight matrices and distribute the work. everyone shares (and adds together) their partial sums so that each person knows the full sum. That final sharing step is called an all-reduce: the **all-reduce** is not the issue directly. This is the way it is done for  **Feed-forward network** and **multi-head attention projections**.  **inter-GPU communication using NCCL** under the hood for synchronization.

Are the All-Reduce Operations in Embeddings and FFN/MHA Connected?

- The embedding layer's all-reduce happens only once at the start, ensuring every GPU has the full token embeddings before the model starts processing them.
- The all-reduce in FFN and MHA happens repeatedly in every transformer block during both training and inference.

In large models, the **embedding matrix** is also handled in parallel. Although The overhead of synchronizing embeddings is often negligible compared to the cost of multi-head attention and feed-forward layers.

** multi-modal projector uses tensor parallel building blocks internally, it suffers from the same limitation as the embedding layer: ultimately, the full set of projected features must be available on all GPUs before the next stage (the Transformer’s self-attention) can proceed.**

The argument I am making is that the **vision encoder were not fully sharded across GPUs**. “image encoder is not fully sharded” and only the standard Transformer layers had tensor parallel support, inviting contributions to improve this. The VLLM team mentioned that the “image encoder is not fully sharded” and only the standard Transformer layers had tensor parallel support, inviting contributions to improve this​
GITHUB.COM
. This means that in a multi-GPU deployment, one GPU might handle all the vision processing and projection, then distribute the results to others. That approach can create a performance bottleneck: while one GPU is busy encoding the image, the others may sit idle until the projector output is broadcast.

Potential bottlenecks: Because an all-reduce is required, the embedding layer can become a communication hotspot if a very large number of tokens are embedded at once. However, in most transformer inference scenarios the embedding lookup is a small fraction of total compute. The overhead of synchronizing embeddings is often negligible compared to the cost of multi-head attention and feed-forward layers. 

- As models grow into the tens or hundreds of billions of parameters, Cannot clone model across GPUS, so Megatron-LM addresses this by sharding the model’s layers themselves across GPUs, 
- This is what tensor parallelism is: splits the model layers across multiple GPUs on a single node, following the Megatron-LM approach​
-  VLLM uses specialized parallel linear layer classes (e.g. ColumnParallelLinear and RowParallelLinear) to partition weight matrices and distribute the work. Think of a weight matrix as a big table of numbers (parameters) that the network “learns” during training. Then, during inference (the forward pass), the input data is multiplied by these learned parameters to produce the layer’s output. They’re called “weights” because during training, the network adjusts (or “learns”) the entries in 
𝑊
W and 
𝑏
b to minimize a loss function.
- RowParallelLinear splits a weight matrix by rows (i.e. splits the input features), so each GPU gets a slice of the input vector and a corresponding slice of weights
- everyone shares (and adds together) their partial sums so that each person knows the full sum. That final sharing step is called an all-reduce: it makes sure all GPUs end up with the same complete answer. In neural networks, a RowParallelLinear layer uses the same idea. Instead of storing the entire weight matrix on one GPU, each GPU gets a slice (a few rows). **Each GPU multiplies its slice by the input**, getting a partial result. Then they do an all-reduce to sum up those partial results. Afterward, every GPU holds the full output—even though each GPU started out holding just part of the weights. This method cuts down on memory use per GPU and lets them work together efficiently.
- In a Transformer model, each block consists of two main components:
1) Multi-head self-attention
2) **Feed-forward network** (FFN) -> First linear layer , Second linear layer
Step 1 (First Linear Layer - ColumnParallelLinear)
Each worker (GPU) gets the same input but works on a different portion of the job.
No need to talk to each other after working—each does its own task.
Step 2 (Second Linear Layer - RowParallelLinear)
Each worker now only has a piece of the full output.
To complete the job, they must combine their results (**all-reduce**) to ensure every worker has the full answer.
The same concept applies to **multi-head attention projections**, where **GPUs split the work** in the query-key-value matrices and combine results after computing attention scores.
- VLLM manages the **inter-GPU communication using NCCL** under the hood, and it benefits from high-bandwidth links like NVLink to reduce overhead
 What is an Embedding Layer?
- mapping of words to vectors is done through a lookup table (or a matrix) called an embedding matrix.If a model has a vocabulary of 50,000 words, and each word is represented as a 4,096-dimensional vector, then the embedding matrix has: 50,000,4096
In large models, the **embedding matrix** is huge—often tens or hundreds of gigabytes in size.
A single GPU does not have enough memory to store the entire embedding matrix for large-scale models.
To solve this, we use vocabulary parallelism, where we split the embedding matrix across multiple GPUs.
-  VLLM supports splitting the embedding matrix across GPUs in the vocabulary dimension – an approach known as vocabulary parallelism. The VocabParallelEmbedding module is provided for this purpose​
- At runtime (during inference or training):
Each GPU only stores a portion of the embedding matrix.
This means that not every GPU has access to all words—each GPU only knows about some words.
**Imagine we give the model a sentence to process. Each word in the sentence corresponds to an index in the embedding table. VLLM creates a "mask"—a kind of checklist that tells each GPU: "These words belong to you" ✅ "These words belong to another GPU" ❌ This mask helps each GPU determine which words it should process and which it should ignore. Each GPU retrieves the embeddings only for the words it owns. For words that belong to another GPU, it outputs zeros instead. This means that every GPU ends up with a partial result—some words have valid embeddings, while others are just zeros. Since each GPU only has part of the output, we now need to combine the missing embeddings from all GPUs. An all-reduce operation is used to sum up the partial results.**
This allows every GPU to end up with the full set of embeddings, just as if it had the entire embedding table

**embedding parallelism and the parallelism used in feed-forward networks (FFN) and multi-head attention (MHA) projections are separate but related. They all use tensor parallelism but operate at different points in the model’s computation, and the all-reduce operations happen independently in each case.**

 some users observed that “the embedding layers... are not being parallelized” as effectively as other layers
