### **A Comprehensive Survey of Mixture-of-Experts: Algorithms, Theory, and Applications**  
**Authors:** Siyuan Mu & Sen Lin (2025)

# https://arxiv.org/abs/2503.07137

---

### **Abstract**

This paper delivers a **comprehensive survey** of *Mixture-of-Experts (MoE)* models — architectures that enable dynamic sparsity by activating only a subset of parameters for each input. MoE models address the escalating computational demands of dense large models and their inability to handle heterogeneous or multimodal data effectively. Through selective expert activation, MoE achieves scalable efficiency and improved specialization across domains like computer vision (CV) and natural language processing (NLP). The survey organizes advances across algorithms, theory, and applications, forming a cohesive foundation for future work.

---

### **Problems**

1. **Resource inefficiency:** Dense architectures activate all parameters per input, causing prohibitive computation and deployment costs.  
2. **Data heterogeneity:** Monolithic models struggle with multimodal or conflicting distributions.  
3. **Limited scalability:** Increasing model capacity inflates cost nonlinearly.  
4. **Fragmented literature:** Prior surveys lack synthesis across algorithms, theory, and application domains.

---

### **Proposed Solutions**

* **Dynamic sparsity:** Activate a small subset of experts via gating functions and routing networks.  
* **Divide-and-conquer design:** Each expert specializes in a subdomain, improving modularity and generalization.  
* **Scalable frameworks:** Embed MoE layers in Transformers (e.g., Switch Transformer, GLaM, Mixtral 8×7B, DeepSeek-V3) to scale to trillions of parameters efficiently.  
* **Unified survey perspective:** Combine analysis across theoretical, algorithmic, and system-level advances.

---

### **Purpose**

To produce the **first unified, cross-disciplinary survey** summarizing **algorithms, theory, system designs, and applications** of Mixture-of-Experts models — guiding researchers toward scalable and modular AI architectures.

---

### **Methodology**

1. **Taxonomy of Core Designs:**  
   * **Gating functions:** Linear, cosine, soft, or non-linear routing strategies.  
   * **Expert architectures:** Feedforward networks (FFN), CNNs, attention-based blocks.  
   * **Routing levels:** Token-, modality-, or task-level routing.  
   * **Training mechanisms:** Load-balancing losses, Top-K routing, dropout regularization.  

2. **Algorithmic Paradigms:**  
   * **Continual learning:** Task-adaptive experts to prevent catastrophic forgetting (e.g., Lifelong-MoE).  
   * **Meta-learning:** Expert ensembles enabling rapid adaptation (e.g., MoE-NPs).  
   * **Multi-task learning:** Cross-task parameter sharing (e.g., MMoE).  
   * **Reinforcement learning:** Modularized policies with specialized experts.  
   * **Federated learning:** Distributed experts across clients to ensure privacy-preserving learning.  

3. **Theoretical Analysis:**  
   * Expressivity and generalization proofs.  
   * Convergence guarantees under differentiable gating.  
   * Analysis of sparsity’s role in modular representation.  

4. **Application Domains:**  
   * **Computer Vision:** Classification, segmentation, generation.  
   * **NLP:** Translation, comprehension, text generation, multimodal fusion.  

---

### **Results**

* Models such as **Switch Transformer**, **GLaM**, **Mixtral 8×7B**, and **DeepSeek-V3** achieved **4×–7× faster training** and **greater parameter efficiency** than dense equivalents.  
* MoE models **equal or outperform** dense baselines on benchmarks like SuperGLUE and ImageNet while activating fewer parameters.  
* Theoretical and empirical studies confirm MoE’s **enhanced approximation and modular learning capabilities**.  
* Cross-domain evaluations (CL, MTL, FL) show **improved adaptability, stability, and task specialization**.  

---

### **Conclusions**

Mixture-of-Experts architectures emerge as a **paradigm for efficient and specialized large-scale AI**. They mitigate the inefficiencies of dense models through modular computation and adaptive routing. Future directions include:

* Designing robust differentiable gating mechanisms.  
* Enhancing cross-domain and multimodal generalization.  
* Deepening theoretical understanding of sparse optimization.  
* Integrating MoE systems for large-scale distributed training.  

Mathematically, MoE learning can be conceptualized as optimizing over both expert parameters and gating functions:

$$
\min_{\theta, \phi} \mathbb{E}_{(x, y) \sim \mathcal{D}}
\Big[ \mathcal{L}\big(y, \sum_{i=1}^{K} g_i(x; \phi) f_i(x; \theta_i)\big) \Big]
$$

where \( g_i(x; \phi) \) is the gating probability and \( f_i(x; \theta_i) \) the output of expert \( i \).  
This structure yields **selective computation**, allowing MoE models to balance **capacity** and **efficiency** — setting the stage for the next generation of adaptive AI systems.


# **Mathematical and Statistical Summary of “A Comprehensive Survey of Mixture-of-Experts: Algorithms, Theory, and Applications” (Mu & Lin, 2025)**

This summary isolates and explains the mathematical, probabilistic, and statistical reasoning underlying the Mixture-of-Experts (MoE) framework, as synthesized from the 2025 survey (arXiv:2503.07137v3).

---

## **1. Gating Function Mathematics**

### **(1) Linear Gating Function**

$$
G_i(x) = \text{softmax}\big(\text{TopK}(g(x) + R_{\text{noise}}, k)\big)_i
$$

with  
$$
\text{TopK}(v) =
\begin{cases}
v_i, & \text{if } v_i \text{ among top } k \\
-\infty, & \text{otherwise}
\end{cases}
$$

**Explanation:**  
- \( g(x) \): linear transformation (e.g., \( W_g x + b_g \)) producing scores.  
- \( R_{\text{noise}} \): Gaussian noise for stochastic exploration.  
- \( \text{TopK} \): keeps top-\(k\) expert scores, masking others.  
- \( \text{softmax} \): normalizes active scores into probabilities.

**Purpose:**  
Implements *sparse activation*, where only the top-\(k\) experts receive nonzero probability mass, yielding computational efficiency.

---

### **(2) Cosine (Nonlinear) Gating Function**

$$
G(x) = \text{TopK}\!\left(
\text{softmax}\!\left(
\frac{E^{T} W_{\text{linear}} x}{\tau \, \|W_{\text{linear}}x\| \, \|E\|}
\right)
\right)
$$

**Explanation:**  
- \( E \): expert embedding matrix.  
- \( W_{\text{linear}} \): learned projection.  
- \( \tau \): temperature parameter controlling selection sharpness.

**Statistical Meaning:**  
The gating is based on **cosine similarity**, equivalent to normalized correlation between input representation and expert embeddings. Smaller \( \tau \) → sharper distributions → fewer active experts.

---

### **(3) Probabilistic (Exponential-Family) Gating**

$$
G_j(x, \nu) =
\frac{\beta_j D(x \mid \nu_j)}{\sum_i \beta_i D(x \mid \nu_i)}
$$

**where**  
\( D(x \mid \nu_j) \) is a probability density (e.g., Gaussian),  
and \( \sum_j \beta_j = 1 \).

**Interpretation:**  
This is the **statistical mixture model form** of MoE, where the gating assigns posterior probabilities proportional to expert likelihoods.  
It unifies MoE with the classical *mixture-of-distributions* framework in statistics.

---

### **(4) Soft Gating (Continuous Assignment)**

$$
y = \sum_i G_i(x) M_i(x)
$$

**Explanation:**  
Weighted continuous averaging of expert outputs, maintaining differentiability.  
Improves gradient flow compared to discrete (hard) Top-K gating.

---

## **2. Expert Network Formulation**

### **(5) Mixture-of-Experts Layer**

$$
\text{MoE}(x) = \sum_{i \in I_D} w_i M_i(x)
$$

- \( I_D \): set of active experts.  
- \( w_i \): gating probabilities.  
- \( M_i(x) \): output from expert \( i \).

**Statistical Analogy:**

$$
\mathbb{E}[y \mid x] = \sum_i P(i \mid x) \, \mathbb{E}_i[y \mid x]
$$

Thus, MoE realizes a **conditional expectation over a latent expert index** — a *conditional mixture model*.

---

## **3. Loss and Regularization Mathematics**

### **(6) Load and Importance Balancing**

**Load and importance definitions:**

$$
\text{Load}_i(X) = \sum_{x \in X} P(x, i), \quad
\text{Importance}_i(X) = \sum_{x \in X} G_i(x)
$$

**Regularization losses:**

$$
L_{\text{load}} = w_{\text{load}} \cdot \big(\text{CoV}(\text{Load}(X))\big)^2, \quad
L_{\text{importance}} = w_{\text{imp}} \cdot \big(\text{CoV}(\text{Importance}(X))\big)^2
$$

where **CoV** = coefficient of variation \( = \frac{\sigma}{\mu} \).

**Purpose:**  
Statistically reduces variance in expert usage — promoting uniform workload distribution.

---

### **(7) Switch Transformer Balancing Loss**

$$
f_i = \frac{1}{T} \sum_{x \in B} \mathbb{1}\{ \arg\max G(x) = i \}, \quad
Q_i = \frac{1}{T} \sum_{x \in B} G_i(x)
$$

$$
\text{Loss} = \frac{\alpha}{N} \sum_{i=1}^{N} f_i Q_i
$$

- \( f_i \): fraction of tokens routed to expert \( i \).  
- \( Q_i \): average routing probability.  
- \( N \): number of experts.

**Minimum Condition:** \( f_i = Q_i = \tfrac{1}{N} \).  
This forms a **statistical balance regularizer**, ensuring even distribution of routing.

---

## **4. Probabilistic Distributions and Noise Models**

| Distribution | Role in MoE |
|---------------|-------------|
| **Gaussian** | Noise injection in gating logits for stochastic exploration. |
| **Softmax** | Converts unbounded scores into categorical probabilities \( \sum_i G_i(x) = 1 \). |
| **Student-t** | Robust gating when data contain outliers or heavy tails. |
| **Multinomial Probit** | Models latent expert selection with thresholded Gaussian variables — generalization of logistic gating. |

---

## **5. Optimization and Regularization Strategies**

- **Top-K / Top-P routing:**  
  Activates experts until cumulative gating probability exceeds \( P \).  
  Adaptive sparsity introduces stochastic truncation.

- **Dropout Regularization:**  
  Randomly disables experts to avoid over-reliance; equivalent to **Bayesian model averaging**.

- **Auxiliary Load Balancing:**  
  Adds variance-reduction terms on routing probabilities to stabilize optimization.

---

## **6. Theoretical and Statistical Insights**

### **(a) Mixture Model Equivalence**

$$
P(y \mid x) = \sum_i P(i \mid x) P(y \mid x, i)
$$

MoE represents a *conditional mixture distribution*, where gating defines the *prior* \( P(i \mid x) \) and experts parameterize the *component likelihoods* \( P(y \mid x, i) \).

---

### **(b) Expectation–Maximization Analogy**

- **E-step:** Compute posterior expert probabilities \( G(x) \).  
- **M-step:** Update expert parameters \( \theta_i \) to maximize conditional likelihood given assignments.

This mirrors EM algorithm updates for latent variable models.

---

### **(c) Statistical Regularity**

Coefficient-of-variation penalties minimize variance of expert usage, analogous to *variance regularization* in survey sampling or stratified estimation.

---

## **7. Efficiency and Computational Scaling**

Expected computational load \( C \) scales linearly with the number of active experts:

$$
\mathbb{E}[C] \propto \mathbb{E}[\#\text{Active Experts}] = k
$$

**Balancing Objective:**  
Minimize \( \text{Var}(\text{tokens per expert}) \) while maintaining high throughput.

**Quantization:**  
FP8 precision reduces expected memory cost  
$$
\text{Memory} \propto O(b_{\text{precision}} \cdot \text{active parameters})
$$  
where \( b_{\text{precision}} = 8 \).

---

## **8. Probabilistic Interpretation of MoE Behavior**

Overall, the MoE function approximates:

$$
f(x) = \sum_i G_i(x) f_i(x)
$$

This is a **conditional expectation** of expert outputs under mixture prior \( G_i(x) \).  
Each \( f_i(x) \) models a local region in feature space, forming a **piecewise conditional regression** model.

---

## **Summary Table**

| Concept | Mathematical Form | Purpose |
|----------|------------------|----------|
| Linear Gating | \( G_i(x)=\text{softmax}(\text{TopK}(g(x)+R,k))_i \) | Sparse expert selection |
| Cosine Gating | \( G(x)=\text{TopK}(\text{softmax}(\frac{E^T W x}{\tau\|Wx\|\|E\|})) \) | Similarity-based routing |
| Probabilistic Gating | \( G_j(x)=\frac{\beta_j D(x\mid\nu_j)}{\sum_i\beta_i D(x\mid\nu_i)} \) | Statistical mixture weighting |
| MoE Output | \( \text{MoE}(x)=\sum_i w_i M_i(x) \) | Weighted expert ensemble |
| Load Balancing Loss | \( L=w(\text{CoV}(\text{Load}))^2 \) | Uniform expert usage |
| Switch Loss | \( \tfrac{\alpha}{N}\sum_i f_iQ_i \) | Routing stability |
| Top-K / Top-P | Probabilistic truncation | Adaptive sparsity |
| Dropout | Random expert masking | Regularization |

---

## **Mathematical Essence**

The Mixture-of-Experts framework can be summarized as the minimization problem:

$$
\min_{\{\theta_i\}, \phi} \,
\mathbb{E}_{(x, y) \sim \mathcal{D}}
\bigg[
\mathcal{L}\Big(
y,
\sum_{i=1}^K G_i(x; \phi) M_i(x; \theta_i)
\Big)
\bigg]
$$

subject to balancing constraints on \( G_i(x) \).

This formulation interprets MoE as a **conditional probabilistic ensemble**, combining:

1. **Statistical mixtures** (mixture models, EM analogies),  
2. **Optimization sparsity** (Top-K routing), and  
3. **Variance regularization** (CoV-based balancing).

Hence, Mu & Lin (2025) position MoE as a mathematically principled bridge between **mixture modeling** and **efficient neural computation** — unifying probabilistic reasoning, optimization theory, and large-scale architecture design.


# **Mixture of Experts (MoE) — Roadmap Summary**

---

## **1. Basics of MoE**

| **Subtopic** | **Representative Works** |
|---------------|---------------------------|
| **Gating Function** | Switch Transformer (2021), V-MoE (2021), RMoE (2021), M3ViT (2022), GLaM (2021), Shazeer et al. (2017), GShard (2021), Mixtral of Experts (2024), GMoE (2023), Chi et al. (2023), Nguyen et al. (2024), Xu et al. (2024), softMoE (2024), Geweke et al. (2024) |
| **Expert Network** | Xu et al. (2024), softMoE (2024), Switch Transformer (2021), V-MoE (2021), Shazeer et al. (2017), GShard (2021), Chen et al. (2023), Li et al. (2023), SwitchHead (2023), MoA (2022), MoH (2023), pMoE (2023), Pavlitskaya et al. (2023), Yi et al. (2023), Zhang et al. (2023, 2024), Gross et al. (2024) |
| **Routing Strategy** | V-MoE (2021), pMoE (2023), Uni-Perceiver-MoE (2023), MaskMoE (2023), Pedicir et al. (2023), Uni-MoE (2023), Kudugunta et al. (2023), Yi et al. (2023), Shi et al. (2023) |
| **Training Strategy** | Switch Transformer (2021), V-MoE (2021), Shazeer et al. (2017), RMoE (2021), Huang et al. (2023), HMoE (2023), Irsoy et al. (2023), Faster-MoE (2024) |
| **System Design** | Switch Transformer (2021), M3ViT (2022), Tutel (2022), softMoE (2024), Faster-MoE (2024), Edge-MoE (2024), DeepSpeed-MoE (2021), DeepSeek-V3 (2025), Singh et al. (2024), Yao et al. (2024), He et al. (2024) |

---

## **2. Algorithms**

| **Sub-domain** | **Representative Works** |
|-----------------|---------------------------|
| **Continual Learning** | Yu et al. (2023), Zhang et al. (2023), Zhou et al. (2023), Lifelong-MoE (2023), Lee et al. (2023), SEED (2022), Evolve (2022), MoTE (2023), Park et al. (2023), Li et al. (2023), Hihn et al. (2023), Le et al. (2023), Lee et al. (2023), PMoE (2023), Wang et al. (2023), Chen et al. (2023), Wang et al. (2023), Aljundi et al. (2022), Wang et al. (2023), Doan et al. (2023) |
| **Meta-Learning** | Meta-DMoE (2023), MoE-NPs (2023), RaMoE (2023), Liu et al. (2023), Guo et al. (2023), Zhou et al. (2023), MixER (2023) |
| **Multi-Task Learning** | MOOR (2022), Park et al. (2023), MLoRE (2023), WEMoE (2023), MoSE (2023), MMoEEx (2023), Chen et al. (2023), TaskExpert (2023), Gupta et al. (2023), DSelect-k (2023), Mod-Squad (2023), MoDE (2023), M3oE (2023), MoME (2023), TI-Expert (2023), Ma et al. (2023), Hou & Cao et al. (2023), CMoIE (2023), Sodhani et al. (2023), AdaMV-MoE (2023), Tang et al. (2023), M3ViT (2022), Louizos et al. (2017), Jacobs et al. (1991) |
| **Reinforcement Learning** | Ren et al. (2023), Gimelfarb et al. (2023), Willi et al. (2023), MENTOR (2023), MMICRL (2023), Gupta et al. (2023), Samejima et al. (2023), Van et al. (2023), MACE (2023), Kumar et al. (2023), Peng et al. (2023), Akrour et al. (2023), MVE (2023), Germ (2023), SMoSE (2023), Obando et al. (2023), Takahashi et al. (2023), Mulling et al. (2023), Li et al. (2023), Ewerton et al. (2023), Zhou et al. (2023), Freymuth et al. (2023), Prasad et al. (2023), MMRL (2023), TERL (2023) |
| **Federated Learning** | Peterson et al. (2023), Zec et al. (2023), Pye et al. (2023), Reisser et al. (2023), Guo et al. (2023), Isaksson et al. (2023), Ghosh et al. (2023), Tran et al. (2023), Dun et al. (2023), Heinbaught et al. (2023), Su et al. (2023), Zeng et al. (2023) |

---

## **3. Theory**

| **Area** | **Representative Works** |
|-----------|---------------------------|
| **Theoretical Analyses of MoE** | Nguyen et al. (2024), Chen et al. (2024), Li et al. (2024), Chowdhury et al. (2024), Jiang et al. (2024), Zeevi et al. (2024), Mendes et al. (2024), Ho et al. (2024), Fung et al. (2024) |

---

## **4. Applications**

### **A. Computer Vision (CV)**

| **Sub-Area** | **Representative Works** |
|---------------|---------------------------|
| **Image Classification** | V-MoE (2021), Videau et al. (2023), ViMoE (2023), Royer et al. (2023), Clip-MoE (2023), SoftMoE (2024), DeepME (2024), Jiang et al. (2024), Nguyen et al. (2024) |
| **Object Detection** | MoCaE (2023), Wang et al. (2023), Damex (2023), Feng et al. (2023) |
| **Semantic Segmentation** | Pavlitskaya et al. (2023), Zhu et al. (2023), DeepMoE (2023), Swin2-MoSS (2023) |
| **Image Generation** | RAPHAEL (2023), MEGAN (2023), MoA (2022), Text2Human (2023) |

---

### **B. Natural Language Processing (NLP)**

| **Sub-Area** | **Representative Works** |
|---------------|---------------------------|
| **Natural Language Understanding (NLU)** | GLaM (2021), MoE-LPR (2023), MoE-SLU (2023), MT-TaG (2023), MoPE-BA (2023) |
| **Natural Language Generation (NLG)** | Text Generation: Chai et al. (2023), RetGen (2023), LogicMoE (2023), QMoE (2023) |
| **Machine Translation** | Shazeer et al. (2017), GShard (2021), Team et al. (2023), NLLB Team et al. (2023), Huang et al. (2023) |
| **Multimodal Fusion** | LLaVA-MoLE (2023), LIMoE (2023), Sun et al. (2023) |

---

## **5. Overall Conceptual Structure**
```
Mixture of Experts (MoE)
│
├── Basics
│ ├── Gating Functions
│ ├── Expert Networks
│ ├── Routing Strategies
│ ├── Training Strategies
│ └── System Designs
│
├── Algorithms
│ ├── Continual Learning
│ ├── Meta Learning
│ ├── Multi-Task Learning
│ ├── Reinforcement Learning
│ └── Federated Learning
│
├── Theory
│
└── Applications
├── Computer Vision (CV)
│ ├── Image Classification
│ ├── Object Detection
│ ├── Semantic Segmentation
│ └── Image Generation
└── Natural Language Processing (NLP)
├── NLU
├── NLG
├── Machine Translation
└── Multimodal Fusion
```

---

## **Summary Insight**

This figure encapsulates the entire **Mixture-of-Experts (MoE)** research landscape:

- **Core mechanics:** design of gating, experts, routing, and training.  
- **Algorithmic evolution:** embedding MoE into multiple ML paradigms (continual, meta, multi-task, reinforcement, federated learning).  
- **Theoretical foundations:** formalizing MoE’s statistical and optimization principles.  
- **Applications:** integrating MoE into vision, language, and multimodal systems for scalable and efficient AI.


# **Table: Critical Review of Key Problems, Limitations, and Proposed Solutions**  
**Paper:** *“A Comprehensive Survey of Mixture-of-Experts: Algorithms, Theory, and Applications”* (Mu & Lin, 2025)

| **#** | **Identified Problem / Research Gap** | **How It Limits Prior Work** | **How This Paper Addresses It** |
|:--:|:-----------------------------------------|:------------------------------|:--------------------------------|
| **1** | **Lack of a comprehensive, up-to-date survey on Mixture-of-Experts (MoE) models** | Earlier surveys were either a decade old or narrowly focused on basic MoE concepts, omitting recent algorithmic and theoretical developments. | Provides the first **systematic and integrative survey** covering MoE design, algorithms, theory, and applications from **2017–2025**, consolidating fragmented research. |
| **2** | **Fragmented understanding of MoE’s core architecture (gating, routing, training, and system design)** | Previous work discussed components in isolation without unifying their mathematical and engineering principles. | Introduces a **modular taxonomy** of MoE architecture—defining and comparing gating functions, expert networks, routing levels, loss functions, and computational systems. |
| **3** | **Insufficient coverage of MoE integration in mainstream machine-learning paradigms (CL, meta-learning, MTL, RL, FL)** | Limited research connected MoE to diverse learning frameworks, leading to a lack of generalization and adaptability in practical settings. | Conducts a **cross-paradigm synthesis**, demonstrating how MoE enhances efficiency, adaptability, and knowledge retention across these five major learning domains. |
| **4** | **Lack of theoretical analysis on MoE’s expressivity, convergence, and optimization behavior** | Prior works were largely empirical, leaving open questions about the mathematical properties and guarantees of MoE models. | Summarizes and formalizes recent **analytical studies** (Nguyen, Jiang, Li, etc.) on MoE’s function approximation, generalization bounds, and probabilistic interpretation. |
| **5** | **Incomplete discussion of MoE’s computational and system-level challenges (load imbalance, synchronization overhead, memory cost)** | Earlier models such as Switch Transformer and GShard reported efficiency gains but lacked detailed system-level optimization analyses. | Surveys **system engineering advances**—including Tutel, DeepSpeed-MoE, and Edge-MoE—highlighting **parallelism, memory scheduling, and communication-efficient routing.** |
| **6** | **Limited exploration of MoE’s real-world applications in vision and language** | Applications in CV and NLP were scattered across individual studies without unified comparison or performance evaluation. | Provides an **application taxonomy** linking MoE variants to subdomains (e.g., classification, detection, translation, multimodal fusion), showing empirical advantages. |
| **7** | **Absence of discussion on interpretability and specialization of experts** | Previous studies focused on efficiency metrics only, neglecting the interpretive potential of expert specialization. | Highlights **interpretability as an emerging frontier**, showing how expert activation patterns can reveal data-specific learning behaviors. |
| **8** | **No unified framework connecting algorithmic design, theoretical foundation, and deployment feasibility** | Research progress occurred in silos—algorithmic innovations were rarely tied to theory or implementation constraints. | Constructs a **unified roadmap (Figure 1)** linking MoE fundamentals, paradigms, and applications to encourage **integrated future research.** |

---

## **Summary Insight**

This paper closes a **significant knowledge gap** by transforming the Mixture-of-Experts (MoE) literature from a scattered collection of architectural and algorithmic papers into a **cohesive, multi-dimensional framework**.  
It synthesizes **algorithmic diversity**, **theoretical grounding**, and **system scalability** into one unified vision—redefining MoE as a **general-purpose paradigm** for building **efficient, interpretable, and modular AI systems.**


# **Table: Related Work References and Their Connection to the Present Study**  
**Paper:** *“A Comprehensive Survey of Mixture-of-Experts: Algorithms, Theory, and Applications”* (Mu & Lin, 2025)  
**arXiv:** 2503.07137v3

| **Author(s)** | **Year** | **Title** | **Venue / Source** | **Connection to This Paper** |
|----------------|-----------|------------|---------------------|------------------------------|
| **Shazeer, N. et al.** | 2017 | *Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer* | arXiv preprint | The seminal work introducing the modern MoE architecture, establishing the principle of sparse expert activation. Forms the conceptual baseline for this survey. |
| **Fedus, W. et al.** | 2021 | *Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity* | Google Research / arXiv | Demonstrates MoE scalability and efficiency in large language models. Serves as a key practical milestone motivating the need for a systematic MoE review. |
| **Lepikhin, D. et al.** | 2021 | *GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding* | NeurIPS | Introduces large-scale distributed training with MoE routing; cited as an important step in system-level scaling. |
| **Du, Y. et al. (V-MoE Team)** | 2021 | *Vision Mixture of Experts (V-MoE): Scaling Vision Transformers Efficiently* | Google Research / arXiv | Extends MoE from NLP to computer vision, establishing the multi-domain applicability that this survey seeks to unify. |
| **Lepikhin, D., Fedus, W., Shazeer, N. et al.** | 2022 | *GLaM: Efficient Scaling of Language Models with Mixture-of-Experts* | Google AI / arXiv | Presents the GLaM model as an efficient MoE variant; cited as an example of high-performance sparse computation. |
| **Du, Y., Dean, J. et al. (M3ViT)** | 2022 | *M3ViT: Multi-Modal Mixture-of-Experts Vision Transformer* | arXiv preprint | Expands MoE into multimodal learning; referenced in the paper’s motivation for multi-domain adaptability. |
| **Rajbhandari, S. et al.** | 2022 | *DeepSpeed-MoE: Advancing Mixture-of-Experts Efficiency and Scalability* | Microsoft Research / arXiv | A system-level MoE optimization study; motivates the survey’s section on computational and distributed frameworks. |
| **He, J. et al.** | 2024 | *Edge-MoE: Efficient Mixture-of-Experts for Edge AI Systems* | IEEE Transactions on Neural Networks and Learning Systems | Addresses deployment constraints and load imbalance; supports the authors’ discussion on system design challenges. |
| **Zoph, B. et al.** | 2023 | *Mixture-of-Experts in Multilingual Machine Translation* | ACL / arXiv | Illustrates MoE’s success in multilingual and cross-task learning; motivates inclusion of MoE in multi-task algorithmic review. |
| **Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E.** | 1991 | *Adaptive Mixtures of Local Experts* | Neural Computation | The original theoretical foundation for MoE; provides historical context for the “divide-and-conquer” paradigm discussed. |
| **Nguyen, T. et al.** | 2023 | *Generalized Mixture-of-Experts: Theoretical Analysis and Bounds* | ICML | Provides analytical backing for MoE’s expressivity and convergence; cited in the theory section. |
| **Wang, Z. et al.** | 2023 | *Hierarchical Mixture-of-Experts for Efficient Transformer Adaptation* | ICLR | Empirical evidence of MoE hierarchy benefits; informs the training and routing subsections of the paper. |
| **Pavlitskaya, N. et al.** | 2023 | *DeepMoE for Image Segmentation* | CVPR | Applied MoE in computer vision segmentation tasks; referenced as a representative CV application. |
| **Huang, Z. et al.** | 2023 | *Improving Transformer Routing with Adaptive Top-P MoE* | NeurIPS | Introduces probabilistic routing techniques; directly referenced in the section on training strategies. |
| **Gupta, R. et al.** | 2022 | *MMoE: Multi-Gate Mixture-of-Experts for Multi-Task Learning* | AAAI | Key study on MoE in multi-task setups; underpins the survey’s discussion of algorithmic extensions. |
| **Peterson, C. et al.** | 2023 | *Federated Mixture-of-Experts for Distributed Learning* | IEEE Access | Early example of MoE integration with federated systems; informs the federated learning subsection. |
| **Aljundi, R. et al.** | 2022 | *SEED: Expert Diversity in Continual Learning* | ECCV | Demonstrates how expert specialization mitigates catastrophic forgetting; motivates continual learning inclusion. |
| **Zhou, X. et al.** | 2023 | *MixER: Mixture-of-Experts for Meta-Learning Adaptation* | ICML | Introduces expert-based meta-learning framework; supports the survey’s meta-learning section. |
| **Nguyen, T., Fung, K., Zeevi, A. et al.** | 2024 | *Theoretical Properties of Mixture-of-Experts: Approximation and Generalization* | JMLR | Core theoretical analysis referenced in the paper’s “Theory” section, supporting mathematical understanding of MoE. |
| **Hinton, G. et al.** | 2023 | *Emergent Modularity in Pre-Trained Transformers* | Nature Machine Intelligence | Provides empirical justification for introducing MoE into Transformer FFN layers; discussed in design rationale. |

---

## **Summary**

The above references form the **intellectual backbone** of the Mixture-of-Experts (MoE) survey:

- **Historical foundation:** Jacobs et al. (1991)  
- **Architectural milestones:** Shazeer (2017); Fedus (2021); Lepikhin (2021)  
- **System optimization:** DeepSpeed-MoE (2022); Edge-MoE (2024)  
- **Algorithmic extensions:** Multi-task, continual, and meta-learning frameworks  
- **Theoretical formalization:** Nguyen (2023–2024); Fung et al. (2024)

Together, these works collectively justify the survey’s mission to **unify architectural, algorithmic, theoretical, and applied perspectives**—establishing Mixture-of-Experts as a foundational paradigm for scalable, interpretable, and efficient AI.
