Scaling Laws for Fine-Grained Mixture of Experts, Jakub Krajewski+, N/A, arXiv'24 #1230

AkihikoWatanabe · 2024-02-15T10:53:29Z

URL

Mixture of Experts (MoE) models have emerged as a primary solution forreducing the computational cost of Large Language Models. In this work, weanalyze their scaling properties, incorporating an expanded range of variables.Specifically, we introduce a new hyperparameter, granularity, whose adjustmentenables precise control over the size of the experts. Building on this, weestablish scaling laws for fine-grained MoE, taking into account the number oftraining tokens, model size, and granularity. Leveraging these laws, we derivethe optimal training configuration for a given computational budget. Ourfindings not only show that MoE models consistently outperform denseTransformers but also highlight that the efficiency gap between dense and MoEmodels widens as we scale up the model size and training budget. Furthermore,we demonstrate that the common practice of setting the size of experts in MoEto mirror the feed-forward layer is not optimal at almost any computationalbudget.

Mixture of Experts（MoE）モデルは、大規模言語モデルの計算コストを削減するための主要な解決策として登場しています。本研究では、より広範な変数を組み込んだスケーリング特性を分析します。具体的には、エキスパートのサイズを正確に制御するための新しいハイパーパラメータである「粒度」を導入します。これに基づいて、トレーニングトークンの数、モデルのサイズ、および粒度を考慮した細かいMoEのスケーリング則を確立します。これらの則を活用して、与えられた計算予算に対する最適なトレーニング設定を導出します。私たちの調査結果は、MoEモデルが一貫して密なトランスフォーマーよりも優れた性能を発揮するだけでなく、モデルのサイズとトレーニング予算をスケールアップするにつれて、密なモデルとMoEモデルの効率の差が広がることを示しています。さらに、MoEにおけるエキスパートのサイズをフィードフォワード層と同じに設定するという一般的な方法が、ほとんどの計算予算では最適ではないことを示しています。

本研究では、Mixture of Experts（MoE）モデルのスケーリング特性を分析し、新しいハイパーパラメータである「粒度」を導入することで、計算コストを削減する方法を提案しています。さらに、MoEモデルが密なモデルよりも優れた性能を発揮し、モデルのサイズとトレーニング予算をスケールアップするにつれてその差が広がることを示しています。また、一般的な方法では最適ではないことも示しています。

AkihikoWatanabe added the Pocket label Feb 15, 2024

AkihikoWatanabe changed the title あ Scaling Laws for Fine-Grained Mixture of Experts, Jakub Krajewski+, N/A, arXiv'24 Feb 15, 2024