Data Selection for Language Models via Importance Resampling, Sang Michael Xie+, N/A, arXiv'23 #1189

AkihikoWatanabe · 2023-12-16T22:57:48Z

URL

https://arxiv.org/abs/2302.03169

Affiliations

Sang Michael Xie, N/A
Shibani Santurkar, N/A
Tengyu Ma, N/A
Percy Liang, N/A

Abstract

Selecting a suitable pretraining dataset is crucial for both general-domain(e.g., GPT-3) and domain-specific (e.g., Codex) language models (LMs). Weformalize this problem as selecting a subset of a large raw unlabeled datasetto match a desired target distribution given unlabeled target samples. Due tothe scale and dimensionality of the raw text data, existing methods use simpleheuristics or require human experts to manually curate data. Instead, we extendthe classic importance resampling approach used in low-dimensions for LM dataselection. We propose Data Selection with Importance Resampling (DSIR), anefficient and scalable framework that estimates importance weights in a reducedfeature space for tractability and selects data with importance resamplingaccording to these weights. We instantiate the DSIR framework with hashedn-gram features for efficiency, enabling the selection of 100M documents fromthe full Pile dataset in 4.5 hours. To measure whether hashed n-gram featurespreserve the aspects of the data that are relevant to the target, we define KLreduction, a data metric that measures the proximity between the selectedpretraining data and the target on some feature space. Across 8 data selectionmethods (including expert selection), KL reduction on hashed n-gram featureshighly correlates with average downstream accuracy (r=0.82). When selectingdata for continued pretraining on a specific domain, DSIR performs comparablyto expert curation across 8 target distributions. When pretraininggeneral-domain models (target is Wikipedia and books), DSIR improves overrandom selection and heuristic filtering baselines by 2-2.5% on the GLUEbenchmark. Code is available at https://github.com/p-lambda/dsir.

Translation (by gpt-3.5-turbo)

適切な事前学習データセットの選択は、一般的なドメイン（例：GPT-3）および特定のドメイン（例：Codex）の言語モデル（LMs）の両方にとって重要です。
この問題を、ラベルのない大規模な生のデータセットから目的のターゲット分布に一致するようなサブセットを選択する問題として形式化します。
生のテキストデータのスケールと次元のため、既存の方法では単純なヒューリスティックスを使用するか、人間の専門家がデータを手動で選別する必要があります。
代わりに、LMデータ選択のために低次元で使用されるクラシックな重要度リサンプリングアプローチを拡張します。
重要度リサンプリングを使用して重要度の重みを推定し、これらの重みに基づいて重要度リサンプリングによってデータを選択するための効率的でスケーラブルなフレームワークであるData Selection with Importance Resampling（DSIR）を提案します。
効率性のために、ハッシュ化されたn-gram特徴を使用してDSIRフレームワークを具体化し、完全なPileデータセットから100Mのドキュメントを4.5時間で選択することができます。
ターゲットに関連するデータの側面を保持するかどうかを測定するために、選択された事前学習データとターゲットとの間の近接性を測定するデータメトリックであるKL削減を定義します。
ハッシュ化されたn-gram特徴におけるKL削減は、エキスパート選択を含む8つのデータ選択方法において、平均ダウンストリーム精度と高い相関関係（r=0.82）を示します。
特定のドメインでの継続的な事前学習のためのデータ選択時、DSIRは8つのターゲット分布全体でエキスパートの選別と同等のパフォーマンスを発揮します。
一般的なドメインモデル（ターゲットはWikipediaと書籍）の事前学習時、DSIRはGLUEベンチマークでランダム選択およびヒューリスティックフィルタリングのベースラインよりも2〜2.5％改善されます。
コードはhttps://github.com/p-lambda/dsirで利用可能です。

Summary (by gpt-3.5-turbo)

適切な事前学習データセットの選択は、言語モデルの性能向上に重要である。既存の方法ではヒューリスティックスや人手による選別が必要だが、本研究では重要度リサンプリングを用いたデータ選択フレームワークであるDSIRを提案する。DSIRは効率的かつスケーラブルであり、KL削減というデータメトリックを用いて選択されたデータとターゲットとの近接性を測定する。実験結果では、DSIRが他の方法よりも高い精度を示し、特定のドメインや一般的なドメインの事前学習においても優れた性能を発揮することが示された。

AkihikoWatanabe added the Pocket label Dec 16, 2023

AkihikoWatanabe changed the title あ Data Selection for Language Models via Importance Resampling, Sang Michael Xie+, N/A, arXiv'23 Dec 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Selection for Language Models via Importance Resampling, Sang Michael Xie+, N/A, arXiv'23 #1189

Data Selection for Language Models via Importance Resampling, Sang Michael Xie+, N/A, arXiv'23 #1189

AkihikoWatanabe commented Dec 16, 2023 •

edited

Data Selection for Language Models via Importance Resampling, Sang Michael Xie+, N/A, arXiv'23 #1189

Data Selection for Language Models via Importance Resampling, Sang Michael Xie+, N/A, arXiv'23 #1189

Comments

AkihikoWatanabe commented Dec 16, 2023 • edited

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented Dec 16, 2023 •

edited