Chain of thought prompting elicits reasoning in large language models, Wei+, Google Research, arXiv'22 #551

AkihikoWatanabe · 2023-04-27T01:49:40Z

https://arxiv.org/abs/2201.11903

AkihikoWatanabe · 2023-04-27T01:50:21Z

Chain-of-Thoughtを提案した論文。CoTをする上でパラメータ数が100B未満のモデルではあまり効果が発揮されないということは念頭に置いた方が良さそう。

AkihikoWatanabe · 2023-05-05T06:52:28Z

先行研究では、reasoningが必要なタスクの性能が低い問題をintermediate stepを明示的に作成し、pre-trainedモデルをfinetuningすることで解決していた。しかしこの方法では、finetuning用の高品質なrationaleが記述された大規模データを準備するのに多大なコストがかかるという問題があった。
このため、few-shot promptingによってこの問題を解決することが考えられるが、reasoning能力が必要なタスクでは性能が悪いという問題あがった。そこで、両者の強みを組み合わせた手法として、chain-of-thought promptingは提案された。

AkihikoWatanabe · 2023-05-05T07:30:07Z

CoTによる実験結果

以下のベンチマークを利用

math word problem: GSM8K, SVAMP, ASDiv, AQuA, MAWPS
commonsense reasoning: CSQA, StrategyQA, Big-bench Effort (Date, Sports), SayCan
Symbolic Reasoning: Last Letter concatenation, Coin Flip
- Last Letter concatnation: 名前の単語のlast wordをconcatするタスク（"Amy Brown" -> "yn"）
- Coin Flip: コインをひっくり返す、あるいはひっくり返さない動作の記述の後に、コインが表向きであるかどうかをモデルに回答するよう求めるタスク

math word problem benchmark

モデルのサイズが大きくなるにつれ性能が大きく向上（emergent ability）することがあることがわかる
- 言い換えるとCoTは<100Bのモデルではパフォーマンスに対してインパクトを与えない
- モデルサイズが小さいと、誤ったCoTを生成してしまうため
複雑な問題になればなるほど、CoTによる恩恵が大きい
- ベースラインの性能が最も低かったGSM8Kでは、パフォーマンスの2倍向上しており、1 stepのreasoningで解決できるSingleOpやMAWPSでは、性能の向上幅が小さい
Task specificなモデルをfinetuningした以前のSoTAと比較してcomparable, あるいはoutperformしている

Ablation Study

CoTではなく、他のタイプのpromptingでも同じような効果が得られるのではないか？という疑問に回答するために、3つのpromptingを実施し、CoTと性能比較した：

Equation Only: 回答するまえに数式を記載するようなprompt
- promptの中に数式が書かれているから性能改善されているのでは？という疑問に対する検証
- => GSM8Kによる結果を見ると、equation onlyでは性能が低かった。これは、これは数式だけでreasoning stepsを表現できないことに起因している
Variable compute only: dotのsequence (...) のみのprompt
- CoTは難しい問題に対してより多くの計算（intermediate token）をすることができているからでは？という疑問に対する検証
- variable computationとCoTの影響を分離するために、dotのsequence (...) のみでpromptingする方法を検証
- => 結果はbaselineと性能変わらず。このことから、variableの計算自体が性能向上に寄与しているわけではないことがわかる。
Chain of Thought after answer: 回答の後にCoTを出力するようなprompting
- 単にpretrainingの際のrelevantな知識にアクセスしやすくなっているだけなのでは？という疑問を検証
- => baselineと性能は変わらず、単に知識を活性化させるだけでは性能が向上しないことがわかる。

CoTのロバスト性

人間のAnnotatorにCoTを作成させ、それらを利用したCoTpromptingとexamplarベースな手法によって性能がどれだけ変わるかを検証。standard promptingを全ての場合で上回る性能を獲得した。このことから、linguisticなstyleにCoTは影響を受けていないことがわかる。

commonsense reasoning

全てのデータセットにおいて、CoTがstandard promptingをoutperformした。

Symbolic Reasoning

in-domain test setとout-of-domain test setの2種類を用意した。前者は必要なreasoning stepがfew-shot examplarと同一のもの、後者は必要なreasoning stepがfew-shot examplarよりも多いものである。
CoTがStandard proimptingを上回っている。特に、standard promptingではOOV test setではモデルをスケールさせても性能が向上しなかったのに対し、CoTではより大きなgainを得ている。このことから、CoTにはreasoning stepのlengthに対しても汎化能力があることがわかる。

AkihikoWatanabe added Neural NLP Zero/Few-shot CoT labels Apr 27, 2023

AkihikoWatanabe changed the title ~~Chain of thought prompting elicits reasoning in large language models, Wei+, arXiv'22~~ Chain of thought prompting elicits reasoning in large language models, Wei+, Google Research, arXiv'22 Apr 27, 2023

AkihikoWatanabe mentioned this issue Apr 27, 2023

Enhancing LLM Chain-of-Thought w/ Iterative Bootstrapping, Sun+, Xiamen University (w/ MSRA et al.), arXiv'23 #532

Open

AkihikoWatanabe mentioned this issue May 5, 2023

Large Language Models are Zero-Shot Reasoners, Kojima+, University of Tokyo, NeurIPS'22 #553

Open

AkihikoWatanabe mentioned this issue Aug 22, 2023

Graph of Thoughts: Solving Elaborate Problems with Large Language Models, Maciej Besta+, N/A, arXiv'23 #1012

Open

AkihikoWatanabe mentioned this issue Oct 22, 2023

SCOTT: Self-Consistent Chain-of-Thought Distillation, ACL'23 #829

Open

AkihikoWatanabe added the Prompting label Nov 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chain of thought prompting elicits reasoning in large language models, Wei+, Google Research, arXiv'22 #551

Chain of thought prompting elicits reasoning in large language models, Wei+, Google Research, arXiv'22 #551

AkihikoWatanabe commented Apr 27, 2023

AkihikoWatanabe commented Apr 27, 2023 •

edited

AkihikoWatanabe commented May 5, 2023

AkihikoWatanabe commented May 5, 2023 •

edited

Chain of thought prompting elicits reasoning in large language models, Wei+, Google Research, arXiv'22 #551

Chain of thought prompting elicits reasoning in large language models, Wei+, Google Research, arXiv'22 #551

Comments

AkihikoWatanabe commented Apr 27, 2023

AkihikoWatanabe commented Apr 27, 2023 • edited

AkihikoWatanabe commented May 5, 2023

AkihikoWatanabe commented May 5, 2023 • edited

CoTによる実験結果

math word problem benchmark

Ablation Study

CoTのロバスト性

commonsense reasoning

Symbolic Reasoning

AkihikoWatanabe commented Apr 27, 2023 •

edited

AkihikoWatanabe commented May 5, 2023 •

edited