Explaining grokking through circuit efficiency, Vikrant Varma+, N/A, arXiv'23 #1051

AkihikoWatanabe · 2023-09-30T09:44:46Z

URL

https://arxiv.org/abs/2309.02390

Affiliations

Vikrant Varma, N/A
Rohin Shah, N/A
Zachary Kenton, N/A
János Kramár, N/A
Ramana Kumar, N/A

Abstract

One of the most surprising puzzles in neural network generalisation isgrokking: a network with perfect training accuracy but poor generalisationwill, upon further training, transition to perfect generalisation. We proposethat grokking occurs when the task admits a generalising solution and amemorising solution, where the generalising solution is slower to learn butmore efficient, producing larger logits with the same parameter norm. Wehypothesise that memorising circuits become more inefficient with largertraining datasets while generalising circuits do not, suggesting there is acritical dataset size at which memorisation and generalisation are equallyefficient. We make and confirm four novel predictions about grokking, providingsignificant evidence in favour of our explanation. Most strikingly, wedemonstrate two novel and surprising behaviours: ungrokking, in which a networkregresses from perfect to low test accuracy, and semi-grokking, in which anetwork shows delayed generalisation to partial rather than perfect testaccuracy.

Translation (by gpt-3.5-turbo)

ニューラルネットワークの一般化における最も驚くべきパズルの一つは、グロッキングです。完璧なトレーニング精度を持つネットワークでも一般化が悪い場合、さらなるトレーニングにより完璧な一般化に移行します。私たちは、グロッキングがタスクが一般化する解と記憶する解の両方を許容する場合に発生すると提案しています。一般化する解は学習が遅く、より効率的であり、同じパラメータノルムでより大きなロジットを生成します。私たちは、記憶回路はトレーニングデータセットが大きくなるにつれてより非効率になる一方、一般化回路はそうではないと仮説を立てています。これは、記憶と一般化が同じくらい効率的な臨界データセットサイズが存在することを示唆しています。私たちはグロッキングに関して4つの新しい予測を立て、それらを確認し、私たちの説明を支持する重要な証拠を提供しています。最も驚くべきことに、私たちは2つの新しい予測外の行動を示しました。一つは、完璧なテスト精度から低いテスト精度に逆戻りするアングロッキングであり、もう一つは、完璧なテスト精度ではなく部分的なテスト精度への遅れた一般化を示すセミグロッキングです。

Summary (by gpt-3.5-turbo)

グロッキングとは、完璧なトレーニング精度を持つネットワークでも一般化が悪い現象のことである。この現象は、タスクが一般化する解と記憶する解の両方を許容する場合に起こると考えられている。一般化する解は学習が遅く、効率的であり、同じパラメータノルムでより大きなロジットを生成する。一方、記憶回路はトレーニングデータセットが大きくなるにつれて非効率になるが、一般化回路はそうではないと仮説が立てられている。これは、記憶と一般化が同じくらい効率的な臨界データセットサイズが存在することを示唆している。さらに、グロッキングに関して4つの新しい予測が立てられ、それらが確認され、説明が支持される重要な証拠が提供されている。また、グロッキング以外の2つの新しい現象も示されており、それはアングロッキングとセミグロッキングである。アングロッキングは完璧なテスト精度から低いテスト精度に逆戻りする現象であり、セミグロッキングは完璧なテスト精度ではなく部分的なテスト精度への遅れた一般化を示す現象である。

AkihikoWatanabe · 2023-09-30T09:45:33Z

Grokkingがいつ、なぜ発生するかを説明する理論を示した研究。
理由としては、最初はmemorizationを学習していくのだが、ある時点から一般化回路であるGenに切り替わる。これが切り替わる理由としては、memorizationよりも、genの方がlossが小さくなるから、とのこと。これはより大規模なデータセットで顕著。

AkihikoWatanabe · 2023-10-27T10:44:48Z

Grokkingが最初に報告された研究は #524

AkihikoWatanabe added the Pocket label Sep 30, 2023

AkihikoWatanabe changed the title あ Explaining grokking through circuit efficiency, Vikrant Varma+, N/A, arXiv'23 Sep 30, 2023

AkihikoWatanabe added MachineLearning Grokking LanguageModel Neural and removed LanguageModel labels Oct 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explaining grokking through circuit efficiency, Vikrant Varma+, N/A, arXiv'23 #1051

Explaining grokking through circuit efficiency, Vikrant Varma+, N/A, arXiv'23 #1051

AkihikoWatanabe commented Sep 30, 2023 •

edited

AkihikoWatanabe commented Sep 30, 2023 •

edited

AkihikoWatanabe commented Oct 27, 2023

Explaining grokking through circuit efficiency, Vikrant Varma+, N/A, arXiv'23 #1051

Explaining grokking through circuit efficiency, Vikrant Varma+, N/A, arXiv'23 #1051

Comments

AkihikoWatanabe commented Sep 30, 2023 • edited

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented Sep 30, 2023 • edited

AkihikoWatanabe commented Oct 27, 2023

AkihikoWatanabe commented Sep 30, 2023 •

edited

AkihikoWatanabe commented Sep 30, 2023 •

edited