ScalingIntelligence · jonsaadfalcon · May 19, 2025 · May 18, 2025
diff --git a/_pubs/monkeyspower.md b/_pubs/monkeyspower.md
@@ -0,0 +1,42 @@
+---
+title: 'How Do Large Language Monkeys Get Their Power (Laws)?'
+authors:
+  - key: rylanschaeffer
+    affiliation: University of California, Berkeley
+  - key: joshuakazdan
+    affiliation: Stanford University
+  - key: johnhughes
+    affiliation: University of Cambridge
+  - name: Jordan Juravsky
+    affiliation: University of California, Berkeley
+  - name: Sara Price
+    affiliation: University of Cambridge
+  - name: Aengus Lynch
+    affiliation: University of Cambridge
+  - name: Erik Jones
+    affiliation: University of Toronto
+  - name: Robert Kirk
+    affiliation: University of Cambridge
+  - key: azaliamirhoseini
+    affiliation: Google DeepMind
+  - name: Sanmi Koyejo
+    affiliation: Stanford University
+venue: arXiv preprint
+year: 2025
+month: February
+day: 24
+has_pdf: true
+doi: 10.48550/arXiv.2502.17578
+tags:
+  - machine learning
+  - scaling laws
+  - generative AI
+  - inference compute
+teaser: 
+  Recent research documents a surprising finding: increasing inference compute through repeated sampling reveals power-law scaling in average success rates. This occurs despite per-problem exponential scaling, explained by heavy-tailed distributions of single-attempt success rates. Our analysis unifies these observations under a mathematical framework, offering new insights into inference-time scaling laws and more efficient performance forecasting for language models.
+materials:
+  - name: Paper
+    url: https://arxiv.org/abs/2502.17578
+    type: file-pdf
+---
+Recent research across mathematical problem solving, proof assistant programming and multimodal jailbreaking documents a striking finding: when (multimodal) language model tackle a suite of tasks with multiple attempts per task -- succeeding if any attempt is correct -- then the negative log of the average success rate scales a power law in the number of attempts. In this work, we identify an apparent puzzle: a simple mathematical calculation predicts that on each problem, the failure rate should fall exponentially with the number of attempts. We confirm this prediction empirically, raising a question: from where does aggregate polynomial scaling emerge? We then answer this question by demonstrating per-problem exponential scaling can be made consistent with aggregate polynomial scaling if the distribution of single-attempt success probabilities is heavy tailed such that a small fraction of tasks with extremely low success probabilities collectively warp the aggregate success trend into a power law - even as each problem scales exponentially on its own. We further demonstrate that this distributional perspective explains previously observed deviations from power law scaling, and provides a simple method for forecasting the power law exponent with an order of magnitude lower relative error, or equivalently, ~2-4 orders of magnitude less inference compute. Overall, our work contributes to a better understanding of how neural language model performance improves with scaling inference compute and the development of scaling-predictable evaluations of (multimodal) language models.
diff --git a/_pubs/sdgrl.md b/_pubs/sdgrl.md
@@ -0,0 +1,44 @@
+---
+title: 'Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use'
+authors:
+  - name: Anna Goldie
+    affiliation: Stanford University
+    email: agoldie@cs.stanford.edu
+    equal: true
+  - name: Azalia Mirhoseini
+    affiliation: Stanford University
+    email: azalia@cs.stanford.edu
+    equal: true
+  - name: Hao Zhou
+    affiliation: Google DeepMind
+  - name: Irene Cai
+    affiliation: Google DeepMind
+  - name: Christopher D. Manning
+    affiliation: Stanford University
+venue: arXiv preprint
+year: 2025
+month: April
+day: 28
+has_pdf: true
+doi: 10.48550/arXiv.2504.04736
+tags:
+  - reinforcement learning
+  - language models
+  - reasoning
+  - tool use
+  - synthetic data
+  - generative AI
+teaser: 
+  This paper introduces Step-Wise Reinforcement Learning (SWiRL), a framework for improving multi-step reasoning and tool use in language models through synthetic data generation and offline RL. SWiRL decomposes reasoning trajectories into sub-trajectories, enabling fine-grained feedback and significant accuracy improvements across challenging tasks like HotPotQA, GSM8K, MuSiQue, CofCA, and BeerQA. Notably, SWiRL-trained models outperform larger proprietary models in multi-step reasoning while demonstrating strong task generalization and improved cost efficiency.
+materials:
+  - name: Paper
+    url: https://arxiv.org/abs/2504.04736
+    type: file-pdf
+  - name: Code Repository
+    url: https://github.com/ScalingIntelligence/swirl_rl
+    type: code
+  - name: Synthetic Multi-Step Reasoning Dataset
+    url: https://huggingface.co/datasets/ScalingIntelligence/swirl_synthetic_data
+    type: database
+---
+Reinforcement learning has been shown to improve the performance of large language models. However, traditional approaches like RLHF or RLAIF treat the problem as single-step. As focus shifts toward more complex reasoning and agentic tasks, language models must take multiple steps of text generation, reasoning and environment interaction before generating a solution. We propose a synthetic data generation and RL methodology targeting multi-step optimization scenarios. This approach, called Step-Wise Reinforcement Learning (SWiRL), iteratively generates multi-step reasoning and tool use data, and then learns from that data. It employs a simple step-wise decomposition that breaks each multi-step trajectory into multiple sub-trajectories corresponding to each action by the original model. It then applies synthetic data filtering and RL optimization on these sub-trajectories. We evaluated SWiRL on a number of multi-step tool use, question answering, and mathematical reasoning tasks. Our experiments show that SWiRL outperforms baseline approaches by 21.5%, 12.3%, 14.8%, 11.1%, and 15.3% in relative accuracy on GSM8K, HotPotQA, CofCA, MuSiQue, and BeerQA, respectively. Excitingly, the approach exhibits generalization across tasks: for example, training only on HotPotQA (text question-answering) improves zero-shot performance on GSM8K (a math dataset) by a relative 16.9%.
diff --git a/_pubs/tpt.md b/_pubs/tpt.md
@@ -0,0 +1,32 @@
+---
+title: 'Think, Prune, Train, Improve: Scaling Reasoning Without Scaling Models'
+authors:
+  - name: Caia Costello
+    affiliation: Stanford University / Ceramic AI
+    email: caia@stanford.edu
+  - name: Simon Guo
+    affiliation: Stanford University
+  - name: Anna Goldie
+    affiliation: Stanford University
+  - key: azaliamirhoseini
+    affiliation: Google DeepMind
+venue: arXiv preprint
+year: 2025
+month: April
+day: 25
+has_pdf: true
+doi: 10.48550/arXiv.2504.18116
+tags:
+  - machine learning
+  - language models
+  - reasoning
+  - self-improvement
+  - generative AI
+teaser: 
+  This paper introduces the Think, Prune, Train (TPT) framework, a scalable method for improving language model reasoning without increasing model size. By iteratively fine-tuning models on their own reasoning traces and applying correctness-based pruning, TPT enables smaller models to achieve performance rivaling or exceeding larger ones. Experimental results on GSM8K and CodeContests show that models like Gemma-2B and LLaMA-70B-Instruct can surpass even GPT-4o on Pass@1 accuracy through recursive self-improvement.
+materials:
+  - name: Paper
+    url: https://arxiv.org/abs/2504.18116
+    type: file-pdf
+---
+Large language models (LLMs) have demonstrated strong capabilities in programming and mathematical reasoning tasks, but are constrained by limited high-quality training data. Synthetic data can be leveraged to enhance fine-tuning outcomes, but several factors influence this process, including model size, synthetic data volume, pruning strategy, and number of fine-tuning rounds. We explore these axes and investigate which conditions enable model self-improvement. We introduce the Think, Prune, Train process, a scalable framework that iteratively fine-tunes models on their own reasoning traces, using ground-truth pruning to ensure high-quality training data. This approach yields improved performance: on GSM8K, Gemma2-2B achieves a Pass@1 of 57.6% (from 41.9%), Gemma2-9B reaches 82%, matching LLaMA-3.1-70B, and LLaMA-3.1-70B attains 91%, even surpassing GPT-4o, demonstrating the effectiveness of self-generated reasoning and systematic data selection for improving LLM capabilities.
diff --git a/imgs/teasers/monkeyspower.png b/imgs/teasers/monkeyspower.png
diff --git a/imgs/teasers/sdgrl.png b/imgs/teasers/sdgrl.png
diff --git a/imgs/teasers/tpt.png b/imgs/teasers/tpt.png
diff --git a/imgs/thumbs/monkeyspower.png b/imgs/thumbs/monkeyspower.png
diff --git a/imgs/thumbs/sdgrl.png b/imgs/thumbs/sdgrl.png
diff --git a/imgs/thumbs/tpt.png b/imgs/thumbs/tpt.png
diff --git a/pubs/monkeyspower.pdf b/pubs/monkeyspower.pdf
diff --git a/pubs/sdgrl.pdf b/pubs/sdgrl.pdf
diff --git a/pubs/tpt.pdf b/pubs/tpt.pdf