Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions _pubs/monkeyspower.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
---
title: 'How Do Large Language Monkeys Get Their Power (Laws)?'
authors:
- key: rylanschaeffer
affiliation: University of California, Berkeley
- key: joshuakazdan
affiliation: Stanford University
- key: johnhughes
affiliation: University of Cambridge
- name: Jordan Juravsky
affiliation: University of California, Berkeley
- name: Sara Price
affiliation: University of Cambridge
- name: Aengus Lynch
affiliation: University of Cambridge
- name: Erik Jones
affiliation: University of Toronto
- name: Robert Kirk
affiliation: University of Cambridge
- key: azaliamirhoseini
affiliation: Google DeepMind
- name: Sanmi Koyejo
affiliation: Stanford University
venue: arXiv preprint
year: 2025
month: February
day: 24
has_pdf: true
doi: 10.48550/arXiv.2502.17578
tags:
- machine learning
- scaling laws
- generative AI
- inference compute
teaser:
Recent research documents a surprising finding: increasing inference compute through repeated sampling reveals power-law scaling in average success rates. This occurs despite per-problem exponential scaling, explained by heavy-tailed distributions of single-attempt success rates. Our analysis unifies these observations under a mathematical framework, offering new insights into inference-time scaling laws and more efficient performance forecasting for language models.
materials:
- name: Paper
url: https://arxiv.org/abs/2502.17578
type: file-pdf
---
Recent research across mathematical problem solving, proof assistant programming and multimodal jailbreaking documents a striking finding: when (multimodal) language model tackle a suite of tasks with multiple attempts per task -- succeeding if any attempt is correct -- then the negative log of the average success rate scales a power law in the number of attempts. In this work, we identify an apparent puzzle: a simple mathematical calculation predicts that on each problem, the failure rate should fall exponentially with the number of attempts. We confirm this prediction empirically, raising a question: from where does aggregate polynomial scaling emerge? We then answer this question by demonstrating per-problem exponential scaling can be made consistent with aggregate polynomial scaling if the distribution of single-attempt success probabilities is heavy tailed such that a small fraction of tasks with extremely low success probabilities collectively warp the aggregate success trend into a power law - even as each problem scales exponentially on its own. We further demonstrate that this distributional perspective explains previously observed deviations from power law scaling, and provides a simple method for forecasting the power law exponent with an order of magnitude lower relative error, or equivalently, ~2-4 orders of magnitude less inference compute. Overall, our work contributes to a better understanding of how neural language model performance improves with scaling inference compute and the development of scaling-predictable evaluations of (multimodal) language models.
44 changes: 44 additions & 0 deletions _pubs/sdgrl.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
---
title: 'Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use'
authors:
- name: Anna Goldie
affiliation: Stanford University
email: agoldie@cs.stanford.edu
equal: true
- name: Azalia Mirhoseini
affiliation: Stanford University
email: azalia@cs.stanford.edu
equal: true
- name: Hao Zhou
affiliation: Google DeepMind
- name: Irene Cai
affiliation: Google DeepMind
- name: Christopher D. Manning
affiliation: Stanford University
venue: arXiv preprint
year: 2025
month: April
day: 28
has_pdf: true
doi: 10.48550/arXiv.2504.04736
tags:
- reinforcement learning
- language models
- reasoning
- tool use
- synthetic data
- generative AI
teaser:
This paper introduces Step-Wise Reinforcement Learning (SWiRL), a framework for improving multi-step reasoning and tool use in language models through synthetic data generation and offline RL. SWiRL decomposes reasoning trajectories into sub-trajectories, enabling fine-grained feedback and significant accuracy improvements across challenging tasks like HotPotQA, GSM8K, MuSiQue, CofCA, and BeerQA. Notably, SWiRL-trained models outperform larger proprietary models in multi-step reasoning while demonstrating strong task generalization and improved cost efficiency.
materials:
- name: Paper
url: https://arxiv.org/abs/2504.04736
type: file-pdf
- name: Code Repository
url: https://github.com/ScalingIntelligence/swirl_rl
type: code
- name: Synthetic Multi-Step Reasoning Dataset
url: https://huggingface.co/datasets/ScalingIntelligence/swirl_synthetic_data
type: database
---
Reinforcement learning has been shown to improve the performance of large language models. However, traditional approaches like RLHF or RLAIF treat the problem as single-step. As focus shifts toward more complex reasoning and agentic tasks, language models must take multiple steps of text generation, reasoning and environment interaction before generating a solution. We propose a synthetic data generation and RL methodology targeting multi-step optimization scenarios. This approach, called Step-Wise Reinforcement Learning (SWiRL), iteratively generates multi-step reasoning and tool use data, and then learns from that data. It employs a simple step-wise decomposition that breaks each multi-step trajectory into multiple sub-trajectories corresponding to each action by the original model. It then applies synthetic data filtering and RL optimization on these sub-trajectories. We evaluated SWiRL on a number of multi-step tool use, question answering, and mathematical reasoning tasks. Our experiments show that SWiRL outperforms baseline approaches by 21.5%, 12.3%, 14.8%, 11.1%, and 15.3% in relative accuracy on GSM8K, HotPotQA, CofCA, MuSiQue, and BeerQA, respectively. Excitingly, the approach exhibits generalization across tasks: for example, training only on HotPotQA (text question-answering) improves zero-shot performance on GSM8K (a math dataset) by a relative 16.9%.
32 changes: 32 additions & 0 deletions _pubs/tpt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
title: 'Think, Prune, Train, Improve: Scaling Reasoning Without Scaling Models'
authors:
- name: Caia Costello
affiliation: Stanford University / Ceramic AI
email: caia@stanford.edu
- name: Simon Guo
affiliation: Stanford University
- name: Anna Goldie
affiliation: Stanford University
- key: azaliamirhoseini
affiliation: Google DeepMind
venue: arXiv preprint
year: 2025
month: April
day: 25
has_pdf: true
doi: 10.48550/arXiv.2504.18116
tags:
- machine learning
- language models
- reasoning
- self-improvement
- generative AI
teaser:
This paper introduces the Think, Prune, Train (TPT) framework, a scalable method for improving language model reasoning without increasing model size. By iteratively fine-tuning models on their own reasoning traces and applying correctness-based pruning, TPT enables smaller models to achieve performance rivaling or exceeding larger ones. Experimental results on GSM8K and CodeContests show that models like Gemma-2B and LLaMA-70B-Instruct can surpass even GPT-4o on Pass@1 accuracy through recursive self-improvement.
materials:
- name: Paper
url: https://arxiv.org/abs/2504.18116
type: file-pdf
---
Large language models (LLMs) have demonstrated strong capabilities in programming and mathematical reasoning tasks, but are constrained by limited high-quality training data. Synthetic data can be leveraged to enhance fine-tuning outcomes, but several factors influence this process, including model size, synthetic data volume, pruning strategy, and number of fine-tuning rounds. We explore these axes and investigate which conditions enable model self-improvement. We introduce the Think, Prune, Train process, a scalable framework that iteratively fine-tunes models on their own reasoning traces, using ground-truth pruning to ensure high-quality training data. This approach yields improved performance: on GSM8K, Gemma2-2B achieves a Pass@1 of 57.6% (from 41.9%), Gemma2-9B reaches 82%, matching LLaMA-3.1-70B, and LLaMA-3.1-70B attains 91%, even surpassing GPT-4o, demonstrating the effectiveness of self-generated reasoning and systematic data selection for improving LLM capabilities.
Binary file added imgs/teasers/monkeyspower.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added imgs/teasers/sdgrl.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added imgs/teasers/tpt.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added imgs/thumbs/monkeyspower.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added imgs/thumbs/sdgrl.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added imgs/thumbs/tpt.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added pubs/monkeyspower.pdf
Binary file not shown.
Binary file added pubs/sdgrl.pdf
Binary file not shown.
Binary file added pubs/tpt.pdf
Binary file not shown.