Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions _pubs/monkeyspower.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
---
title: 'How Do Large Language Monkeys Get Their Power (Laws)?'
authors:
- name: rylanschaeffer
- name: Rylan Schaeffer
affiliation: University of California, Berkeley
- name: joshuakazdan
- name: Joshua Kazdan
affiliation: Stanford University
- name: johnhughes
- name: John Hughes
affiliation: University of Cambridge
- name: Jordan Juravsky
affiliation: University of California, Berkeley
Expand All @@ -20,16 +20,18 @@ authors:
- key: azaliamirhoseini
- name: Sanmi Koyejo
affiliation: Stanford University
venue: arXiv preprint
venue: preprint
year: 2025
date: 2025-02-24
month: February
day: 24
has_pdf: true
doi: 10.48550/arXiv.2502.17578
tags:
- machine learning
- generative ai
teaser: Recent research documents a surprising finding: increasing inference compute through repeated sampling reveals power-law scaling in average success rates. This occurs despite per-problem exponential scaling, explained by heavy-tailed distributions of single-attempt success rates. Our analysis unifies these observations under a mathematical framework, offering new insights into inference-time scaling laws and more efficient performance forecasting for language models.
slug: monkeyspower
teaser: We explain how language models exhibit power-law scaling in success rates despite per-problem exponential scaling, revealing that heavy-tailed distributions of success probabilities drive this phenomenon and enabling more efficient performance forecasting.
materials:
- name: Paper
url: https://arxiv.org/abs/2502.17578
Expand Down
11 changes: 3 additions & 8 deletions _pubs/sdgrl.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,25 +12,20 @@ authors:
affiliation: Google DeepMind
- name: Christopher D. Manning
affiliation: Stanford University
venue: arXiv preprint
venue: preprint
year: 2025
date: 2025-04-28
month: April
day: 28
has_pdf: true
doi: 10.48550/arXiv.2504.04736
tags:
- model
- generative ai
teaser: This paper introduces Step-Wise Reinforcement Learning (SWiRL), a framework for improving multi-step reasoning and tool use in language models through synthetic data generation and offline RL. SWiRL decomposes reasoning trajectories into sub-trajectories, enabling fine-grained feedback and significant accuracy improvements across challenging tasks like HotPotQA, GSM8K, MuSiQue, CofCA, and BeerQA. Notably, SWiRL-trained models outperform larger proprietary models in multi-step reasoning while demonstrating strong task generalization and improved cost efficiency.
teaser: SWiRL is a framework that improves language model reasoning through synthetic data generation and step-wise reinforcement learning, enabling models to outperform larger proprietary models across diverse reasoning tasks while demonstrating strong generalization capabilities.
materials:
- name: Paper
url: https://arxiv.org/abs/2504.04736
type: file-pdf
- name: Code Repository
url: https://github.com/ScalingIntelligence/swirl_rl
type: code
- name: Synthetic Multi-Step Reasoning Dataset
url: https://huggingface.co/datasets/ScalingIntelligence/swirl_synthetic_data
type: database
---
Reinforcement learning has been shown to improve the performance of large language models. However, traditional approaches like RLHF or RLAIF treat the problem as single-step. As focus shifts toward more complex reasoning and agentic tasks, language models must take multiple steps of text generation, reasoning and environment interaction before generating a solution. We propose a synthetic data generation and RL methodology targeting multi-step optimization scenarios. This approach, called Step-Wise Reinforcement Learning (SWiRL), iteratively generates multi-step reasoning and tool use data, and then learns from that data. It employs a simple step-wise decomposition that breaks each multi-step trajectory into multiple sub-trajectories corresponding to each action by the original model. It then applies synthetic data filtering and RL optimization on these sub-trajectories. We evaluated SWiRL on a number of multi-step tool use, question answering, and mathematical reasoning tasks. Our experiments show that SWiRL outperforms baseline approaches by 21.5%, 12.3%, 14.8%, 11.1%, and 15.3% in relative accuracy on GSM8K, HotPotQA, CofCA, MuSiQue, and BeerQA, respectively. Excitingly, the approach exhibits generalization across tasks: for example, training only on HotPotQA (text question-answering) improves zero-shot performance on GSM8K (a math dataset) by a relative 16.9%.
5 changes: 3 additions & 2 deletions _pubs/tpt.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,9 @@ authors:
- name: Anna Goldie
affiliation: Stanford University
- key: azaliamirhoseini
venue: arXiv preprint
venue: preprint
year: 2025
date: 2025-04-25
month: April
day: 25
has_pdf: true
Expand All @@ -16,7 +17,7 @@ tags:
- machine learning
- model
- generative ai
teaser: This paper introduces the Think, Prune, Train (TPT) framework, a scalable method for improving language model reasoning without increasing model size. By iteratively fine-tuning models on their own reasoning traces and applying correctness-based pruning, TPT enables smaller models to achieve performance rivaling or exceeding larger ones. Experimental results on GSM8K and CodeContests show that models like Gemma-2B and LLaMA-70B-Instruct can surpass even GPT-4o on Pass@1 accuracy through recursive self-improvement.
teaser: Think, Prune, Train (TPT) is a scalable framework that enables smaller language models to achieve performance rivaling larger ones through iterative self-improvement on their own reasoning traces, with experimental results showing models like Gemma-2B and LLaMA-70B-Instruct surpassing GPT-4o on reasoning tasks.
materials:
- name: Paper
url: https://arxiv.org/abs/2504.18116
Expand Down