diff --git a/_pubs/monkeyspower.md b/_pubs/monkeyspower.md index 3b77e77d..43597219 100644 --- a/_pubs/monkeyspower.md +++ b/_pubs/monkeyspower.md @@ -1,11 +1,11 @@ --- title: 'How Do Large Language Monkeys Get Their Power (Laws)?' authors: - - name: rylanschaeffer + - name: Rylan Schaeffer affiliation: University of California, Berkeley - - name: joshuakazdan + - name: Joshua Kazdan affiliation: Stanford University - - name: johnhughes + - name: John Hughes affiliation: University of Cambridge - name: Jordan Juravsky affiliation: University of California, Berkeley @@ -20,8 +20,9 @@ authors: - key: azaliamirhoseini - name: Sanmi Koyejo affiliation: Stanford University -venue: arXiv preprint +venue: preprint year: 2025 +date: 2025-02-24 month: February day: 24 has_pdf: true @@ -29,7 +30,8 @@ doi: 10.48550/arXiv.2502.17578 tags: - machine learning - generative ai -teaser: Recent research documents a surprising finding: increasing inference compute through repeated sampling reveals power-law scaling in average success rates. This occurs despite per-problem exponential scaling, explained by heavy-tailed distributions of single-attempt success rates. Our analysis unifies these observations under a mathematical framework, offering new insights into inference-time scaling laws and more efficient performance forecasting for language models. +slug: monkeyspower +teaser: We explain how language models exhibit power-law scaling in success rates despite per-problem exponential scaling, revealing that heavy-tailed distributions of success probabilities drive this phenomenon and enabling more efficient performance forecasting. materials: - name: Paper url: https://arxiv.org/abs/2502.17578 diff --git a/_pubs/sdgrl.md b/_pubs/sdgrl.md index a8a329fd..0bfe62fa 100644 --- a/_pubs/sdgrl.md +++ b/_pubs/sdgrl.md @@ -12,8 +12,9 @@ authors: affiliation: Google DeepMind - name: Christopher D. Manning affiliation: Stanford University -venue: arXiv preprint +venue: preprint year: 2025 +date: 2025-04-28 month: April day: 28 has_pdf: true @@ -21,16 +22,10 @@ doi: 10.48550/arXiv.2504.04736 tags: - model - generative ai -teaser: This paper introduces Step-Wise Reinforcement Learning (SWiRL), a framework for improving multi-step reasoning and tool use in language models through synthetic data generation and offline RL. SWiRL decomposes reasoning trajectories into sub-trajectories, enabling fine-grained feedback and significant accuracy improvements across challenging tasks like HotPotQA, GSM8K, MuSiQue, CofCA, and BeerQA. Notably, SWiRL-trained models outperform larger proprietary models in multi-step reasoning while demonstrating strong task generalization and improved cost efficiency. +teaser: SWiRL is a framework that improves language model reasoning through synthetic data generation and step-wise reinforcement learning, enabling models to outperform larger proprietary models across diverse reasoning tasks while demonstrating strong generalization capabilities. materials: - name: Paper url: https://arxiv.org/abs/2504.04736 type: file-pdf - - name: Code Repository - url: https://github.com/ScalingIntelligence/swirl_rl - type: code - - name: Synthetic Multi-Step Reasoning Dataset - url: https://huggingface.co/datasets/ScalingIntelligence/swirl_synthetic_data - type: database --- Reinforcement learning has been shown to improve the performance of large language models. However, traditional approaches like RLHF or RLAIF treat the problem as single-step. As focus shifts toward more complex reasoning and agentic tasks, language models must take multiple steps of text generation, reasoning and environment interaction before generating a solution. We propose a synthetic data generation and RL methodology targeting multi-step optimization scenarios. This approach, called Step-Wise Reinforcement Learning (SWiRL), iteratively generates multi-step reasoning and tool use data, and then learns from that data. It employs a simple step-wise decomposition that breaks each multi-step trajectory into multiple sub-trajectories corresponding to each action by the original model. It then applies synthetic data filtering and RL optimization on these sub-trajectories. We evaluated SWiRL on a number of multi-step tool use, question answering, and mathematical reasoning tasks. Our experiments show that SWiRL outperforms baseline approaches by 21.5%, 12.3%, 14.8%, 11.1%, and 15.3% in relative accuracy on GSM8K, HotPotQA, CofCA, MuSiQue, and BeerQA, respectively. Excitingly, the approach exhibits generalization across tasks: for example, training only on HotPotQA (text question-answering) improves zero-shot performance on GSM8K (a math dataset) by a relative 16.9%. diff --git a/_pubs/tpt.md b/_pubs/tpt.md index c135ded7..49c51bb0 100644 --- a/_pubs/tpt.md +++ b/_pubs/tpt.md @@ -6,8 +6,9 @@ authors: - name: Anna Goldie affiliation: Stanford University - key: azaliamirhoseini -venue: arXiv preprint +venue: preprint year: 2025 +date: 2025-04-25 month: April day: 25 has_pdf: true @@ -16,7 +17,7 @@ tags: - machine learning - model - generative ai -teaser: This paper introduces the Think, Prune, Train (TPT) framework, a scalable method for improving language model reasoning without increasing model size. By iteratively fine-tuning models on their own reasoning traces and applying correctness-based pruning, TPT enables smaller models to achieve performance rivaling or exceeding larger ones. Experimental results on GSM8K and CodeContests show that models like Gemma-2B and LLaMA-70B-Instruct can surpass even GPT-4o on Pass@1 accuracy through recursive self-improvement. +teaser: Think, Prune, Train (TPT) is a scalable framework that enables smaller language models to achieve performance rivaling larger ones through iterative self-improvement on their own reasoning traces, with experimental results showing models like Gemma-2B and LLaMA-70B-Instruct surpassing GPT-4o on reasoning tasks. materials: - name: Paper url: https://arxiv.org/abs/2504.18116