ScalingIntelligence · jonsaadfalcon · May 22, 2025 · May 22, 2025
diff --git a/_pubs/monkeyspower.md b/_pubs/monkeyspower.md
@@ -1,11 +1,11 @@
 ---
 title: 'How Do Large Language Monkeys Get Their Power (Laws)?'
 authors:
-  - name: rylanschaeffer
+  - name: Rylan Schaeffer
     affiliation: University of California, Berkeley
-  - name: joshuakazdan
+  - name: Joshua Kazdan
     affiliation: Stanford University
-  - name: johnhughes
+  - name: John Hughes
     affiliation: University of Cambridge
   - name: Jordan Juravsky
     affiliation: University of California, Berkeley
@@ -20,16 +20,18 @@ authors:
   - key: azaliamirhoseini
   - name: Sanmi Koyejo
     affiliation: Stanford University
-venue: arXiv preprint
+venue: preprint
 year: 2025
+date: 2025-02-24
 month: February
 day: 24
 has_pdf: true
 doi: 10.48550/arXiv.2502.17578
 tags:
   - machine learning
   - generative ai
-teaser: Recent research documents a surprising finding: increasing inference compute through repeated sampling reveals power-law scaling in average success rates. This occurs despite per-problem exponential scaling, explained by heavy-tailed distributions of single-attempt success rates. Our analysis unifies these observations under a mathematical framework, offering new insights into inference-time scaling laws and more efficient performance forecasting for language models.
+slug: monkeyspower
+teaser: We explain how language models exhibit power-law scaling in success rates despite per-problem exponential scaling, revealing that heavy-tailed distributions of success probabilities drive this phenomenon and enabling more efficient performance forecasting.
 materials:
   - name: Paper
     url: https://arxiv.org/abs/2502.17578

diff --git a/_pubs/sdgrl.md b/_pubs/sdgrl.md
@@ -12,25 +12,20 @@ authors:
     affiliation: Google DeepMind
   - name: Christopher D. Manning
     affiliation: Stanford University
-venue: arXiv preprint
+venue: preprint
 year: 2025
+date: 2025-04-28
 month: April
 day: 28
 has_pdf: true
 doi: 10.48550/arXiv.2504.04736
 tags:
   - model
   - generative ai
-teaser: This paper introduces Step-Wise Reinforcement Learning (SWiRL), a framework for improving multi-step reasoning and tool use in language models through synthetic data generation and offline RL. SWiRL decomposes reasoning trajectories into sub-trajectories, enabling fine-grained feedback and significant accuracy improvements across challenging tasks like HotPotQA, GSM8K, MuSiQue, CofCA, and BeerQA. Notably, SWiRL-trained models outperform larger proprietary models in multi-step reasoning while demonstrating strong task generalization and improved cost efficiency.
+teaser: SWiRL is a framework that improves language model reasoning through synthetic data generation and step-wise reinforcement learning, enabling models to outperform larger proprietary models across diverse reasoning tasks while demonstrating strong generalization capabilities.
 materials:
   - name: Paper
     url: https://arxiv.org/abs/2504.04736
     type: file-pdf
-  - name: Code Repository
-    url: https://github.com/ScalingIntelligence/swirl_rl
-    type: code
-  - name: Synthetic Multi-Step Reasoning Dataset
-    url: https://huggingface.co/datasets/ScalingIntelligence/swirl_synthetic_data
-    type: database
 ---
 Reinforcement learning has been shown to improve the performance of large language models. However, traditional approaches like RLHF or RLAIF treat the problem as single-step. As focus shifts toward more complex reasoning and agentic tasks, language models must take multiple steps of text generation, reasoning and environment interaction before generating a solution. We propose a synthetic data generation and RL methodology targeting multi-step optimization scenarios. This approach, called Step-Wise Reinforcement Learning (SWiRL), iteratively generates multi-step reasoning and tool use data, and then learns from that data. It employs a simple step-wise decomposition that breaks each multi-step trajectory into multiple sub-trajectories corresponding to each action by the original model. It then applies synthetic data filtering and RL optimization on these sub-trajectories. We evaluated SWiRL on a number of multi-step tool use, question answering, and mathematical reasoning tasks. Our experiments show that SWiRL outperforms baseline approaches by 21.5%, 12.3%, 14.8%, 11.1%, and 15.3% in relative accuracy on GSM8K, HotPotQA, CofCA, MuSiQue, and BeerQA, respectively. Excitingly, the approach exhibits generalization across tasks: for example, training only on HotPotQA (text question-answering) improves zero-shot performance on GSM8K (a math dataset) by a relative 16.9%.
diff --git a/_pubs/tpt.md b/_pubs/tpt.md
@@ -6,8 +6,9 @@ authors:
   - name: Anna Goldie
     affiliation: Stanford University
   - key: azaliamirhoseini
-venue: arXiv preprint
+venue: preprint
 year: 2025
+date: 2025-04-25
 month: April
 day: 25
 has_pdf: true
@@ -16,7 +17,7 @@ tags:
   - machine learning
   - model
   - generative ai
-teaser: This paper introduces the Think, Prune, Train (TPT) framework, a scalable method for improving language model reasoning without increasing model size. By iteratively fine-tuning models on their own reasoning traces and applying correctness-based pruning, TPT enables smaller models to achieve performance rivaling or exceeding larger ones. Experimental results on GSM8K and CodeContests show that models like Gemma-2B and LLaMA-70B-Instruct can surpass even GPT-4o on Pass@1 accuracy through recursive self-improvement.
+teaser: Think, Prune, Train (TPT) is a scalable framework that enables smaller language models to achieve performance rivaling larger ones through iterative self-improvement on their own reasoning traces, with experimental results showing models like Gemma-2B and LLaMA-70B-Instruct surpassing GPT-4o on reasoning tasks.
 materials:
   - name: Paper
     url: https://arxiv.org/abs/2504.18116