Skip to content

RishavAr/Enough-Thinking

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Enough Thinking

Efficient Reasoning via GRPO + SEAL + MCP

Teaching Large Reasoning Models when to think — and when not to.


1. Abstract

Large Reasoning Models (LRMs) frequently over-generate chain-of-thought, even for simple problems, leading to unnecessary latency and cost. In this project, we study reasoning efficiency as a first-class optimization objective.

We present a two-stage reinforcement learning framework:

  1. Phase-1 (GRPO): induces structured reasoning behavior.
  2. Phase-2 (SEAL): internalizes recurring reasoning patterns to reduce token usage without sacrificing correctness.

Finally, we demonstrate that the optimized model can act over real-world infrastructure via the Model Context Protocol (MCP).


2. Why This Project (Motivation)

Recent work shows that:

  • Chain-of-Thought improves accuracy
  • But long reasoning traces are not always necessary

In production settings, excess reasoning:

  • Increases inference cost
  • Hurts latency
  • Limits agent scalability

Key question:

Can a model learn when detailed reasoning is necessary — and compress it when it is not?


3. Related Work & Positioning

This project is inspired by and positioned relative to:

Reasoning & RL

  • DeepSeek-R1 — Group Relative Policy Optimization (GRPO)
  • RLHF / RLAIF — reward-guided behavior shaping

Self-Adaptation

  • SEAL (Self-Editing Adaptive LLMs) — inner-loop weight updates
  • Meta-learning & continual learning literature

Tool-Augmented Agents

  • Model Context Protocol (MCP) — standardized tool access
  • Tool-use constrained generation

📌 Key distinction: Most prior work improves accuracy. This project optimizes the accuracy–efficiency tradeoff.


4. Model Choice (Why this model?)

Base Model

We use a small instruction-tuned causal LM (~0.5–1B params).

Why not a larger model?

  • Efficiency effects are easier to observe
  • Faster iteration on limited compute
  • Demonstrates that gains come from training strategy, not scale

Why LoRA?

  • Enables inner-loop adaptation (SEAL)
  • Lightweight, reversible updates
  • Mirrors real deployment constraints

5. Method Overview

Phase-0: Baseline

  • Instruction-tuned model
  • Standard prompting
  • No explicit reasoning optimization

Phase-1: “Thinking” — GRPO

Goal: Encourage verification, reflection, and correction behavior.

Mechanism (GRPO):

  • Sample multiple reasoning trajectories per problem
  • Rank trajectories by correctness
  • Optimize policy relative to other samples, not an absolute reward

Why GRPO (vs PPO)?

  • More stable under small batch sizes
  • Avoids value-function collapse
  • Used successfully in DeepSeek-R1

Outcome:

  • Accuracy ↑
  • Reasoning quality ↑
  • Token usage ↑ moderately

Phase-2: “Enough” — SEAL

Goal: Reduce unnecessary reasoning while preserving correctness.

Key idea:

If the model has already learned a reasoning pattern, it should not regenerate it every time.

Mechanism:

  1. Model proposes a SELF_EDIT rule
  2. Inner-loop LoRA update internalizes the rule
  3. Reward penalizes excess tokens only if accuracy is preserved

Why SEAL?

  • Enables self-modification
  • Bridges reasoning → weights
  • Avoids external distillation pipelines

Phase-3: “Professional” — MCP Integration

Goal: Demonstrate that optimized reasoning transfers to real systems.

Mechanism:

  • Connect model to a local database via MCP
  • Model reasons over structured tool outputs
  • Strict format + tool discipline enforced

6. Experimental Setup

Environment

  • Hardware: single GPU (Colab / local)
  • Frameworks: PyTorch, HuggingFace, PEFT
  • Training style: lightweight LoRA fine-tuning

Dataset

  • Grade-school math (GSM-style)
  • Small curated subsets for rapid iteration

Evaluation Metrics

  • Accuracy
  • Average generated tokens
  • Format compliance (for MCP)

7. Results

Accuracy vs Token Efficiency

Method Accuracy Avg Tokens
Baseline ~0.50–0.52 ~215–220
Phase-1 (GRPO) ~0.62 ~190–200
Phase-2 (SEAL) ~0.68–0.69 ~110–130

Observation: Phase-2 achieves 35–45% token reduction with only minor accuracy degradation.

This represents a clear Pareto improvement.


8. Reproducibility & Variance

This system is intentionally stochastic.

Why values change across runs

  1. RL trajectory sampling
  2. Autoregressive generation variance
  3. SEAL inner-loop adaptation differences
  4. Small evaluation sets

How we interpret results

We do not optimize for single-run point estimates.

Instead, we evaluate:

  • Directional improvements
  • Consistent Pareto dominance
  • Stability across independent runs

This mirrors standard practice in RL and alignment research.


9. How to Run (End-to-End)

pip install -r requirements.txt

bash scripts/01_build_data.sh
bash scripts/02_eval_baselines.sh
bash scripts/03_run_phase1_grpo.sh
bash scripts/04_run_phase2_seal.sh
bash scripts/05_run_mcp_demo.sh

python scripts/06_make_plots.py

10. Why This Matters

This project shows that:

  • Overthinking is a trainable failure mode
  • Reasoning efficiency can be optimized explicitly
  • Self-adaptation is a viable alternative to distillation
  • Tool-augmented agents benefit from efficient reasoning z

11. Limitations & Future Work

  • Larger-scale evaluation
  • Multi-task generalization
  • Formal cost-aware reward shaping
  • Integration with planning-heavy agents

12. References

  • DeepSeek-R1: Incentivizing Reasoning via GRPO
  • SEAL: Self-Editing Adaptive Language Models
  • Model Context Protocol (Anthropic)

About

Efficient Reasoning via GRPO + SEAL + MCP (Teaching Large Reasoning Models -- when to think and when not to think)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors