Enough Thinking

Efficient Reasoning via GRPO + SEAL + MCP

Teaching Large Reasoning Models when to think — and when not to.

1. Abstract

Large Reasoning Models (LRMs) frequently over-generate chain-of-thought, even for simple problems, leading to unnecessary latency and cost. In this project, we study reasoning efficiency as a first-class optimization objective.

We present a two-stage reinforcement learning framework:

Phase-1 (GRPO): induces structured reasoning behavior.
Phase-2 (SEAL): internalizes recurring reasoning patterns to reduce token usage without sacrificing correctness.

Finally, we demonstrate that the optimized model can act over real-world infrastructure via the Model Context Protocol (MCP).

2. Why This Project (Motivation)

Recent work shows that:

Chain-of-Thought improves accuracy
But long reasoning traces are not always necessary

In production settings, excess reasoning:

Increases inference cost
Hurts latency
Limits agent scalability

Key question:

Can a model learn when detailed reasoning is necessary — and compress it when it is not?

3. Related Work & Positioning

This project is inspired by and positioned relative to:

Reasoning & RL

DeepSeek-R1 — Group Relative Policy Optimization (GRPO)
RLHF / RLAIF — reward-guided behavior shaping

Self-Adaptation

SEAL (Self-Editing Adaptive LLMs) — inner-loop weight updates
Meta-learning & continual learning literature

Tool-Augmented Agents

Model Context Protocol (MCP) — standardized tool access
Tool-use constrained generation

📌 Key distinction: Most prior work improves accuracy. This project optimizes the accuracy–efficiency tradeoff.

4. Model Choice (Why this model?)

Base Model

We use a small instruction-tuned causal LM (~0.5–1B params).

Why not a larger model?

Efficiency effects are easier to observe
Faster iteration on limited compute
Demonstrates that gains come from training strategy, not scale

Why LoRA?

Enables inner-loop adaptation (SEAL)
Lightweight, reversible updates
Mirrors real deployment constraints

5. Method Overview

Phase-0: Baseline

Instruction-tuned model
Standard prompting
No explicit reasoning optimization

Phase-1: “Thinking” — GRPO

Goal: Encourage verification, reflection, and correction behavior.

Mechanism (GRPO):

Sample multiple reasoning trajectories per problem
Rank trajectories by correctness
Optimize policy relative to other samples, not an absolute reward

Why GRPO (vs PPO)?

More stable under small batch sizes
Avoids value-function collapse
Used successfully in DeepSeek-R1

Outcome:

Accuracy ↑
Reasoning quality ↑
Token usage ↑ moderately

Phase-2: “Enough” — SEAL

Goal: Reduce unnecessary reasoning while preserving correctness.

Key idea:

If the model has already learned a reasoning pattern, it should not regenerate it every time.

Mechanism:

Model proposes a SELF_EDIT rule
Inner-loop LoRA update internalizes the rule
Reward penalizes excess tokens only if accuracy is preserved

Why SEAL?

Enables self-modification
Bridges reasoning → weights
Avoids external distillation pipelines

Phase-3: “Professional” — MCP Integration

Goal: Demonstrate that optimized reasoning transfers to real systems.

Mechanism:

Connect model to a local database via MCP
Model reasons over structured tool outputs
Strict format + tool discipline enforced

6. Experimental Setup

Environment

Hardware: single GPU (Colab / local)
Frameworks: PyTorch, HuggingFace, PEFT
Training style: lightweight LoRA fine-tuning

Dataset

Grade-school math (GSM-style)
Small curated subsets for rapid iteration

Evaluation Metrics

Accuracy
Average generated tokens
Format compliance (for MCP)

7. Results

Accuracy vs Token Efficiency

Method	Accuracy	Avg Tokens
Baseline	~0.50–0.52	~215–220
Phase-1 (GRPO)	~0.62	~190–200
Phase-2 (SEAL)	~0.68–0.69	~110–130

Observation: Phase-2 achieves 35–45% token reduction with only minor accuracy degradation.

This represents a clear Pareto improvement.

8. Reproducibility & Variance

This system is intentionally stochastic.

Why values change across runs

RL trajectory sampling
Autoregressive generation variance
SEAL inner-loop adaptation differences
Small evaluation sets

How we interpret results

We do not optimize for single-run point estimates.

Instead, we evaluate:

Directional improvements
Consistent Pareto dominance
Stability across independent runs

This mirrors standard practice in RL and alignment research.

9. How to Run (End-to-End)

pip install -r requirements.txt

bash scripts/01_build_data.sh
bash scripts/02_eval_baselines.sh
bash scripts/03_run_phase1_grpo.sh
bash scripts/04_run_phase2_seal.sh
bash scripts/05_run_mcp_demo.sh

python scripts/06_make_plots.py

10. Why This Matters

This project shows that:

Overthinking is a trainable failure mode
Reasoning efficiency can be optimized explicitly
Self-adaptation is a viable alternative to distillation
Tool-augmented agents benefit from efficient reasoning z

11. Limitations & Future Work

Larger-scale evaluation
Multi-task generalization
Formal cost-aware reward shaping
Integration with planning-heavy agents

12. References

DeepSeek-R1: Incentivizing Reasoning via GRPO
SEAL: Self-Editing Adaptive Language Models
Model Context Protocol (Anthropic)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data/local		data/local
scripts		scripts
src		src
.DS_Store		.DS_Store
.Rhistory		.Rhistory
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Enough Thinking

1. Abstract

2. Why This Project (Motivation)

3. Related Work & Positioning

Reasoning & RL

Self-Adaptation

Tool-Augmented Agents

4. Model Choice (Why this model?)

Base Model

5. Method Overview

Phase-0: Baseline

Phase-1: “Thinking” — GRPO

Phase-2: “Enough” — SEAL

Phase-3: “Professional” — MCP Integration

6. Experimental Setup

Environment

Dataset

Evaluation Metrics

7. Results

Accuracy vs Token Efficiency

8. Reproducibility & Variance

Why values change across runs

How we interpret results

9. How to Run (End-to-End)

10. Why This Matters

11. Limitations & Future Work

12. References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Enough Thinking

1. Abstract

2. Why This Project (Motivation)

3. Related Work & Positioning

Reasoning & RL

Self-Adaptation

Tool-Augmented Agents

4. Model Choice (Why this model?)

Base Model

5. Method Overview

Phase-0: Baseline

Phase-1: “Thinking” — GRPO

Phase-2: “Enough” — SEAL

Phase-3: “Professional” — MCP Integration

6. Experimental Setup

Environment

Dataset

Evaluation Metrics

7. Results

Accuracy vs Token Efficiency

8. Reproducibility & Variance

Why values change across runs

How we interpret results

9. How to Run (End-to-End)

10. Why This Matters

11. Limitations & Future Work

12. References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages