This research investigates token consumption patterns across five distinct approaches for integrating AI assistants with development tools. Using a controlled data analysis task with a 500-row dataset, we measured token usage, API call efficiency, and scalability characteristics. Results demonstrate that optimized tool integration can reduce token consumption by up to 81% compared to baseline code generation approaches, with significant implications for production deployment costs and system design.
Principal Findings:
- Optimized MCP approach: 60K tokens (81% reduction vs baseline)
- Progressive discovery proxy: 81K-155K tokens (50-75% reduction vs baseline)
- UTCP code-mode approach: 182K-240K tokens (unexpectedly higher than baseline)
- Baseline code generation: 108K-158K tokens
- Vanilla MCP approach: 204K-309K tokens (least efficient due to data passing)
Modern AI-assisted development relies on large language models (LLMs) that consume tokens for both input and output. As organizations scale AI integration into production workflows, token efficiency becomes a critical cost and performance factor. This research aims to:
- Quantify token efficiency across different tool integration architectures
- Identify scalability characteristics with varying dataset sizes
- Evaluate trade-offs between flexibility, efficiency, and implementation complexity
- Establish evidence-based guidelines for production system design
- Benchmark emerging protocols (MCP, UTCP, progressive discovery)
Token consumption directly impacts:
- Operational costs: At scale, token efficiency translates to significant cost savings
- Latency: Fewer tokens reduce processing time and network overhead
- Context window utilization: Efficient approaches preserve context for complex reasoning
- Scalability: Data-passing approaches fail with large datasets; file-based approaches scale linearly
- How does tool integration architecture affect token consumption?
- What is the relationship between dataset size and token efficiency across approaches?
- Can progressive tool discovery reduce initial context overhead?
- Do declarative code-generation approaches (UTCP) improve efficiency?
- What are the implementation trade-offs for production deployment?
Controlled Variables:
- Task description: Identical 160-word prompt across all approaches
- Dataset: 500 employee records (7 departments, 7 locations, realistic distributions)
- Required outputs: 4 statistical analyses + 4 visualizations
- Model: Claude Sonnet 4.5 (claude-sonnet-4-5-20250929)
- Environment: Claude Code CLI with network instrumentation
Independent Variable:
- Tool integration approach (5 variants)
Dependent Variables:
- Total token consumption (input + output + cache)
- API call count
- Token distribution per request
- Cumulative token growth
Architecture: LLM generates and executes Python scripts guided by skills
Implementation:
- Skills provide domain guidance without explicit tools
- LLM writes complete Python scripts for analysis and visualization
- Iterative refinement through script generation
Hypothesis: Maximum flexibility but higher token overhead due to code verbosity
Architecture: Model Context Protocol with direct data passing
Implementation:
- MCP server exposes tools:
read_csv_data,analyze_data,create_visualization - Full data arrays passed as tool parameters
- Sequential tool calls for each operation
Hypothesis: Reduced API calls but high token cost for large datasets
Architecture: File-path based MCP tools
Implementation:
- Tools accept file paths instead of data arrays:
analyze_csv_file,create_visualization_from_file - Server reads files internally
- Enables parallel tool calls (multiple visualizations in single request)
Hypothesis: Minimal token overhead, scales with dataset size
Architecture: Progressive tool discovery via meta-tools
Implementation:
- Initial context: 2 meta-tools (
describe_tools,use_tool) ~400 tokens - Tools loaded on-demand vs. upfront (~10,000+ tokens for all tools)
- 90%+ reduction in initial overhead
Hypothesis: Lower initial overhead, efficiency improves across sessions
Architecture: Universal Tool Calling Protocol with TypeScript code generation
Implementation:
- LLM generates TypeScript code that calls MCP tools
- Single execution of generated code
- Claimed 60% faster, 68% fewer tokens, 88% fewer API calls
Hypothesis: Code generation + tool integration = best of both worlds
Network Instrumentation:
// networkLog.js - Intercepts all HTTP(S) requests
process.on('fetch', (request, response) => {
// Capture full request/response including token usage
logAPICall({
url, method, status, duration,
requestBody, responseBody,
parsedMessage: { usage: { input_tokens, output_tokens, ... } }
});
});Metrics Captured:
input_tokens: Regular input tokensoutput_tokens: Generated tokenscache_creation_input_tokens: Prompt cache creationcache_read_input_tokens: Prompt cache hitstotal_tokens: Sum of all token typesduration: Request latencymodel: Model identifierstop_reason: Completion reason
Data Pipeline:
- Raw data collection: Network logs with full request/response (PII included)
- PII redaction: Remove API keys, user IDs, workspace paths, emails, phone numbers
- JSONL conversion: One API call per line for analysis
- Visualization: Python matplotlib for comparative charts
All collected data underwent automated PII redaction:
- API keys and tokens →
[REDACTED] - User/account/session IDs →
[ID_REDACTED] - Workspace paths →
/workspace - Email addresses →
[EMAIL] - Phone numbers →
[PHONE] - OS version details →
[OS_VERSION]
Content and tool interactions preserved for analysis.
Sessions per approach: 3 (for variance measurement) Aggregation: Mean and variance across sessions Visualization: Per-request and cumulative token usage charts Comparison: Both within-approach (session variance) and cross-approach (efficiency ranking)
| Rank | Approach | Avg Total Tokens | Avg API Calls | Tokens/Call | Efficiency vs Baseline |
|---|---|---|---|---|---|
| 1 | MCP Optimized | 60,420 | 4 | 15,105 | +44-81% ✓ |
| 2 | MCP Proxy | 81,415-154,734 | 5-8 | 16,283-19,342 | +25-50% |
| 3 | Code-Skill | 108,566-157,749 | 6-8 | 18,094-19,719 | Baseline |
| 4 | UTCP Code-Mode | 182,377-239,542 | 9-11 | 20,264-21,777 | -40-68% |
| 5 | MCP Vanilla | 204,099-309,053 | 7-9 | 29,157-34,339 | -88-195% ❌ |
Session 1: 60,307 tokens (4 calls, 15,077 avg)
Session 2: 60,144 tokens (4 calls, 15,036 avg)
Session 3: 60,808 tokens (4 calls, 15,202 avg)
Variance: ±0.6% (extremely consistent)
Efficiency: 81% better than worst approach, 44% better than baseline
Key Characteristics:
- Minimal context consumption (file paths only)
- Parallel tool execution (4 visualizations in single request)
- Linear scaling with dataset size
- Lowest variance across sessions
Architecture Advantages:
// Single API call generates 4 visualizations in parallel
{
"tool_uses": [
{ "name": "create_visualization_from_file", "input": { "path": "...", "type": "bar" } },
{ "name": "create_visualization_from_file", "input": { "path": "...", "type": "scatter" } },
{ "name": "create_visualization_from_file", "input": { "path": "...", "type": "pie" } },
{ "name": "create_visualization_from_file", "input": { "path": "...", "type": "bar" } }
]
}
// Total context: ~400 tokens vs ~10,000+ if passing data arraysSession 1: 154,734 tokens (8 calls, 19,342 avg) - Initial discovery overhead
Session 2: 81,528 tokens (5 calls, 16,306 avg) - Optimized after discovery
Session 3: 81,415 tokens (5 calls, 16,283 avg) - Stable performance
Variance: ±47% (session 1 vs 2-3)
Efficiency: 50% better than baseline in steady state
Key Characteristics:
- Progressive discovery: 2 meta-tools initially vs 10+ full tool descriptions
- Initial overhead amortized across sessions
- 90% reduction in upfront tool context
- Converges to efficient pattern after discovery
Progressive Discovery Pattern:
Session 1: describe_tools() → load needed tools → use_tool()
High overhead from discovery process
Sessions 2+: Tools cached, direct use_tool() calls
Steady-state efficiency achieved
Session 1: 157,749 tokens (8 calls, 19,719 avg)
Session 2: 132,702 tokens (7 calls, 18,957 avg)
Session 3: 108,566 tokens (6 calls, 18,094 avg)
Variance: ±31% (session-to-session)
Efficiency: Reference baseline
Key Characteristics:
- High variance due to different code generation paths
- Sequential script generation and execution
- Full code in context each iteration
- Debugging overhead adds API calls
Variance Analysis: Different solution paths lead to unpredictable token usage:
- Session 1: More debugging iterations (8 calls)
- Session 2: Medium complexity path (7 calls)
- Session 3: Efficient path found (6 calls)
Session 1: 190,113 tokens (9 calls, 21,124 avg)
Session 2: 182,377 tokens (9 calls, 20,264 avg)
Session 3: 239,542 tokens (11 calls, 21,777 avg)
Variance: ±23% (session-to-session)
Efficiency: 40-68% WORSE than baseline ⚠️
Key Characteristics:
- Generates TypeScript code to call MCP tools
- Higher token overhead than direct tool calls
- More API calls than expected
- Does NOT achieve claimed efficiency gains
Analysis of Unexpected Results: Contrary to claimed "68% fewer tokens, 88% fewer API calls":
- Code generation adds verbosity vs direct tool calls
- TypeScript compilation/execution overhead
- Error handling requires additional iterations
- Not optimized for file-based operations
Hypothesis: UTCP may excel in different use cases (complex workflows, conditional logic), but not data analysis tasks.
Session 1: 299,908 tokens (9 calls, 33,323 avg)
Session 2: 309,053 tokens (9 calls, 34,339 avg)
Session 3: 204,099 tokens (7 calls, 29,157 avg)
Variance: ±34% (session-to-session)
Efficiency: 88-195% WORSE than baseline ❌
Key Characteristics:
- Passes full 500-row dataset in every tool call
- Data duplication across multiple operations
- Context window saturation
- Does NOT scale with dataset size
Token Breakdown Analysis:
Example tool call with data passing:
{
"data": [
{"name": "Alice", "dept": "Engineering", "salary": 95000, ...}, // Row 1
{"name": "Bob", "dept": "Marketing", "salary": 75000, ...}, // Row 2
... // 498 more rows
]
}
Token cost: ~8,000-12,000 tokens per call just for data
Total across 9 calls: ~80,000 tokens wasted on data duplication
Dataset Size vs Token Consumption:
| Approach | 20 rows | 500 rows | Scaling Factor |
|---|---|---|---|
| MCP Optimized | ~40K | ~60K | 1.5x (minimal) ✓ |
| MCP Proxy | ~60K | ~80K-155K | 1.3-2.6x |
| Code-Skill | ~100K | ~108K-158K | 1.1-1.6x |
| UTCP Code-Mode | ~140K | ~182K-240K | 1.3-1.7x |
| MCP Vanilla | ~105K | ~204K-309K | 2.0-2.9x ❌ |
Key Finding: File-path approaches (MCP Optimized) scale sub-linearly with dataset size, while data-passing approaches (MCP Vanilla) scale super-linearly, becoming prohibitively expensive with large datasets.
Extrapolation to 10,000 rows:
- MCP Optimized: ~65K tokens (minimal increase)
- MCP Vanilla: ~500K+ tokens (unsustainable)
Coefficient of Variation (CV) across 3 sessions:
| Approach | Mean Tokens | Std Dev | CV | Consistency |
|---|---|---|---|---|
| MCP Optimized | 60,420 | 343 | 0.6% | Excellent ✓ |
| MCP Vanilla | 271,020 | 57,512 | 21.2% | Poor |
| Code-Skill | 133,006 | 24,884 | 18.7% | Poor |
| UTCP Code-Mode | 204,011 | 31,149 | 15.3% | Moderate |
| MCP Proxy* | 105,892 | 42,203 | 39.9% | Poor initially |
*Note: MCP Proxy variance primarily from session 1 discovery overhead; sessions 2-3 are consistent (CV ~0.5%)
Interpretation:
- Tool-based approaches with deterministic workflows (MCP Optimized) show excellent consistency
- Code generation approaches (Code-Skill, UTCP) show high variance due to solution path differences
- Progressive discovery (MCP Proxy) requires warm-up period but then stabilizes
Assumptions: Claude Sonnet 4.5 pricing (~$3/M input tokens, ~$15/M output tokens)
Per-Session Costs (Average):
| Approach | Input Tokens | Output Tokens | Input Cost | Output Cost | Total Cost |
|---|---|---|---|---|---|
| MCP Optimized | 57,743 | 2,677 | $0.173 | $0.040 | $0.213 ✓ |
| MCP Proxy | 103,317 | 2,909 | $0.310 | $0.044 | $0.354 |
| Code-Skill | 129,427 | 3,579 | $0.388 | $0.054 | $0.442 |
| UTCP Code-Mode | 200,091 | 3,919 | $0.600 | $0.059 | $0.659 |
| MCP Vanilla | 256,228 | 14,792 | $0.769 | $0.222 | $0.991 ❌ |
ROI Analysis (1,000 sessions/month):
| Approach | Monthly Cost | Savings vs Baseline | Annual Savings |
|---|---|---|---|
| MCP Optimized | $213 | $229 (52%) | $2,748 ✓ |
| MCP Proxy | $354 | $88 (20%) | $1,056 |
| Code-Skill | $442 | Baseline | - |
| UTCP Code-Mode | $659 | -$217 (-49%) | -$2,604 ❌ |
| MCP Vanilla | $991 | -$549 (-124%) | -$6,588 ❌ |
Break-Even Analysis:
MCP Optimized server development cost amortizes after:
- ~10 sessions if 1 week development time ($1,000 value)
- ~50 sessions if 1 month development time ($5,000 value)
At 1,000 sessions/year: ROI = 548% (assuming 1 month development)
Sequential vs Parallel Tool Calls:
| Operation | Code-Skill | MCP Vanilla | MCP Optimized |
|---|---|---|---|
| Generate viz 1 | Call 1 | Call 1 | Call 1 (parallel) |
| Generate viz 2 | Call 2 | Call 2 | Call 1 (parallel) |
| Generate viz 3 | Call 3 | Call 3 | Call 1 (parallel) |
| Generate viz 4 | Call 4 | Call 4 | Call 1 (parallel) |
| Total Calls | 4 | 4 | 1 ✓ |
| Wall Time | 4x latency | 4x latency | 1x latency |
Impact:
- Latency reduction: 4x faster wall-clock time for parallel operations
- Token reduction: No repeated context for each operation
- Only possible with: Independent tools + file-path architecture
MCP Proxy Learning Curve:
Session 1 (Discovery):
describe_tools() → 17 API calls → 154,734 tokens
High overhead from exploring available tools
Session 2 (Optimized):
Direct tool usage → 6 API calls → 81,528 tokens
47% reduction from session 1
Session 3 (Stable):
Efficient workflow → 6 API calls → 81,415 tokens
Consistent performance achieved
Implication: Systems with repeated usage benefit significantly from progressive discovery after initial warm-up.
Selecting an approach requires balancing competing objectives across three primary dimensions:
- Context Efficiency vs API Call Count
- Variance Tolerance (Consistency Requirements)
- Task Repeatability (One-off vs Production)
These dimensions are interdependent and create distinct optimization profiles for each approach.
Fundamental Tradeoff:
- Fewer API calls often require more context per call (complex instructions, data passing)
- Less context per call may require more API calls (iterative refinement, sequential operations)
Approach Positioning:
| Approach | Total Tokens | API Calls | Tokens/Call | Efficiency Profile |
|---|---|---|---|---|
| MCP Optimized | 60,420 | 4 | 15,105 | Optimal balance ✓ |
| MCP Proxy | 81,415-154,734 | 5-8 | 16,283-19,342 | Good (after warm-up) |
| Code-Skill | 108,566-157,749 | 6-8 | 18,094-19,719 | Moderate |
| UTCP Code-Mode | 182,377-239,542 | 9-11 | 20,264-21,777 | Poor (both high) |
| MCP Vanilla | 204,099-309,053 | 7-9 | 29,157-34,339 | Worst (high context) ❌ |
Analysis:
MCP Optimized: Pareto Optimal
- Achieves both lowest total tokens AND fewest API calls
- File-path architecture eliminates context/API call tradeoff
- Parallel execution further reduces API calls without increasing context
- Tradeoff eliminated through architectural innovation
MCP Vanilla: Worst of Both Worlds
- High API call count (7-9) due to sequential operations
- Highest tokens per call (29-34K) due to data passing
- Tradeoff amplified by poor design choices
Code-Skill: Classic Tradeoff
- Moderate API calls (6-8) from iterative development
- Moderate context per call (18-20K) from code verbosity
- Traditional tradeoff profile
UTCP Code-Mode: Unexpected Anti-Pattern
- High API calls (9-11) despite code generation claims
- High tokens per call (20-22K) from code + tool overhead
- Tradeoff exacerbated by additional abstraction layer
MCP Proxy: Time-Dependent Tradeoff
- Session 1: High tokens (154K), high calls (8) - discovery overhead
- Sessions 2+: Moderate tokens (81K), moderate calls (5-6) - optimized
- Tradeoff improves with usage
Decision Rules:
IF context_budget_critical AND api_latency_acceptable:
→ MCP Optimized (minimizes both)
IF api_calls_must_minimize AND context_abundant:
→ Still MCP Optimized (parallel execution wins)
IF budget_constrained AND high_variance_tolerable:
→ Code-Skill (moderate both, high flexibility)
IF large_tool_catalog AND repeated_usage:
→ MCP Proxy (amortizes discovery cost)
NEVER:
→ MCP Vanilla (loses on both dimensions)
→ UTCP Code-Mode (for data tasks)
Quantitative Thresholds:
| Constraint | Threshold | Recommended Approach |
|---|---|---|
| Total token budget | < 100K | MCP Optimized, MCP Proxy (steady-state) |
| Total token budget | 100K-200K | Code-Skill |
| Total token budget | > 200K | ❌ Re-architect task |
| API call budget | < 5 calls | MCP Optimized (parallel execution) |
| API call budget | 5-10 calls | Code-Skill, MCP Proxy |
| Tokens per call | < 20K | MCP Optimized, MCP Proxy, Code-Skill |
| Tokens per call | > 25K | ❌ MCP Vanilla (redesign needed) |
Definition: Acceptable variation in token consumption and API calls across sessions for identical tasks
Variance Sources:
- Solution path diversity (code generation approaches)
- Debugging iterations (trial-and-error execution)
- LLM sampling variation (temperature, non-determinism)
- Tool selection uncertainty (multiple valid sequences)
Measured Variance (Coefficient of Variation):
| Approach | Mean Tokens | Std Dev | CV | Consistency Rating |
|---|---|---|---|---|
| MCP Optimized | 60,420 | 343 | 0.6% | Excellent ✓ |
| MCP Vanilla | 271,020 | 57,512 | 21.2% | Poor |
| Code-Skill | 133,006 | 24,884 | 18.7% | Poor |
| UTCP Code-Mode | 204,011 | 31,149 | 15.3% | Moderate |
| MCP Proxy* | 105,892 | 42,203 | 39.9% | Poor (initially) |
*MCP Proxy: Sessions 2-3 only: CV = 0.5% (excellent after warm-up)
Variance Impact on Production Systems:
Low Variance Systems (CV < 5%):
- Predictable costs: Accurate budget forecasting
- Consistent latency: Reliable SLA compliance
- Stable monitoring: Anomaly detection works well
- Capacity planning: Deterministic resource allocation
High Variance Systems (CV > 15%):
- Cost uncertainty: 20-40% budget variance
- Latency unpredictability: P99 latency 2-3x P50
- Monitoring challenges: Normal variance masks real issues
- Over-provisioning: Must plan for worst-case scenarios
Tradeoff Analysis:
MCP Optimized: Deterministic by Design
- Why low variance: Fixed tool sequence, no debugging iterations
- Tradeoff: Requires upfront workflow definition
- When acceptable: Production systems, SLA-driven applications
- Cost: Inflexible to novel requirements
Code-Skill: Non-Deterministic by Nature
- Why high variance: Different code solutions, varying debug paths
- Tradeoff: Maximum flexibility, unpredictable cost
- When acceptable: Exploratory work, research, prototyping
- Benefit: Handles novel tasks without modification
MCP Proxy: Variance Converges Over Time
- Why initial variance: Tool discovery exploration
- Why eventual consistency: Cached tool knowledge
- Tradeoff: Must tolerate initial instability
- When acceptable: Long-running systems with warm-up period
Decision Matrix:
| Requirement | Variance Tolerance | Recommended Approach |
|---|---|---|
| Production SLA | Low (< 5% variation) | MCP Optimized |
| Cost budgeting | Low (< 10% variation) | MCP Optimized, MCP Proxy (steady) |
| Experimentation | High (20-40% acceptable) | Code-Skill |
| User-facing latency | Low (consistent P99) | MCP Optimized |
| Internal tools | Medium (10-20% ok) | MCP Proxy, Code-Skill |
| Capacity planning | Low (predictable peaks) | MCP Optimized |
Quantitative Thresholds:
IF p99_latency_sla_required OR cost_budget_strict:
variance_tolerance = LOW
→ MCP Optimized ONLY
IF exploratory_task OR prototype_phase:
variance_tolerance = HIGH
→ Code-Skill acceptable
IF production BUT budget_flexible:
variance_tolerance = MEDIUM
→ MCP Proxy (after warm-up)
IF sla_critical AND novel_requirements:
→ CONFLICT: Cannot satisfy both
→ Recommendation: Code-Skill prototype → MCP Optimized migration
Definition: How many times will this exact task be executed?
Economic Model:
Total_Cost = Development_Cost + (Per_Execution_Cost × Execution_Count)
Where:
Development_Cost = Time to implement approach
Per_Execution_Cost = Token cost per execution
Execution_Count = Number of times task runs
Approach Economics:
| Approach | Dev Cost | Per-Execution Cost | Break-Even Count |
|---|---|---|---|
| Code-Skill | Low ($0, immediate) | High ($0.44) | 0 (always usable) |
| MCP Optimized | High ($1,000-5,000) | Low ($0.21) | 4-22 executions |
| MCP Proxy | Medium ($500-2,000) | Medium ($0.35) | 6-22 executions |
| MCP Vanilla | Medium ($500-2,000) | Highest ($0.99) | Never breaks even ❌ |
| UTCP Code-Mode | Medium ($500-1,500) | High ($0.66) | Never breaks even ❌ |
Break-Even Analysis:
MCP Optimized vs Code-Skill:
Development_Cost = $2,500 (1 week engineer time)
Savings_Per_Execution = $0.44 - $0.21 = $0.23
Break_Even = $2,500 / $0.23 = 11 executions
At 100 executions:
Total savings = $0.23 × 100 = $23
ROI = $23 / $2.5 = 920%
At 1,000 executions:
Total savings = $0.23 × 1,000 = $230
ROI = $230 / $2.5 = 9,200%
MCP Proxy vs Code-Skill:
Development_Cost = $1,000 (3 days engineer time)
Savings_Per_Execution = $0.44 - $0.35 = $0.09
Break_Even = $1,000 / $0.09 = 11 executions
At 100 executions:
Total savings = $9
ROI = 900%
Repeatability Decision Framework:
| Execution Count | Repeatability | Recommended Approach | Rationale |
|---|---|---|---|
| 1 time | One-off | Code-Skill | Zero dev cost, immediate |
| 2-5 times | Low | Code-Skill | ROI insufficient |
| 6-20 times | Medium | MCP Proxy | Breaks even, moderate ROI |
| 20-100 times | High | MCP Optimized | Strong ROI (900%+) |
| 100+ times | Production | MCP Optimized | Exceptional ROI (9,000%+) |
Time Horizon Considerations:
One-off Tasks (1-5 executions):
- Optimize for: Speed to first result
- Accept: High per-execution cost, high variance
- Choose: Code-Skill
- Example: Quarterly board presentation, one-time data migration
Recurring Tasks (20-100 executions):
- Optimize for: Total cost of ownership
- Accept: Upfront development time
- Choose: MCP Optimized or MCP Proxy
- Example: Weekly sales reports, monthly analytics dashboards
Production Workflows (100+ executions):
- Optimize for: Per-execution efficiency, reliability
- Accept: Significant development investment
- Choose: MCP Optimized (always)
- Example: Real-time data processing, automated reporting systems
Hybrid Strategy for Uncertain Repeatability:
Phase 1: Use Code-Skill (executions 1-10)
→ Validate task, understand requirements
→ Total cost: ~$4.40
Phase 2: Decision point (after 10 executions)
IF task_stabilized AND execution_count_forecast > 20:
→ Invest in MCP Optimized
→ Development: $2,500
→ Future savings: $0.23/execution
ELSE:
→ Continue with Code-Skill
→ Re-evaluate at 50 executions
Phase 3: Monitor (ongoing)
→ Track actual execution count
→ Measure token cost trends
→ Migrate to MCP Optimized when break-even certain
Tradeoff Summary:
Code-Skill Optimizes for:
- ✅ Unknown repeatability (no upfront investment)
- ✅ Rapidly changing requirements
- ✅ Immediate results needed
- ❌ Poor for > 20 executions (expensive)
MCP Optimized Optimizes for:
- ✅ High repeatability (exceptional ROI)
- ✅ Stable, well-defined workflows
- ✅ Production systems
- ❌ Poor for one-offs (wasted investment)
MCP Proxy Optimizes for:
- ✅ Medium repeatability (20-100 executions)
- ✅ Evolving tool requirements
- ✅ Large tool catalogs
- ❌ Poor for one-offs or very high frequency
Multi-Dimensional Optimization:
Use this decision tree to select the optimal approach based on your constraints:
START
Q1: Is this a one-off task (< 5 executions)?
YES → Code-Skill (end)
NO → Continue to Q2
Q2: Is total token budget < 100K AND variance < 5% required?
YES → MCP Optimized (end)
NO → Continue to Q3
Q3: Do you have > 20 tools AND task repeats > 50 times?
YES → MCP Proxy (end)
NO → Continue to Q4
Q4: Is execution count > 20 AND requirements stable?
YES → MCP Optimized (end)
NO → Continue to Q5
Q5: Is high variance acceptable (CV > 15%)?
YES → Code-Skill (end)
NO → MCP Optimized (end, invest in stability)
NEVER CHOOSE:
- MCP Vanilla (always suboptimal)
- UTCP Code-Mode (for data analysis tasks)
Tradeoff Visualization:
Context Efficient
▲
│
│ MCP Optimized
│ ●
│ / \
│ / \
│ / \
High Variance ◄────────┼─────────► Low Variance
(Flexible) │ (Predictable)
│ ●
│ Code-Skill
│ \
│ \
│ ● ● MCP Proxy
│ UTCP (post-warmup)
│ \
│ ● MCP Vanilla
▼
Many API Calls
Pareto Frontier:
Only three approaches are on the Pareto frontier (not dominated on all dimensions):
- MCP Optimized: Best context efficiency, best consistency, best for high repeatability
- MCP Proxy: Good efficiency after warmup, good for large tool sets
- Code-Skill: Best flexibility, zero dev cost, best for one-offs
Dominated Approaches (never optimal):
- MCP Vanilla: Dominated by MCP Optimized on all dimensions
- UTCP Code-Mode: Dominated by Code-Skill (worse cost, similar flexibility)
-
Architecture Matters More Than Protocol
- File-path approach (60K tokens) vs data-passing (309K tokens) = 5x difference
- Protocol choice (MCP vs UTCP) less impactful than architectural design
- Conclusion: Focus on data flow design, not protocol selection
-
Parallelization is Underutilized
- MCP Optimized achieves 4x latency reduction through parallel execution
- Only possible with independent, file-based tools
- Significant competitive advantage in production systems
-
Progressive Discovery Shows Promise
- 47% token reduction after warm-up period
- Suitable for large tool catalogs
- Requires session persistence for effectiveness
-
UTCP Code-Mode Underperforms for Data Tasks
- 40-68% worse than baseline (contrary to claims)
- May excel in different domains (requires further research)
- Not recommended for data analysis workflows
-
Scalability Characteristics are Non-Linear
- File-path approaches: Sub-linear scaling (1.5x from 20→500 rows)
- Data-passing approaches: Super-linear scaling (2.9x from 20→500 rows)
- Critical consideration for production deployment
| Approach | Production Ready | Confidence | Recommendation |
|---|---|---|---|
| MCP Optimized | Yes | High | Deploy now for frequent workflows |
| MCP Proxy | Yes | Medium | Deploy for large tool catalogs |
| Code-Skill | Yes | High | Keep for novel/exploratory tasks |
| UTCP Code-Mode | No | Low | Avoid for data tasks; research further |
| MCP Vanilla | No | High | Avoid in production (cost prohibitive) |
For Individual Developers:
- Time savings: 30-50% from parallel execution
- Cost reduction: $200-500/year in token costs
- Learning curve: 2-4 weeks to proficiency with tools
For Teams (10 engineers):
- Cost savings: $2,000-5,000/year
- Velocity improvement: 15-25% from reduced debugging
- Infrastructure investment: $10,000-20,000 (tool development)
- ROI timeline: 3-6 months
For Organizations (100+ engineers):
- Cost savings: $50,000-100,000/year
- Competitive advantage: Faster feature delivery
- Platform opportunity: Internal tool marketplace
- Strategic value: Differentiated AI capabilities
Tier 1 (Highest Priority):
- Immediate: Deploy MCP Optimized for top 5 frequent operations
- Month 1: Measure token reduction and ROI
- Month 2: Expand to top 20 operations
Tier 2 (Medium Priority):
- Month 3: Pilot MCP Proxy for large tool catalogs
- Month 4: Develop hybrid routing logic
- Month 6: Full hybrid architecture deployment
Tier 3 (Research):
- Ongoing: Monitor UTCP protocol developments
- Q2: Re-evaluate UTCP for workflow orchestration
- Q3: Cross-model validation studies
Dataset Characteristics:
- Rows: 500
- Columns: 6 (name, department, salary, years_experience, performance_score, location)
- Size: ~45KB CSV
- Distribution: Realistic salary ranges with experience correlation
Environment:
- Model: claude-sonnet-4-5-20250929
- Interface: Claude Code CLI v2.0.42
- OS: macOS (Darwin 24.6.0)
- Node.js: v24.4.1
- Network: Instrumented with custom logging
Sessions: 3 per approach (15 total) Data Collection Period: November 2025 Analysis Tools: Python (matplotlib, pandas), Node.js
tool-metrics/
├── experiments/data-analysis/
│ ├── code-skill-approach/
│ ├── mcp-approach/
│ ├── mcp-approach-optimized/
│ ├── mcp-proxy-approach/
│ ├── otcp-code-approach/
│ └── shared/sample-data.csv
├── raw-data/experiments/ # Network logs with PII
├── data/experiments/ # Cleaned JSONL data
├── visualizations/ # Comparative charts
├── clean-data.js # PII redaction pipeline
├── sessions-comparison.js # Within-approach analysis
├── approaches-comparison.js # Cross-approach analysis
└── README.md
Protocols and Frameworks:
-
Model Context Protocol (MCP)
- Official Specification: https://modelcontextprotocol.io/
- Description: Protocol for connecting AI models to external tools and data sources
- Used in: MCP Vanilla, MCP Optimized, MCP Proxy approaches
-
Universal Tool Calling Protocol (UTCP) - Code Mode
- Repository: https://github.com/universal-tool-calling-protocol/code-mode
- Description: Enables writing TypeScript code that calls MCP tools in single execution
- Claims: 60% faster, 68% fewer tokens, 88% fewer API calls
- Used in: UTCP Code-Mode approach
- Note: Claims not validated in this research for data analysis tasks
-
one-mcp (MCP Proxy)
- Repository: https://github.com/AgiFlow/aicode-toolkit/blob/main/packages/one-mcp
- Description: Smart MCP proxy providing progressive tool discovery
- Features: Loads 2 meta-tools initially (~400 tokens) instead of all tools upfront (~10,000+ tokens)
- Reduction: 90%+ initial overhead
- Used in: MCP Proxy approach
Claude Code:
- Official Site: https://code.claude.com/
- CLI Repository: https://github.com/anthropics/claude-code
- Version Used: v2.0.42
- Description: Official CLI for Claude AI assistant
Analysis Tools:
- Python: matplotlib, pandas, numpy for visualization
- Node.js: Network instrumentation and data processing
- Claude API: https://docs.anthropic.com/en/api
To reproduce results:
- Clone repository and install dependencies
- Set up Claude Code CLI with API key
- Configure MCP servers (see approach-specific READMEs)
- Run experiment:
node run-experiment.js data-analysis <approach> <session-name> - Clean data:
node clean-data.js - Generate visualizations:
node sessions-comparison.js && node approaches-comparison.js
Tool Setup:
- UTCP Bridge:
npm install -g @utcp/mcp-bridge(see repository for configuration) - one-mcp:
npm install -g @agiflowai/one-mcp(see repository for mcp-config.yaml) - Custom MCP Servers: Node.js implementations in each approach directory
Data availability:
- Cleaned data (PII redacted): Published in repository
- Raw logs: Not published (contains PII)
- Visualization code: Open-source
This research was conducted to advance understanding of token efficiency in AI-assisted development. Results are shared openly to benefit the broader engineering community.
Tool Acknowledgments:
- Anthropic for Claude Code and MCP specification
- Universal Tool Calling Protocol team for UTCP code-mode bridge
- AgiFlow for one-mcp progressive discovery proxy
- v1.0 (November 2025): Initial publication with 5 approaches, 500-row dataset
Author: Principal Engineer, AI Systems Research Contact: [Redacted for privacy] License: MIT - see LICENSE file for details
