Skip to content

AgiFlow/token-usage-metrics

Repository files navigation

Token Efficiency in AI-Assisted Development: A Comparative Analysis of Tool Integration Patterns

License: MIT Discord

Token Usage Comparison

Abstract

This research investigates token consumption patterns across five distinct approaches for integrating AI assistants with development tools. Using a controlled data analysis task with a 500-row dataset, we measured token usage, API call efficiency, and scalability characteristics. Results demonstrate that optimized tool integration can reduce token consumption by up to 81% compared to baseline code generation approaches, with significant implications for production deployment costs and system design.

Principal Findings:

  • Optimized MCP approach: 60K tokens (81% reduction vs baseline)
  • Progressive discovery proxy: 81K-155K tokens (50-75% reduction vs baseline)
  • UTCP code-mode approach: 182K-240K tokens (unexpectedly higher than baseline)
  • Baseline code generation: 108K-158K tokens
  • Vanilla MCP approach: 204K-309K tokens (least efficient due to data passing)

1. Purpose

1.1 Research Objectives

Modern AI-assisted development relies on large language models (LLMs) that consume tokens for both input and output. As organizations scale AI integration into production workflows, token efficiency becomes a critical cost and performance factor. This research aims to:

  1. Quantify token efficiency across different tool integration architectures
  2. Identify scalability characteristics with varying dataset sizes
  3. Evaluate trade-offs between flexibility, efficiency, and implementation complexity
  4. Establish evidence-based guidelines for production system design
  5. Benchmark emerging protocols (MCP, UTCP, progressive discovery)

1.2 Industry Context

Token consumption directly impacts:

  • Operational costs: At scale, token efficiency translates to significant cost savings
  • Latency: Fewer tokens reduce processing time and network overhead
  • Context window utilization: Efficient approaches preserve context for complex reasoning
  • Scalability: Data-passing approaches fail with large datasets; file-based approaches scale linearly

1.3 Research Questions

  1. How does tool integration architecture affect token consumption?
  2. What is the relationship between dataset size and token efficiency across approaches?
  3. Can progressive tool discovery reduce initial context overhead?
  4. Do declarative code-generation approaches (UTCP) improve efficiency?
  5. What are the implementation trade-offs for production deployment?

2. Methodology

2.1 Experimental Design

Controlled Variables:

  • Task description: Identical 160-word prompt across all approaches
  • Dataset: 500 employee records (7 departments, 7 locations, realistic distributions)
  • Required outputs: 4 statistical analyses + 4 visualizations
  • Model: Claude Sonnet 4.5 (claude-sonnet-4-5-20250929)
  • Environment: Claude Code CLI with network instrumentation

Independent Variable:

  • Tool integration approach (5 variants)

Dependent Variables:

  • Total token consumption (input + output + cache)
  • API call count
  • Token distribution per request
  • Cumulative token growth

2.2 Approaches Tested

Approach 1: Code-Skill (Baseline)

Architecture: LLM generates and executes Python scripts guided by skills

Implementation:

  • Skills provide domain guidance without explicit tools
  • LLM writes complete Python scripts for analysis and visualization
  • Iterative refinement through script generation

Hypothesis: Maximum flexibility but higher token overhead due to code verbosity

Approach 2: MCP Vanilla

Architecture: Model Context Protocol with direct data passing

Implementation:

  • MCP server exposes tools: read_csv_data, analyze_data, create_visualization
  • Full data arrays passed as tool parameters
  • Sequential tool calls for each operation

Hypothesis: Reduced API calls but high token cost for large datasets

Approach 3: MCP Optimized

Architecture: File-path based MCP tools

Implementation:

  • Tools accept file paths instead of data arrays: analyze_csv_file, create_visualization_from_file
  • Server reads files internally
  • Enables parallel tool calls (multiple visualizations in single request)

Hypothesis: Minimal token overhead, scales with dataset size

Approach 4: MCP Proxy (one-mcp)

Architecture: Progressive tool discovery via meta-tools

Implementation:

  • Initial context: 2 meta-tools (describe_tools, use_tool) ~400 tokens
  • Tools loaded on-demand vs. upfront (~10,000+ tokens for all tools)
  • 90%+ reduction in initial overhead

Hypothesis: Lower initial overhead, efficiency improves across sessions

Approach 5: UTCP Code-Mode

Architecture: Universal Tool Calling Protocol with TypeScript code generation

Implementation:

  • LLM generates TypeScript code that calls MCP tools
  • Single execution of generated code
  • Claimed 60% faster, 68% fewer tokens, 88% fewer API calls

Hypothesis: Code generation + tool integration = best of both worlds

2.3 Data Collection

Network Instrumentation:

// networkLog.js - Intercepts all HTTP(S) requests
process.on('fetch', (request, response) => {
  // Capture full request/response including token usage
  logAPICall({
    url, method, status, duration,
    requestBody, responseBody,
    parsedMessage: { usage: { input_tokens, output_tokens, ... } }
  });
});

Metrics Captured:

  • input_tokens: Regular input tokens
  • output_tokens: Generated tokens
  • cache_creation_input_tokens: Prompt cache creation
  • cache_read_input_tokens: Prompt cache hits
  • total_tokens: Sum of all token types
  • duration: Request latency
  • model: Model identifier
  • stop_reason: Completion reason

Data Pipeline:

  1. Raw data collection: Network logs with full request/response (PII included)
  2. PII redaction: Remove API keys, user IDs, workspace paths, emails, phone numbers
  3. JSONL conversion: One API call per line for analysis
  4. Visualization: Python matplotlib for comparative charts

2.4 Privacy and Ethics

All collected data underwent automated PII redaction:

  • API keys and tokens → [REDACTED]
  • User/account/session IDs → [ID_REDACTED]
  • Workspace paths → /workspace
  • Email addresses → [EMAIL]
  • Phone numbers → [PHONE]
  • OS version details → [OS_VERSION]

Content and tool interactions preserved for analysis.

2.5 Statistical Analysis

Sessions per approach: 3 (for variance measurement) Aggregation: Mean and variance across sessions Visualization: Per-request and cumulative token usage charts Comparison: Both within-approach (session variance) and cross-approach (efficiency ranking)


3. Results

3.1 Overall Token Efficiency Rankings

Rank Approach Avg Total Tokens Avg API Calls Tokens/Call Efficiency vs Baseline
1 MCP Optimized 60,420 4 15,105 +44-81%
2 MCP Proxy 81,415-154,734 5-8 16,283-19,342 +25-50%
3 Code-Skill 108,566-157,749 6-8 18,094-19,719 Baseline
4 UTCP Code-Mode 182,377-239,542 9-11 20,264-21,777 -40-68% ⚠️
5 MCP Vanilla 204,099-309,053 7-9 29,157-34,339 -88-195%

3.2 Detailed Results by Approach

MCP Optimized (Winner)

Session 1: 60,307 tokens (4 calls, 15,077 avg)
Session 2: 60,144 tokens (4 calls, 15,036 avg)
Session 3: 60,808 tokens (4 calls, 15,202 avg)

Variance: ±0.6% (extremely consistent)
Efficiency: 81% better than worst approach, 44% better than baseline

Key Characteristics:

  • Minimal context consumption (file paths only)
  • Parallel tool execution (4 visualizations in single request)
  • Linear scaling with dataset size
  • Lowest variance across sessions

Architecture Advantages:

// Single API call generates 4 visualizations in parallel
{
  "tool_uses": [
    { "name": "create_visualization_from_file", "input": { "path": "...", "type": "bar" } },
    { "name": "create_visualization_from_file", "input": { "path": "...", "type": "scatter" } },
    { "name": "create_visualization_from_file", "input": { "path": "...", "type": "pie" } },
    { "name": "create_visualization_from_file", "input": { "path": "...", "type": "bar" } }
  ]
}
// Total context: ~400 tokens vs ~10,000+ if passing data arrays

MCP Proxy (Second Place)

Session 1: 154,734 tokens (8 calls, 19,342 avg) - Initial discovery overhead
Session 2:  81,528 tokens (5 calls, 16,306 avg) - Optimized after discovery
Session 3:  81,415 tokens (5 calls, 16,283 avg) - Stable performance

Variance: ±47% (session 1 vs 2-3)
Efficiency: 50% better than baseline in steady state

Key Characteristics:

  • Progressive discovery: 2 meta-tools initially vs 10+ full tool descriptions
  • Initial overhead amortized across sessions
  • 90% reduction in upfront tool context
  • Converges to efficient pattern after discovery

Progressive Discovery Pattern:

Session 1: describe_tools() → load needed tools → use_tool()
           High overhead from discovery process

Sessions 2+: Tools cached, direct use_tool() calls
             Steady-state efficiency achieved

Code-Skill Baseline

Session 1: 157,749 tokens (8 calls, 19,719 avg)
Session 2: 132,702 tokens (7 calls, 18,957 avg)
Session 3: 108,566 tokens (6 calls, 18,094 avg)

Variance: ±31% (session-to-session)
Efficiency: Reference baseline

Key Characteristics:

  • High variance due to different code generation paths
  • Sequential script generation and execution
  • Full code in context each iteration
  • Debugging overhead adds API calls

Variance Analysis: Different solution paths lead to unpredictable token usage:

  • Session 1: More debugging iterations (8 calls)
  • Session 2: Medium complexity path (7 calls)
  • Session 3: Efficient path found (6 calls)

UTCP Code-Mode (Underperformer)

Session 1: 190,113 tokens (9 calls, 21,124 avg)
Session 2: 182,377 tokens (9 calls, 20,264 avg)
Session 3: 239,542 tokens (11 calls, 21,777 avg)

Variance: ±23% (session-to-session)
Efficiency: 40-68% WORSE than baseline ⚠️

Key Characteristics:

  • Generates TypeScript code to call MCP tools
  • Higher token overhead than direct tool calls
  • More API calls than expected
  • Does NOT achieve claimed efficiency gains

Analysis of Unexpected Results: Contrary to claimed "68% fewer tokens, 88% fewer API calls":

  1. Code generation adds verbosity vs direct tool calls
  2. TypeScript compilation/execution overhead
  3. Error handling requires additional iterations
  4. Not optimized for file-based operations

Hypothesis: UTCP may excel in different use cases (complex workflows, conditional logic), but not data analysis tasks.

MCP Vanilla (Least Efficient)

Session 1: 299,908 tokens (9 calls, 33,323 avg)
Session 2: 309,053 tokens (9 calls, 34,339 avg)
Session 3: 204,099 tokens (7 calls, 29,157 avg)

Variance: ±34% (session-to-session)
Efficiency: 88-195% WORSE than baseline ❌

Key Characteristics:

  • Passes full 500-row dataset in every tool call
  • Data duplication across multiple operations
  • Context window saturation
  • Does NOT scale with dataset size

Token Breakdown Analysis:

Example tool call with data passing:
{
  "data": [
    {"name": "Alice", "dept": "Engineering", "salary": 95000, ...}, // Row 1
    {"name": "Bob", "dept": "Marketing", "salary": 75000, ...},     // Row 2
    ... // 498 more rows
  ]
}

Token cost: ~8,000-12,000 tokens per call just for data
Total across 9 calls: ~80,000 tokens wasted on data duplication

3.3 Scalability Analysis

Dataset Size vs Token Consumption:

Approach 20 rows 500 rows Scaling Factor
MCP Optimized ~40K ~60K 1.5x (minimal) ✓
MCP Proxy ~60K ~80K-155K 1.3-2.6x
Code-Skill ~100K ~108K-158K 1.1-1.6x
UTCP Code-Mode ~140K ~182K-240K 1.3-1.7x
MCP Vanilla ~105K ~204K-309K 2.0-2.9x

Key Finding: File-path approaches (MCP Optimized) scale sub-linearly with dataset size, while data-passing approaches (MCP Vanilla) scale super-linearly, becoming prohibitively expensive with large datasets.

Extrapolation to 10,000 rows:

  • MCP Optimized: ~65K tokens (minimal increase)
  • MCP Vanilla: ~500K+ tokens (unsustainable)

3.4 Variance and Consistency

Coefficient of Variation (CV) across 3 sessions:

Approach Mean Tokens Std Dev CV Consistency
MCP Optimized 60,420 343 0.6% Excellent ✓
MCP Vanilla 271,020 57,512 21.2% Poor
Code-Skill 133,006 24,884 18.7% Poor
UTCP Code-Mode 204,011 31,149 15.3% Moderate
MCP Proxy* 105,892 42,203 39.9% Poor initially

*Note: MCP Proxy variance primarily from session 1 discovery overhead; sessions 2-3 are consistent (CV ~0.5%)

Interpretation:

  • Tool-based approaches with deterministic workflows (MCP Optimized) show excellent consistency
  • Code generation approaches (Code-Skill, UTCP) show high variance due to solution path differences
  • Progressive discovery (MCP Proxy) requires warm-up period but then stabilizes

3.5 Cost Analysis

Assumptions: Claude Sonnet 4.5 pricing (~$3/M input tokens, ~$15/M output tokens)

Per-Session Costs (Average):

Approach Input Tokens Output Tokens Input Cost Output Cost Total Cost
MCP Optimized 57,743 2,677 $0.173 $0.040 $0.213
MCP Proxy 103,317 2,909 $0.310 $0.044 $0.354
Code-Skill 129,427 3,579 $0.388 $0.054 $0.442
UTCP Code-Mode 200,091 3,919 $0.600 $0.059 $0.659
MCP Vanilla 256,228 14,792 $0.769 $0.222 $0.991

ROI Analysis (1,000 sessions/month):

Approach Monthly Cost Savings vs Baseline Annual Savings
MCP Optimized $213 $229 (52%) $2,748
MCP Proxy $354 $88 (20%) $1,056
Code-Skill $442 Baseline -
UTCP Code-Mode $659 -$217 (-49%) -$2,604 ❌
MCP Vanilla $991 -$549 (-124%) -$6,588 ❌

Break-Even Analysis:

MCP Optimized server development cost amortizes after:

  • ~10 sessions if 1 week development time ($1,000 value)
  • ~50 sessions if 1 month development time ($5,000 value)

At 1,000 sessions/year: ROI = 548% (assuming 1 month development)

3.6 Parallelization Impact

Sequential vs Parallel Tool Calls:

Operation Code-Skill MCP Vanilla MCP Optimized
Generate viz 1 Call 1 Call 1 Call 1 (parallel)
Generate viz 2 Call 2 Call 2 Call 1 (parallel)
Generate viz 3 Call 3 Call 3 Call 1 (parallel)
Generate viz 4 Call 4 Call 4 Call 1 (parallel)
Total Calls 4 4 1
Wall Time 4x latency 4x latency 1x latency

Impact:

  • Latency reduction: 4x faster wall-clock time for parallel operations
  • Token reduction: No repeated context for each operation
  • Only possible with: Independent tools + file-path architecture

3.7 Session Progression Patterns

MCP Proxy Learning Curve:

Session 1 (Discovery):
  describe_tools() → 17 API calls → 154,734 tokens
  High overhead from exploring available tools

Session 2 (Optimized):
  Direct tool usage → 6 API calls → 81,528 tokens
  47% reduction from session 1

Session 3 (Stable):
  Efficient workflow → 6 API calls → 81,415 tokens
  Consistent performance achieved

Implication: Systems with repeated usage benefit significantly from progressive discovery after initial warm-up.


4. Decision Metrics and Tradeoffs

4.1 Multi-Dimensional Decision Space

Selecting an approach requires balancing competing objectives across three primary dimensions:

  1. Context Efficiency vs API Call Count
  2. Variance Tolerance (Consistency Requirements)
  3. Task Repeatability (One-off vs Production)

These dimensions are interdependent and create distinct optimization profiles for each approach.


4.2 Dimension 1: Context Efficiency vs API Call Count

Fundamental Tradeoff:

  • Fewer API calls often require more context per call (complex instructions, data passing)
  • Less context per call may require more API calls (iterative refinement, sequential operations)

Approach Positioning:

Approach Total Tokens API Calls Tokens/Call Efficiency Profile
MCP Optimized 60,420 4 15,105 Optimal balance
MCP Proxy 81,415-154,734 5-8 16,283-19,342 Good (after warm-up)
Code-Skill 108,566-157,749 6-8 18,094-19,719 Moderate
UTCP Code-Mode 182,377-239,542 9-11 20,264-21,777 Poor (both high)
MCP Vanilla 204,099-309,053 7-9 29,157-34,339 Worst (high context)

Analysis:

MCP Optimized: Pareto Optimal

  • Achieves both lowest total tokens AND fewest API calls
  • File-path architecture eliminates context/API call tradeoff
  • Parallel execution further reduces API calls without increasing context
  • Tradeoff eliminated through architectural innovation

MCP Vanilla: Worst of Both Worlds

  • High API call count (7-9) due to sequential operations
  • Highest tokens per call (29-34K) due to data passing
  • Tradeoff amplified by poor design choices

Code-Skill: Classic Tradeoff

  • Moderate API calls (6-8) from iterative development
  • Moderate context per call (18-20K) from code verbosity
  • Traditional tradeoff profile

UTCP Code-Mode: Unexpected Anti-Pattern

  • High API calls (9-11) despite code generation claims
  • High tokens per call (20-22K) from code + tool overhead
  • Tradeoff exacerbated by additional abstraction layer

MCP Proxy: Time-Dependent Tradeoff

  • Session 1: High tokens (154K), high calls (8) - discovery overhead
  • Sessions 2+: Moderate tokens (81K), moderate calls (5-6) - optimized
  • Tradeoff improves with usage

Decision Rules:

IF context_budget_critical AND api_latency_acceptable:
    → MCP Optimized (minimizes both)

IF api_calls_must_minimize AND context_abundant:
    → Still MCP Optimized (parallel execution wins)

IF budget_constrained AND high_variance_tolerable:
    → Code-Skill (moderate both, high flexibility)

IF large_tool_catalog AND repeated_usage:
    → MCP Proxy (amortizes discovery cost)

NEVER:
    → MCP Vanilla (loses on both dimensions)
    → UTCP Code-Mode (for data tasks)

Quantitative Thresholds:

Constraint Threshold Recommended Approach
Total token budget < 100K MCP Optimized, MCP Proxy (steady-state)
Total token budget 100K-200K Code-Skill
Total token budget > 200K ❌ Re-architect task
API call budget < 5 calls MCP Optimized (parallel execution)
API call budget 5-10 calls Code-Skill, MCP Proxy
Tokens per call < 20K MCP Optimized, MCP Proxy, Code-Skill
Tokens per call > 25K ❌ MCP Vanilla (redesign needed)

4.3 Dimension 2: Variance Tolerance

Definition: Acceptable variation in token consumption and API calls across sessions for identical tasks

Variance Sources:

  1. Solution path diversity (code generation approaches)
  2. Debugging iterations (trial-and-error execution)
  3. LLM sampling variation (temperature, non-determinism)
  4. Tool selection uncertainty (multiple valid sequences)

Measured Variance (Coefficient of Variation):

Approach Mean Tokens Std Dev CV Consistency Rating
MCP Optimized 60,420 343 0.6% Excellent ✓
MCP Vanilla 271,020 57,512 21.2% Poor
Code-Skill 133,006 24,884 18.7% Poor
UTCP Code-Mode 204,011 31,149 15.3% Moderate
MCP Proxy* 105,892 42,203 39.9% Poor (initially)

*MCP Proxy: Sessions 2-3 only: CV = 0.5% (excellent after warm-up)

Variance Impact on Production Systems:

Low Variance Systems (CV < 5%):

  • Predictable costs: Accurate budget forecasting
  • Consistent latency: Reliable SLA compliance
  • Stable monitoring: Anomaly detection works well
  • Capacity planning: Deterministic resource allocation

High Variance Systems (CV > 15%):

  • Cost uncertainty: 20-40% budget variance
  • Latency unpredictability: P99 latency 2-3x P50
  • Monitoring challenges: Normal variance masks real issues
  • Over-provisioning: Must plan for worst-case scenarios

Tradeoff Analysis:

MCP Optimized: Deterministic by Design

  • Why low variance: Fixed tool sequence, no debugging iterations
  • Tradeoff: Requires upfront workflow definition
  • When acceptable: Production systems, SLA-driven applications
  • Cost: Inflexible to novel requirements

Code-Skill: Non-Deterministic by Nature

  • Why high variance: Different code solutions, varying debug paths
  • Tradeoff: Maximum flexibility, unpredictable cost
  • When acceptable: Exploratory work, research, prototyping
  • Benefit: Handles novel tasks without modification

MCP Proxy: Variance Converges Over Time

  • Why initial variance: Tool discovery exploration
  • Why eventual consistency: Cached tool knowledge
  • Tradeoff: Must tolerate initial instability
  • When acceptable: Long-running systems with warm-up period

Decision Matrix:

Requirement Variance Tolerance Recommended Approach
Production SLA Low (< 5% variation) MCP Optimized
Cost budgeting Low (< 10% variation) MCP Optimized, MCP Proxy (steady)
Experimentation High (20-40% acceptable) Code-Skill
User-facing latency Low (consistent P99) MCP Optimized
Internal tools Medium (10-20% ok) MCP Proxy, Code-Skill
Capacity planning Low (predictable peaks) MCP Optimized

Quantitative Thresholds:

IF p99_latency_sla_required OR cost_budget_strict:
    variance_tolerance = LOW
    → MCP Optimized ONLY

IF exploratory_task OR prototype_phase:
    variance_tolerance = HIGH
    → Code-Skill acceptable

IF production BUT budget_flexible:
    variance_tolerance = MEDIUM
    → MCP Proxy (after warm-up)

IF sla_critical AND novel_requirements:
    → CONFLICT: Cannot satisfy both
    → Recommendation: Code-Skill prototype → MCP Optimized migration

4.4 Dimension 3: Repeatability (One-off vs Production)

Definition: How many times will this exact task be executed?

Economic Model:

Total_Cost = Development_Cost + (Per_Execution_Cost × Execution_Count)

Where:
  Development_Cost = Time to implement approach
  Per_Execution_Cost = Token cost per execution
  Execution_Count = Number of times task runs

Approach Economics:

Approach Dev Cost Per-Execution Cost Break-Even Count
Code-Skill Low ($0, immediate) High ($0.44) 0 (always usable)
MCP Optimized High ($1,000-5,000) Low ($0.21) 4-22 executions
MCP Proxy Medium ($500-2,000) Medium ($0.35) 6-22 executions
MCP Vanilla Medium ($500-2,000) Highest ($0.99) Never breaks even ❌
UTCP Code-Mode Medium ($500-1,500) High ($0.66) Never breaks even ❌

Break-Even Analysis:

MCP Optimized vs Code-Skill:

Development_Cost = $2,500 (1 week engineer time)
Savings_Per_Execution = $0.44 - $0.21 = $0.23

Break_Even = $2,500 / $0.23 = 11 executions

At 100 executions:
  Total savings = $0.23 × 100 = $23
  ROI = $23 / $2.5 = 920%

At 1,000 executions:
  Total savings = $0.23 × 1,000 = $230
  ROI = $230 / $2.5 = 9,200%

MCP Proxy vs Code-Skill:

Development_Cost = $1,000 (3 days engineer time)
Savings_Per_Execution = $0.44 - $0.35 = $0.09

Break_Even = $1,000 / $0.09 = 11 executions

At 100 executions:
  Total savings = $9
  ROI = 900%

Repeatability Decision Framework:

Execution Count Repeatability Recommended Approach Rationale
1 time One-off Code-Skill Zero dev cost, immediate
2-5 times Low Code-Skill ROI insufficient
6-20 times Medium MCP Proxy Breaks even, moderate ROI
20-100 times High MCP Optimized Strong ROI (900%+)
100+ times Production MCP Optimized Exceptional ROI (9,000%+)

Time Horizon Considerations:

One-off Tasks (1-5 executions):

  • Optimize for: Speed to first result
  • Accept: High per-execution cost, high variance
  • Choose: Code-Skill
  • Example: Quarterly board presentation, one-time data migration

Recurring Tasks (20-100 executions):

  • Optimize for: Total cost of ownership
  • Accept: Upfront development time
  • Choose: MCP Optimized or MCP Proxy
  • Example: Weekly sales reports, monthly analytics dashboards

Production Workflows (100+ executions):

  • Optimize for: Per-execution efficiency, reliability
  • Accept: Significant development investment
  • Choose: MCP Optimized (always)
  • Example: Real-time data processing, automated reporting systems

Hybrid Strategy for Uncertain Repeatability:

Phase 1: Use Code-Skill (executions 1-10)
  → Validate task, understand requirements
  → Total cost: ~$4.40

Phase 2: Decision point (after 10 executions)
  IF task_stabilized AND execution_count_forecast > 20:
      → Invest in MCP Optimized
      → Development: $2,500
      → Future savings: $0.23/execution
  ELSE:
      → Continue with Code-Skill
      → Re-evaluate at 50 executions

Phase 3: Monitor (ongoing)
  → Track actual execution count
  → Measure token cost trends
  → Migrate to MCP Optimized when break-even certain

Tradeoff Summary:

Code-Skill Optimizes for:

  • ✅ Unknown repeatability (no upfront investment)
  • ✅ Rapidly changing requirements
  • ✅ Immediate results needed
  • ❌ Poor for > 20 executions (expensive)

MCP Optimized Optimizes for:

  • ✅ High repeatability (exceptional ROI)
  • ✅ Stable, well-defined workflows
  • ✅ Production systems
  • ❌ Poor for one-offs (wasted investment)

MCP Proxy Optimizes for:

  • ✅ Medium repeatability (20-100 executions)
  • ✅ Evolving tool requirements
  • ✅ Large tool catalogs
  • ❌ Poor for one-offs or very high frequency

4.5 Integrated Decision Framework

Multi-Dimensional Optimization:

Use this decision tree to select the optimal approach based on your constraints:

START

Q1: Is this a one-off task (< 5 executions)?
    YES → Code-Skill (end)
    NO → Continue to Q2

Q2: Is total token budget < 100K AND variance < 5% required?
    YES → MCP Optimized (end)
    NO → Continue to Q3

Q3: Do you have > 20 tools AND task repeats > 50 times?
    YES → MCP Proxy (end)
    NO → Continue to Q4

Q4: Is execution count > 20 AND requirements stable?
    YES → MCP Optimized (end)
    NO → Continue to Q5

Q5: Is high variance acceptable (CV > 15%)?
    YES → Code-Skill (end)
    NO → MCP Optimized (end, invest in stability)

NEVER CHOOSE:
  - MCP Vanilla (always suboptimal)
  - UTCP Code-Mode (for data analysis tasks)

Tradeoff Visualization:

                    Context Efficient
                           ▲
                           │
                           │  MCP Optimized
                           │     ●
                           │   /   \
                           │  /     \
                           │ /       \
    High Variance ◄────────┼─────────► Low Variance
    (Flexible)             │           (Predictable)
                           │    ●
                           │  Code-Skill
                           │     \
                           │      \
                           │   ●   ● MCP Proxy
                           │  UTCP  (post-warmup)
                           │   \
                           │    ● MCP Vanilla
                           ▼
                    Many API Calls

Pareto Frontier:

Only three approaches are on the Pareto frontier (not dominated on all dimensions):

  1. MCP Optimized: Best context efficiency, best consistency, best for high repeatability
  2. MCP Proxy: Good efficiency after warmup, good for large tool sets
  3. Code-Skill: Best flexibility, zero dev cost, best for one-offs

Dominated Approaches (never optimal):

  • MCP Vanilla: Dominated by MCP Optimized on all dimensions
  • UTCP Code-Mode: Dominated by Code-Skill (worse cost, similar flexibility)

5. Conclusions

5.1 Key Findings

  1. Architecture Matters More Than Protocol

    • File-path approach (60K tokens) vs data-passing (309K tokens) = 5x difference
    • Protocol choice (MCP vs UTCP) less impactful than architectural design
    • Conclusion: Focus on data flow design, not protocol selection
  2. Parallelization is Underutilized

    • MCP Optimized achieves 4x latency reduction through parallel execution
    • Only possible with independent, file-based tools
    • Significant competitive advantage in production systems
  3. Progressive Discovery Shows Promise

    • 47% token reduction after warm-up period
    • Suitable for large tool catalogs
    • Requires session persistence for effectiveness
  4. UTCP Code-Mode Underperforms for Data Tasks

    • 40-68% worse than baseline (contrary to claims)
    • May excel in different domains (requires further research)
    • Not recommended for data analysis workflows
  5. Scalability Characteristics are Non-Linear

    • File-path approaches: Sub-linear scaling (1.5x from 20→500 rows)
    • Data-passing approaches: Super-linear scaling (2.9x from 20→500 rows)
    • Critical consideration for production deployment

5.2 Production Readiness

Approach Production Ready Confidence Recommendation
MCP Optimized Yes High Deploy now for frequent workflows
MCP Proxy Yes Medium Deploy for large tool catalogs
Code-Skill Yes High Keep for novel/exploratory tasks
UTCP Code-Mode No Low Avoid for data tasks; research further
MCP Vanilla No High Avoid in production (cost prohibitive)

5.3 Impact Assessment

For Individual Developers:

  • Time savings: 30-50% from parallel execution
  • Cost reduction: $200-500/year in token costs
  • Learning curve: 2-4 weeks to proficiency with tools

For Teams (10 engineers):

  • Cost savings: $2,000-5,000/year
  • Velocity improvement: 15-25% from reduced debugging
  • Infrastructure investment: $10,000-20,000 (tool development)
  • ROI timeline: 3-6 months

For Organizations (100+ engineers):

  • Cost savings: $50,000-100,000/year
  • Competitive advantage: Faster feature delivery
  • Platform opportunity: Internal tool marketplace
  • Strategic value: Differentiated AI capabilities

5.4 Final Recommendations

Tier 1 (Highest Priority):

  1. Immediate: Deploy MCP Optimized for top 5 frequent operations
  2. Month 1: Measure token reduction and ROI
  3. Month 2: Expand to top 20 operations

Tier 2 (Medium Priority):

  1. Month 3: Pilot MCP Proxy for large tool catalogs
  2. Month 4: Develop hybrid routing logic
  3. Month 6: Full hybrid architecture deployment

Tier 3 (Research):

  1. Ongoing: Monitor UTCP protocol developments
  2. Q2: Re-evaluate UTCP for workflow orchestration
  3. Q3: Cross-model validation studies

Appendix

A. Experimental Metadata

Dataset Characteristics:

  • Rows: 500
  • Columns: 6 (name, department, salary, years_experience, performance_score, location)
  • Size: ~45KB CSV
  • Distribution: Realistic salary ranges with experience correlation

Environment:

  • Model: claude-sonnet-4-5-20250929
  • Interface: Claude Code CLI v2.0.42
  • OS: macOS (Darwin 24.6.0)
  • Node.js: v24.4.1
  • Network: Instrumented with custom logging

Sessions: 3 per approach (15 total) Data Collection Period: November 2025 Analysis Tools: Python (matplotlib, pandas), Node.js

B. Repository Structure

tool-metrics/
├── experiments/data-analysis/
│   ├── code-skill-approach/
│   ├── mcp-approach/
│   ├── mcp-approach-optimized/
│   ├── mcp-proxy-approach/
│   ├── otcp-code-approach/
│   └── shared/sample-data.csv
├── raw-data/experiments/        # Network logs with PII
├── data/experiments/             # Cleaned JSONL data
├── visualizations/               # Comparative charts
├── clean-data.js                 # PII redaction pipeline
├── sessions-comparison.js        # Within-approach analysis
├── approaches-comparison.js      # Cross-approach analysis
└── README.md

C. Tool and Protocol References

Protocols and Frameworks:

  1. Model Context Protocol (MCP)

    • Official Specification: https://modelcontextprotocol.io/
    • Description: Protocol for connecting AI models to external tools and data sources
    • Used in: MCP Vanilla, MCP Optimized, MCP Proxy approaches
  2. Universal Tool Calling Protocol (UTCP) - Code Mode

    • Repository: https://github.com/universal-tool-calling-protocol/code-mode
    • Description: Enables writing TypeScript code that calls MCP tools in single execution
    • Claims: 60% faster, 68% fewer tokens, 88% fewer API calls
    • Used in: UTCP Code-Mode approach
    • Note: Claims not validated in this research for data analysis tasks
  3. one-mcp (MCP Proxy)

Claude Code:

Analysis Tools:

D. Reproducibility

To reproduce results:

  1. Clone repository and install dependencies
  2. Set up Claude Code CLI with API key
  3. Configure MCP servers (see approach-specific READMEs)
  4. Run experiment: node run-experiment.js data-analysis <approach> <session-name>
  5. Clean data: node clean-data.js
  6. Generate visualizations: node sessions-comparison.js && node approaches-comparison.js

Tool Setup:

  • UTCP Bridge: npm install -g @utcp/mcp-bridge (see repository for configuration)
  • one-mcp: npm install -g @agiflowai/one-mcp (see repository for mcp-config.yaml)
  • Custom MCP Servers: Node.js implementations in each approach directory

Data availability:

  • Cleaned data (PII redacted): Published in repository
  • Raw logs: Not published (contains PII)
  • Visualization code: Open-source

E. Acknowledgments

This research was conducted to advance understanding of token efficiency in AI-assisted development. Results are shared openly to benefit the broader engineering community.

Tool Acknowledgments:

  • Anthropic for Claude Code and MCP specification
  • Universal Tool Calling Protocol team for UTCP code-mode bridge
  • AgiFlow for one-mcp progressive discovery proxy

F. Version History

  • v1.0 (November 2025): Initial publication with 5 approaches, 500-row dataset

Author: Principal Engineer, AI Systems Research Contact: [Redacted for privacy] License: MIT - see LICENSE file for details

About

Compare token usages between mcp, skills, progressive mcp, etc...

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published