Compare token usage, cost, and DIRT (Data in Remote Transit) across architectural approaches for using Model Context Protocol (MCP) servers with GitHub.
| MCP Server | Package | Use Case | Tools |
|---|---|---|---|
| GitHub | @modelcontextprotocol/server-github |
Repository operations (issues, PRs, commits) | 10+ tools |
LangGraph agent loads ALL available MCP tools into context. GPT-5.1 selects and executes appropriate tool.
- Trade-off: Full control + observability, high token cost
- DIRT: ~170 KB | Cost: ~$0.015/request
Direct TaskSandbox delegation. TSBX agent has native MCP tool calling capabilities and can call MCP tools directly.
- Trade-off: Lowest tokens/DIRT, autonomous execution
- DIRT: ~10-20 KB | Cost: ~$0.001-0.003/request
┌─────────────────────────────────────────────────────────────┐
│ Scenario 1: Conventional │
│ LangGraph → list ALL tools → GPT-5.1 selects → MCP │
│ Trade-off: Full control, high tokens │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Scenario 2: TSBX │
│ Script → TSBX agent (with MCP) → MCP Server → Result │
│ Trade-off: Lowest cost, black box │
└─────────────────────────────────────────────────────────────┘
# Activate environment
source .venv/bin/activate
# Run both scenarios (conventional + TSBX) and auto-generate comparison
python run_benchmark_suite.py \
--manifest github-data/manifest.json \
--scenario both
# Output:
# - results/benchmark_suite/github/conventional_TIMESTAMP/
# - results/benchmark_suite/github/tsbx_TIMESTAMP/
# - results/comparison_github_TIMESTAMP.json (auto-generated!)# Python 3.13+ required
python3 --version
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtCreate .env file:
# Required: OpenAI (GPT-5.1 for conventional scenario)
OPENAI_API_KEY=sk-proj-xxxxx
# MCP Server API Keys
# GitHub MCP
GITHUB_PERSONAL_ACCESS_TOKEN=ghp_xxxxx
# Required for Scenario 2: TaskSandbox (local instance)
TASKSANDBOX_API_KEY=your_local_token_here
TASKSANDBOX_BASE_URL=http://localhost:3000
INFERENCE_PROVIDER=Hyperbolic
INFERENCE_API_KEY=your_hyperbolic_key_here
# Optional: LangSmith observability
LANGCHAIN_API_KEY=lsv2_pt_xxxxx
LANGCHAIN_TRACING_V2=true
LANGCHAIN_PROJECT=mcp-benchmarkGitHub Personal Access Token:
- Visit https://github.com/settings/tokens
- Generate new token (classic)
- Select scopes:
repo,read:org,read:user
Note: There is currently no hosted version of TSBX available. To run the TSBX scenario, you'll need to run TSBX locally.
Clone and run TSBX with MCP support:
# Clone the TSBX repository with MCP support
git clone -b feature/mcp https://github.com/RactorLabs/tsbx.git
cd tsbx
# Follow the build instructions in the TSBX README
# (Mac-specific build instructions are documented there)
# Start TSBX (typically on port 3000)
# Refer to TSBX README for specific startup commandsUpdate your .env for local TSBX:
# Point to your local TSBX instance
TASKSANDBOX_BASE_URL=http://localhost:3000
TASKSANDBOX_API_KEY=your_local_token_here
# Other required variables remain the same
INFERENCE_PROVIDER=Hyperbolic
INFERENCE_API_KEY=your_hyperbolic_key_hereImportant: The feature/mcp branch contains the MCP support required for this benchmark. Make sure you're on this branch before building.
# Test GitHub MCP client
python test_github_mcp.pyThe benchmark suite runner (run_benchmark_suite.py) automatically detects the MCP server type from the manifest and runs all tasks.
# Run all GitHub tasks with conventional scenario
python run_benchmark_suite.py \
--manifest github-data/manifest.json \
--scenario conventional
# Run specific GitHub tasks
python run_benchmark_suite.py \
--manifest github-data/manifest.json \
--tasks GH-SIMPLE-001,GH-SIMPLE-002,GH-SIMPLE-003
# Quick test with simple tasks only
python run_benchmark_suite.py \
--manifest github-data/manifest.json \
--level 1run_benchmark_suite.py
└─> For each task:
│
├─> Scenario 1: Conventional
│ └─> run_conventional_scenario(task, mcp_type)
│ └─> LangGraph workflow (GPT-5.1)
│ ├─ load_tools (dynamically from MCP server)
│ ├─ select_tool (pick appropriate tool)
│ ├─ execute_tool (call MCP)
│ └─ process_response (format result)
│
└─> Scenario 2: TSBX
└─> run_tsbx_programmatic_scenario(task, mcp_type)
└─> TaskSandbox agent (native MCP support)
├─ Autonomous tool discovery
├─ Direct MCP tool calling
└─ Answer synthesis
By Level (1=simple, 6=complex):
--level 1 # Only Level 1 tasks
--level 1-3 # Level 1, 2, and 3 tasks
--level 4-6 # Complex tasks onlyBy Difficulty (1=easy, 6=hard):
--difficulty 1-2 # Easy tasks
--difficulty 5-6 # Hard tasksBy Type:
--type read # Read-only operations
--type write # Operations that modify dataResults organized by MCP server type:
- GitHub:
results/benchmark_suite/github/{scenario}_{timestamp}/
Per-task JSON (task_GH-SIMPLE-001_conventional.json):
{
"task_id": "GH-SIMPLE-001",
"task_name": "List repositories for user",
"scenario": "conventional",
"execution_model": "gpt-5.1",
"total_metrics": {
"total_tokens": 8765,
"total_cost": 0.0987,
"total_latency_seconds": 2.5
},
"success": true,
"accuracy_score": 1.0
}Summary statistics (summary.json):
{
"total_tasks": 10,
"successful_tasks": 9,
"success_rate": 0.90,
"total_tokens": 87650,
"total_cost": 5.50,
"avg_tokens_per_task": 8765,
"avg_cost_per_task": 0.55,
"by_level": {
"1": {"count": 5, "success_rate": 1.0, "avg_cost": 0.25},
"2": {"count": 3, "success_rate": 0.83, "avg_cost": 0.35}
}
}Run multiple scenarios:
# Run both conventional and TSBX scenarios together
python run_benchmark_suite.py --manifest github-data/manifest.json --scenario bothCompare scenarios (Conventional vs TSBX):
# Run BOTH scenarios and auto-generate comparison (recommended!)
python run_benchmark_suite.py \
--manifest github-data/manifest.json \
--scenario both
# This will:
# 1. Run conventional scenario
# 2. Run TSBX scenario
# 3. Auto-generate timestamped comparison file: results/comparison_github_TIMESTAMP.json
# 4. Print comparison summary to console
# Or run scenarios individually:
# Conventional only
python run_benchmark_suite.py --manifest github-data/manifest.json
# TSBX only
python run_benchmark_suite.py --manifest github-data/manifest.json --scenario tsbx
# Manual comparison of individual runs
python compare_scenarios.py \
results/benchmark_suite/github/conventional_TIMESTAMP \
results/benchmark_suite/github/tsbx_TIMESTAMPBenchmark suite results:
# View summary statistics
cat results/benchmark_suite/github/conventional_TIMESTAMP/summary.json
# View individual task result
cat results/benchmark_suite/github/conventional_TIMESTAMP/task_GH-SIMPLE-001_conventional.json
# Compare metrics across scenarios
jq '.total_cost' results/benchmark_suite/github/*/summary.json| Metric | Description |
|---|---|
| Input Tokens | Tokens sent to GPT-5.1 (includes full tool catalog) |
| Output Tokens | Tokens generated by GPT-5.1 |
| Total Tokens | Sum of input + output |
| Cost | USD cost based on GPT-5.1 pricing ($1.25/1M input, $10/1M output) |
| Latency | End-to-end execution time (seconds) |
| DIRT | Data In Remote Transit (KB) - includes LLM, MCP, and LangSmith overhead |
| LLM Calls | Number of calls to GPT-5.1 |
| Tool Calls | Number of MCP operations |
DIRT tracks all data transferred over the network:
- LLM KB: Data sent to/from OpenAI API
- Tool KB: Data sent to/from MCP server
- LangSmith KB: Estimated observability overhead (1.5x LLM traffic)
Why estimate LangSmith? LangSmith runs in the background automatically (when LANGCHAIN_TRACING_V2=true). We can't directly measure it without network interception, but we know it sends trace data roughly equal to LLM traffic plus metadata, so we estimate it as 1.5x.
The @modelcontextprotocol/server-github provides tools for:
Repositories:
- List repositories for user/org
- Get repository metadata
- Create/update repositories
Issues:
- List issues (with filters)
- Search issues by label
- Create/update issues
Pull Requests:
- List pull requests
- Get PR review status
- Create/update PRs
Commits:
- List commits
- Get commit details
And more - Run python test_github_mcp.py to see all available tools.
python run_benchmark_suite.py [OPTIONS]
Options:
--manifest TEXT Path to benchmark manifest JSON (required)
--scenario TEXT conventional | tsbx | tsbx-pg | both | all (default: conventional)
--level TEXT Filter by level (e.g., "1", "1-3", "4-6")
--difficulty TEXT Filter by difficulty (e.g., "1-3")
--type TEXT Filter by type (read | write)
--tasks TEXT Comma-separated task IDs (e.g., "GH-SIMPLE-001,GH-MEDIUM-005")
--iterations INT Number of times to run each task (default: 1)
--output-dir TEXT Output directory (default: results/benchmark_suite)Note: The MCP server type is automatically detected from the manifest metadata.
When LANGCHAIN_TRACING_V2=true is set, all LangGraph/LangChain operations are automatically traced to LangSmith:
- View full execution traces
- See all LLM calls with prompts/responses
- Track tool execution
- Analyze latency bottlenecks
View traces: https://smith.langchain.com/ (project: mcp-benchmark)
Error: Authentication credentials not found
Solution: Verify your GitHub Personal Access Token:
- Check it starts with
ghp_orgithub_pat_ - Verify scopes include
repo,read:org,read:user - Regenerate token if needed
Error: @<package>/mcp-server not found
Solution: The MCP server runs via npx, which downloads it on-demand. Ensure you have Node.js/npm installed:
node --version # Should be v18+
npm --versionThe factory couldn't find the GitHub client. Make sure:
github_mcp_client.pyexists insrc/clients/- No syntax errors in the file
- Run
python test_github_mcp.pyto test
SyntaxError: invalid syntax
Solution: Ensure you're using Python 3.13+:
python3 --version
source .venv/bin/activate # Use the venv| Model | Input (per 1M tokens) | Output (per 1M tokens) | Usage |
|---|---|---|---|
| GPT-5.1 | $1.25 | $10.00 | Conventional scenario (tool selection & execution) |
| TSBX Agent | Variable | Variable | TSBX scenario (autonomous execution with native MCP) |
Cost Tracking: Results track total tokens and cost per scenario for direct comparison.
-
Does full tool context justify the cost?
- Conventional passes all tools to LLM (~11K+ tokens)
- TSBX has minimal orchestration overhead
- What's the cost difference in practice?
-
Control vs. Efficiency
- Conventional: Full control, full observability, high cost
- TSBX: Autonomous execution, minimal cost
- Which approach is better for different task types?
-
Native MCP Support Impact
- How does TSBX's native MCP tool calling compare to LangGraph orchestration?
- What are the performance and cost implications?
github-mcp-benchmark/
├── src/
│ ├── clients/
│ │ ├── base_mcp_client.py # Abstract base class
│ │ ├── mcp_client_factory.py # Factory pattern
│ │ ├── github_mcp_client.py # GitHub implementation
│ │ └── tasksandbox_client.py # TaskSandbox client
│ ├── scenarios/
│ │ ├── scenario1_conventional.py # Conventional approach
│ │ └── scenario3_tsbx_direct.py # TSBX delegation
│ ├── measurement/
│ │ └── dirt_tracker.py # DIRT metrics
│ └── utils/
│ ├── config.py # Configuration
│ └── validator.py # Ground truth validation
├── github-data/ # GitHub manifests & ground truth
├── results/
│ └── benchmark_suite/
│ └── github/ # GitHub results
├── run_benchmark_suite.py # Main runner
├── test_github_mcp.py # Test GitHub client
└── README.md # This file
- Model Context Protocol
- GitHub MCP Server
- LangGraph Documentation
- GAIA Benchmark - Reference implementation
Built with: Python 3.13, LangGraph, OpenAI GPT-5.1, Multiple MCP Servers December 2025