GitHub MCP Benchmark

Compare token usage, cost, and DIRT (Data in Remote Transit) across architectural approaches for using Model Context Protocol (MCP) servers with GitHub.

Supported MCP Server

MCP Server	Package	Use Case	Tools
GitHub	`@modelcontextprotocol/server-github`	Repository operations (issues, PRs, commits)	10+ tools

Two Scenarios

1. Conventional - Full MCP Context

LangGraph agent loads ALL available MCP tools into context. GPT-5.1 selects and executes appropriate tool.

Trade-off: Full control + observability, high token cost
DIRT: ~170 KB | Cost: ~$0.015/request

2. TSBX - Direct Delegation

Direct TaskSandbox delegation. TSBX agent has native MCP tool calling capabilities and can call MCP tools directly.

Trade-off: Lowest tokens/DIRT, autonomous execution
DIRT: ~10-20 KB | Cost: ~$0.001-0.003/request

Architecture Comparison

┌─────────────────────────────────────────────────────────────┐
│ Scenario 1: Conventional                                    │
│   LangGraph → list ALL tools → GPT-5.1 selects → MCP        │
│   Trade-off: Full control, high tokens                      │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ Scenario 2: TSBX                                            │
│   Script → TSBX agent (with MCP) → MCP Server → Result      │
│   Trade-off: Lowest cost, black box                         │
└─────────────────────────────────────────────────────────────┘

Quick Start

TL;DR - Run Both Scenarios with Auto-Comparison

# Activate environment
source .venv/bin/activate

# Run both scenarios (conventional + TSBX) and auto-generate comparison
python run_benchmark_suite.py \
  --manifest github-data/manifest.json \
  --scenario both

# Output:
# - results/benchmark_suite/github/conventional_TIMESTAMP/
# - results/benchmark_suite/github/tsbx_TIMESTAMP/
# - results/comparison_github_TIMESTAMP.json (auto-generated!)

1. Prerequisites

# Python 3.13+ required
python3 --version

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Configure Environment

Create .env file:

# Required: OpenAI (GPT-5.1 for conventional scenario)
OPENAI_API_KEY=sk-proj-xxxxx

# MCP Server API Keys
# GitHub MCP
GITHUB_PERSONAL_ACCESS_TOKEN=ghp_xxxxx

# Required for Scenario 2: TaskSandbox (local instance)
TASKSANDBOX_API_KEY=your_local_token_here
TASKSANDBOX_BASE_URL=http://localhost:3000
INFERENCE_PROVIDER=Hyperbolic
INFERENCE_API_KEY=your_hyperbolic_key_here

# Optional: LangSmith observability
LANGCHAIN_API_KEY=lsv2_pt_xxxxx
LANGCHAIN_TRACING_V2=true
LANGCHAIN_PROJECT=mcp-benchmark

Getting API Keys

GitHub Personal Access Token:

Visit https://github.com/settings/tokens
Generate new token (classic)
Select scopes: repo, read:org, read:user

3. Run TSBX Locally

Note: There is currently no hosted version of TSBX available. To run the TSBX scenario, you'll need to run TSBX locally.

Clone and run TSBX with MCP support:

# Clone the TSBX repository with MCP support
git clone -b feature/mcp https://github.com/RactorLabs/tsbx.git
cd tsbx

# Follow the build instructions in the TSBX README
# (Mac-specific build instructions are documented there)

# Start TSBX (typically on port 3000)
# Refer to TSBX README for specific startup commands

Update your .env for local TSBX:

# Point to your local TSBX instance
TASKSANDBOX_BASE_URL=http://localhost:3000
TASKSANDBOX_API_KEY=your_local_token_here

# Other required variables remain the same
INFERENCE_PROVIDER=Hyperbolic
INFERENCE_API_KEY=your_hyperbolic_key_here

Important: The feature/mcp branch contains the MCP support required for this benchmark. Make sure you're on this branch before building.

4. Test MCP Connections

# Test GitHub MCP client
python test_github_mcp.py

5. Run Benchmark Suite

The benchmark suite runner (run_benchmark_suite.py) automatically detects the MCP server type from the manifest and runs all tasks.

Run GitHub Benchmarks

# Run all GitHub tasks with conventional scenario
python run_benchmark_suite.py \
  --manifest github-data/manifest.json \
  --scenario conventional

# Run specific GitHub tasks
python run_benchmark_suite.py \
  --manifest github-data/manifest.json \
  --tasks GH-SIMPLE-001,GH-SIMPLE-002,GH-SIMPLE-003

# Quick test with simple tasks only
python run_benchmark_suite.py \
  --manifest github-data/manifest.json \
  --level 1

Architecture

run_benchmark_suite.py
  └─> For each task:
        │
        ├─> Scenario 1: Conventional
        │     └─> run_conventional_scenario(task, mcp_type)
        │           └─> LangGraph workflow (GPT-5.1)
        │                 ├─ load_tools (dynamically from MCP server)
        │                 ├─ select_tool (pick appropriate tool)
        │                 ├─ execute_tool (call MCP)
        │                 └─ process_response (format result)
        │
        └─> Scenario 2: TSBX
              └─> run_tsbx_programmatic_scenario(task, mcp_type)
                    └─> TaskSandbox agent (native MCP support)
                          ├─ Autonomous tool discovery
                          ├─ Direct MCP tool calling
                          └─ Answer synthesis

Filtering Options

By Level (1=simple, 6=complex):

--level 1              # Only Level 1 tasks
--level 1-3            # Level 1, 2, and 3 tasks
--level 4-6            # Complex tasks only

By Difficulty (1=easy, 6=hard):

--difficulty 1-2       # Easy tasks
--difficulty 5-6       # Hard tasks

By Type:

--type read            # Read-only operations
--type write           # Operations that modify data

Output & Results

Results organized by MCP server type:

GitHub: results/benchmark_suite/github/{scenario}_{timestamp}/

Per-task JSON (task_GH-SIMPLE-001_conventional.json):

{
  "task_id": "GH-SIMPLE-001",
  "task_name": "List repositories for user",
  "scenario": "conventional",
  "execution_model": "gpt-5.1",
  "total_metrics": {
    "total_tokens": 8765,
    "total_cost": 0.0987,
    "total_latency_seconds": 2.5
  },
  "success": true,
  "accuracy_score": 1.0
}

Summary statistics (summary.json):

{
  "total_tasks": 10,
  "successful_tasks": 9,
  "success_rate": 0.90,
  "total_tokens": 87650,
  "total_cost": 5.50,
  "avg_tokens_per_task": 8765,
  "avg_cost_per_task": 0.55,
  "by_level": {
    "1": {"count": 5, "success_rate": 1.0, "avg_cost": 0.25},
    "2": {"count": 3, "success_rate": 0.83, "avg_cost": 0.35}
  }
}

Common Workflows

Run multiple scenarios:

# Run both conventional and TSBX scenarios together
python run_benchmark_suite.py --manifest github-data/manifest.json --scenario both

Compare scenarios (Conventional vs TSBX):

# Run BOTH scenarios and auto-generate comparison (recommended!)
python run_benchmark_suite.py \
  --manifest github-data/manifest.json \
  --scenario both

# This will:
# 1. Run conventional scenario
# 2. Run TSBX scenario
# 3. Auto-generate timestamped comparison file: results/comparison_github_TIMESTAMP.json
# 4. Print comparison summary to console

# Or run scenarios individually:
# Conventional only
python run_benchmark_suite.py --manifest github-data/manifest.json

# TSBX only
python run_benchmark_suite.py --manifest github-data/manifest.json --scenario tsbx

# Manual comparison of individual runs
python compare_scenarios.py \
  results/benchmark_suite/github/conventional_TIMESTAMP \
  results/benchmark_suite/github/tsbx_TIMESTAMP

6. View Results

Benchmark suite results:

# View summary statistics
cat results/benchmark_suite/github/conventional_TIMESTAMP/summary.json

# View individual task result
cat results/benchmark_suite/github/conventional_TIMESTAMP/task_GH-SIMPLE-001_conventional.json

# Compare metrics across scenarios
jq '.total_cost' results/benchmark_suite/github/*/summary.json

Metrics Measured

Metric	Description
Input Tokens	Tokens sent to GPT-5.1 (includes full tool catalog)
Output Tokens	Tokens generated by GPT-5.1
Total Tokens	Sum of input + output
Cost	USD cost based on GPT-5.1 pricing ($1.25/1M input, $10/1M output)
Latency	End-to-end execution time (seconds)
DIRT	Data In Remote Transit (KB) - includes LLM, MCP, and LangSmith overhead
LLM Calls	Number of calls to GPT-5.1
Tool Calls	Number of MCP operations

DIRT Breakdown

DIRT tracks all data transferred over the network:

LLM KB: Data sent to/from OpenAI API
Tool KB: Data sent to/from MCP server
LangSmith KB: Estimated observability overhead (1.5x LLM traffic)

Why estimate LangSmith? LangSmith runs in the background automatically (when LANGCHAIN_TRACING_V2=true). We can't directly measure it without network interception, but we know it sends trace data roughly equal to LLM traffic plus metadata, so we estimate it as 1.5x.

MCP Server Tools

GitHub MCP (10+ Tools)

The @modelcontextprotocol/server-github provides tools for:

Repositories:

List repositories for user/org
Get repository metadata
Create/update repositories

Issues:

List issues (with filters)
Search issues by label
Create/update issues

Pull Requests:

List pull requests
Get PR review status
Create/update PRs

Commits:

List commits
Get commit details

And more - Run python test_github_mcp.py to see all available tools.

CLI Options

Benchmark Suite Runner (`run_benchmark_suite.py`)

python run_benchmark_suite.py [OPTIONS]

Options:
  --manifest TEXT         Path to benchmark manifest JSON (required)
  --scenario TEXT         conventional | tsbx | tsbx-pg | both | all (default: conventional)
  --level TEXT           Filter by level (e.g., "1", "1-3", "4-6")
  --difficulty TEXT      Filter by difficulty (e.g., "1-3")
  --type TEXT            Filter by type (read | write)
  --tasks TEXT           Comma-separated task IDs (e.g., "GH-SIMPLE-001,GH-MEDIUM-005")
  --iterations INT       Number of times to run each task (default: 1)
  --output-dir TEXT      Output directory (default: results/benchmark_suite)

Note: The MCP server type is automatically detected from the manifest metadata.

Observability with LangSmith

When LANGCHAIN_TRACING_V2=true is set, all LangGraph/LangChain operations are automatically traced to LangSmith:

View full execution traces
See all LLM calls with prompts/responses
Track tool execution
Analyze latency bottlenecks

View traces: https://smith.langchain.com/ (project: mcp-benchmark)

Troubleshooting

Authentication Error (401)

Error: Authentication credentials not found

Solution: Verify your GitHub Personal Access Token:

Check it starts with ghp_ or github_pat_
Verify scopes include repo, read:org, read:user
Regenerate token if needed

MCP Server Not Found

Error: @<package>/mcp-server not found

Solution: The MCP server runs via npx, which downloads it on-demand. Ensure you have Node.js/npm installed:

node --version  # Should be v18+
npm --version

Error: "Unknown MCP type: 'github'"

The factory couldn't find the GitHub client. Make sure:

github_mcp_client.py exists in src/clients/
No syntax errors in the file
Run python test_github_mcp.py to test

Python Version Error

SyntaxError: invalid syntax

Solution: Ensure you're using Python 3.13+:

python3 --version
source .venv/bin/activate  # Use the venv

Model Pricing

Execution Models

Model	Input (per 1M tokens)	Output (per 1M tokens)	Usage
GPT-5.1	$1.25	$10.00	Conventional scenario (tool selection & execution)
TSBX Agent	Variable	Variable	TSBX scenario (autonomous execution with native MCP)

Cost Tracking: Results track total tokens and cost per scenario for direct comparison.

Research Questions

Does full tool context justify the cost?
- Conventional passes all tools to LLM (~11K+ tokens)
- TSBX has minimal orchestration overhead
- What's the cost difference in practice?
Control vs. Efficiency
- Conventional: Full control, full observability, high cost
- TSBX: Autonomous execution, minimal cost
- Which approach is better for different task types?
Native MCP Support Impact
- How does TSBX's native MCP tool calling compare to LangGraph orchestration?
- What are the performance and cost implications?

Project Structure

github-mcp-benchmark/
├── src/
│   ├── clients/
│   │   ├── base_mcp_client.py          # Abstract base class
│   │   ├── mcp_client_factory.py       # Factory pattern
│   │   ├── github_mcp_client.py        # GitHub implementation
│   │   └── tasksandbox_client.py       # TaskSandbox client
│   ├── scenarios/
│   │   ├── scenario1_conventional.py   # Conventional approach
│   │   └── scenario3_tsbx_direct.py    # TSBX delegation
│   ├── measurement/
│   │   └── dirt_tracker.py             # DIRT metrics
│   └── utils/
│       ├── config.py                   # Configuration
│       └── validator.py                # Ground truth validation
├── github-data/                        # GitHub manifests & ground truth
├── results/
│   └── benchmark_suite/
│       └── github/                     # GitHub results
├── run_benchmark_suite.py              # Main runner
├── test_github_mcp.py                  # Test GitHub client
└── README.md                           # This file

References

Model Context Protocol
GitHub MCP Server
LangGraph Documentation
GAIA Benchmark - Reference implementation

Built with: Python 3.13, LangGraph, OpenAI GPT-5.1, Multiple MCP Servers December 2025

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
github-data		github-data
published-results		published-results
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CITATION.md		CITATION.md
CLAUDE.md		CLAUDE.md
PRD.md		PRD.md
README.md		README.md
claude-code-update.md		claude-code-update.md
compare_scenarios.py		compare_scenarios.py
learnings.md		learnings.md
requirements.txt		requirements.txt
run.py		run.py
run_benchmark_suite.py		run_benchmark_suite.py
test_github_mcp.py		test_github_mcp.py

Folders and files

Latest commit

History

Repository files navigation

GitHub MCP Benchmark

Supported MCP Server

Two Scenarios

1. Conventional - Full MCP Context

2. TSBX - Direct Delegation

Architecture Comparison

Quick Start

TL;DR - Run Both Scenarios with Auto-Comparison

1. Prerequisites

2. Configure Environment

Getting API Keys

3. Run TSBX Locally

4. Test MCP Connections

5. Run Benchmark Suite

Run GitHub Benchmarks

Architecture

Filtering Options

Output & Results

Common Workflows

6. View Results

Metrics Measured

DIRT Breakdown

MCP Server Tools

GitHub MCP (10+ Tools)

CLI Options

Benchmark Suite Runner (run_benchmark_suite.py)

Observability with LangSmith

Troubleshooting

Authentication Error (401)

MCP Server Not Found

Error: "Unknown MCP type: 'github'"

Python Version Error

Model Pricing

Execution Models

Research Questions

Project Structure

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Benchmark Suite Runner (`run_benchmark_suite.py`)

Packages