Skip to content

RactorLabs/benchmark-github-mcp

Repository files navigation

GitHub MCP Benchmark

Compare token usage, cost, and DIRT (Data in Remote Transit) across architectural approaches for using Model Context Protocol (MCP) servers with GitHub.

Supported MCP Server

MCP Server Package Use Case Tools
GitHub @modelcontextprotocol/server-github Repository operations (issues, PRs, commits) 10+ tools

Two Scenarios

1. Conventional - Full MCP Context

LangGraph agent loads ALL available MCP tools into context. GPT-5.1 selects and executes appropriate tool.

  • Trade-off: Full control + observability, high token cost
  • DIRT: ~170 KB | Cost: ~$0.015/request

2. TSBX - Direct Delegation

Direct TaskSandbox delegation. TSBX agent has native MCP tool calling capabilities and can call MCP tools directly.

  • Trade-off: Lowest tokens/DIRT, autonomous execution
  • DIRT: ~10-20 KB | Cost: ~$0.001-0.003/request

Architecture Comparison

┌─────────────────────────────────────────────────────────────┐
│ Scenario 1: Conventional                                    │
│   LangGraph → list ALL tools → GPT-5.1 selects → MCP        │
│   Trade-off: Full control, high tokens                      │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ Scenario 2: TSBX                                            │
│   Script → TSBX agent (with MCP) → MCP Server → Result      │
│   Trade-off: Lowest cost, black box                         │
└─────────────────────────────────────────────────────────────┘

Quick Start

TL;DR - Run Both Scenarios with Auto-Comparison

# Activate environment
source .venv/bin/activate

# Run both scenarios (conventional + TSBX) and auto-generate comparison
python run_benchmark_suite.py \
  --manifest github-data/manifest.json \
  --scenario both

# Output:
# - results/benchmark_suite/github/conventional_TIMESTAMP/
# - results/benchmark_suite/github/tsbx_TIMESTAMP/
# - results/comparison_github_TIMESTAMP.json (auto-generated!)

1. Prerequisites

# Python 3.13+ required
python3 --version

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Configure Environment

Create .env file:

# Required: OpenAI (GPT-5.1 for conventional scenario)
OPENAI_API_KEY=sk-proj-xxxxx

# MCP Server API Keys
# GitHub MCP
GITHUB_PERSONAL_ACCESS_TOKEN=ghp_xxxxx

# Required for Scenario 2: TaskSandbox (local instance)
TASKSANDBOX_API_KEY=your_local_token_here
TASKSANDBOX_BASE_URL=http://localhost:3000
INFERENCE_PROVIDER=Hyperbolic
INFERENCE_API_KEY=your_hyperbolic_key_here

# Optional: LangSmith observability
LANGCHAIN_API_KEY=lsv2_pt_xxxxx
LANGCHAIN_TRACING_V2=true
LANGCHAIN_PROJECT=mcp-benchmark

Getting API Keys

GitHub Personal Access Token:

  1. Visit https://github.com/settings/tokens
  2. Generate new token (classic)
  3. Select scopes: repo, read:org, read:user

3. Run TSBX Locally

Note: There is currently no hosted version of TSBX available. To run the TSBX scenario, you'll need to run TSBX locally.

Clone and run TSBX with MCP support:

# Clone the TSBX repository with MCP support
git clone -b feature/mcp https://github.com/RactorLabs/tsbx.git
cd tsbx

# Follow the build instructions in the TSBX README
# (Mac-specific build instructions are documented there)

# Start TSBX (typically on port 3000)
# Refer to TSBX README for specific startup commands

Update your .env for local TSBX:

# Point to your local TSBX instance
TASKSANDBOX_BASE_URL=http://localhost:3000
TASKSANDBOX_API_KEY=your_local_token_here

# Other required variables remain the same
INFERENCE_PROVIDER=Hyperbolic
INFERENCE_API_KEY=your_hyperbolic_key_here

Important: The feature/mcp branch contains the MCP support required for this benchmark. Make sure you're on this branch before building.

4. Test MCP Connections

# Test GitHub MCP client
python test_github_mcp.py

5. Run Benchmark Suite

The benchmark suite runner (run_benchmark_suite.py) automatically detects the MCP server type from the manifest and runs all tasks.

Run GitHub Benchmarks

# Run all GitHub tasks with conventional scenario
python run_benchmark_suite.py \
  --manifest github-data/manifest.json \
  --scenario conventional

# Run specific GitHub tasks
python run_benchmark_suite.py \
  --manifest github-data/manifest.json \
  --tasks GH-SIMPLE-001,GH-SIMPLE-002,GH-SIMPLE-003

# Quick test with simple tasks only
python run_benchmark_suite.py \
  --manifest github-data/manifest.json \
  --level 1

Architecture

run_benchmark_suite.py
  └─> For each task:
        │
        ├─> Scenario 1: Conventional
        │     └─> run_conventional_scenario(task, mcp_type)
        │           └─> LangGraph workflow (GPT-5.1)
        │                 ├─ load_tools (dynamically from MCP server)
        │                 ├─ select_tool (pick appropriate tool)
        │                 ├─ execute_tool (call MCP)
        │                 └─ process_response (format result)
        │
        └─> Scenario 2: TSBX
              └─> run_tsbx_programmatic_scenario(task, mcp_type)
                    └─> TaskSandbox agent (native MCP support)
                          ├─ Autonomous tool discovery
                          ├─ Direct MCP tool calling
                          └─ Answer synthesis

Filtering Options

By Level (1=simple, 6=complex):

--level 1              # Only Level 1 tasks
--level 1-3            # Level 1, 2, and 3 tasks
--level 4-6            # Complex tasks only

By Difficulty (1=easy, 6=hard):

--difficulty 1-2       # Easy tasks
--difficulty 5-6       # Hard tasks

By Type:

--type read            # Read-only operations
--type write           # Operations that modify data

Output & Results

Results organized by MCP server type:

  • GitHub: results/benchmark_suite/github/{scenario}_{timestamp}/

Per-task JSON (task_GH-SIMPLE-001_conventional.json):

{
  "task_id": "GH-SIMPLE-001",
  "task_name": "List repositories for user",
  "scenario": "conventional",
  "execution_model": "gpt-5.1",
  "total_metrics": {
    "total_tokens": 8765,
    "total_cost": 0.0987,
    "total_latency_seconds": 2.5
  },
  "success": true,
  "accuracy_score": 1.0
}

Summary statistics (summary.json):

{
  "total_tasks": 10,
  "successful_tasks": 9,
  "success_rate": 0.90,
  "total_tokens": 87650,
  "total_cost": 5.50,
  "avg_tokens_per_task": 8765,
  "avg_cost_per_task": 0.55,
  "by_level": {
    "1": {"count": 5, "success_rate": 1.0, "avg_cost": 0.25},
    "2": {"count": 3, "success_rate": 0.83, "avg_cost": 0.35}
  }
}

Common Workflows

Run multiple scenarios:

# Run both conventional and TSBX scenarios together
python run_benchmark_suite.py --manifest github-data/manifest.json --scenario both

Compare scenarios (Conventional vs TSBX):

# Run BOTH scenarios and auto-generate comparison (recommended!)
python run_benchmark_suite.py \
  --manifest github-data/manifest.json \
  --scenario both

# This will:
# 1. Run conventional scenario
# 2. Run TSBX scenario
# 3. Auto-generate timestamped comparison file: results/comparison_github_TIMESTAMP.json
# 4. Print comparison summary to console

# Or run scenarios individually:
# Conventional only
python run_benchmark_suite.py --manifest github-data/manifest.json

# TSBX only
python run_benchmark_suite.py --manifest github-data/manifest.json --scenario tsbx

# Manual comparison of individual runs
python compare_scenarios.py \
  results/benchmark_suite/github/conventional_TIMESTAMP \
  results/benchmark_suite/github/tsbx_TIMESTAMP

6. View Results

Benchmark suite results:

# View summary statistics
cat results/benchmark_suite/github/conventional_TIMESTAMP/summary.json

# View individual task result
cat results/benchmark_suite/github/conventional_TIMESTAMP/task_GH-SIMPLE-001_conventional.json

# Compare metrics across scenarios
jq '.total_cost' results/benchmark_suite/github/*/summary.json

Metrics Measured

Metric Description
Input Tokens Tokens sent to GPT-5.1 (includes full tool catalog)
Output Tokens Tokens generated by GPT-5.1
Total Tokens Sum of input + output
Cost USD cost based on GPT-5.1 pricing ($1.25/1M input, $10/1M output)
Latency End-to-end execution time (seconds)
DIRT Data In Remote Transit (KB) - includes LLM, MCP, and LangSmith overhead
LLM Calls Number of calls to GPT-5.1
Tool Calls Number of MCP operations

DIRT Breakdown

DIRT tracks all data transferred over the network:

  • LLM KB: Data sent to/from OpenAI API
  • Tool KB: Data sent to/from MCP server
  • LangSmith KB: Estimated observability overhead (1.5x LLM traffic)

Why estimate LangSmith? LangSmith runs in the background automatically (when LANGCHAIN_TRACING_V2=true). We can't directly measure it without network interception, but we know it sends trace data roughly equal to LLM traffic plus metadata, so we estimate it as 1.5x.

MCP Server Tools

GitHub MCP (10+ Tools)

The @modelcontextprotocol/server-github provides tools for:

Repositories:

  • List repositories for user/org
  • Get repository metadata
  • Create/update repositories

Issues:

  • List issues (with filters)
  • Search issues by label
  • Create/update issues

Pull Requests:

  • List pull requests
  • Get PR review status
  • Create/update PRs

Commits:

  • List commits
  • Get commit details

And more - Run python test_github_mcp.py to see all available tools.

CLI Options

Benchmark Suite Runner (run_benchmark_suite.py)

python run_benchmark_suite.py [OPTIONS]

Options:
  --manifest TEXT         Path to benchmark manifest JSON (required)
  --scenario TEXT         conventional | tsbx | tsbx-pg | both | all (default: conventional)
  --level TEXT           Filter by level (e.g., "1", "1-3", "4-6")
  --difficulty TEXT      Filter by difficulty (e.g., "1-3")
  --type TEXT            Filter by type (read | write)
  --tasks TEXT           Comma-separated task IDs (e.g., "GH-SIMPLE-001,GH-MEDIUM-005")
  --iterations INT       Number of times to run each task (default: 1)
  --output-dir TEXT      Output directory (default: results/benchmark_suite)

Note: The MCP server type is automatically detected from the manifest metadata.

Observability with LangSmith

When LANGCHAIN_TRACING_V2=true is set, all LangGraph/LangChain operations are automatically traced to LangSmith:

  • View full execution traces
  • See all LLM calls with prompts/responses
  • Track tool execution
  • Analyze latency bottlenecks

View traces: https://smith.langchain.com/ (project: mcp-benchmark)

Troubleshooting

Authentication Error (401)

Error: Authentication credentials not found

Solution: Verify your GitHub Personal Access Token:

  • Check it starts with ghp_ or github_pat_
  • Verify scopes include repo, read:org, read:user
  • Regenerate token if needed

MCP Server Not Found

Error: @<package>/mcp-server not found

Solution: The MCP server runs via npx, which downloads it on-demand. Ensure you have Node.js/npm installed:

node --version  # Should be v18+
npm --version

Error: "Unknown MCP type: 'github'"

The factory couldn't find the GitHub client. Make sure:

  • github_mcp_client.py exists in src/clients/
  • No syntax errors in the file
  • Run python test_github_mcp.py to test

Python Version Error

SyntaxError: invalid syntax

Solution: Ensure you're using Python 3.13+:

python3 --version
source .venv/bin/activate  # Use the venv

Model Pricing

Execution Models

Model Input (per 1M tokens) Output (per 1M tokens) Usage
GPT-5.1 $1.25 $10.00 Conventional scenario (tool selection & execution)
TSBX Agent Variable Variable TSBX scenario (autonomous execution with native MCP)

Cost Tracking: Results track total tokens and cost per scenario for direct comparison.

Research Questions

  1. Does full tool context justify the cost?

    • Conventional passes all tools to LLM (~11K+ tokens)
    • TSBX has minimal orchestration overhead
    • What's the cost difference in practice?
  2. Control vs. Efficiency

    • Conventional: Full control, full observability, high cost
    • TSBX: Autonomous execution, minimal cost
    • Which approach is better for different task types?
  3. Native MCP Support Impact

    • How does TSBX's native MCP tool calling compare to LangGraph orchestration?
    • What are the performance and cost implications?

Project Structure

github-mcp-benchmark/
├── src/
│   ├── clients/
│   │   ├── base_mcp_client.py          # Abstract base class
│   │   ├── mcp_client_factory.py       # Factory pattern
│   │   ├── github_mcp_client.py        # GitHub implementation
│   │   └── tasksandbox_client.py       # TaskSandbox client
│   ├── scenarios/
│   │   ├── scenario1_conventional.py   # Conventional approach
│   │   └── scenario3_tsbx_direct.py    # TSBX delegation
│   ├── measurement/
│   │   └── dirt_tracker.py             # DIRT metrics
│   └── utils/
│       ├── config.py                   # Configuration
│       └── validator.py                # Ground truth validation
├── github-data/                        # GitHub manifests & ground truth
├── results/
│   └── benchmark_suite/
│       └── github/                     # GitHub results
├── run_benchmark_suite.py              # Main runner
├── test_github_mcp.py                  # Test GitHub client
└── README.md                           # This file

References


Built with: Python 3.13, LangGraph, OpenAI GPT-5.1, Multiple MCP Servers December 2025

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages