Skip to content

JingWC/MCP-Focus

Repository files navigation

MCP-Focus: Leveraging Function-Oriented Document Enhancement for MCP Server Retrieval


MCP framework

Overview

MCP-Focus is a function-oriented document enhancement framework for MCP server retrieval. It generates retrieval-ready MCP documentation from raw server repositories via a multi-stage, LLM-driven agentic pipeline that performs white-box code analysis to (i) extract exposed tools and schemas, (ii) refine tool-level descriptions grounded in implementation, and (iii) synthesize a structured server-level overview for indexing.

This repository also includes a large-scale retrieval benchmark built from 3,763 real-world open-source MCP servers and human-guided queries with controlled semantic ambiguity, constraint specificity, and multi-function complexity, together with an end-to-end pipeline for indexing enhanced documents and evaluating retrievers.

Quick Start

Install dependencies

Create and activate a conda environment, then install Python dependencies from requirements.txt:

conda create -n mcp-focus python=3.13 -y
conda activate mcp-focus
pip install -r requirements.txt

Configure config.json

Copy the template config and fill in your credentials/paths:

cp config_template.json config.json

Then edit config.json and update:

  • github_token: your GitHub personal access token (used to access/download repos via the GitHub API).
  • glama_api_token: your Glama API token (used to crawl Glama MCP server data).
  • openai_api_key: your OpenAI API key.
  • base_urls / api_keys: optional list(s) of OpenAI-compatible base URLs and API keys if you use non-default endpoints or multiple providers.
  • data_dir: where datasets/artifacts will be stored (default: ../data).
  • github_mcp_server_repos_dir: subdirectory under data_dir for downloaded GitHub repos.
  • think_models: model names that should be treated as "thinking" models (only needed if you use such models).

Run MCP-Focus

(1) Tool Extraction (Tool Parser)

python -m tool_parser.main \
  --input_format <INPUT_FORMAT> \
  --input_path <INPUT_JSONL_PATH> \
  --output_path <OUTPUT_JSONL_PATH> \
  --repos_dir <REPOS_DIR> \
  --process_num <PROCESS_NUM> \
  --max_workers <MAX_WORKERS> \
  --model_name <MODEL_NAME> \
  --max_tokens <MAX_TOKENS> \
  --save_interval <SAVE_INTERVAL>

This command parses MCP server repositories and extracts all exposed MCP tools, including each tool’s name, description, and input schema. It outputs a JSONL file that augments each server entry with extracted tool metadata, which serves as the input for downstream tool-document refinement.

  • --input_format: The input data format (e.g., glama / github / vallina), which determines how the loader interprets each JSONL record.
  • --input_path: Path to the input JSONL file containing MCP server entries (e.g., repository URLs/IDs and metadata).
  • --output_path: Path to the output JSONL file where extracted tools and related artifacts are saved.
  • --repos_dir: Directory used to download/cache repository source code for analysis.
  • --process_num: Number of server entries to process from the input (useful for subsampling or partial runs).
  • --max_workers: Number of concurrent worker threads for parallel processing.
  • --model_name: LLM name used by the tool extractor agent.
  • --max_tokens: Maximum token budget for each LLM call (controls context and generation length).
  • --save_interval: Save checkpoint interval (in number of processed entries) to enable incremental persistence and recovery.

(2) Tool Document Refinement

python -m doc_refiner.tool_doc_refiner \
  --input_path <INPUT_JSONL_PATH> \
  --output_path <OUTPUT_JSONL_PATH> \
  --input_format <INPUT_FORMAT> \
  --process_num <PROCESS_NUM> \
  --max_workers <MAX_WORKERS> \
  --model_name <MODEL_NAME> \
  --max_tokens <MAX_TOKENS> \
  --save_interval <SAVE_INTERVAL> \
  --repos_dir <REPOS_DIR>

This command refines tool-level documentation by performing white-box analysis over the tool implementations. Given extracted tools (schemas + tool-to-code mapping), it navigates the repository code to generate retrieval-oriented, implementation-grounded tool documents (typically in markdown), and writes them back into a JSONL output.

  • --input_path: Path to the input JSONL file containing extracted tools (i.e., the output from the tool parser stage).
  • --output_path: Path to the output JSONL file where refined tool documents are saved.
  • --input_format: The input data format (e.g., glama / github / vallina), used to parse and normalize input records.
  • --process_num: Number of server entries to process from the input.
  • --max_workers: Number of concurrent worker threads for parallel refinement.
  • --model_name: LLM name used by the tool document refiner agent.
  • --max_tokens: Maximum token budget for each LLM call during refinement.
  • --save_interval: Save checkpoint interval (in number of processed entries).
  • --repos_dir: Directory for downloading/caching repositories (required because refinement reads source code to ground the tool semantics).

(3) Server Document Refinement

python -m doc_refiner.server_doc_refiner \
  --input_path <INPUT_JSONL_PATH> \
  --output_path <OUTPUT_JSONL_PATH> \
  --input_format <INPUT_FORMAT> \
  --process_num <PROCESS_NUM> \
  --max_workers <MAX_WORKERS> \
  --model_name <MODEL_NAME> \
  --max_tokens <MAX_TOKENS> \
  --save_interval <SAVE_INTERVAL>

This command generates server-level documentation from refined tool documents. It synthesizes a coherent server overview and aggregates tool-level evidence into retrieval-ready server documentation.

  • --input_path: Path to the input JSONL file containing refined tool documents (i.e., the output from the tool document refinement stage).
  • --output_path: Path to the output JSONL file where server documentation is saved.
  • --input_format: The input data format (e.g., glama / github / vallina), used to parse and normalize input records.
  • --process_num: Number of server entries to process from the input.
  • --max_workers: Number of concurrent worker threads for parallel generation.
  • --model_name: LLM name used by the server document refiner agent.
  • --max_tokens: Maximum token budget for each LLM call during server document generation.
  • --save_interval: Save checkpoint interval (in number of processed entries).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages