MCP-Focus: Leveraging Function-Oriented Document Enhancement for MCP Server Retrieval

Overview

MCP-Focus is a function-oriented document enhancement framework for MCP server retrieval. It generates retrieval-ready MCP documentation from raw server repositories via a multi-stage, LLM-driven agentic pipeline that performs white-box code analysis to (i) extract exposed tools and schemas, (ii) refine tool-level descriptions grounded in implementation, and (iii) synthesize a structured server-level overview for indexing.

This repository also includes a large-scale retrieval benchmark built from 3,763 real-world open-source MCP servers and human-guided queries with controlled semantic ambiguity, constraint specificity, and multi-function complexity, together with an end-to-end pipeline for indexing enhanced documents and evaluating retrievers.

Quick Start

Install dependencies

Create and activate a conda environment, then install Python dependencies from requirements.txt:

conda create -n mcp-focus python=3.13 -y
conda activate mcp-focus
pip install -r requirements.txt

Configure `config.json`

Copy the template config and fill in your credentials/paths:

cp config_template.json config.json

Then edit config.json and update:

github_token: your GitHub personal access token (used to access/download repos via the GitHub API).
glama_api_token: your Glama API token (used to crawl Glama MCP server data).
openai_api_key: your OpenAI API key.
base_urls / api_keys: optional list(s) of OpenAI-compatible base URLs and API keys if you use non-default endpoints or multiple providers.
data_dir: where datasets/artifacts will be stored (default: ../data).
github_mcp_server_repos_dir: subdirectory under data_dir for downloaded GitHub repos.
think_models: model names that should be treated as "thinking" models (only needed if you use such models).

Run MCP-Focus

(1) Tool Extraction (Tool Parser)

python -m tool_parser.main \
  --input_format <INPUT_FORMAT> \
  --input_path <INPUT_JSONL_PATH> \
  --output_path <OUTPUT_JSONL_PATH> \
  --repos_dir <REPOS_DIR> \
  --process_num <PROCESS_NUM> \
  --max_workers <MAX_WORKERS> \
  --model_name <MODEL_NAME> \
  --max_tokens <MAX_TOKENS> \
  --save_interval <SAVE_INTERVAL>

This command parses MCP server repositories and extracts all exposed MCP tools, including each tool’s name, description, and input schema. It outputs a JSONL file that augments each server entry with extracted tool metadata, which serves as the input for downstream tool-document refinement.

--input_format: The input data format (e.g., glama / github / vallina), which determines how the loader interprets each JSONL record.
--input_path: Path to the input JSONL file containing MCP server entries (e.g., repository URLs/IDs and metadata).
--output_path: Path to the output JSONL file where extracted tools and related artifacts are saved.
--repos_dir: Directory used to download/cache repository source code for analysis.
--process_num: Number of server entries to process from the input (useful for subsampling or partial runs).
--max_workers: Number of concurrent worker threads for parallel processing.
--model_name: LLM name used by the tool extractor agent.
--max_tokens: Maximum token budget for each LLM call (controls context and generation length).
--save_interval: Save checkpoint interval (in number of processed entries) to enable incremental persistence and recovery.

(2) Tool Document Refinement

python -m doc_refiner.tool_doc_refiner \
  --input_path <INPUT_JSONL_PATH> \
  --output_path <OUTPUT_JSONL_PATH> \
  --input_format <INPUT_FORMAT> \
  --process_num <PROCESS_NUM> \
  --max_workers <MAX_WORKERS> \
  --model_name <MODEL_NAME> \
  --max_tokens <MAX_TOKENS> \
  --save_interval <SAVE_INTERVAL> \
  --repos_dir <REPOS_DIR>

This command refines tool-level documentation by performing white-box analysis over the tool implementations. Given extracted tools (schemas + tool-to-code mapping), it navigates the repository code to generate retrieval-oriented, implementation-grounded tool documents (typically in markdown), and writes them back into a JSONL output.

--input_path: Path to the input JSONL file containing extracted tools (i.e., the output from the tool parser stage).
--output_path: Path to the output JSONL file where refined tool documents are saved.
--input_format: The input data format (e.g., glama / github / vallina), used to parse and normalize input records.
--process_num: Number of server entries to process from the input.
--max_workers: Number of concurrent worker threads for parallel refinement.
--model_name: LLM name used by the tool document refiner agent.
--max_tokens: Maximum token budget for each LLM call during refinement.
--save_interval: Save checkpoint interval (in number of processed entries).
--repos_dir: Directory for downloading/caching repositories (required because refinement reads source code to ground the tool semantics).

(3) Server Document Refinement

python -m doc_refiner.server_doc_refiner \
  --input_path <INPUT_JSONL_PATH> \
  --output_path <OUTPUT_JSONL_PATH> \
  --input_format <INPUT_FORMAT> \
  --process_num <PROCESS_NUM> \
  --max_workers <MAX_WORKERS> \
  --model_name <MODEL_NAME> \
  --max_tokens <MAX_TOKENS> \
  --save_interval <SAVE_INTERVAL>

This command generates server-level documentation from refined tool documents. It synthesizes a coherent server overview and aggregates tool-level evidence into retrieval-ready server documentation.

--input_path: Path to the input JSONL file containing refined tool documents (i.e., the output from the tool document refinement stage).
--output_path: Path to the output JSONL file where server documentation is saved.
--input_format: The input data format (e.g., glama / github / vallina), used to parse and normalize input records.
--process_num: Number of server entries to process from the input.
--max_workers: Number of concurrent worker threads for parallel generation.
--model_name: LLM name used by the server document refiner agent.
--max_tokens: Maximum token budget for each LLM call during server document generation.
--save_interval: Save checkpoint interval (in number of processed entries).

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
data		data
data_collector		data_collector
dataset_builder		dataset_builder
doc_issue_statistics		doc_issue_statistics
doc_refiner		doc_refiner
mcp_retriever		mcp_retriever
prompts		prompts
tool_parser		tool_parser
utils		utils
.gitignore		.gitignore
README.md		README.md
config_template.json		config_template.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MCP-Focus: Leveraging Function-Oriented Document Enhancement for MCP Server Retrieval

Overview

Quick Start

Install dependencies

Configure `config.json`

Run MCP-Focus

(1) Tool Extraction (Tool Parser)

(2) Tool Document Refinement

(3) Server Document Refinement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MCP-Focus: Leveraging Function-Oriented Document Enhancement for MCP Server Retrieval

Overview

Quick Start

Install dependencies

Configure config.json

Run MCP-Focus

(1) Tool Extraction (Tool Parser)

(2) Tool Document Refinement

(3) Server Document Refinement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Configure `config.json`

Packages