MCP-Focus is a function-oriented document enhancement framework for MCP server retrieval. It generates retrieval-ready MCP documentation from raw server repositories via a multi-stage, LLM-driven agentic pipeline that performs white-box code analysis to (i) extract exposed tools and schemas, (ii) refine tool-level descriptions grounded in implementation, and (iii) synthesize a structured server-level overview for indexing.
This repository also includes a large-scale retrieval benchmark built from 3,763 real-world open-source MCP servers and human-guided queries with controlled semantic ambiguity, constraint specificity, and multi-function complexity, together with an end-to-end pipeline for indexing enhanced documents and evaluating retrievers.
Create and activate a conda environment, then install Python dependencies from requirements.txt:
conda create -n mcp-focus python=3.13 -y
conda activate mcp-focus
pip install -r requirements.txtCopy the template config and fill in your credentials/paths:
cp config_template.json config.jsonThen edit config.json and update:
github_token: your GitHub personal access token (used to access/download repos via the GitHub API).glama_api_token: your Glama API token (used to crawl Glama MCP server data).openai_api_key: your OpenAI API key.base_urls/api_keys: optional list(s) of OpenAI-compatible base URLs and API keys if you use non-default endpoints or multiple providers.data_dir: where datasets/artifacts will be stored (default:../data).github_mcp_server_repos_dir: subdirectory underdata_dirfor downloaded GitHub repos.think_models: model names that should be treated as "thinking" models (only needed if you use such models).
python -m tool_parser.main \
--input_format <INPUT_FORMAT> \
--input_path <INPUT_JSONL_PATH> \
--output_path <OUTPUT_JSONL_PATH> \
--repos_dir <REPOS_DIR> \
--process_num <PROCESS_NUM> \
--max_workers <MAX_WORKERS> \
--model_name <MODEL_NAME> \
--max_tokens <MAX_TOKENS> \
--save_interval <SAVE_INTERVAL>This command parses MCP server repositories and extracts all exposed MCP tools, including each tool’s name, description, and input schema. It outputs a JSONL file that augments each server entry with extracted tool metadata, which serves as the input for downstream tool-document refinement.
--input_format: The input data format (e.g., glama / github / vallina), which determines how the loader interprets each JSONL record.--input_path: Path to the input JSONL file containing MCP server entries (e.g., repository URLs/IDs and metadata).--output_path: Path to the output JSONL file where extracted tools and related artifacts are saved.--repos_dir: Directory used to download/cache repository source code for analysis.--process_num: Number of server entries to process from the input (useful for subsampling or partial runs).--max_workers: Number of concurrent worker threads for parallel processing.--model_name: LLM name used by the tool extractor agent.--max_tokens: Maximum token budget for each LLM call (controls context and generation length).--save_interval: Save checkpoint interval (in number of processed entries) to enable incremental persistence and recovery.
python -m doc_refiner.tool_doc_refiner \
--input_path <INPUT_JSONL_PATH> \
--output_path <OUTPUT_JSONL_PATH> \
--input_format <INPUT_FORMAT> \
--process_num <PROCESS_NUM> \
--max_workers <MAX_WORKERS> \
--model_name <MODEL_NAME> \
--max_tokens <MAX_TOKENS> \
--save_interval <SAVE_INTERVAL> \
--repos_dir <REPOS_DIR>This command refines tool-level documentation by performing white-box analysis over the tool implementations. Given extracted tools (schemas + tool-to-code mapping), it navigates the repository code to generate retrieval-oriented, implementation-grounded tool documents (typically in markdown), and writes them back into a JSONL output.
--input_path: Path to the input JSONL file containing extracted tools (i.e., the output from the tool parser stage).--output_path: Path to the output JSONL file where refined tool documents are saved.--input_format: The input data format (e.g., glama / github / vallina), used to parse and normalize input records.--process_num: Number of server entries to process from the input.--max_workers: Number of concurrent worker threads for parallel refinement.--model_name: LLM name used by the tool document refiner agent.--max_tokens: Maximum token budget for each LLM call during refinement.--save_interval: Save checkpoint interval (in number of processed entries).--repos_dir: Directory for downloading/caching repositories (required because refinement reads source code to ground the tool semantics).
python -m doc_refiner.server_doc_refiner \
--input_path <INPUT_JSONL_PATH> \
--output_path <OUTPUT_JSONL_PATH> \
--input_format <INPUT_FORMAT> \
--process_num <PROCESS_NUM> \
--max_workers <MAX_WORKERS> \
--model_name <MODEL_NAME> \
--max_tokens <MAX_TOKENS> \
--save_interval <SAVE_INTERVAL>This command generates server-level documentation from refined tool documents. It synthesizes a coherent server overview and aggregates tool-level evidence into retrieval-ready server documentation.
--input_path: Path to the input JSONL file containing refined tool documents (i.e., the output from the tool document refinement stage).--output_path: Path to the output JSONL file where server documentation is saved.--input_format: The input data format (e.g., glama / github / vallina), used to parse and normalize input records.--process_num: Number of server entries to process from the input.--max_workers: Number of concurrent worker threads for parallel generation.--model_name: LLM name used by the server document refiner agent.--max_tokens: Maximum token budget for each LLM call during server document generation.--save_interval: Save checkpoint interval (in number of processed entries).
