Skip to content

Security: MCP server exposes local filesystem and lacks clone safety limits #59

@jrdej51

Description

@jrdej51

The MCP server has two related issues that let an attacker (via prompt injection, a malicious document the agent
reads, or a compromised tool output) coerce the agent into clones it shouldn't make.

  1. file:// URLs allow arbitrary local-repo read

utils._GIT_URL_SCHEMES includes file://, so an LLM-supplied repo argument like file:///home/victim/some-private-repo
is accepted by from_git and cloned via git clone. The repo's contents are then returned to the model through
search/find_related. On developer machines, private git repos are usually trivial to find by path.

ssh:// and SCP-form URLs (git@host:org/repo) also reach git clone unfiltered, enabling SSH connections to arbitrary
internal hosts (useful for reconnaissance even though stdin is suppressed).

Suggested fix: in mcp.py, restrict accepted schemes to https:// / http:// (or make non-https schemes opt-in via an env
var like SEMBLE_ALLOW_LOCAL_GIT=1). The Python API can keep accepting all schemes — the boundary worth hardening is
the MCP tool.

  1. No timeout, size, or cache limits → DoS

SembleIndex.from_git calls subprocess.run(["git", "clone", ...]) with no timeout=. A slow or intentionally large
remote blocks the MCP event loop indefinitely. There is also no max-file-size in chunker.chunk_file, no max-files cap
in walk_files, and _IndexCache grows unbounded — chunks stay fully in RAM, with no LRU. A session that gets nudged to
index several large repos can exhaust memory.

Suggested fix:

  • subprocess.run(..., timeout=300) (configurable) and surface TimeoutExpired as RuntimeError.
  • --depth 1 --filter=blob:limit=10m on the clone, or a post-clone size check before indexing.
  • Cap _IndexCache to N entries with LRU eviction; expose a clear_cache MCP tool.
  • Skip files larger than e.g. 1 MB in chunk_file (configurable).
  1. (Bonus) MCP instructions encourage URL hallucination

The server instructions string tells the model: "For questions about a library (e.g. a PyPI/npm package), resolve the
GitHub URL from your training knowledge and pass it as repo." This actively encourages cloning typo-squatted or
hallucinated repos and presenting their contents as authoritative. Recommend rewording to require an explicit URL from
the user/context.

Happy to send a PR if useful.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions