Skip to content

crashes when GitHub releases is unreachable; existing line-chunker fallback isn't wired in #94

@pszemraj

Description

@pszemraj

semble/chunking/core.py calls manifest_languages() at module scope on line 13. This fires a synchronous network request to GitHub releases the first time anything in the semble package is imported. In any environment that can't reach github.com over a clean TLS handshake (corporate firewalls with TLS interception, air-gapped networks, or just that endpoint being unreachable for whatever reason), import semble raises an unhandled exception before user code runs. This blocks every advertised entry point: the Python API, the CLI, and the MCP server.

Note

As coincidentally trying to have Claude use this in its CPU environment on https://claude.ai/ ran into the same thing (Despite having standard package registries and so on network whitelisted in settings), I had it trace through and analyze in its env. most of the below is from that

Reproduced on Linux x86_64, Python 3.12, semble==0.1.7, tree-sitter-language-pack==1.6.2. My specific case is HTTPS egress through a TLS-intercepting corporate proxy. The corporate CA is in the OS trust store but rustls (used by tree-sitter-language-pack via ureq) ships with webpki-roots and doesn't consult the system trust store, so the handshake to github.com fails with UnknownIssuer. Same crash happens for any reason that URL is unreachable.

$ pip install semble
$ python -c "import semble"
Traceback (most recent call last):
  ...
  File ".../semble/chunking/core.py", line 13, in <module>
    _TREE_SITTER_LANGUAGES: frozenset[str] = frozenset(manifest_languages())
                                                       ^^^^^^^^^^^^^^^^^^^^
tree_sitter_language_pack.DownloadError: Download error: Failed to fetch manifest from https://github.com/kreuzberg-dev/tree-sitter-language-pack/releases/download/v1.6.2/parsers.json: io: invalid peer certificate: UnknownIssuer

The README says semble "runs on CPU with no API keys, GPU, or external services," but that doesn't hold for users without an unrestricted connection to GitHub releases at import time.

semble already handles this case in chunking/chunking.py: when is_supported_language(language) returns False, chunk_source calls chunk_lines(source, ...) instead. That code path is fully implemented and works today. If _TREE_SITTER_LANGUAGES ends up empty because the manifest never loaded, is_supported_language returns False for every language, every file routes to chunk_lines, and get_parser() is never called. The rest of the pipeline (bm25s, the static potion-code-16M embedding via model2vec, RRF, all the ranking signals that don't depend on AST node types) continues to work normally. The fallback just isn't wired in.

Proposed fix

Wrap the import-time call in a try/except:

try:
    _TREE_SITTER_LANGUAGES: frozenset[str] = frozenset(manifest_languages())
except Exception as e:
    logger.warning(
        "tree-sitter manifest unreachable (%s); falling back to line-based chunking",
        e,
    )
    _TREE_SITTER_LANGUAGES: frozenset[str] = frozenset()

Local to chunking/core.py:13. Uses code paths that already exist in semble. No new dependencies, no API changes, no behavioral change for online users.

I applied the patch locally and verified end-to-end on a sandbox configured with the same UnknownIssuer failure as my work machine. import semble succeeds, the warning logs once, and SembleIndex.from_path(...) indexes a real codebase with no tree-sitter network calls (verified by checking ~/.cache/tree-sitter-language-pack/ stays empty). semble search returns ranked results. Testing against semble's own source: symbol query manifest_languages ranks chunking/core.py first, natural-language query how files are walked and filtered ranks index/file_walker.py:94-126 first. A query for reciprocal rank fusion for combining sparse and dense scores ranks ranking/boosting.py first and search.py (where RRF actually lives) third, which is the kind of query where tree-sitter chunking would probably do better.

Trade-offs in degraded mode

as stated by the repo authors, this is wrong and should be ignored. left for transparency

Line-based chunking cuts at character windows (1500 chars by default), not at AST boundaries. Function definitions that straddle a chunk boundary get split, and ranking signals that depend on tree-sitter node types (the "definition boost" and "file coherence" signals from the README) become no-ops. In practice the embeddings carry most of the load and results are usable. Hard to put a precise number on it without running the benchmark with line-chunking forced, but it should sit between BM25 alone (NDCG@10 0.673 in your numbers) and full semble (0.854), closer to the upper bound because chunk-boundary quality is a smaller contributor than the embedding model itself. That's an acceptable trade for keeping semble working at all in environments that are currently completely blocked.

Related

#78 looks like a different surface symptom of the same underlying assumption: that semble can reach GitHub releases at runtime without checking. That issue reports a 20-minute hang with no error on first MCP search call against a 2604-file repo on Windows. Most likely cause is the cascading lazy get_parser(lang) downloads, one per language in the repo, on a connection that is slow rather than blocked. The fix here addresses that too: when _TREE_SITTER_LANGUAGES is empty after a failed manifest fetch, get_parser is never called and the cascade doesn't happen. Worth treating both as the same class of bug.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions