You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
second_pass in api/analyzers/source_analyzer.py dominates indexing wall-time on real Python repos (e.g. sympy ~30 min, pytest ~4 min). The cost is dominated by lsp.request_definition calls funneled through multilspy's SyncLanguageServer, which in turn shells out to jedi.
Two structural problems make this expensive:
~80% of jedi calls return None ("Unexpected response from Language Server: None"). We pay 50–200 ms per call for nothing.
We already pay for tree-sitter to parse every file in first_pass. We have the parse tree, the entity table, and the import list. A tree-sitter-based static resolver can replace jedi for ~90% of resolution cases (module-local names, intra-project imports, attribute access on known classes) at orders-of-magnitude lower per-call cost.
Goal
Add a TreeSitterPythonResolver that implements the same resolve_symbol(...) contract as the jedi path, selectable at runtime via an env var. When enabled, indexing should:
Match jedi's edge counts within ~5% on a representative corpus (pytest, sympy, xarray).
New api/analyzers/python/tree_sitter_resolver.py — TreeSitterPythonResolver implementing the analyzer resolve_symbol(files, lsp, file_path, project_path, key, symbol) contract. The lsp argument is accepted but ignored.
Project-wide symbol table — built once in first_pass: {(module_path, name) -> (file_path, node)}. Subclasses TreeSitterAnalyzer from T15 to share parser/query plumbing.
Resolution rules (initial scope, Python only):
Module-local names (function / class defined in same file).
Names imported via from X import Y (resolve to module X then Y).
Names imported via import X then X.Y.
Method calls on instances whose class is statically inferable (assignment from Cls()).
Runtime selection — CODE_GRAPH_PY_RESOLVER=tree_sitter|jedi env var. Default jedi so existing behavior is unchanged until we A/B.
Parallel indexing — once tree-sitter resolver works, switch second_pass to ProcessPoolExecutor over files. Resolver is a pure function of (file, project_symbol_table), no shared mutable state.
Bench harness A/B — extend bench_index_test.py to compare jedi vs tree-sitter on node_count, edge_count, and wall-time for pytest-6202, sympy-20154, xarray-3993.
Scope (out)
Non-Python languages (JS/Kotlin/Java/C# stay on their current path).
Dynamic resolution (decorators, metaclasses, getattr, monkey-patching) — falls back to "unresolved", matching jedi's None behavior.
Type inference beyond direct assignment.
Changes to graph schema.
Files
new api/analyzers/python/tree_sitter_resolver.py
modified api/analyzers/python/analyzer.py — emit the project symbol table during first_pass
Build the project symbol table in first_pass so second_pass doesn't re-traverse. Estimate ~1 KB per defined symbol.
For imports, use tree-sitter queries on import_statement / import_from_statement nodes; resolve module paths via sys.path-like lookup rooted at the project root (no installed-deps resolution — out of scope).
Match jedi's "miss → no edge" semantics exactly; don't fabricate edges to make numbers look better. The A/B compares quality honestly.
The 5% edge-count tolerance accounts for jedi-only wins (dynamic dispatch) and tree-sitter-only wins (cases jedi returns None on).
Context
second_passinapi/analyzers/source_analyzer.pydominates indexing wall-time on real Python repos (e.g. sympy ~30 min, pytest ~4 min). The cost is dominated bylsp.request_definitioncalls funneled through multilspy'sSyncLanguageServer, which in turn shells out to jedi.Two structural problems make this expensive:
None("Unexpected response from Language Server: None"). We pay 50–200 ms per call for nothing.We already pay for tree-sitter to parse every file in
first_pass. We have the parse tree, the entity table, and the import list. A tree-sitter-based static resolver can replace jedi for ~90% of resolution cases (module-local names, intra-project imports, attribute access on known classes) at orders-of-magnitude lower per-call cost.Goal
Add a
TreeSitterPythonResolverthat implements the sameresolve_symbol(...)contract as the jedi path, selectable at runtime via an env var. When enabled, indexing should:Scope (in)
api/analyzers/python/tree_sitter_resolver.py—TreeSitterPythonResolverimplementing the analyzerresolve_symbol(files, lsp, file_path, project_path, key, symbol)contract. Thelspargument is accepted but ignored.first_pass:{(module_path, name) -> (file_path, node)}. SubclassesTreeSitterAnalyzerfrom T15 to share parser/query plumbing.from X import Y(resolve to moduleXthenY).import XthenX.Y.Cls()).Cls.method).CODE_GRAPH_PY_RESOLVER=tree_sitter|jedienv var. Defaultjediso existing behavior is unchanged until we A/B.second_passtoProcessPoolExecutorover files. Resolver is a pure function of(file, project_symbol_table), no shared mutable state.bench_index_test.pyto compare jedi vs tree-sitter onnode_count,edge_count, and wall-time for pytest-6202, sympy-20154, xarray-3993.Scope (out)
getattr, monkey-patching) — falls back to "unresolved", matching jedi'sNonebehavior.Files
api/analyzers/python/tree_sitter_resolver.pyapi/analyzers/python/analyzer.py— emit the project symbol table duringfirst_passapi/analyzers/source_analyzer.py— resolver selectionbench_index_test.py— A/B harnesstests/analyzers/test_tree_sitter_resolver.pyAcceptance criteria
CODE_GRAPH_INDEX_WORKERS=4: additional 2–3× speedup over single-threaded tree-sitter.make testandmake lintclean.Dependencies
TreeSitterAnalyzerbase class.Notes for the implementer
first_passsosecond_passdoesn't re-traverse. Estimate ~1 KB per defined symbol.import_statement/import_from_statementnodes; resolve module paths viasys.path-like lookup rooted at the project root (no installed-deps resolution — out of scope).