Merge PR #9: PathFilter, file_path scoping, progress reporting, and include_tests by Codeturion · Pull Request #10 · Codeturion/codesurface

Codeturion · 2026-04-02T00:11:30Z

Summary

Merges #9 from @michael-howell-island with conflict resolution against #8 (C++ parser).

Resolve merge conflicts in db.py and server.py, both namespace and file_path params now coexist in get_class_members()
Align C++ parser to use BaseParser.parse_directory (consolidated os.walk with PathFilter)
Restore .test.ts/.spec.ts skip suffixes in TypeScript parser

What's included from #9

PathFilter: os.walk with dir pruning, .codesurfaceignore, --exclude CLI globs, worktree/submodule skipping
file_path scoping: search, get_signature, get_class tools accept optional file_path filter
Progress reporting: stderr output during indexing (scanning N files..., indexing: 50%, done:)
Consolidated parse_directory: all parsers now use BaseParser's os.walk loop with skip_suffixes, skip_files, _should_skip_dir hooks
include_tests param (cherry-picked from 3a278e6, authored by @michael-howell-island): search, get_signature, get_class exclude test files by default; pass include_tests=true to opt in. TypeScript parser also picks up .js/.jsx.

Review comments addressed

Extracted _file_path_condition() helper in db.py (was duplicated between search() and get_class_members())
String-based directory exclusion in walk loops (no Path() per dir, no relative_to() per file)
Progress count now matches parser exclusion rules (_count_files calls each parser's _walk_files)
Eliminated third directory walk (mtime snapshot now collected during parse via on_progress)

Test plan

42/42 pytest tests pass (10 new tests for include_tests)
C++ parser verified on imgui (4,840 records), bullet3 (15,867 records)
TS parser verified on vscode (91,794 records, test files excluded)

…usion

…rides

…r into server indexing

…embers, clean up docstring

rglob() cannot skip directories mid-walk — it traverses the entire tree (including all git worktrees) before the path_filter check runs. With 16 worktrees the indexer was doing ~17x the necessary work, causing 1000+s startup times instead of ~2 minutes. Switch all six walk sites (base.py, server.py x2, typescript.py, python_parser.py, go.py, java.py) to os.walk() with in-place dirs[:] pruning. Excluded directories (.worktrees, submodules, _SKIP_DIRS) are now dropped before descent so os.walk never enters them.

…orktree-support feat: file filtering, worktree exclusions, and query-time file_path scoping

Also forward on_progress in TypeScriptParser's parse_directory override. The callback is invoked after each successful parse_file call, passing the Path of the parsed file; skipped files and parse errors do not trigger it.

…ertions - Remove `print(summary, file=sys.stderr)` from `main()` after calling `_index_full()`, since `_index_full` already prints the done line itself. - Strengthen `test_index_full_emits_progress_to_stderr` to also assert that the scanning line and 0% baseline indexing line appear in stderr.

…le level Move on_progress callback to a finally block in all five parsers so it fires whether parse_file succeeds or raises. Previously, unparseable files were counted by _count_files but never triggered on_progress, causing the startup progress display to stall below 100% on repos with problematic files. Also move `import sys` from inside except blocks to module-level imports in typescript.py, python_parser.py, go.py, and java.py.

- Add _DEFAULT_EXCLUDED_DIRS to PathFilter (node_modules, .git, dist, build, vendor, .nx, .yarn, etc.) so vendored dirs are always skipped - Rewrite detect_languages to use os.walk with PathFilter instead of rglob, which crawled into node_modules and hung indefinitely - Consolidate duplicate parse_directory from all 4 parsers into BaseParser with str ops instead of pathlib (saves ~3s on 30K files) - Rewrite _count_braces_and_parens: regex strip + str.count with fast path for lines without strings/comments (saves ~4s on 30K files) - Replace Path.read_text with open()+read(), path.relative_to with os.path.relpath, and _file_to_module with pure string ops - Remove per-parser _SKIP_DIRS (now centralized in PathFilter) - Remove __tests__/__mocks__/test/spec file exclusions from TS parser so test files are indexed for coding reference - Add skip_suffixes, skip_files, _should_skip_dir hooks to BaseParser for per-parser file filtering without duplicating the walk loop Benchmark on 34K-file TS monorepo: 22s -> 15s (32% faster)

Path('~/work/cloud').is_dir() returns False because Python's pathlib does not expand ~ — causing the server to skip indexing entirely and sit idle in the MCP listen loop without emitting the done: line.

These are internal planning artifacts and fork install instructions that don't belong in the upstream PR.

Resolve conflicts between PR #8 (C++ namespace support) and PR #9 (PathFilter, file_path scoping, progress reporting). Merged both namespace and file_path params into get_class_members.

- Remove CppParser.parse_directory override, use BaseParser's os.walk with _should_skip_dir for C++-specific dirs (Debug, Release, x64, x86, cmake-build-*) - Remove _SKIP_DIRS from cpp.py (now centralized in PathFilter) - Apply open()+read() and os.path.relpath to cpp.py for consistency - Restore .test.ts/.spec.ts skip suffixes in TypeScript parser

gemini-code-assist

Code Review

This pull request introduces a comprehensive path filtering system and significant performance optimizations for the indexing process. Key changes include the addition of a PathFilter class to handle default and user-defined exclusions, the integration of progress reporting during full indexing, and the refactoring of language parsers to use os.walk for faster directory traversal. The database layer was also updated to support scoped searches via a new file_path parameter. Feedback focuses on further performance improvements, such as consolidating multiple directory walks, reducing pathlib overhead in hot loops, and addressing code duplication in SQL construction. Additionally, the file counting logic for progress reporting needs to be synchronized with parser-specific exclusion rules to ensure accuracy.

- Extract _walk_files from parse_directory so _count_files can reuse the same filter logic (skip_suffixes, skip_files, _should_skip_dir) - Collect mtimes in on_progress callback instead of a separate walk - _index_full now does 2 walks (count + parse) instead of 3

- db.py: extract _file_path_condition() helper, dedupe between search() and get_class_members() - filters.py: add is_dir_excluded_git() and _read_git_file_str() for string-path walks; route is_file_excluded() through is_file_excluded_rel - parsers/base.py: walk loop uses is_dir_excluded_git(root, d) instead of Path(os.path.join(...)) - server.py: _index_incremental walk is fully string-based — drops per-dir Path() and per-file relative_to() in favor of prefix slicing

Add include_tests boolean (default false) to search, get_signature, and get_class MCP tools so test files are excluded from results by default but easily pulled in when needed. Test file detection covers common conventions: - Directory patterns: __tests__/, tests/, test/ - Filename patterns: .test., .spec., _test., test_ Also adds .js/.jsx to TypeScript parser file_extensions so plain JavaScript files are indexed alongside TypeScript.

The incremental walk previously used a single global os.walk over all_extensions(), which only applied path_filter. Per-parser rules (skip_suffixes, skip_files, _should_skip_dir) were skipped, so _file_mtimes accumulated entries for files the parser ultimately ignored (e.g. .test.ts, .spec.ts, .d.ts, module-info.java). This caused first-reindex-after-full-index to spuriously report files as "added" and inflated the scanned-file count. Switch to per-parser _walk_files (same as _count_files in _index_full) so the two indexing paths see the same file set. Add a regression test.

_index_full pins to a single parser when --language is set, but _index_incremental ignored the flag and called get_parsers_for_project, which auto-detects every language present in the tree. On a polyglot project run with --language=cpp, the incremental walk found .py files the full walk had skipped, so the first reindex falsely reported them as added (and re-parsed them on every restart-then-reindex cycle). Store the CLI language in a module global at startup and apply the same pinning to incremental walks. Add a regression test covering the case where --language=python is set in a project containing both .py and .ts files.

michael-howell-island and others added 25 commits March 29, 2026 15:55

docs: add filtering and worktree support design

4f87c3c

docs: add filtering implementation plan

54376ca

feat: add PathFilter with default worktree/submodule skip rules

6c299b2

feat: add .codesurfaceignore and --exclude glob support to PathFilter

81cc39c

feat: thread PathFilter through parse_directory for dir and file excl…

b6815e4

…usion

fix: add path_filter support to Go, Java, Python parse_directory over…

d13cb01

…rides

feat: add --exclude and --include-submodules CLI args, wire PathFilte…

c2dd5ff

…r into server indexing

feat: add file_path scoping to search, get_signature, get_class tools

67fe469

fix: Path(rel) in incremental reindex, apply file_path to get_class_m…

4f08f53

…embers, clean up docstring

docs: add filtering features, install instructions for fork

f7116b5

Merge pull request #1 from michael-howell-island/feat/filtering-and-w…

cc24549

…orktree-support feat: file filtering, worktree exclusions, and query-time file_path scoping

docs: add startup progress reporting design

8fd7a72

docs: add startup progress implementation plan

15d5c6b

feat: add on_progress callback to BaseParser.parse_directory

b56ef8a

Also forward on_progress in TypeScriptParser's parse_directory override. The callback is invoked after each successful parse_file call, passing the Path of the parsed file; skipped files and parse errors do not trigger it.

feat: forward on_progress callback in Python, Go, Java parser overrides

8929c53

feat: stream indexing progress to stderr with file count and percentage

be30dec

fix: apply is_file_excluded in _count_files to match parser behavior

075f619

fix: expand tilde in --project path to support ~/work/cloud style args

4dc4af7

Path('~/work/cloud').is_dir() returns False because Python's pathlib does not expand ~ — causing the server to skip indexing entirely and sit idle in the MCP listen loop without emitting the done: line.

chore: remove fork-specific README additions and planning docs

84432c4

These are internal planning artifacts and fork install instructions that don't belong in the upstream PR.

Merge PR #9 with conflict resolution

8075185

Resolve conflicts between PR #8 (C++ namespace support) and PR #9 (PathFilter, file_path scoping, progress reporting). Merged both namespace and file_path params into get_class_members.

gemini-code-assist Bot reviewed Apr 2, 2026

View reviewed changes

Comment thread src/codesurface/server.py Outdated

Comment thread src/codesurface/server.py Outdated

Comment thread src/codesurface/db.py Outdated

Comment thread src/codesurface/parsers/base.py Outdated

Comment thread src/codesurface/filters.py

Codeturion self-assigned this Apr 2, 2026

Codeturion and others added 2 commits April 28, 2026 23:37

This was referenced Apr 28, 2026

Various fixes for speed, flexibility, and reporting status #9

Closed

feat: add --output json flag for structured tool responses #11

Open

Codeturion changed the title ~~Merge PR #9: PathFilter, file_path scoping, and progress reporting~~ Merge PR #9: PathFilter, file_path scoping, progress reporting, and include_tests Apr 28, 2026

Codeturion added 2 commits April 29, 2026 00:15

Codeturion merged commit b0e81e8 into master Apr 28, 2026

Codeturion deleted the review/pr-9 branch April 28, 2026 21:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge PR #9: PathFilter, file_path scoping, progress reporting, and include_tests#10

Merge PR #9: PathFilter, file_path scoping, progress reporting, and include_tests#10
Codeturion merged 30 commits intomasterfrom
review/pr-9

Codeturion commented Apr 2, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Codeturion commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included from #9

Review comments addressed

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Codeturion commented Apr 2, 2026 •

edited

Loading