Skip to content

feat: Code Intelligence — auto-populate project entities from repo metadata (R68 Round 4)#192

Merged
EtanHey merged 3 commits intomainfrom
feat/code-intelligence-r68-round4
Apr 3, 2026
Merged

feat: Code Intelligence — auto-populate project entities from repo metadata (R68 Round 4)#192
EtanHey merged 3 commits intomainfrom
feat/code-intelligence-r68-round4

Conversation

@EtanHey
Copy link
Copy Markdown
Owner

@EtanHey EtanHey commented Apr 3, 2026

Summary

  • New code_intelligence.py module that scans ~/Gits/ for repos with pyproject.toml or package.json
  • Extracts metadata (name, version, description, dependencies, scripts, language/framework) and upserts structured project entities into the KG
  • Creates depends_on relations for notable dependencies (frameworks, SDKs, databases)
  • New CLI command: brainlayer code-intel [--base-dir] [--dry-run]
  • Detects package manager (npm/bun/pnpm/yarn) and framework (React, Next.js, MCP, Convex, etc.)

R68 Sprint Context

Test plan

  • 14 tests pass covering scan, enrich, idempotency, dry-run, metadata storage
  • Ruff lint + format clean
  • Follows agent_enrichment.py pattern for KG entity upsert

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

🤖 Generated with Claude Code

Note

Add code-intel pipeline to auto-populate project entities from repo metadata

  • Adds code_intelligence.py which scans a directory of repos, extracts metadata from pyproject.toml or package.json, and upserts project entities and depends_on relations into the knowledge graph SQLite DB.
  • Metadata extraction captures name, version, description, dependencies, scripts, language, package manager, and (for JS) framework inferred from dependency names.
  • Dependency relations are only created for a curated set of notable library names; missing library entities are auto-created with lib- prefixed IDs.
  • Exposes the pipeline via a code-intel Typer CLI command in cli/init.py supporting --base-dir and --dry-run; returns stats covering entities created/updated and relations added.
  • Adds a full test suite in tests/test_code_intelligence.py covering scanning, extraction, DB persistence, idempotency, and dry-run behavior.

Macroscope summarized 29611c0.

Summary by CodeRabbit

  • New Features
    • Added a code-intel CLI to scan repositories, extract metadata (name, version, language, package manager, scripts, framework hints), and enrich the knowledge graph; supports --base-dir and --dry-run.
    • Detects Python and JavaScript (including bun) projects, creates/updates project and library entities, and establishes project→library relationships without duplicating on re-runs.
  • Tests
    • Added tests for discovery, metadata parsing, enrichment, dry-run behavior, reporting, and idempotency.

…tadata (R68 Round 4)

Scans ~/Gits/ for repos with pyproject.toml or package.json, extracts metadata
(name, version, description, dependencies, scripts, language/framework), and
upserts structured project entities into the KG with depends_on relations for
notable dependencies.

- New module: pipeline/code_intelligence.py (scan, extract, enrich)
- New CLI command: brainlayer code-intel [--base-dir] [--dry-run]
- 14 tests covering scan, enrich, idempotency, dry-run, metadata storage
- Detects package manager (npm/bun/pnpm/yarn) and framework (React, Next.js, MCP, etc.)
- Creates library entities for notable deps (frameworks, SDKs, databases)
- Follows agent_enrichment.py pattern for KG entity upsert

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented Apr 3, 2026

@coderabbitai review

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 3, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 3, 2026

📝 Walkthrough

Walkthrough

Added a new code-intel CLI and a code intelligence pipeline that scans local repos (pyproject/package.json), extracts metadata/dependencies, upserts project/library entities into the KG SQLite DB, creates depends_on relations, supports dry-run, and reports counts.

Changes

Cohort / File(s) Summary
CLI Integration
src/brainlayer/cli/__init__.py
Added code-intel Typer command (--base-dir/-d, --dry-run) that resolves DB path, invokes enrich_projects(), and prints a completion report of scanned projects, created/updated entities, relations added, and created library entities.
Core Pipeline
src/brainlayer/pipeline/code_intelligence.py
New module with scan_projects() and enrich_projects(). Discovers repos by pyproject.toml/package.json, parses metadata (name/version/description/language/package_manager/deps/scripts/framework hint), upserts project entities (match on LOWER(name)), upserts notable library entities, creates depends_on relations if missing, supports dry_run, logs progress, and returns operation stats.
Tests
tests/test_code_intelligence.py
New test suite with fixtures and tests covering project discovery, metadata parsing, framework detection, DB enrichment (create/update entities, relations, libraries), dry-run behavior, idempotency, and ignored dirs.
Database Test Fixtures
tests/... (fixtures within tests/test_code_intelligence.py)
Adds temp SQLite DB setup for kg_entities and kg_relations to validate enrichment behavior and idempotency.

Sequence Diagram

sequenceDiagram
    actor User
    participant CLI as "CLI\nrgba(45,125,255,0.5)"
    participant Scanner as "Scanner\nrgba(60,180,75,0.5)"
    participant Enricher as "Enricher\nrgba(255,165,0,0.5)"
    participant Database as "Database\nrgba(128,0,128,0.5)"

    User->>CLI: run code-intel (--base-dir, --dry-run)
    CLI->>Scanner: scan_projects(base_dir)
    Scanner->>Scanner: detect repos (pyproject.toml / package.json)
    Scanner-->>CLI: return list of project metadata

    CLI->>Enricher: enrich_projects(db_path, base_dir, dry_run)
    loop for each project
        Enricher->>Database: query project by LOWER(name)
        alt not found
            Enricher->>Database: insert project entity
        else found
            Enricher->>Database: update project entity
        end
        loop for each notable dependency
            Enricher->>Database: upsert library entity
            Enricher->>Database: create depends_on relation (if missing)
        end
    end
    Database-->>Enricher: return operation counts
    Enricher-->>CLI: return stats
    CLI-->>User: print completion report
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 I hop through folders, sniffing tomes of code,

I find the names, the scripts, the nodes,
I stitch the links where libraries belong,
In KG burrows I hum a tidy song,
Carrots for commits, and graphs all day long.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.17% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: Code Intelligence — auto-populate project entities from repo metadata (R68 Round 4)' clearly and specifically describes the main feature being added: a code intelligence system that automatically populates project entities from repository metadata.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/code-intelligence-r68-round4

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread src/brainlayer/cli/__init__.py
) -> None:
"""Add a depends_on relation from project to dependency.

Creates the dependency as a 'library' entity if it doesn't exist.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low pipeline/code_intelligence.py:265

In dry-run mode, stats["dep_entities_created"] over-counts when multiple projects share the same notable dependency. When dry_run=True, the library entity is never inserted, so each subsequent call for the same dependency fails the target_row check, generates a new random target_id, and increments the counter again. For example, 3 projects depending on react reports dep_entities_created: 3 instead of 1. Consider tracking which dependencies have been "virtually" created in dry-run mode to keep the stats accurate.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file src/brainlayer/pipeline/code_intelligence.py around line 265:

In dry-run mode, `stats["dep_entities_created"]` over-counts when multiple projects share the same notable dependency. When `dry_run=True`, the library entity is never inserted, so each subsequent call for the same dependency fails the `target_row` check, generates a new random `target_id`, and increments the counter again. For example, 3 projects depending on `react` reports `dep_entities_created: 3` instead of `1`. Consider tracking which dependencies have been "virtually" created in dry-run mode to keep the stats accurate.

Evidence trail:
src/brainlayer/pipeline/code_intelligence.py lines 309-323 at REVIEWED_COMMIT:
- Line 309-312: target_row query checks database for existing entity
- Line 314: `if not target_row:` branch
- Line 315: generates new UUID-based target_id
- Line 316-320: `if not dry_run:` - INSERT only runs when dry_run=False
- Line 321: `stats["dep_entities_created"] += 1` - incremented unconditionally within the `if not target_row` block

In dry-run mode, the entity is never inserted, so subsequent calls for the same dependency will find target_row=None again, generating new IDs and incrementing the counter multiple times.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — minor dry-run accuracy issue. In practice dry-run is for previewing what would happen, and the slight overcount doesn't affect actual DB writes. Won't fix for now.

Comment thread src/brainlayer/pipeline/code_intelligence.py
Comment thread src/brainlayer/pipeline/code_intelligence.py Outdated
…issing key

- Fix framework detection order: check specific (expo, react-native, next)
  before generic (react) to avoid misclassification
- Add ~ to version specifier regex for PEP 440 compatible release (~=)
- Add dep_entities_created key to early return path (empty scan)
- Add 3 regression tests for all 3 fixes (17 total)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/brainlayer/cli/__init__.py`:
- Around line 1936-1957: The code block inside the CLI command function
code_intel is failing ruff format checks; run the project's formatter and linter
(e.g., ruff check src/ && ruff format src/) or directly run ruff format on the
file containing the code_intel function so the import/enrichment block and the
f-strings are reformatted to match project style, then re-run tests/CI and
commit the formatted changes (ensure changes include the code_intel function and
any nearby formatting Ruff adjusts).

In `@src/brainlayer/pipeline/code_intelligence.py`:
- Around line 57-65: The pyproject manifest loader in _extract_pyproject
currently lets tomllib.load() exceptions propagate and abort the scan; update
_extract_pyproject to wrap the file load/parsing in a try/except that catches
tomllib.JSONDecodeError, tomllib.TOMLDecodeError (or generic Exception if
backend-specific), logs a warning that includes the repo_path/path and the
error, and returns an empty dict so the scan continues; apply the same defensive
pattern to the package.json extractor referenced at lines 91-94 (wrap
json.load() in try/except, log a warning with file/path and error, and return an
empty dict).
- Around line 349-366: Remove the duplicated argparse-based CLI block under if
__name__ == "__main__" that imports argparse, sets up logging, parses
--base-dir/--dry-run, calls enrich_projects and prints results; this duplicate
entrypoint should be deleted so the Typer CLI is the single source of truth.
Ensure enrich_projects, DEFAULT_BASE_DIR and any needed types remain
exported/importable by the Typer command (do not change enrich_projects
signature), and remove the print/result-handling code in this module so callers
(the Typer command) handle invocation and output.
- Around line 168-171: The early-return when no projects are found in
process_scan (in src/brainlayer/pipeline/code_intelligence.py where projects =
scan_projects(base_dir)) returns a stats dict missing the dep_entities_created
key; update that return to include dep_entities_created: 0 so the stats shape
matches downstream expectations (e.g., the caller in
src/brainlayer/cli/__init__.py that reads dep_entities_created) and avoids
KeyError.
- Around line 309-321: The SELECT that looks up target_row should restrict to
library entities to avoid matching non-library rows; update the query executed
via conn.execute that currently uses "SELECT id FROM kg_entities WHERE
LOWER(name) = LOWER(?)" to include "AND entity_type = 'library'" (and pass
dep_name as before), so the lookup uses both entity_type and name, and leave the
INSERT into kg_entities (which already sets entity_type='library') unchanged;
make this change in the block that references target_row, conn.execute, dep_name
and the subsequent INSERT that creates target_id.
- Around line 173-175: The bulk writer in
src/brainlayer/pipeline/code_intelligence.py currently opens SQLite directly
with sqlite3.connect(db_path), bypassing the shared KG bootstrap in
src/brainlayer/vector_store.py (the kg_entities/kg_relations initializer) and
thus can run against un-migrated DBs and conflict with enrichment workers;
change the code to obtain the DB connection via the shared initializer in
vector_store (reuse the existing bootstrap/init function that ensures migrations
and columns like description/importance exist), ensure you stop enrichment
workers and checkpoint the WAL before performing bulk writes, use a per-worker
DB connection (do not share a single connection across threads/processes), and
add retry/backoff logic for SQLITE_BUSY errors (exponential backoff with limited
attempts) so the bulk writer coordinates correctly with other BrainLayer
writers.

In `@tests/test_code_intelligence.py`:
- Around line 89-111: The test fixture creates kg_entities and kg_relations
without the UNIQUE constraints used in production, causing different behavior
for idempotency/upsert tests; update the CREATE TABLE statements for kg_entities
to include UNIQUE(entity_type, name) and for kg_relations to include
UNIQUE(source_id, target_id, relation_type) so the test schema matches the
production constraints used by the vector store code (kg_entities,
kg_relations).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1b97f866-bf99-4893-86b9-cb6c135ce6d1

📥 Commits

Reviewing files that changed from the base of the PR and between 6a9137d and d04b6d8.

📒 Files selected for processing (3)
  • src/brainlayer/cli/__init__.py
  • src/brainlayer/pipeline/code_intelligence.py
  • tests/test_code_intelligence.py
📜 Review details
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests

Files:

  • src/brainlayer/cli/__init__.py
  • tests/test_code_intelligence.py
  • src/brainlayer/pipeline/code_intelligence.py
src/brainlayer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/**/*.py: Use Python/Typer CLI architecture for the main package in src/brainlayer/
All scripts and CLI must use paths.py:get_db_path() for resolving database path instead of hardcoding
Implement chunk lifecycle management with columns superseded_by, aggregated_into, archived_at; default search excludes lifecycle-managed chunks
Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first
Drop FTS triggers before bulk deletes on chunks table and recreate after; batch deletes in 5-10K chunks with checkpoint every 3 batches
Implement retry logic on SQLITE_BUSY errors; each worker must use its own database connection
Use ruff check src/ && ruff format src/ for linting and formatting

Files:

  • src/brainlayer/cli/__init__.py
  • src/brainlayer/pipeline/code_intelligence.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use pytest for testing

Files:

  • tests/test_code_intelligence.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-14T02:20:54.656Z
Learning: Request codex review, cursor review, and bugbot review for BrainLayer PRs
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/brainlayer/cli.py : Use Typer CLI framework for command-line interface in `src/brainlayer/`

Applied to files:

  • src/brainlayer/cli/__init__.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/**/*.py : Use Python/Typer CLI architecture for the main package in `src/brainlayer/`

Applied to files:

  • src/brainlayer/cli/__init__.py
🪛 GitHub Actions: CI
src/brainlayer/cli/__init__.py

[error] 1-1: ruff format --check failed: file would be reformatted (1 file out of 172). Run 'ruff format src/brainlayer/cli/init.py' to apply formatting.

Comment thread src/brainlayer/cli/__init__.py
Comment on lines +57 to +65
def _extract_pyproject(repo_path: Path, path: Path) -> dict[str, Any]:
"""Extract metadata from pyproject.toml."""
try:
import tomllib
except ImportError:
import tomli as tomllib # type: ignore[no-redef]

with open(path, "rb") as f:
data = tomllib.load(f)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Skip malformed manifests instead of aborting the whole scan.

tomllib.load() and json.load() failures bubble straight out of these helpers. One half-written pyproject.toml or package.json under the scan root will terminate the entire command instead of warning and continuing with the next repo.

Also applies to: 91-94

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/pipeline/code_intelligence.py` around lines 57 - 65, The
pyproject manifest loader in _extract_pyproject currently lets tomllib.load()
exceptions propagate and abort the scan; update _extract_pyproject to wrap the
file load/parsing in a try/except that catches tomllib.JSONDecodeError,
tomllib.TOMLDecodeError (or generic Exception if backend-specific), logs a
warning that includes the repo_path/path and the error, and returns an empty
dict so the scan continues; apply the same defensive pattern to the package.json
extractor referenced at lines 91-94 (wrap json.load() in try/except, log a
warning with file/path and error, and return an empty dict).

Comment thread src/brainlayer/pipeline/code_intelligence.py Outdated
Comment on lines +173 to +175
conn = sqlite3.connect(db_path)
conn.execute("PRAGMA journal_mode = WAL")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Go through the shared KG bootstrap before writing here.

This raw sqlite3.connect() path bypasses the kg_entities/kg_relations bootstrap in src/brainlayer/vector_store.py, while the later statements already assume migrated columns like description and importance exist. Fresh or pre-migration DBs can fail immediately, and this bulk writer still has no SQLITE_BUSY retry or coordination with other BrainLayer writers. Reuse the shared initializer before issuing these writes. As per coding guidelines 'Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first' and 'Implement retry logic on SQLITE_BUSY errors; each worker must use its own database connection'.

Also applies to: 231-245

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/pipeline/code_intelligence.py` around lines 173 - 175, The
bulk writer in src/brainlayer/pipeline/code_intelligence.py currently opens
SQLite directly with sqlite3.connect(db_path), bypassing the shared KG bootstrap
in src/brainlayer/vector_store.py (the kg_entities/kg_relations initializer) and
thus can run against un-migrated DBs and conflict with enrichment workers;
change the code to obtain the DB connection via the shared initializer in
vector_store (reuse the existing bootstrap/init function that ensures migrations
and columns like description/importance exist), ensure you stop enrichment
workers and checkpoint the WAL before performing bulk writes, use a per-worker
DB connection (do not share a single connection across threads/processes), and
add retry/backoff logic for SQLITE_BUSY errors (exponential backoff with limited
attempts) so the bulk writer coordinates correctly with other BrainLayer
writers.

Comment on lines +309 to +321
target_row = conn.execute(
"SELECT id FROM kg_entities WHERE LOWER(name) = LOWER(?)",
(dep_name,),
).fetchone()

if not target_row:
target_id = f"lib-{uuid.uuid4().hex[:12]}"
if not dry_run:
conn.execute(
"""INSERT INTO kg_entities (id, entity_type, name, importance, created_at, updated_at)
VALUES (?, 'library', ?, 3.0, strftime('%Y-%m-%dT%H:%M:%fZ','now'), strftime('%Y-%m-%dT%H:%M:%fZ','now'))""",
(target_id, dep_name),
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Restrict the dependency lookup to library entities.

kg_entities is unique on (entity_type, name), not on name alone. Without entity_type = 'library' here, a dependency like openai or fastapi can resolve to some unrelated entity with the same name and the depends_on edge will point at the wrong node.

Suggested fix
     target_row = conn.execute(
-        "SELECT id FROM kg_entities WHERE LOWER(name) = LOWER(?)",
+        "SELECT id FROM kg_entities WHERE LOWER(name) = LOWER(?) AND entity_type = 'library'",
         (dep_name,),
     ).fetchone()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/pipeline/code_intelligence.py` around lines 309 - 321, The
SELECT that looks up target_row should restrict to library entities to avoid
matching non-library rows; update the query executed via conn.execute that
currently uses "SELECT id FROM kg_entities WHERE LOWER(name) = LOWER(?)" to
include "AND entity_type = 'library'" (and pass dep_name as before), so the
lookup uses both entity_type and name, and leave the INSERT into kg_entities
(which already sets entity_type='library') unchanged; make this change in the
block that references target_row, conn.execute, dep_name and the subsequent
INSERT that creates target_id.

Comment on lines +349 to +366
if __name__ == "__main__":
import argparse

logging.basicConfig(level=logging.INFO, format="%(message)s")

parser = argparse.ArgumentParser(description="Populate project entities from repo metadata")
parser.add_argument("--base-dir", type=Path, default=DEFAULT_BASE_DIR, help="Directory containing repos")
parser.add_argument("--dry-run", action="store_true", help="Show changes without applying")
args = parser.parse_args()

result = enrich_projects(base_dir=args.base_dir, dry_run=args.dry_run)
prefix = "[DRY RUN] " if args.dry_run else ""
print(f"\n{prefix}Code intelligence scan complete:")
print(f" Projects scanned: {result['projects_scanned']}")
print(f" Entities created: {result['entities_created']}")
print(f" Entities updated: {result['entities_updated']}")
print(f" Relations added: {result['relations_added']}")
print(f" Library entities added: {result['dep_entities_created']}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Drop the secondary argparse CLI from the pipeline module.

The Typer command already exposes this flow. Keeping another entrypoint here duplicates option parsing/output and makes the behavior easy to drift. As per coding guidelines 'Use Python/Typer CLI architecture for the main package in src/brainlayer/'.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/pipeline/code_intelligence.py` around lines 349 - 366, Remove
the duplicated argparse-based CLI block under if __name__ == "__main__" that
imports argparse, sets up logging, parses --base-dir/--dry-run, calls
enrich_projects and prints results; this duplicate entrypoint should be deleted
so the Typer CLI is the single source of truth. Ensure enrich_projects,
DEFAULT_BASE_DIR and any needed types remain exported/importable by the Typer
command (do not change enrich_projects signature), and remove the
print/result-handling code in this module so callers (the Typer command) handle
invocation and output.

Comment on lines +89 to +111
conn.execute(
"""CREATE TABLE IF NOT EXISTS kg_entities (
id TEXT PRIMARY KEY,
entity_type TEXT NOT NULL,
name TEXT NOT NULL,
description TEXT DEFAULT '',
metadata TEXT DEFAULT '{}',
importance REAL DEFAULT 0.0,
created_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')),
updated_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now'))
)"""
)
conn.execute(
"""CREATE TABLE IF NOT EXISTS kg_relations (
id TEXT PRIMARY KEY,
source_id TEXT NOT NULL,
target_id TEXT NOT NULL,
relation_type TEXT NOT NULL,
properties TEXT DEFAULT '{}',
confidence REAL DEFAULT 0.5,
created_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now'))
)"""
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Mirror the production KG constraints in this fixture.

The real schema in src/brainlayer/vector_store.py adds UNIQUE(entity_type, name) on kg_entities and UNIQUE(source_id, target_id, relation_type) on kg_relations, but this fixture omits both. That makes the new idempotency/upsert tests run against a schema that behaves differently from production.

Suggested fix
     conn.execute(
         """CREATE TABLE IF NOT EXISTS kg_entities (
             id TEXT PRIMARY KEY,
             entity_type TEXT NOT NULL,
             name TEXT NOT NULL,
             description TEXT DEFAULT '',
             metadata TEXT DEFAULT '{}',
             importance REAL DEFAULT 0.0,
             created_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')),
-            updated_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now'))
+            updated_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')),
+            UNIQUE(entity_type, name)
         )"""
     )
     conn.execute(
         """CREATE TABLE IF NOT EXISTS kg_relations (
             id TEXT PRIMARY KEY,
             source_id TEXT NOT NULL,
             target_id TEXT NOT NULL,
             relation_type TEXT NOT NULL,
             properties TEXT DEFAULT '{}',
-            confidence REAL DEFAULT 0.5,
-            created_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now'))
+            confidence REAL DEFAULT 1.0,
+            created_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')),
+            UNIQUE(source_id, target_id, relation_type)
         )"""
     )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
conn.execute(
"""CREATE TABLE IF NOT EXISTS kg_entities (
id TEXT PRIMARY KEY,
entity_type TEXT NOT NULL,
name TEXT NOT NULL,
description TEXT DEFAULT '',
metadata TEXT DEFAULT '{}',
importance REAL DEFAULT 0.0,
created_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')),
updated_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now'))
)"""
)
conn.execute(
"""CREATE TABLE IF NOT EXISTS kg_relations (
id TEXT PRIMARY KEY,
source_id TEXT NOT NULL,
target_id TEXT NOT NULL,
relation_type TEXT NOT NULL,
properties TEXT DEFAULT '{}',
confidence REAL DEFAULT 0.5,
created_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now'))
)"""
)
conn.execute(
"""CREATE TABLE IF NOT EXISTS kg_entities (
id TEXT PRIMARY KEY,
entity_type TEXT NOT NULL,
name TEXT NOT NULL,
description TEXT DEFAULT '',
metadata TEXT DEFAULT '{}',
importance REAL DEFAULT 0.0,
created_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')),
updated_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')),
UNIQUE(entity_type, name)
)"""
)
conn.execute(
"""CREATE TABLE IF NOT EXISTS kg_relations (
id TEXT PRIMARY KEY,
source_id TEXT NOT NULL,
target_id TEXT NOT NULL,
relation_type TEXT NOT NULL,
properties TEXT DEFAULT '{}',
confidence REAL DEFAULT 1.0,
created_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')),
UNIQUE(source_id, target_id, relation_type)
)"""
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_code_intelligence.py` around lines 89 - 111, The test fixture
creates kg_entities and kg_relations without the UNIQUE constraints used in
production, causing different behavior for idempotency/upsert tests; update the
CREATE TABLE statements for kg_entities to include UNIQUE(entity_type, name) and
for kg_relations to include UNIQUE(source_id, target_id, relation_type) so the
test schema matches the production constraints used by the vector store code
(kg_entities, kg_relations).

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
tests/test_code_intelligence.py (1)

89-111: ⚠️ Potential issue | 🟠 Major

Mirror the canonical KG schema in this fixture.

This test DB still differs from src/brainlayer/vector_store.py:456-491: it adds description/importance, keeps confidence at 0.5, and omits the production UNIQUE(...) constraints. That lets these tests pass against a schema production never boots and masks the kg_entities column mismatch in src/brainlayer/pipeline/code_intelligence.py.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_code_intelligence.py` around lines 89 - 111, The test fixture's
CREATE TABLE statements for kg_entities and kg_relations diverge from the
canonical schema in src/brainlayer/vector_store.py (lines ~456-491): remove the
extra description and importance columns from kg_entities, align default values
(e.g., set confidence to the production default), and add the same UNIQUE(...)
constraints present in the canonical schema for both kg_entities and
kg_relations so the test DB matches production; update the CREATE TABLE blocks
named kg_entities and kg_relations in tests/test_code_intelligence.py to exactly
mirror the column names, defaults, and UNIQUE constraints from
src/brainlayer/vector_store.py.
src/brainlayer/pipeline/code_intelligence.py (4)

57-65: ⚠️ Potential issue | 🟠 Major

Don’t let one bad manifest abort the whole scan.

Line 65 and Line 94 still let parse errors bubble out, so a single half-written pyproject.toml or package.json stops discovery for every other repo under the base directory. Catch load/decode failures, log the path, and return None so the scan can continue.

🐛 Suggested hardening
-def _extract_pyproject(repo_path: Path, path: Path) -> dict[str, Any]:
+def _extract_pyproject(repo_path: Path, path: Path) -> dict[str, Any] | None:
     """Extract metadata from pyproject.toml."""
     try:
         import tomllib
     except ImportError:
         import tomli as tomllib  # type: ignore[no-redef]
 
-    with open(path, "rb") as f:
-        data = tomllib.load(f)
+    try:
+        with path.open("rb") as f:
+            data = tomllib.load(f)
+    except Exception as exc:
+        logger.warning("Skipping unreadable manifest %s: %s", path, exc)
+        return None
@@
-def _extract_package_json(repo_path: Path, path: Path) -> dict[str, Any]:
+def _extract_package_json(repo_path: Path, path: Path) -> dict[str, Any] | None:
     """Extract metadata from package.json."""
-    with open(path) as f:
-        data = json.load(f)
+    try:
+        with path.open(encoding="utf-8") as f:
+            data = json.load(f)
+    except Exception as exc:
+        logger.warning("Skipping unreadable manifest %s: %s", path, exc)
+        return None

Also applies to: 91-94

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/pipeline/code_intelligence.py` around lines 57 - 65, The
_extract_pyproject function (and likewise the package.json parsing code around
_extract_package_json or the block at lines 91-94) currently lets tomllib/tomli
load errors propagate and abort the whole scan; wrap the tomllib.load (and the
JSON load for package.json) in a try/except that catches parsing/decoding
exceptions (e.g., tomllib.TOMLDecodeError, tomli.TOMLDecodeError,
json.JSONDecodeError, and generic Exception for IO issues), log a warning
including the manifest path and repo_path using the existing logger, and return
None on failure so discovery continues for other repos. Ensure the function
signature and callers tolerate a None return and only treat valid dicts as
successful parses.

360-377: 🛠️ Refactor suggestion | 🟠 Major

Keep the src/brainlayer/ CLI surface on Typer.

The code-intel Typer command already owns --base-dir and --dry-run; this second argparse entrypoint duplicates parsing/output and will drift. If module execution must stay supported, make __main__ a thin wrapper into the Typer command or a shared helper instead of another parser.

Run this to inspect both entrypoints side by side:

#!/bin/bash
set -euo pipefail

echo "== Existing Typer command =="
sed -n '1937,1957p' src/brainlayer/cli/__init__.py

echo
echo "== Duplicate argparse entrypoint =="
sed -n '360,377p' src/brainlayer/pipeline/code_intelligence.py

As per coding guidelines "Use Python/Typer CLI architecture for the main package in src/brainlayer/".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/pipeline/code_intelligence.py` around lines 360 - 377, The
__main__ argparse block duplicates the existing Typer "code-intel" CLI; remove
the argparse parser and replace the if __name__ == "__main__": section with a
thin wrapper that delegates to the existing Typer command (do not re-implement
argument parsing). Import and invoke the Typer command (the CLI entry that
exposes "code-intel" in src.brainlayer.cli) or call the shared helper used by
that command (the enrich_projects(...) function) with the parsed values,
preserving the same printed summary format; reference the enrich_projects
function and the Typer "code-intel" command when making the change so the module
execution reuses the canonical CLI parsing instead of duplicating it.

320-323: ⚠️ Potential issue | 🟠 Major

Constrain dependency lookups to library entities.

Line 321 matches on name alone. Because kg_entities is keyed by (entity_type, name), a dependency like fastapi can resolve to an unrelated project/entity row and the depends_on edge will point at the wrong node.

🎯 Suggested fix
     target_row = conn.execute(
-        "SELECT id FROM kg_entities WHERE LOWER(name) = LOWER(?)",
+        "SELECT id FROM kg_entities WHERE LOWER(name) = LOWER(?) AND entity_type = 'library'",
         (dep_name,),
     ).fetchone()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/pipeline/code_intelligence.py` around lines 320 - 323, The
dependency lookup in code_intelligence.py uses a SELECT by name only (see the
conn.execute call that sets target_row) which can match non-library entities;
modify the query used in the function that builds dependency edges to also
constrain entity_type = 'library' (i.e. SELECT id FROM kg_entities WHERE
entity_type = ? AND LOWER(name) = LOWER(?) and pass 'library' as the first
parameter) so target_row resolves only to library nodes before creating the
depends_on edge.

184-200: ⚠️ Potential issue | 🔴 Critical

Use the shared KG bootstrap/writer before these inserts.

The canonical KG bootstrap in src/brainlayer/vector_store.py:456-470 does not create kg_entities.description or kg_entities.importance, so these write paths can raise sqlite3.OperationalError on a freshly initialized DB. This direct sqlite3.connect() path also skips the repo’s required SQLITE_BUSY retry and writer-coordination flow, so it can collide with other BrainLayer writers on an active database.

Run this to verify whether any migration adds those columns and whether a shared retry/bootstrap helper already exists:

#!/bin/bash
set -euo pipefail

echo "== KG schema / migrations =="
rg -n -C2 'CREATE TABLE IF NOT EXISTS kg_entities|ALTER TABLE kg_entities ADD COLUMN description|ALTER TABLE kg_entities ADD COLUMN importance|CREATE TABLE IF NOT EXISTS kg_relations' src/brainlayer

echo
echo "== SQLite retry / coordination helpers =="
rg -n -C2 'SQLITE_BUSY|wal_checkpoint|sqlite3\.connect|journal_mode = WAL' src/brainlayer

As per coding guidelines "Implement retry logic on SQLITE_BUSY errors; each worker must use its own database connection" and "Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first".

Also applies to: 241-255, 329-353

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/pipeline/code_intelligence.py` around lines 184 - 200, Replace
the direct sqlite3.connect/PRAGMA usage and uncoordinated writes with the repo's
shared KG bootstrap/writer (the canonical bootstrap/writer used by the
vector_store module) so migrations that add kg_entities.description/importance
are applied and SQLITE_BUSY retry/coordination is respected; obtain the
connection/writer from that shared helper and use it in place of conn for calls
to _upsert_project, ensure the helper's retry logic handles SQLITE_BUSY, ensure
you stop/enqueue workers and run a WAL checkpoint via the shared writer before
performing bulk inserts, and perform the commit/flush through the writer API
(only when not dry_run) instead of calling conn.commit().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/brainlayer/pipeline/code_intelligence.py`:
- Around line 325-333: The code in the block that creates library entities
(checks target_row, builds target_id, inserts to kg_entities and increments
stats["dep_entities_created"]) overcounts when dry_run=True because missing
libraries never get inserted and subsequent occurrences are treated as new; fix
by tracking planned library names (e.g., a module-level or function-local set
like planned_libs) and change the logic in the branch for missing target_row to
first check if dep_name is already in planned_libs: if not, add dep_name to
planned_libs and increment stats["dep_entities_created"]; only perform the
actual DB INSERT when not dry_run (as now). This ensures each missing library is
counted once whether or not dry_run is set.

---

Duplicate comments:
In `@src/brainlayer/pipeline/code_intelligence.py`:
- Around line 57-65: The _extract_pyproject function (and likewise the
package.json parsing code around _extract_package_json or the block at lines
91-94) currently lets tomllib/tomli load errors propagate and abort the whole
scan; wrap the tomllib.load (and the JSON load for package.json) in a try/except
that catches parsing/decoding exceptions (e.g., tomllib.TOMLDecodeError,
tomli.TOMLDecodeError, json.JSONDecodeError, and generic Exception for IO
issues), log a warning including the manifest path and repo_path using the
existing logger, and return None on failure so discovery continues for other
repos. Ensure the function signature and callers tolerate a None return and only
treat valid dicts as successful parses.
- Around line 360-377: The __main__ argparse block duplicates the existing Typer
"code-intel" CLI; remove the argparse parser and replace the if __name__ ==
"__main__": section with a thin wrapper that delegates to the existing Typer
command (do not re-implement argument parsing). Import and invoke the Typer
command (the CLI entry that exposes "code-intel" in src.brainlayer.cli) or call
the shared helper used by that command (the enrich_projects(...) function) with
the parsed values, preserving the same printed summary format; reference the
enrich_projects function and the Typer "code-intel" command when making the
change so the module execution reuses the canonical CLI parsing instead of
duplicating it.
- Around line 320-323: The dependency lookup in code_intelligence.py uses a
SELECT by name only (see the conn.execute call that sets target_row) which can
match non-library entities; modify the query used in the function that builds
dependency edges to also constrain entity_type = 'library' (i.e. SELECT id FROM
kg_entities WHERE entity_type = ? AND LOWER(name) = LOWER(?) and pass 'library'
as the first parameter) so target_row resolves only to library nodes before
creating the depends_on edge.
- Around line 184-200: Replace the direct sqlite3.connect/PRAGMA usage and
uncoordinated writes with the repo's shared KG bootstrap/writer (the canonical
bootstrap/writer used by the vector_store module) so migrations that add
kg_entities.description/importance are applied and SQLITE_BUSY
retry/coordination is respected; obtain the connection/writer from that shared
helper and use it in place of conn for calls to _upsert_project, ensure the
helper's retry logic handles SQLITE_BUSY, ensure you stop/enqueue workers and
run a WAL checkpoint via the shared writer before performing bulk inserts, and
perform the commit/flush through the writer API (only when not dry_run) instead
of calling conn.commit().

In `@tests/test_code_intelligence.py`:
- Around line 89-111: The test fixture's CREATE TABLE statements for kg_entities
and kg_relations diverge from the canonical schema in
src/brainlayer/vector_store.py (lines ~456-491): remove the extra description
and importance columns from kg_entities, align default values (e.g., set
confidence to the production default), and add the same UNIQUE(...) constraints
present in the canonical schema for both kg_entities and kg_relations so the
test DB matches production; update the CREATE TABLE blocks named kg_entities and
kg_relations in tests/test_code_intelligence.py to exactly mirror the column
names, defaults, and UNIQUE constraints from src/brainlayer/vector_store.py.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8f12c60e-60bb-4bfa-9418-0a5e93d5a03b

📥 Commits

Reviewing files that changed from the base of the PR and between d04b6d8 and e76d3e0.

📒 Files selected for processing (2)
  • src/brainlayer/pipeline/code_intelligence.py
  • tests/test_code_intelligence.py
📜 Review details
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests

Files:

  • tests/test_code_intelligence.py
  • src/brainlayer/pipeline/code_intelligence.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use pytest for testing

Files:

  • tests/test_code_intelligence.py
src/brainlayer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/**/*.py: Use Python/Typer CLI architecture for the main package in src/brainlayer/
All scripts and CLI must use paths.py:get_db_path() for resolving database path instead of hardcoding
Implement chunk lifecycle management with columns superseded_by, aggregated_into, archived_at; default search excludes lifecycle-managed chunks
Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first
Drop FTS triggers before bulk deletes on chunks table and recreate after; batch deletes in 5-10K chunks with checkpoint every 3 batches
Implement retry logic on SQLITE_BUSY errors; each worker must use its own database connection
Use ruff check src/ && ruff format src/ for linting and formatting

Files:

  • src/brainlayer/pipeline/code_intelligence.py
🧠 Learnings (11)
📚 Learning: 2026-03-14T02:20:54.656Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-14T02:20:54.656Z
Learning: Applies to **/*.py : Run pytest before claiming behavior changed safely; current test suite has 929 tests

Applied to files:

  • tests/test_code_intelligence.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/**/*.py : Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first

Applied to files:

  • src/brainlayer/pipeline/code_intelligence.py
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/brainlayer/*bulk*.py : Before bulk database operations: stop enrichment workers, checkpoint WAL with `PRAGMA wal_checkpoint(FULL)`, drop FTS triggers before bulk deletes

Applied to files:

  • src/brainlayer/pipeline/code_intelligence.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/**/*.py : Implement retry logic on `SQLITE_BUSY` errors; each worker must use its own database connection

Applied to files:

  • src/brainlayer/pipeline/code_intelligence.py
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/**/*.py : All database connections must retry on `SQLITE_BUSY`; each worker uses its own connection

Applied to files:

  • src/brainlayer/pipeline/code_intelligence.py
📚 Learning: 2026-03-14T02:20:54.656Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-14T02:20:54.656Z
Learning: Be aware of known BrainLayer issues: DB locking during enrichment and WAL growth up to 4.7GB

Applied to files:

  • src/brainlayer/pipeline/code_intelligence.py
📚 Learning: 2026-03-14T02:20:54.656Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-14T02:20:54.656Z
Learning: Applies to **/*.py : Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior

Applied to files:

  • src/brainlayer/pipeline/code_intelligence.py
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Never run bulk database operations while enrichment is writing; this causes WAL bloat and potential freeze

Applied to files:

  • src/brainlayer/pipeline/code_intelligence.py
📚 Learning: 2026-03-18T00:12:15.607Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 87
File: brain-bar/Sources/BrainBar/BrainBarServer.swift:118-129
Timestamp: 2026-03-18T00:12:15.607Z
Learning: In `brain-bar/Sources/BrainBar/MCPRouter.swift` (Swift, BrainBar daemon), the socket-before-DB startup pattern means the Unix socket binds immediately (~1ms) while the database may take several seconds to open on cold start (8GB file). Any tool handler that accesses `database` MUST throw an explicit error (e.g., `ToolError.noDatabase`) when `database` is nil — never return empty or default results (e.g., `guard let db else { return "[]" }` is forbidden). The false-success pattern hides startup timing issues from MCP clients. Flag any `guard let db = database else { return ... }` patterns that silently return defaults instead of throwing.

Applied to files:

  • src/brainlayer/pipeline/code_intelligence.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/**/*.py : Use Python/Typer CLI architecture for the main package in `src/brainlayer/`

Applied to files:

  • src/brainlayer/pipeline/code_intelligence.py
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/brainlayer/cli.py : Use Typer CLI framework for command-line interface in `src/brainlayer/`

Applied to files:

  • src/brainlayer/pipeline/code_intelligence.py
🔇 Additional comments (2)
tests/test_code_intelligence.py (1)

173-199: Good regression coverage for the last two fixes.

These cases pin both the framework-precedence bug and the ~= dependency parsing regression.

src/brainlayer/pipeline/code_intelligence.py (1)

133-153: Nice specificity ordering here.

Checking Expo / React Native / Next.js before generic React preserves the intended classification.

Comment on lines +325 to +333
if not target_row:
target_id = f"lib-{uuid.uuid4().hex[:12]}"
if not dry_run:
conn.execute(
"""INSERT INTO kg_entities (id, entity_type, name, importance, created_at, updated_at)
VALUES (?, 'library', ?, 3.0, strftime('%Y-%m-%dT%H:%M:%fZ','now'), strftime('%Y-%m-%dT%H:%M:%fZ','now'))""",
(target_id, dep_name),
)
stats["dep_entities_created"] += 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

dry_run overcounts shared library creations.

When dry_run=True, the first missing library is never inserted, so the second project that depends on the same missing package hits this branch again and increments dep_entities_created a second time. Track planned library names in memory or precompute unique notable dependencies before the loop.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/pipeline/code_intelligence.py` around lines 325 - 333, The
code in the block that creates library entities (checks target_row, builds
target_id, inserts to kg_entities and increments stats["dep_entities_created"])
overcounts when dry_run=True because missing libraries never get inserted and
subsequent occurrences are treated as new; fix by tracking planned library names
(e.g., a module-level or function-local set like planned_libs) and change the
logic in the branch for missing target_row to first check if dep_name is already
in planned_libs: if not, add dep_name to planned_libs and increment
stats["dep_entities_created"]; only perform the actual DB INSERT when not
dry_run (as now). This ensures each missing library is counted once whether or
not dry_run is set.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/brainlayer/cli/__init__.py`:
- Line 1947: enrich_projects() in src/brainlayer/pipeline/code_intelligence.py
opens sqlite3 directly and sets PRAGMA journal_mode = WAL but lacks PRAGMA
busy_timeout and retry-on-SQLITE_BUSY semantics; update the DB open and write
paths in enrich_projects() (and the helper that calls sqlite3.connect()) to
execute PRAGMA busy_timeout = <ms> after opening the connection and wrap write
transactions in a retry loop that catches sqlite3.OperationalError /
SQLITE_BUSY, sleeps/backoffs and retries a few times (matching patterns used in
clustering.py and vector_store.py) so concurrent MCP/enrichment work won't fail
on lock contention. Ensure the same connection/config is used for all writes in
enrich_projects() so the busy_timeout applies.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 956e09b0-1a09-4594-8e10-f2811ccfd66e

📥 Commits

Reviewing files that changed from the base of the PR and between e76d3e0 and 29611c0.

📒 Files selected for processing (1)
  • src/brainlayer/cli/__init__.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: test (3.13)
  • GitHub Check: test (3.12)
  • GitHub Check: test (3.11)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests

Files:

  • src/brainlayer/cli/__init__.py
src/brainlayer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/**/*.py: Use Python/Typer CLI architecture for the main package in src/brainlayer/
All scripts and CLI must use paths.py:get_db_path() for resolving database path instead of hardcoding
Implement chunk lifecycle management with columns superseded_by, aggregated_into, archived_at; default search excludes lifecycle-managed chunks
Never run bulk database operations while enrichment workers are writing; always stop workers and checkpoint WAL first
Drop FTS triggers before bulk deletes on chunks table and recreate after; batch deletes in 5-10K chunks with checkpoint every 3 batches
Implement retry logic on SQLITE_BUSY errors; each worker must use its own database connection
Use ruff check src/ && ruff format src/ for linting and formatting

Files:

  • src/brainlayer/cli/__init__.py
🧠 Learnings (3)
📚 Learning: 2026-03-29T23:19:50.743Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:50.743Z
Learning: Applies to src/brainlayer/cli.py : Use Typer CLI framework for command-line interface in `src/brainlayer/`

Applied to files:

  • src/brainlayer/cli/__init__.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/**/*.py : Use Python/Typer CLI architecture for the main package in `src/brainlayer/`

Applied to files:

  • src/brainlayer/cli/__init__.py
📚 Learning: 2026-03-29T23:19:51.321Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T23:19:51.321Z
Learning: Applies to src/brainlayer/**/*.py : Use `ruff check src/ && ruff format src/` for linting and formatting

Applied to files:

  • src/brainlayer/cli/__init__.py

db_path = get_db_path()
rprint(f"[bold blue]זיכרון[/] - Code intelligence scan: [italic]{base_dir}[/]")

result = enrich_projects(db_path=str(db_path), base_dir=base_dir, dry_run=dry_run)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

❓ Verification inconclusive

Script executed:

#!/bin/bash
set -euo pipefail

echo "1) Confirm call site in CLI"
rg -n -C2 '\bdef code_intel\s*\(' src/brainlayer/cli/__init__.py
rg -n -C2 '\benrich_projects\s*\(' src/brainlayer/cli/__init__.py

echo "2) Inspect enrichment implementation for write-serialization / busy handling"
rg -n -C4 '\bdef enrich_projects\s*\(' src/brainlayer/pipeline/code_intelligence.py
rg -n -C3 'sqlite3|connect\(|busy_timeout|SQLITE_BUSY|retry|lock|BEGIN|commit|rollback|checkpoint' src/brainlayer/pipeline/code_intelligence.py

echo "3) Locate existing shared one-writer guards for reuse"
rg -n -C2 'write.*lock|exclusive.*lock|one-write|filelock|portalocker|SQLITE_BUSY|busy_timeout' src/brainlayer --type=py

Repository: EtanHey/brainlayer


Repository: EtanHey/brainlayer
Exit code: 0

stdout:

1) Confirm call site in CLI
1935-
1936-@app.command("code-intel")
1937:def code_intel(
1938-    base_dir: Path = typer.Option(Path.home() / "Gits", "--base-dir", "-d", help="Directory containing repos to scan"),
1939-    dry_run: bool = typer.Option(False, "--dry-run", help="Show changes without applying"),
1945-    rprint(f"[bold blue]זיכרון[/] - Code intelligence scan: [italic]{base_dir}[/]")
1946-
1947:    result = enrich_projects(db_path=str(db_path), base_dir=base_dir, dry_run=dry_run)
1948-
1949-    prefix = "[dim][DRY RUN][/] " if dry_run else ""
2) Inspect enrichment implementation for write-serialization / busy handling
152-            return name
153-    return None
154-
155-
156:def enrich_projects(
157-    db_path: str | None = None,
158-    base_dir: Path | None = None,
159-    dry_run: bool = False,
160-) -> dict[str, Any]:
11-import json
12-import logging
13-import re
14:import sqlite3
15-import uuid
16-from pathlib import Path
17-from typing import Any
--
103-
104-    # Detect package manager
105-    pkg_manager = "npm"
106:    if (repo_path / "bun.lockb").exists() or (repo_path / "bun.lock").exists():
107-        pkg_manager = "bun"
108:    elif (repo_path / "pnpm-lock.yaml").exists():
109-        pkg_manager = "pnpm"
110:    elif (repo_path / "yarn.lock").exists():
111-        pkg_manager = "yarn"
112-
113-    # Detect framework from dependencies
--
181-            "dep_entities_created": 0,
182-        }
183-
184:    conn = sqlite3.connect(db_path)
185-    conn.execute("PRAGMA journal_mode = WAL")
186-
187-    stats = {
--
197-            _upsert_project(conn, project, stats, dry_run)
198-
199-        if not dry_run:
200:            conn.commit()
201-    finally:
202-        conn.close()
203-
--
205-
206-
207-def _upsert_project(
208:    conn: sqlite3.Connection,
209-    project: dict[str, Any],
210-    stats: dict[str, int],
211-    dry_run: bool,
--
264-
265-
266-def _add_dependency_relation(
267:    conn: sqlite3.Connection,
268-    source_id: str,
269-    source_name: str,
270-    dep_name: str,
3) Locate existing shared one-writer guards for reuse
src/brainlayer/vector_store.py-37-
src/brainlayer/vector_store.py-38-
src/brainlayer/vector_store.py:39:def _set_busy_timeout_hook(conn: apsw.Connection) -> None:
src/brainlayer/vector_store.py:40:    """Set busy_timeout on every new connection before any other hooks.
src/brainlayer/vector_store.py-41-
src/brainlayer/vector_store.py-42-    APSW bestpractice hooks (connection_optimize) run PRAGMA optimize inside
src/brainlayer/vector_store.py:43:    the Connection() constructor. Without busy_timeout set first, this PRAGMA
src/brainlayer/vector_store.py-44-    fails with BusyError when other processes hold the DB lock.
src/brainlayer/vector_store.py-45-    """
--
src/brainlayer/vector_store.py-47-
src/brainlayer/vector_store.py-48-
src/brainlayer/vector_store.py:49:# Register busy_timeout hook BEFORE bestpractice hooks so it fires first.
src/brainlayer/vector_store.py-50-# bestpractice.apply() adds hooks that run PRAGMA optimize inside Connection(),
src/brainlayer/vector_store.py:51:# which needs busy_timeout active or it crashes under contention.
src/brainlayer/vector_store.py:52:apsw.connection_hooks.insert(0, _set_busy_timeout_hook)
src/brainlayer/vector_store.py-53-apsw.bestpractice.apply(apsw.bestpractice.recommended)
src/brainlayer/vector_store.py-54-
--
src/brainlayer/vector_store.py-74-
src/brainlayer/vector_store.py-75-        Multiple BrainLayer processes (MCP instances, daemon, enrichment) may
src/brainlayer/vector_store.py:76:        contend for write locks during DDL. Retry with exponential backoff
src/brainlayer/vector_store.py-77-        instead of crashing on the first BusyError.
src/brainlayer/vector_store.py-78-        """
--
src/brainlayer/pipeline/enrichment.py-326-PRIMARY_SYMBOLS: Extract class names, function names, file paths, and variable names that are central to this chunk. Empty array if none.
src/brainlayer/pipeline/enrichment.py-327-
src/brainlayer/pipeline/enrichment.py:328:RESOLVED_QUERY: Write a natural question that someone would ask to find this chunk. E.g., "How do I fix EADDRINUSE errors in Bun?" or "What's the SQLite busy_timeout fix for concurrent access?"
src/brainlayer/pipeline/enrichment.py-329-
src/brainlayer/pipeline/enrichment.py-330-EPISTEMIC_LEVEL:
--
src/brainlayer/mcp/search_handler.py-642-
src/brainlayer/mcp/search_handler.py-643-        # Retry hybrid_search on BusyError — WAL reads shouldn't block but
src/brainlayer/mcp/search_handler.py:644:        # they can during checkpoint or when enrichment holds exclusive lock.
src/brainlayer/mcp/search_handler.py-645-        results = None
src/brainlayer/mcp/search_handler.py-646-        for attempt in range(_RETRY_MAX_ATTEMPTS):
--
src/brainlayer/clustering.py-417-    """
src/brainlayer/clustering.py-418-    cursor = conn.cursor()
src/brainlayer/clustering.py:419:    cursor.execute("PRAGMA busy_timeout = 5000")
src/brainlayer/clustering.py-420-    now = datetime.now(timezone.utc).isoformat()
src/brainlayer/clustering.py-421-
--
src/brainlayer/clustering.py-486-
src/brainlayer/clustering.py-487-    cursor = conn.cursor()
src/brainlayer/clustering.py:488:    cursor.execute("PRAGMA busy_timeout = 5000")
src/brainlayer/clustering.py-489-
src/brainlayer/clustering.py-490-    # Build a mapping from global index to chunk content (sample for efficiency)
--
src/brainlayer/hooks/indexer.py-40-            self._db = sqlite3.connect(self.db_path)
src/brainlayer/hooks/indexer.py-41-            self._db.execute("PRAGMA journal_mode = WAL")
src/brainlayer/hooks/indexer.py:42:            self._db.execute("PRAGMA busy_timeout = 5000")
src/brainlayer/hooks/indexer.py-43-            self._db.execute("PRAGMA synchronous = NORMAL")
src/brainlayer/hooks/indexer.py-44-            self._create_schema()

Add write-serialization protection to enrich_projects() before release.

The call to enrich_projects() at line 1947 invokes a write operation on the database without busy_timeout or retry logic. The function creates a direct sqlite3.connect() at line 184 with only PRAGMA journal_mode = WAL but omits PRAGMA busy_timeout and retry handling on SQLITE_BUSY errors. If enrichment or other MCP work runs concurrently, this will fail on lock contention instead of retrying. Other modules in the codebase already implement this pattern (see clustering.py PRAGMA busy_timeout, vector_store.py busy_timeout hook, search_handler.py BusyError retry). Add the same protection to enrich_projects() in src/brainlayer/pipeline/code_intelligence.py.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/cli/__init__.py` at line 1947, enrich_projects() in
src/brainlayer/pipeline/code_intelligence.py opens sqlite3 directly and sets
PRAGMA journal_mode = WAL but lacks PRAGMA busy_timeout and retry-on-SQLITE_BUSY
semantics; update the DB open and write paths in enrich_projects() (and the
helper that calls sqlite3.connect()) to execute PRAGMA busy_timeout = <ms> after
opening the connection and wrap write transactions in a retry loop that catches
sqlite3.OperationalError / SQLITE_BUSY, sleeps/backoffs and retries a few times
(matching patterns used in clustering.py and vector_store.py) so concurrent
MCP/enrichment work won't fail on lock contention. Ensure the same
connection/config is used for all writes in enrich_projects() so the
busy_timeout applies.

@EtanHey EtanHey merged commit 93066d3 into main Apr 3, 2026
6 checks passed
@EtanHey EtanHey deleted the feat/code-intelligence-r68-round4 branch April 3, 2026 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant