feat: faceted tag enrichment script (Gemini 2.5 Flash) by EtanHey · Pull Request #96 · EtanHey/brainlayer

EtanHey · 2026-03-19T10:37:05Z

Summary

New scripts/enrich_recent.py — enriches chunks with faceted tags via Gemini 2.5 Flash
Uses enrichment prompt v2: topic tags, activity, domain, confidence
Merges new faceted tags with existing tags (no data loss)
Commits every 50 chunks, rate-limited at 0.3s/req
First run: 200 chunks enriched, 0 errors, avg confidence 0.95

Sample output

brainbar-f791af84  topics=['brainlayer-search-quality', 'importance-calibration'], act:implementing, ['dom:sql'], conf=0.98
brainbar-e8677cf3  topics=['multi-agent-coordination', 'agent-message-architecture'], act:designing, ['dom:sql', 'dom:mcp'], conf=0.98
manual-54e90a47    topics=['brainlayer-search-quality', 'sprint-planning-methodology'], act:planning, ['dom:git', 'dom:cli'], conf=0.95

Test plan

200-chunk live run: 0 errors, 100% valid JSON
Tags correctly merged with existing (old tags preserved)
Verify enriched chunks surface better in brain_search

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Introduced automated enrichment functionality that processes recent data chunks lacking specific tags. Each chunk receives AI-generated enhancements including activity classifications, domain categorizations, and confidence scores. All updates are automatically persisted to your local database.

Gemini 2.5 Flash enrichment with faceted tag schema (topic, activity, domain, confidence). Merges new tags with existing, commits every 50. First run: 200 chunks enriched, 0 errors, avg confidence 0.95. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

coderabbitai · 2026-03-19T10:37:23Z

Warning

Rate limit exceeded

@EtanHey has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 5 minutes and 1 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6755d54b-4766-43ea-b697-ee5503e10ef0

📥 Commits

Reviewing files that changed from the base of the PR and between 3d80539 and 719761a.

📒 Files selected for processing (2)

scripts/enrich_recent.py
scripts/enrichment_pilot.py

📝 Walkthrough

Walkthrough

A new standalone script, scripts/enrich_recent.py, has been added that queries a local SQLite database for recent chunks lacking domain tags, calls the Gemini API to generate enriched metadata tags, parses the responses, and updates the database with extracted topic, activity, domain, and confidence values alongside existing tags.

Changes

Cohort / File(s)	Summary
New Data Enrichment Script `scripts/enrich_recent.py`	New standalone script that retrieves recent chunks from SQLite, invokes Gemini for tag enrichment, parses JSON responses with fallback for list-wrapped results, merges new faceted tags with existing tags, and updates database. Includes request throttling, progress logging every 10 iterations, error tracking (first 3 errors logged), and batch checkpoint reporting.

Sequence Diagram

sequenceDiagram
    actor Script as Enrichment Script
    participant DB as SQLite Database
    participant API as Gemini API
    
    Script->>DB: Connect with busy_timeout=5000
    Script->>DB: SELECT recent chunks<br/>(last 7 days, no dom: tags)
    DB-->>Script: Return up to MAX_CHUNKS rows
    
    loop For each chunk
        Script->>API: POST prompt with chunk content<br/>(gemini-2.5-flash)
        API-->>Script: Return JSON response
        Script->>Script: Parse JSON<br/>(handle list wrapping)
        Script->>Script: Extract b_topics, c_activity,<br/>d_domain, e_confidence
        Script->>Script: Merge new tags with<br/>existing tags from old_tags
        Script->>DB: UPDATE chunks<br/>SET tags=?, tag_confidence=?<br/>WHERE id=?
        DB-->>Script: Row updated
        Script->>Script: Throttle with sleep(0.3)
        Script->>Script: Log progress every 10 iterations
    end
    
    Script->>Script: Print sample summary<br/>(up to 10 processed chunks)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A bunny's delight in the database night,
With tags and domains now shining so bright!
Gemini whispers what each chunk should know,
As SQLite blossoms in enrichment's glow, ✨
Ten hops at a time, we chronicle the way!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: faceted tag enrichment script (Gemini 2.5 Flash)' clearly and concisely describes the main change—adding a script for enriching chunks with faceted tags using Gemini 2.5 Flash. It is specific, directly related to the core addition of scripts/enrich_recent.py, and aligns with the PR objectives.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/enrich-recent-faceted-tags

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/enrich_recent.py`:
- Around line 137-139: The UPDATE statement in cursor.execute currently writes
to tag_confidence but the canonical chunks schema expects the enrichment
confidence in column n; update the SQL in the cursor.execute call (the statement
that uses tags_json, confidence, chunk_id) to set n instead of tag_confidence so
the confidence value is persisted into the canonical column (keep tags = ?, n =
? with the same bound variables tags_json, confidence, chunk_id).
- Around line 102-121: The parsed model payload fields need type and range
validation before building new_tags: after parsing `parsed` (the dict), validate
that `parsed.get("b_topics")` is a list of strings (otherwise treat as
malformed), `parsed.get("d_domain")` is a list of strings,
`parsed.get("c_activity")` is either an empty string or a string, and
`parsed.get("e_confidence")` is numeric (coerce to float and ensure it falls in
an expected range, e.g. 0.0–1.0); if any validation fails, increment `errors`
and continue instead of using the malformed values. Update the code around the
`topics`, `activity`, `domains`, `confidence`, and `new_tags` logic to perform
these checks/coercions and only merge/extend `new_tags` with validated
lists/strings.
- Line 12: The code currently falls back to a hardcoded Gemini/Google key via
the API_KEY assignment (os.environ.get("GOOGLE_API_KEY", "...")), which leaks
credentials; remove the embedded literal and change the behavior in the API_KEY
initialization so it fails closed: read API_KEY from environment only
(os.environ["GOOGLE_API_KEY"] or equivalent) and raise a clear exception or exit
if the variable is not set, and rotate/remove the exposed key from history; also
update the other occurrences referenced (lines ~62-63) that use the same
fallback to ensure no hardcoded secret remains.
- Around line 63-66: The DB update phase is susceptible to race conditions and
lacks proper transaction batching: wrap the update-only section that iterates
over rows/old_tags (the code between where rows and old_tags are read and where
cursor.execute/conn.execute runs the UPDATEs, i.e., the block using conn, cursor
and updating tags) with a process-level single-writer guard (e.g., a
multiprocessing.Lock or a file-based lock) to ensure only one process writes at
a time; change the update loop to use explicit transactions by issuing
conn.execute("BEGIN") before a batch, perform up to 50 UPDATEs inside that
transaction, then conn.execute("COMMIT") after each 50-row batch (and finally
commit any remainder), and remove the misleading checkpoint print so that
commits reflect actual persisted checkpoints. Ensure you keep using the existing
genai.Client, conn, cursor, rows and old_tags identifiers when locating and
modifying the code.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 93cab848-9d40-45f1-a766-89a4cc64bb9e

📥 Commits

Reviewing files that changed from the base of the PR and between 4af55ff and 3d80539.

📒 Files selected for processing (1)

scripts/enrich_recent.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: test (3.13)
GitHub Check: test (3.12)
GitHub Check: test (3.11)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests

Files:

scripts/enrich_recent.py

🧠 Learnings (4)

📓 Common learnings

Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-12T14:22:54.809Z
Learning: Run enrichment worker script via `golems/scripts/enrichment-lazy.sh` with launchd scheduling, process priority nice=20, and batch size of 50 items

📚 Learning: 2026-03-12T14:22:54.809Z

Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-12T14:22:54.809Z
Learning: Applies to src/brainlayer/**/*.py : For bulk database operations: stop enrichment workers first, checkpoint WAL before and after operations, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks with checkpoint every 3 batches, never delete from `chunks` while FTS trigger is active

Applied to files:

scripts/enrich_recent.py

📚 Learning: 2026-03-12T14:22:54.809Z

Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-12T14:22:54.809Z
Learning: Applies to src/brainlayer/**/*.py : Use AST-aware chunking with tree-sitter; never split stack traces; mask large tool output during chunking

Applied to files:

scripts/enrich_recent.py

📚 Learning: 2026-03-12T14:22:54.809Z

Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-12T14:22:54.809Z
Learning: Run enrichment worker script via `golems/scripts/enrichment-lazy.sh` with launchd scheduling, process priority nice=20, and batch size of 50 items

Applied to files:

scripts/enrich_recent.py

🪛 Gitleaks (8.30.0)

scripts/enrich_recent.py

[high] 12-12: Uncovered a GCP API key, which could lead to unauthorized access to Google Cloud services and data breaches.

(gcp-api-key)

🔇 Additional comments (1)

scripts/enrich_recent.py (1)

62-178: Please run the full pytest suite before merging this writer.

The validation described here is a live run only, but this script mutates the main chunks table. Please run the existing pytest suite before treating the change as safe.

As per coding guidelines, "Run pytest before claiming behavior changed safely; current test suite has 929 tests."

coderabbitai · 2026-03-19T10:47:34Z

+    client = genai.Client(api_key=API_KEY)
+    conn = apsw.Connection(DB_PATH)
+    conn.execute("PRAGMA busy_timeout = 5000")
+    cursor = conn.cursor()


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd /tmp && find . -name "enrich_recent.py" -type f 2>/dev/null | head -5

Repository: EtanHey/brainlayer

Length of output: 44

🏁 Script executed:

cd /tmp && find . -type f -name "*.py" | grep -i enrich | head -10

Repository: EtanHey/brainlayer

Length of output: 44

🏁 Script executed:

git ls-files | grep -E "(enrich|script)" | head -20

Repository: EtanHey/brainlayer

Length of output: 809

🏁 Script executed:

git ls-files | grep enrich_recent

Repository: EtanHey/brainlayer

Length of output: 86

🏁 Script executed:

find . -name "*enrich*recent*" -o -name "*recent*enrich*" 2>/dev/null

Repository: EtanHey/brainlayer

Length of output: 122

🏁 Script executed:

git log --oneline --all -- "*enrich_recent*" 2>/dev/null | head -10

Repository: EtanHey/brainlayer

Length of output: 127

🏁 Script executed:

cat -n scripts/enrich_recent.py

Repository: EtanHey/brainlayer

Length of output: 9527

🏁 Script executed:

rg -t py "lock|Lock|LOCK|Semaphore|RLock" src/ scripts/ --max-count=3 | head -20

Repository: EtanHey/brainlayer

Length of output: 1780

🏁 Script executed:

cat -n tests/test_recent_enrichment.py

Repository: EtanHey/brainlayer

Length of output: 1730

🏁 Script executed:

rg -t py "_store_lock|process.*lock|flock|FileLock" --max-count=2

Repository: EtanHey/brainlayer

Length of output: 385

🏁 Script executed:

head -50 src/brainlayer/mcp/_shared.py

Repository: EtanHey/brainlayer

Length of output: 1465

Add process-level single-writer guard and implement explicit transaction batching with proper commits.

PRAGMA busy_timeout (line 65) only waits for lock contention—it does not prevent other writers. Since rows and old_tags are snapshots taken before the Gemini API call (lines 69–78), another writer can update tags between the snapshot and the UPDATE at line 138, causing stale data to overwrite newer values.

Line 166 prints a checkpoint message but executes no actual commit—APSW defaults to autocommit mode, so each UPDATE at line 138 is an independent write with no batching.

Fix: Wrap only the DB update phase (lines 137–140) in a process-level single-writer guard (e.g., multiprocessing.Lock or file-based lock). Implement proper explicit transactions: use conn.execute("BEGIN"), batch updates in 50-row chunks with conn.execute("COMMIT"), and remove the misleading checkpoint print.

Applies to lines 69–78, 123–140, and 164–166.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/enrich_recent.py` around lines 63 - 66, The DB update phase is susceptible to race conditions and lacks proper transaction batching: wrap the update-only section that iterates over rows/old_tags (the code between where rows and old_tags are read and where cursor.execute/conn.execute runs the UPDATEs, i.e., the block using conn, cursor and updating tags) with a process-level single-writer guard (e.g., a multiprocessing.Lock or a file-based lock) to ensure only one process writes at a time; change the update loop to use explicit transactions by issuing conn.execute("BEGIN") before a batch, perform up to 50 UPDATEs inside that transaction, then conn.execute("COMMIT") after each 50-row batch (and finally commit any remainder), and remove the misleading checkpoint print so that commits reflect actual persisted checkpoints. Ensure you keep using the existing genai.Client, conn, cursor, rows and old_tags identifiers when locating and modifying the code.

coderabbitai · 2026-03-19T10:47:34Z

+            text = response.text.strip()
+            parsed = json.loads(text)
+
+            # Handle array responses (take first)
+            if isinstance(parsed, list) and parsed:
+                parsed = parsed[0]
+            if not isinstance(parsed, dict):
+                errors += 1
+                continue
+
+            # Build new tags: merge old + new faceted
+            topics = parsed.get("b_topics", [])
+            activity = parsed.get("c_activity", "")
+            domains = parsed.get("d_domain", [])
+            confidence = parsed.get("e_confidence", 0)
+
+            new_tags = list(topics)
+            if activity:
+                new_tags.append(activity)
+            new_tags.extend(domains)


⚠️ Potential issue | 🟠 Major

Reject malformed model payloads before persisting them.

json.loads() only proves syntax. If the model returns a string for b_topics or d_domain, Lines 118 and 121 will turn it into per-character tags, and a non-numeric e_confidence will be written as-is. Validate the field types/range before building new_tags.

Suggested fix

text = response.text.strip() parsed = json.loads(text) @@ topics = parsed.get("b_topics", []) activity = parsed.get("c_activity", "") domains = parsed.get("d_domain", []) confidence = parsed.get("e_confidence", 0) + + if ( + not isinstance(topics, list) + or not all(isinstance(t, str) for t in topics) + or not isinstance(activity, str) + or not isinstance(domains, list) + or not all(isinstance(d, str) for d in domains) + or not isinstance(confidence, (int, float)) + or not 0.0 <= confidence <= 1.0 + ): + errors += 1 + if errors <= 3: + print(f" Invalid payload for {str(chunk_id)[:30]}: {parsed!r}") + continue

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/enrich_recent.py` around lines 102 - 121, The parsed model payload fields need type and range validation before building new_tags: after parsing `parsed` (the dict), validate that `parsed.get("b_topics")` is a list of strings (otherwise treat as malformed), `parsed.get("d_domain")` is a list of strings, `parsed.get("c_activity")` is either an empty string or a string, and `parsed.get("e_confidence")` is numeric (coerce to float and ensure it falls in an expected range, e.g. 0.0–1.0); if any validation fails, increment `errors` and continue instead of using the malformed values. Update the code around the `topics`, `activity`, `domains`, `confidence`, and `new_tags` logic to perform these checks/coercions and only merge/extend `new_tags` with validated lists/strings.

coderabbitai · 2026-03-19T10:47:34Z

+            cursor.execute(
+                "UPDATE chunks SET tags = ?, tag_confidence = ? WHERE id = ?",
+                (tags_json, confidence, chunk_id)


⚠️ Potential issue | 🔴 Critical

Write confidence to the actual chunks column.

Line 138 updates tag_confidence, but the canonical chunks schema stores enrichment confidence in n. On the canonical schema this statement will fail and the run will not persist its updates.

Suggested fix

- "UPDATE chunks SET tags = ?, tag_confidence = ? WHERE id = ?", + "UPDATE chunks SET tags = ?, n = ? WHERE id = ?",

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

cursor.execute(

"UPDATE chunks SET tags = ?, tag_confidence = ? WHERE id = ?",

(tags_json, confidence, chunk_id)

cursor.execute(

"UPDATE chunks SET tags = ?, n = ? WHERE id = ?",

(tags_json, confidence, chunk_id)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/enrich_recent.py` around lines 137 - 139, The UPDATE statement in cursor.execute currently writes to tag_confidence but the canonical chunks schema expects the enrichment confidence in column n; update the SQL in the cursor.execute call (the statement that uses tags_json, confidence, chunk_id) to set n instead of tag_confidence so the confidence value is persisted into the canonical column (keep tags = ?, n = ? with the same bound variables tags_json, confidence, chunk_id).

- Remove Gemini API key fallback from both scripts (require env var) - Add type validation for parsed response fields (topics, domains, confidence) - Addresses CodeRabbit critical finding on #96 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

greptile-apps Bot reviewed Mar 19, 2026

View reviewed changes

coderabbitai Bot reviewed Mar 19, 2026

View reviewed changes

EtanHey merged commit ee1e357 into main Mar 19, 2026
6 checks passed

EtanHey deleted the feat/enrich-recent-faceted-tags branch March 19, 2026 10:52

coderabbitai Bot mentioned this pull request Mar 22, 2026

feat: unified enrichment controller + 3-mode brain_digest #100

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: faceted tag enrichment script (Gemini 2.5 Flash)#96

feat: faceted tag enrichment script (Gemini 2.5 Flash)#96
EtanHey merged 2 commits intomainfrom
feat/enrich-recent-faceted-tags

EtanHey commented Mar 19, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

greptile-apps Bot left a comment

Uh oh!

coderabbitai Bot commented Mar 19, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot Mar 19, 2026

Uh oh!

coderabbitai Bot Mar 19, 2026

Uh oh!

coderabbitai Bot Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EtanHey commented Mar 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Sample output

Test plan

Summary by CodeRabbit

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EtanHey commented Mar 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 19, 2026 •

edited

Loading