feat: faceted tag enrichment script (Gemini 2.5 Flash)#96
Conversation
Gemini 2.5 Flash enrichment with faceted tag schema (topic, activity, domain, confidence). Merges new tags with existing, commits every 50. First run: 200 chunks enriched, 0 errors, avg confidence 0.95. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughA new standalone script, Changes
Sequence DiagramsequenceDiagram
actor Script as Enrichment Script
participant DB as SQLite Database
participant API as Gemini API
Script->>DB: Connect with busy_timeout=5000
Script->>DB: SELECT recent chunks<br/>(last 7 days, no dom: tags)
DB-->>Script: Return up to MAX_CHUNKS rows
loop For each chunk
Script->>API: POST prompt with chunk content<br/>(gemini-2.5-flash)
API-->>Script: Return JSON response
Script->>Script: Parse JSON<br/>(handle list wrapping)
Script->>Script: Extract b_topics, c_activity,<br/>d_domain, e_confidence
Script->>Script: Merge new tags with<br/>existing tags from old_tags
Script->>DB: UPDATE chunks<br/>SET tags=?, tag_confidence=?<br/>WHERE id=?
DB-->>Script: Row updated
Script->>Script: Throttle with sleep(0.3)
Script->>Script: Log progress every 10 iterations
end
Script->>Script: Print sample summary<br/>(up to 10 processed chunks)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/enrich_recent.py`:
- Around line 137-139: The UPDATE statement in cursor.execute currently writes
to tag_confidence but the canonical chunks schema expects the enrichment
confidence in column n; update the SQL in the cursor.execute call (the statement
that uses tags_json, confidence, chunk_id) to set n instead of tag_confidence so
the confidence value is persisted into the canonical column (keep tags = ?, n =
? with the same bound variables tags_json, confidence, chunk_id).
- Around line 102-121: The parsed model payload fields need type and range
validation before building new_tags: after parsing `parsed` (the dict), validate
that `parsed.get("b_topics")` is a list of strings (otherwise treat as
malformed), `parsed.get("d_domain")` is a list of strings,
`parsed.get("c_activity")` is either an empty string or a string, and
`parsed.get("e_confidence")` is numeric (coerce to float and ensure it falls in
an expected range, e.g. 0.0–1.0); if any validation fails, increment `errors`
and continue instead of using the malformed values. Update the code around the
`topics`, `activity`, `domains`, `confidence`, and `new_tags` logic to perform
these checks/coercions and only merge/extend `new_tags` with validated
lists/strings.
- Line 12: The code currently falls back to a hardcoded Gemini/Google key via
the API_KEY assignment (os.environ.get("GOOGLE_API_KEY", "...")), which leaks
credentials; remove the embedded literal and change the behavior in the API_KEY
initialization so it fails closed: read API_KEY from environment only
(os.environ["GOOGLE_API_KEY"] or equivalent) and raise a clear exception or exit
if the variable is not set, and rotate/remove the exposed key from history; also
update the other occurrences referenced (lines ~62-63) that use the same
fallback to ensure no hardcoded secret remains.
- Around line 63-66: The DB update phase is susceptible to race conditions and
lacks proper transaction batching: wrap the update-only section that iterates
over rows/old_tags (the code between where rows and old_tags are read and where
cursor.execute/conn.execute runs the UPDATEs, i.e., the block using conn, cursor
and updating tags) with a process-level single-writer guard (e.g., a
multiprocessing.Lock or a file-based lock) to ensure only one process writes at
a time; change the update loop to use explicit transactions by issuing
conn.execute("BEGIN") before a batch, perform up to 50 UPDATEs inside that
transaction, then conn.execute("COMMIT") after each 50-row batch (and finally
commit any remainder), and remove the misleading checkpoint print so that
commits reflect actual persisted checkpoints. Ensure you keep using the existing
genai.Client, conn, cursor, rows and old_tags identifiers when locating and
modifying the code.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 93cab848-9d40-45f1-a766-89a4cc64bb9e
📒 Files selected for processing (1)
scripts/enrich_recent.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: test (3.13)
- GitHub Check: test (3.12)
- GitHub Check: test (3.11)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests
Files:
scripts/enrich_recent.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-12T14:22:54.809Z
Learning: Run enrichment worker script via `golems/scripts/enrichment-lazy.sh` with launchd scheduling, process priority nice=20, and batch size of 50 items
📚 Learning: 2026-03-12T14:22:54.809Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-12T14:22:54.809Z
Learning: Applies to src/brainlayer/**/*.py : For bulk database operations: stop enrichment workers first, checkpoint WAL before and after operations, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks with checkpoint every 3 batches, never delete from `chunks` while FTS trigger is active
Applied to files:
scripts/enrich_recent.py
📚 Learning: 2026-03-12T14:22:54.809Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-12T14:22:54.809Z
Learning: Applies to src/brainlayer/**/*.py : Use AST-aware chunking with tree-sitter; never split stack traces; mask large tool output during chunking
Applied to files:
scripts/enrich_recent.py
📚 Learning: 2026-03-12T14:22:54.809Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-12T14:22:54.809Z
Learning: Run enrichment worker script via `golems/scripts/enrichment-lazy.sh` with launchd scheduling, process priority nice=20, and batch size of 50 items
Applied to files:
scripts/enrich_recent.py
🪛 Gitleaks (8.30.0)
scripts/enrich_recent.py
[high] 12-12: Uncovered a GCP API key, which could lead to unauthorized access to Google Cloud services and data breaches.
(gcp-api-key)
🔇 Additional comments (1)
scripts/enrich_recent.py (1)
62-178: Please run the full pytest suite before merging this writer.The validation described here is a live run only, but this script mutates the main
chunkstable. Please run the existing pytest suite before treating the change as safe.As per coding guidelines, "Run pytest before claiming behavior changed safely; current test suite has 929 tests."
| client = genai.Client(api_key=API_KEY) | ||
| conn = apsw.Connection(DB_PATH) | ||
| conn.execute("PRAGMA busy_timeout = 5000") | ||
| cursor = conn.cursor() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd /tmp && find . -name "enrich_recent.py" -type f 2>/dev/null | head -5Repository: EtanHey/brainlayer
Length of output: 44
🏁 Script executed:
cd /tmp && find . -type f -name "*.py" | grep -i enrich | head -10Repository: EtanHey/brainlayer
Length of output: 44
🏁 Script executed:
git ls-files | grep -E "(enrich|script)" | head -20Repository: EtanHey/brainlayer
Length of output: 809
🏁 Script executed:
git ls-files | grep enrich_recentRepository: EtanHey/brainlayer
Length of output: 86
🏁 Script executed:
find . -name "*enrich*recent*" -o -name "*recent*enrich*" 2>/dev/nullRepository: EtanHey/brainlayer
Length of output: 122
🏁 Script executed:
git log --oneline --all -- "*enrich_recent*" 2>/dev/null | head -10Repository: EtanHey/brainlayer
Length of output: 127
🏁 Script executed:
cat -n scripts/enrich_recent.pyRepository: EtanHey/brainlayer
Length of output: 9527
🏁 Script executed:
rg -t py "lock|Lock|LOCK|Semaphore|RLock" src/ scripts/ --max-count=3 | head -20Repository: EtanHey/brainlayer
Length of output: 1780
🏁 Script executed:
cat -n tests/test_recent_enrichment.pyRepository: EtanHey/brainlayer
Length of output: 1730
🏁 Script executed:
rg -t py "_store_lock|process.*lock|flock|FileLock" --max-count=2Repository: EtanHey/brainlayer
Length of output: 385
🏁 Script executed:
head -50 src/brainlayer/mcp/_shared.pyRepository: EtanHey/brainlayer
Length of output: 1465
Add process-level single-writer guard and implement explicit transaction batching with proper commits.
PRAGMA busy_timeout (line 65) only waits for lock contention—it does not prevent other writers. Since rows and old_tags are snapshots taken before the Gemini API call (lines 69–78), another writer can update tags between the snapshot and the UPDATE at line 138, causing stale data to overwrite newer values.
Line 166 prints a checkpoint message but executes no actual commit—APSW defaults to autocommit mode, so each UPDATE at line 138 is an independent write with no batching.
Fix: Wrap only the DB update phase (lines 137–140) in a process-level single-writer guard (e.g., multiprocessing.Lock or file-based lock). Implement proper explicit transactions: use conn.execute("BEGIN"), batch updates in 50-row chunks with conn.execute("COMMIT"), and remove the misleading checkpoint print.
Applies to lines 69–78, 123–140, and 164–166.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/enrich_recent.py` around lines 63 - 66, The DB update phase is
susceptible to race conditions and lacks proper transaction batching: wrap the
update-only section that iterates over rows/old_tags (the code between where
rows and old_tags are read and where cursor.execute/conn.execute runs the
UPDATEs, i.e., the block using conn, cursor and updating tags) with a
process-level single-writer guard (e.g., a multiprocessing.Lock or a file-based
lock) to ensure only one process writes at a time; change the update loop to use
explicit transactions by issuing conn.execute("BEGIN") before a batch, perform
up to 50 UPDATEs inside that transaction, then conn.execute("COMMIT") after each
50-row batch (and finally commit any remainder), and remove the misleading
checkpoint print so that commits reflect actual persisted checkpoints. Ensure
you keep using the existing genai.Client, conn, cursor, rows and old_tags
identifiers when locating and modifying the code.
| text = response.text.strip() | ||
| parsed = json.loads(text) | ||
|
|
||
| # Handle array responses (take first) | ||
| if isinstance(parsed, list) and parsed: | ||
| parsed = parsed[0] | ||
| if not isinstance(parsed, dict): | ||
| errors += 1 | ||
| continue | ||
|
|
||
| # Build new tags: merge old + new faceted | ||
| topics = parsed.get("b_topics", []) | ||
| activity = parsed.get("c_activity", "") | ||
| domains = parsed.get("d_domain", []) | ||
| confidence = parsed.get("e_confidence", 0) | ||
|
|
||
| new_tags = list(topics) | ||
| if activity: | ||
| new_tags.append(activity) | ||
| new_tags.extend(domains) |
There was a problem hiding this comment.
Reject malformed model payloads before persisting them.
json.loads() only proves syntax. If the model returns a string for b_topics or d_domain, Lines 118 and 121 will turn it into per-character tags, and a non-numeric e_confidence will be written as-is. Validate the field types/range before building new_tags.
Suggested fix
text = response.text.strip()
parsed = json.loads(text)
@@
topics = parsed.get("b_topics", [])
activity = parsed.get("c_activity", "")
domains = parsed.get("d_domain", [])
confidence = parsed.get("e_confidence", 0)
+
+ if (
+ not isinstance(topics, list)
+ or not all(isinstance(t, str) for t in topics)
+ or not isinstance(activity, str)
+ or not isinstance(domains, list)
+ or not all(isinstance(d, str) for d in domains)
+ or not isinstance(confidence, (int, float))
+ or not 0.0 <= confidence <= 1.0
+ ):
+ errors += 1
+ if errors <= 3:
+ print(f" Invalid payload for {str(chunk_id)[:30]}: {parsed!r}")
+ continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/enrich_recent.py` around lines 102 - 121, The parsed model payload
fields need type and range validation before building new_tags: after parsing
`parsed` (the dict), validate that `parsed.get("b_topics")` is a list of strings
(otherwise treat as malformed), `parsed.get("d_domain")` is a list of strings,
`parsed.get("c_activity")` is either an empty string or a string, and
`parsed.get("e_confidence")` is numeric (coerce to float and ensure it falls in
an expected range, e.g. 0.0–1.0); if any validation fails, increment `errors`
and continue instead of using the malformed values. Update the code around the
`topics`, `activity`, `domains`, `confidence`, and `new_tags` logic to perform
these checks/coercions and only merge/extend `new_tags` with validated
lists/strings.
| cursor.execute( | ||
| "UPDATE chunks SET tags = ?, tag_confidence = ? WHERE id = ?", | ||
| (tags_json, confidence, chunk_id) |
There was a problem hiding this comment.
Write confidence to the actual chunks column.
Line 138 updates tag_confidence, but the canonical chunks schema stores enrichment confidence in n. On the canonical schema this statement will fail and the run will not persist its updates.
Suggested fix
- "UPDATE chunks SET tags = ?, tag_confidence = ? WHERE id = ?",
+ "UPDATE chunks SET tags = ?, n = ? WHERE id = ?",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cursor.execute( | |
| "UPDATE chunks SET tags = ?, tag_confidence = ? WHERE id = ?", | |
| (tags_json, confidence, chunk_id) | |
| cursor.execute( | |
| "UPDATE chunks SET tags = ?, n = ? WHERE id = ?", | |
| (tags_json, confidence, chunk_id) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/enrich_recent.py` around lines 137 - 139, The UPDATE statement in
cursor.execute currently writes to tag_confidence but the canonical chunks
schema expects the enrichment confidence in column n; update the SQL in the
cursor.execute call (the statement that uses tags_json, confidence, chunk_id) to
set n instead of tag_confidence so the confidence value is persisted into the
canonical column (keep tags = ?, n = ? with the same bound variables tags_json,
confidence, chunk_id).
- Remove Gemini API key fallback from both scripts (require env var) - Add type validation for parsed response fields (topics, domains, confidence) - Addresses CodeRabbit critical finding on #96 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
scripts/enrich_recent.py— enriches chunks with faceted tags via Gemini 2.5 FlashSample output
Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit