Skip to content

feat: faceted tag enrichment script (Gemini 2.5 Flash)#96

Merged
EtanHey merged 2 commits intomainfrom
feat/enrich-recent-faceted-tags
Mar 19, 2026
Merged

feat: faceted tag enrichment script (Gemini 2.5 Flash)#96
EtanHey merged 2 commits intomainfrom
feat/enrich-recent-faceted-tags

Conversation

@EtanHey
Copy link
Copy Markdown
Owner

@EtanHey EtanHey commented Mar 19, 2026

Summary

  • New scripts/enrich_recent.py — enriches chunks with faceted tags via Gemini 2.5 Flash
  • Uses enrichment prompt v2: topic tags, activity, domain, confidence
  • Merges new faceted tags with existing tags (no data loss)
  • Commits every 50 chunks, rate-limited at 0.3s/req
  • First run: 200 chunks enriched, 0 errors, avg confidence 0.95

Sample output

brainbar-f791af84  topics=['brainlayer-search-quality', 'importance-calibration'], act:implementing, ['dom:sql'], conf=0.98
brainbar-e8677cf3  topics=['multi-agent-coordination', 'agent-message-architecture'], act:designing, ['dom:sql', 'dom:mcp'], conf=0.98
manual-54e90a47    topics=['brainlayer-search-quality', 'sprint-planning-methodology'], act:planning, ['dom:git', 'dom:cli'], conf=0.95

Test plan

  • 200-chunk live run: 0 errors, 100% valid JSON
  • Tags correctly merged with existing (old tags preserved)
  • Verify enriched chunks surface better in brain_search

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Introduced automated enrichment functionality that processes recent data chunks lacking specific tags. Each chunk receives AI-generated enhancements including activity classifications, domain categorizations, and confidence scores. All updates are automatically persisted to your local database.

Gemini 2.5 Flash enrichment with faceted tag schema (topic, activity,
domain, confidence). Merges new tags with existing, commits every 50.
First run: 200 chunks enriched, 0 errors, avg confidence 0.95.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 19, 2026

Warning

Rate limit exceeded

@EtanHey has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 5 minutes and 1 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6755d54b-4766-43ea-b697-ee5503e10ef0

📥 Commits

Reviewing files that changed from the base of the PR and between 3d80539 and 719761a.

📒 Files selected for processing (2)
  • scripts/enrich_recent.py
  • scripts/enrichment_pilot.py
📝 Walkthrough

Walkthrough

A new standalone script, scripts/enrich_recent.py, has been added that queries a local SQLite database for recent chunks lacking domain tags, calls the Gemini API to generate enriched metadata tags, parses the responses, and updates the database with extracted topic, activity, domain, and confidence values alongside existing tags.

Changes

Cohort / File(s) Summary
New Data Enrichment Script
scripts/enrich_recent.py
New standalone script that retrieves recent chunks from SQLite, invokes Gemini for tag enrichment, parses JSON responses with fallback for list-wrapped results, merges new faceted tags with existing tags, and updates database. Includes request throttling, progress logging every 10 iterations, error tracking (first 3 errors logged), and batch checkpoint reporting.

Sequence Diagram

sequenceDiagram
    actor Script as Enrichment Script
    participant DB as SQLite Database
    participant API as Gemini API
    
    Script->>DB: Connect with busy_timeout=5000
    Script->>DB: SELECT recent chunks<br/>(last 7 days, no dom: tags)
    DB-->>Script: Return up to MAX_CHUNKS rows
    
    loop For each chunk
        Script->>API: POST prompt with chunk content<br/>(gemini-2.5-flash)
        API-->>Script: Return JSON response
        Script->>Script: Parse JSON<br/>(handle list wrapping)
        Script->>Script: Extract b_topics, c_activity,<br/>d_domain, e_confidence
        Script->>Script: Merge new tags with<br/>existing tags from old_tags
        Script->>DB: UPDATE chunks<br/>SET tags=?, tag_confidence=?<br/>WHERE id=?
        DB-->>Script: Row updated
        Script->>Script: Throttle with sleep(0.3)
        Script->>Script: Log progress every 10 iterations
    end
    
    Script->>Script: Print sample summary<br/>(up to 10 processed chunks)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A bunny's delight in the database night,
With tags and domains now shining so bright!
Gemini whispers what each chunk should know,
As SQLite blossoms in enrichment's glow,
Ten hops at a time, we chronicle the way!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: faceted tag enrichment script (Gemini 2.5 Flash)' clearly and concisely describes the main change—adding a script for enriching chunks with faceted tags using Gemini 2.5 Flash. It is specific, directly related to the core addition of scripts/enrich_recent.py, and aligns with the PR objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/enrich-recent-faceted-tags
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/enrich_recent.py`:
- Around line 137-139: The UPDATE statement in cursor.execute currently writes
to tag_confidence but the canonical chunks schema expects the enrichment
confidence in column n; update the SQL in the cursor.execute call (the statement
that uses tags_json, confidence, chunk_id) to set n instead of tag_confidence so
the confidence value is persisted into the canonical column (keep tags = ?, n =
? with the same bound variables tags_json, confidence, chunk_id).
- Around line 102-121: The parsed model payload fields need type and range
validation before building new_tags: after parsing `parsed` (the dict), validate
that `parsed.get("b_topics")` is a list of strings (otherwise treat as
malformed), `parsed.get("d_domain")` is a list of strings,
`parsed.get("c_activity")` is either an empty string or a string, and
`parsed.get("e_confidence")` is numeric (coerce to float and ensure it falls in
an expected range, e.g. 0.0–1.0); if any validation fails, increment `errors`
and continue instead of using the malformed values. Update the code around the
`topics`, `activity`, `domains`, `confidence`, and `new_tags` logic to perform
these checks/coercions and only merge/extend `new_tags` with validated
lists/strings.
- Line 12: The code currently falls back to a hardcoded Gemini/Google key via
the API_KEY assignment (os.environ.get("GOOGLE_API_KEY", "...")), which leaks
credentials; remove the embedded literal and change the behavior in the API_KEY
initialization so it fails closed: read API_KEY from environment only
(os.environ["GOOGLE_API_KEY"] or equivalent) and raise a clear exception or exit
if the variable is not set, and rotate/remove the exposed key from history; also
update the other occurrences referenced (lines ~62-63) that use the same
fallback to ensure no hardcoded secret remains.
- Around line 63-66: The DB update phase is susceptible to race conditions and
lacks proper transaction batching: wrap the update-only section that iterates
over rows/old_tags (the code between where rows and old_tags are read and where
cursor.execute/conn.execute runs the UPDATEs, i.e., the block using conn, cursor
and updating tags) with a process-level single-writer guard (e.g., a
multiprocessing.Lock or a file-based lock) to ensure only one process writes at
a time; change the update loop to use explicit transactions by issuing
conn.execute("BEGIN") before a batch, perform up to 50 UPDATEs inside that
transaction, then conn.execute("COMMIT") after each 50-row batch (and finally
commit any remainder), and remove the misleading checkpoint print so that
commits reflect actual persisted checkpoints. Ensure you keep using the existing
genai.Client, conn, cursor, rows and old_tags identifiers when locating and
modifying the code.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 93cab848-9d40-45f1-a766-89a4cc64bb9e

📥 Commits

Reviewing files that changed from the base of the PR and between 4af55ff and 3d80539.

📒 Files selected for processing (1)
  • scripts/enrich_recent.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: test (3.13)
  • GitHub Check: test (3.12)
  • GitHub Check: test (3.11)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests

Files:

  • scripts/enrich_recent.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-12T14:22:54.809Z
Learning: Run enrichment worker script via `golems/scripts/enrichment-lazy.sh` with launchd scheduling, process priority nice=20, and batch size of 50 items
📚 Learning: 2026-03-12T14:22:54.809Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-12T14:22:54.809Z
Learning: Applies to src/brainlayer/**/*.py : For bulk database operations: stop enrichment workers first, checkpoint WAL before and after operations, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks with checkpoint every 3 batches, never delete from `chunks` while FTS trigger is active

Applied to files:

  • scripts/enrich_recent.py
📚 Learning: 2026-03-12T14:22:54.809Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-12T14:22:54.809Z
Learning: Applies to src/brainlayer/**/*.py : Use AST-aware chunking with tree-sitter; never split stack traces; mask large tool output during chunking

Applied to files:

  • scripts/enrich_recent.py
📚 Learning: 2026-03-12T14:22:54.809Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-12T14:22:54.809Z
Learning: Run enrichment worker script via `golems/scripts/enrichment-lazy.sh` with launchd scheduling, process priority nice=20, and batch size of 50 items

Applied to files:

  • scripts/enrich_recent.py
🪛 Gitleaks (8.30.0)
scripts/enrich_recent.py

[high] 12-12: Uncovered a GCP API key, which could lead to unauthorized access to Google Cloud services and data breaches.

(gcp-api-key)

🔇 Additional comments (1)
scripts/enrich_recent.py (1)

62-178: Please run the full pytest suite before merging this writer.

The validation described here is a live run only, but this script mutates the main chunks table. Please run the existing pytest suite before treating the change as safe.

As per coding guidelines, "Run pytest before claiming behavior changed safely; current test suite has 929 tests."

Comment thread scripts/enrich_recent.py Outdated
Comment thread scripts/enrich_recent.py
Comment on lines +63 to +66
client = genai.Client(api_key=API_KEY)
conn = apsw.Connection(DB_PATH)
conn.execute("PRAGMA busy_timeout = 5000")
cursor = conn.cursor()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd /tmp && find . -name "enrich_recent.py" -type f 2>/dev/null | head -5

Repository: EtanHey/brainlayer

Length of output: 44


🏁 Script executed:

cd /tmp && find . -type f -name "*.py" | grep -i enrich | head -10

Repository: EtanHey/brainlayer

Length of output: 44


🏁 Script executed:

git ls-files | grep -E "(enrich|script)" | head -20

Repository: EtanHey/brainlayer

Length of output: 809


🏁 Script executed:

git ls-files | grep enrich_recent

Repository: EtanHey/brainlayer

Length of output: 86


🏁 Script executed:

find . -name "*enrich*recent*" -o -name "*recent*enrich*" 2>/dev/null

Repository: EtanHey/brainlayer

Length of output: 122


🏁 Script executed:

git log --oneline --all -- "*enrich_recent*" 2>/dev/null | head -10

Repository: EtanHey/brainlayer

Length of output: 127


🏁 Script executed:

cat -n scripts/enrich_recent.py

Repository: EtanHey/brainlayer

Length of output: 9527


🏁 Script executed:

rg -t py "lock|Lock|LOCK|Semaphore|RLock" src/ scripts/ --max-count=3 | head -20

Repository: EtanHey/brainlayer

Length of output: 1780


🏁 Script executed:

cat -n tests/test_recent_enrichment.py

Repository: EtanHey/brainlayer

Length of output: 1730


🏁 Script executed:

rg -t py "_store_lock|process.*lock|flock|FileLock" --max-count=2

Repository: EtanHey/brainlayer

Length of output: 385


🏁 Script executed:

head -50 src/brainlayer/mcp/_shared.py

Repository: EtanHey/brainlayer

Length of output: 1465


Add process-level single-writer guard and implement explicit transaction batching with proper commits.

PRAGMA busy_timeout (line 65) only waits for lock contention—it does not prevent other writers. Since rows and old_tags are snapshots taken before the Gemini API call (lines 69–78), another writer can update tags between the snapshot and the UPDATE at line 138, causing stale data to overwrite newer values.

Line 166 prints a checkpoint message but executes no actual commit—APSW defaults to autocommit mode, so each UPDATE at line 138 is an independent write with no batching.

Fix: Wrap only the DB update phase (lines 137–140) in a process-level single-writer guard (e.g., multiprocessing.Lock or file-based lock). Implement proper explicit transactions: use conn.execute("BEGIN"), batch updates in 50-row chunks with conn.execute("COMMIT"), and remove the misleading checkpoint print.

Applies to lines 69–78, 123–140, and 164–166.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/enrich_recent.py` around lines 63 - 66, The DB update phase is
susceptible to race conditions and lacks proper transaction batching: wrap the
update-only section that iterates over rows/old_tags (the code between where
rows and old_tags are read and where cursor.execute/conn.execute runs the
UPDATEs, i.e., the block using conn, cursor and updating tags) with a
process-level single-writer guard (e.g., a multiprocessing.Lock or a file-based
lock) to ensure only one process writes at a time; change the update loop to use
explicit transactions by issuing conn.execute("BEGIN") before a batch, perform
up to 50 UPDATEs inside that transaction, then conn.execute("COMMIT") after each
50-row batch (and finally commit any remainder), and remove the misleading
checkpoint print so that commits reflect actual persisted checkpoints. Ensure
you keep using the existing genai.Client, conn, cursor, rows and old_tags
identifiers when locating and modifying the code.

Comment thread scripts/enrich_recent.py
Comment on lines +102 to +121
text = response.text.strip()
parsed = json.loads(text)

# Handle array responses (take first)
if isinstance(parsed, list) and parsed:
parsed = parsed[0]
if not isinstance(parsed, dict):
errors += 1
continue

# Build new tags: merge old + new faceted
topics = parsed.get("b_topics", [])
activity = parsed.get("c_activity", "")
domains = parsed.get("d_domain", [])
confidence = parsed.get("e_confidence", 0)

new_tags = list(topics)
if activity:
new_tags.append(activity)
new_tags.extend(domains)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject malformed model payloads before persisting them.

json.loads() only proves syntax. If the model returns a string for b_topics or d_domain, Lines 118 and 121 will turn it into per-character tags, and a non-numeric e_confidence will be written as-is. Validate the field types/range before building new_tags.

Suggested fix
             text = response.text.strip()
             parsed = json.loads(text)
@@
             topics = parsed.get("b_topics", [])
             activity = parsed.get("c_activity", "")
             domains = parsed.get("d_domain", [])
             confidence = parsed.get("e_confidence", 0)
+
+            if (
+                not isinstance(topics, list)
+                or not all(isinstance(t, str) for t in topics)
+                or not isinstance(activity, str)
+                or not isinstance(domains, list)
+                or not all(isinstance(d, str) for d in domains)
+                or not isinstance(confidence, (int, float))
+                or not 0.0 <= confidence <= 1.0
+            ):
+                errors += 1
+                if errors <= 3:
+                    print(f"  Invalid payload for {str(chunk_id)[:30]}: {parsed!r}")
+                continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/enrich_recent.py` around lines 102 - 121, The parsed model payload
fields need type and range validation before building new_tags: after parsing
`parsed` (the dict), validate that `parsed.get("b_topics")` is a list of strings
(otherwise treat as malformed), `parsed.get("d_domain")` is a list of strings,
`parsed.get("c_activity")` is either an empty string or a string, and
`parsed.get("e_confidence")` is numeric (coerce to float and ensure it falls in
an expected range, e.g. 0.0–1.0); if any validation fails, increment `errors`
and continue instead of using the malformed values. Update the code around the
`topics`, `activity`, `domains`, `confidence`, and `new_tags` logic to perform
these checks/coercions and only merge/extend `new_tags` with validated
lists/strings.

Comment thread scripts/enrich_recent.py
Comment on lines +137 to +139
cursor.execute(
"UPDATE chunks SET tags = ?, tag_confidence = ? WHERE id = ?",
(tags_json, confidence, chunk_id)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Write confidence to the actual chunks column.

Line 138 updates tag_confidence, but the canonical chunks schema stores enrichment confidence in n. On the canonical schema this statement will fail and the run will not persist its updates.

Suggested fix
-                "UPDATE chunks SET tags = ?, tag_confidence = ? WHERE id = ?",
+                "UPDATE chunks SET tags = ?, n = ? WHERE id = ?",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cursor.execute(
"UPDATE chunks SET tags = ?, tag_confidence = ? WHERE id = ?",
(tags_json, confidence, chunk_id)
cursor.execute(
"UPDATE chunks SET tags = ?, n = ? WHERE id = ?",
(tags_json, confidence, chunk_id)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/enrich_recent.py` around lines 137 - 139, The UPDATE statement in
cursor.execute currently writes to tag_confidence but the canonical chunks
schema expects the enrichment confidence in column n; update the SQL in the
cursor.execute call (the statement that uses tags_json, confidence, chunk_id) to
set n instead of tag_confidence so the confidence value is persisted into the
canonical column (keep tags = ?, n = ? with the same bound variables tags_json,
confidence, chunk_id).

- Remove Gemini API key fallback from both scripts (require env var)
- Add type validation for parsed response fields (topics, domains, confidence)
- Addresses CodeRabbit critical finding on #96

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@EtanHey EtanHey merged commit ee1e357 into main Mar 19, 2026
6 checks passed
@EtanHey EtanHey deleted the feat/enrich-recent-faceted-tags branch March 19, 2026 10:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant