Defer dataclip search_vector indexing off the insert path#4821
Merged
Conversation
7 tasks
8fc24ba to
1392175
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4821 +/- ##
=====================================
Coverage 90.3% 90.3%
=====================================
Files 443 444 +1
Lines 22562 22577 +15
=====================================
+ Hits 20379 20397 +18
+ Misses 2183 2180 -3 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
1392175 to
5621332
Compare
The dataclips AFTER INSERT trigger built search_vector with jsonb_to_tsvector
on the synchronous insert path. For large bodies under load this could hold the
connection past the timeout and roll back the insert, so the dataclip was never
saved and the run's following events cascade-failed - losing the run.
Move vector building off the insert path, mirroring the log_lines approach:
- safe_jsonb_to_tsvector(regconfig, jsonb): COALESCE(body,'{}') so a NULL/wiped
body yields ''::tsvector (never NULL, so it can't stick in the pending index),
catching program_limit_exceeded -> ''::tsvector.
- Partial index dataclips_pending_search_idx over (inserted_at) WHERE
search_vector IS NULL, built CONCURRENTLY (dataclips is unpartitioned).
- Drop the set_search_vector trigger and update_dataclip_search_vector function
(down restores the program_limit_exceeded-catching version).
- DataclipSearchVectorWorker on a dedicated dataclip_search_indexing queue
(concurrency 1) drains pending rows newest-first with FOR UPDATE SKIP LOCKED,
snowballing when its per-run budget is exhausted, otherwise minute-ly cron.
Uses english_nostop to match the read side (Lightning.Invocation).
Dataclip search is now eventually consistent. The insert no longer blocks on
or rolls back from vector building.
A 2,500-row batch is a single ~21s transaction pushing ~158MB WAL, and a batch catching multi-MB dataclip bodies blows past 60s. Dropping to 250 keeps each transaction short (~2s, bounded WAL/lock time) while the snowball re-enqueue and minute-ly cron carry overall throughput. @max_batches stays at 10 so jobs finish quickly and remain resilient across deploys. Also decouples the moduledoc's queue-isolation note from the sibling log_lines PR: the rationale is now self-contained (own queue avoids starving or being starved by unrelated background work) rather than referencing a queue defined on another, unmerged branch.
Deferring dataclips.search_vector indexing off the insert path means inserted dataclips have a NULL search_vector until DataclipSearchVectorWorker drains them. Tests that insert dataclips and then search them on the body field matched nothing, since the worker never runs on its own in the test environment. Add Lightning.TestUtils.flush_dataclip_search_index/0 (sibling to flush_log_search_index/0), which runs the worker synchronously in-process via Oban.Testing.perform_job/3 so it indexes the uncommitted sandbox rows, and call it in the invocation_test setups and the work_order_live filter test before searching. Also add a positive-control assertion so a regression that re-NULLs the vector fails loudly rather than passing on an empty result.
Mirror the log_lines fix: make DataclipSearchVectorWorker batch_size/max_batches configurable through the Lightning.Config seam (defaults 250/10 in config.exs, 2/2 in test.exs), restructure drain/2 into drain/4, and add a test exercising the recursive drain, budget guard, and snowball enqueue.
5621332 to
1aa5272
Compare
Security Review ✅
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Same idea as #4818, but for
dataclipsinstead oflog_lines.When a run finishes, the worker sends each step's output dataclip back to
Lightning to be saved. An
AFTER INSERTtrigger then builds the dataclip'ssearch vector with
jsonb_to_tsvector. For big dataclips that's slow — slowenough that under load the insert can sit on a connection past the ~70s timeout
and get rolled back. The dataclip never gets saved, and then everything after it
in the run (next step's input, log lines,
final_dataclip_id) points atsomething that doesn't exist, so the whole run is lost. That's one of the shapes
behind #4794.
So this takes the same approach we used for log lines: inserts now leave
search_vectorNULL, and a background Oban worker(
Lightning.Invocation.DataclipSearchVectorWorker) backfills it out-of-band onits own
dataclip_search_indexingqueue. There's a guardedsafe_jsonb_to_tsvectorfunction and a partial index (WHERE search_vector IS NULL) so finding pending rows stays cheap. Dataclip search is noweventually-consistent — usually caught up within a minute.
Three migrations: add
safe_jsonb_to_tsvector, add the partial index, then dropthe trigger.
dataclipsisn't partitioned so the index is a singleCONCURRENTLYbuild (no per-partition dance like the log_lines one needed).Caveat
This is one thing about this approach worth mentioning:
In the situation of an idle instace, with a single dataclip being inserted.
Indexing is poll-driven by the 1-minute cron - a lone insert can't enqueue the worker itself.
So with zero load, a new row waits for the next cron tick: up to ~60s, ~30s on average.
This is a floor set by the cron cadence, not by load.
Closes #4800
Validation steps
mix ecto.migrate). Confirm thedataclipsset_search_vectortrigger is gone,safe_jsonb_to_tsvectorexists, anddataclips_pending_search_idxis VALID.search_vectoris NULL and it isn't matched bysearch yet.
Lightning.Invocation.DataclipSearchVectorWorker— the row'ssearch_vectoris populated and matchesto_tsquery('english_nostop', …).vector rather than erroring, and doesn't get stuck retrying.
mix test test/lightning/invocation/dataclip_search_vector_worker_test.exs test/lightning/runs_test.exsAdditional notes for the reviewer
immediate follow-up while there's backlog, falls back to a 1-minute cron
heartbeat. Concurrency 1 on its own queue. The snowball's
uniquestates arerestricted to
[:available, :scheduled]on purpose (the default includes:executing/:completed, which makes a running job dedup its own successorand kills the chain).
search_indexingon purpose, so a log-linebacklog can't starve dataclip indexing (or vice versa).
AI Usage
Pre-submission checklist