Skip to content

Defer dataclip search_vector indexing off the insert path#4821

Merged
stuartc merged 4 commits into
mainfrom
defer-dataclip-search-vector
Jun 4, 2026
Merged

Defer dataclip search_vector indexing off the insert path#4821
stuartc merged 4 commits into
mainfrom
defer-dataclip-search-vector

Conversation

@stuartc
Copy link
Copy Markdown
Member

@stuartc stuartc commented Jun 1, 2026

Description

Same idea as #4818, but for dataclips instead of log_lines.

When a run finishes, the worker sends each step's output dataclip back to
Lightning to be saved. An AFTER INSERT trigger then builds the dataclip's
search vector with jsonb_to_tsvector. For big dataclips that's slow — slow
enough that under load the insert can sit on a connection past the ~70s timeout
and get rolled back. The dataclip never gets saved, and then everything after it
in the run (next step's input, log lines, final_dataclip_id) points at
something that doesn't exist, so the whole run is lost. That's one of the shapes
behind #4794.

So this takes the same approach we used for log lines: inserts now leave
search_vector NULL, and a background Oban worker
(Lightning.Invocation.DataclipSearchVectorWorker) backfills it out-of-band on
its own dataclip_search_indexing queue. There's a guarded
safe_jsonb_to_tsvector function and a partial index (WHERE search_vector IS NULL) so finding pending rows stays cheap. Dataclip search is now
eventually-consistent — usually caught up within a minute.

Three migrations: add safe_jsonb_to_tsvector, add the partial index, then drop
the trigger. dataclips isn't partitioned so the index is a single
CONCURRENTLY build (no per-partition dance like the log_lines one needed).

Caveat

This is one thing about this approach worth mentioning:

In the situation of an idle instace, with a single dataclip being inserted.
Indexing is poll-driven by the 1-minute cron - a lone insert can't enqueue the worker itself.
So with zero load, a new row waits for the next cron tick: up to ~60s, ~30s on average.
This is a floor set by the cron cadence, not by load.

Closes #4800

Validation steps

  1. Run the migrations (mix ecto.migrate). Confirm the dataclips
    set_search_vector trigger is gone, safe_jsonb_to_tsvector exists, and
    dataclips_pending_search_idx is VALID.
  2. Insert a dataclip — its search_vector is NULL and it isn't matched by
    search yet.
  3. Run Lightning.Invocation.DataclipSearchVectorWorker — the row's
    search_vector is populated and matches to_tsquery('english_nostop', …).
  4. Insert one with an oversized body (>1MB tsvector) — it drains to an empty
    vector rather than erroring, and doesn't get stuck retrying.
  5. mix test test/lightning/invocation/dataclip_search_vector_worker_test.exs test/lightning/runs_test.exs

Additional notes for the reviewer

  1. The worker is basically the log_lines one — drains newest-first, snowballs an
    immediate follow-up while there's backlog, falls back to a 1-minute cron
    heartbeat. Concurrency 1 on its own queue. The snowball's unique states are
    restricted to [:available, :scheduled] on purpose (the default includes
    :executing/:completed, which makes a running job dedup its own successor
    and kills the chain).
  2. Separate queue from log_lines' search_indexing on purpose, so a log-line
    backlog can't starve dataclip indexing (or vice versa).

AI Usage

  • I have used Claude Code
  • I have used another model
  • I have not used AI

Pre-submission checklist

  • I have performed an AI review of my code
  • I have implemented and tested all related authorization policies.
  • I have updated the changelog.
  • I have ticked a box in "AI usage" in this PR

@github-project-automation github-project-automation Bot moved this to New Issues in Core Jun 1, 2026
@stuartc stuartc changed the base branch from main to defer-log-lines-search-vector June 3, 2026 14:45
@stuartc stuartc force-pushed the defer-dataclip-search-vector branch from 8fc24ba to 1392175 Compare June 3, 2026 14:49
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.3%. Comparing base (9b000a0) to head (1aa5272).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@          Coverage Diff          @@
##            main   #4821   +/-   ##
=====================================
  Coverage   90.3%   90.3%           
=====================================
  Files        443     444    +1     
  Lines      22562   22577   +15     
=====================================
+ Hits       20379   20397   +18     
+ Misses      2183    2180    -3     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@stuartc stuartc force-pushed the defer-dataclip-search-vector branch from 1392175 to 5621332 Compare June 3, 2026 16:20
Base automatically changed from defer-log-lines-search-vector to main June 4, 2026 05:32
stuartc added 4 commits June 4, 2026 07:37
The dataclips AFTER INSERT trigger built search_vector with jsonb_to_tsvector
on the synchronous insert path. For large bodies under load this could hold the
connection past the timeout and roll back the insert, so the dataclip was never
saved and the run's following events cascade-failed - losing the run.

Move vector building off the insert path, mirroring the log_lines approach:

- safe_jsonb_to_tsvector(regconfig, jsonb): COALESCE(body,'{}') so a NULL/wiped
  body yields ''::tsvector (never NULL, so it can't stick in the pending index),
  catching program_limit_exceeded -> ''::tsvector.
- Partial index dataclips_pending_search_idx over (inserted_at) WHERE
  search_vector IS NULL, built CONCURRENTLY (dataclips is unpartitioned).
- Drop the set_search_vector trigger and update_dataclip_search_vector function
  (down restores the program_limit_exceeded-catching version).
- DataclipSearchVectorWorker on a dedicated dataclip_search_indexing queue
  (concurrency 1) drains pending rows newest-first with FOR UPDATE SKIP LOCKED,
  snowballing when its per-run budget is exhausted, otherwise minute-ly cron.
  Uses english_nostop to match the read side (Lightning.Invocation).

Dataclip search is now eventually consistent. The insert no longer blocks on
or rolls back from vector building.
A 2,500-row batch is a single ~21s transaction pushing ~158MB WAL, and a
batch catching multi-MB dataclip bodies blows past 60s. Dropping to 250
keeps each transaction short (~2s, bounded WAL/lock time) while the
snowball re-enqueue and minute-ly cron carry overall throughput.
@max_batches stays at 10 so jobs finish quickly and remain resilient
across deploys.

Also decouples the moduledoc's queue-isolation note from the sibling
log_lines PR: the rationale is now self-contained (own queue avoids
starving or being starved by unrelated background work) rather than
referencing a queue defined on another, unmerged branch.
Deferring dataclips.search_vector indexing off the insert path means
inserted dataclips have a NULL search_vector until DataclipSearchVectorWorker
drains them. Tests that insert dataclips and then search them on the body
field matched nothing, since the worker never runs on its own in the test
environment.

Add Lightning.TestUtils.flush_dataclip_search_index/0 (sibling to
flush_log_search_index/0), which runs the worker synchronously in-process via
Oban.Testing.perform_job/3 so it indexes the uncommitted sandbox rows, and
call it in the invocation_test setups and the work_order_live filter test
before searching. Also add a positive-control assertion so a regression that
re-NULLs the vector fails loudly rather than passing on an empty result.
Mirror the log_lines fix: make DataclipSearchVectorWorker batch_size/max_batches
configurable through the Lightning.Config seam (defaults 250/10 in config.exs,
2/2 in test.exs), restructure drain/2 into drain/4, and add a test exercising
the recursive drain, budget guard, and snowball enqueue.
@stuartc stuartc force-pushed the defer-dataclip-search-vector branch from 5621332 to 1aa5272 Compare June 4, 2026 05:43
@stuartc stuartc marked this pull request as ready for review June 4, 2026 05:49
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2026

Security Review ✅

  • S0 (project scoping): N/A — the only data write is DataclipSearchVectorWorker updating dataclips.search_vector (lib/lightning/invocation/dataclip_search_vector_worker.ex:48-55), a system-level background index maintenance task whose SQL is intentionally cross-project; project scoping happens at the read side (Invocation.search_workorders).
  • S1 (authorization): N/A — no new controllers, LiveView events, or web-layer entrypoints; the diff only adds Lightning.Config callbacks and threads them into an existing Oban worker.
  • S2 (audit trail): N/A — no writes to configuration resources (workflows, credentials, project settings, etc.); the worker only backfills a derived search_vector tsvector column.

@stuartc stuartc merged commit f590caa into main Jun 4, 2026
6 checks passed
@stuartc stuartc deleted the defer-dataclip-search-vector branch June 4, 2026 05:52
@github-project-automation github-project-automation Bot moved this from New Issues to Done in Core Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Dataclip insert times out building the search vector, which loses the run

1 participant