Defer dataclip search_vector indexing off the insert path by stuartc · Pull Request #4821 · OpenFn/lightning

stuartc · 2026-06-01T11:23:07Z

Description

Same idea as #4818, but for dataclips instead of log_lines.

When a run finishes, the worker sends each step's output dataclip back to
Lightning to be saved. An AFTER INSERT trigger then builds the dataclip's
search vector with jsonb_to_tsvector. For big dataclips that's slow — slow
enough that under load the insert can sit on a connection past the ~70s timeout
and get rolled back. The dataclip never gets saved, and then everything after it
in the run (next step's input, log lines, final_dataclip_id) points at
something that doesn't exist, so the whole run is lost. That's one of the shapes
behind #4794.

So this takes the same approach we used for log lines: inserts now leave
search_vector NULL, and a background Oban worker
(Lightning.Invocation.DataclipSearchVectorWorker) backfills it out-of-band on
its own dataclip_search_indexing queue. There's a guarded
safe_jsonb_to_tsvector function and a partial index (WHERE search_vector IS NULL) so finding pending rows stays cheap. Dataclip search is now
eventually-consistent — usually caught up within a minute.

Three migrations: add safe_jsonb_to_tsvector, add the partial index, then drop
the trigger. dataclips isn't partitioned so the index is a single
CONCURRENTLY build (no per-partition dance like the log_lines one needed).

Caveat

This is one thing about this approach worth mentioning:

In the situation of an idle instace, with a single dataclip being inserted.
Indexing is poll-driven by the 1-minute cron - a lone insert can't enqueue the worker itself.
So with zero load, a new row waits for the next cron tick: up to ~60s, ~30s on average.
This is a floor set by the cron cadence, not by load.

Closes #4800

Validation steps

Run the migrations (mix ecto.migrate). Confirm the dataclips
set_search_vector trigger is gone, safe_jsonb_to_tsvector exists, and
dataclips_pending_search_idx is VALID.
Insert a dataclip — its search_vector is NULL and it isn't matched by
search yet.
Run Lightning.Invocation.DataclipSearchVectorWorker — the row's
search_vector is populated and matches to_tsquery('english_nostop', …).
Insert one with an oversized body (>1MB tsvector) — it drains to an empty
vector rather than erroring, and doesn't get stuck retrying.
mix test test/lightning/invocation/dataclip_search_vector_worker_test.exs test/lightning/runs_test.exs

Additional notes for the reviewer

The worker is basically the log_lines one — drains newest-first, snowballs an
immediate follow-up while there's backlog, falls back to a 1-minute cron
heartbeat. Concurrency 1 on its own queue. The snowball's unique states are
restricted to [:available, :scheduled] on purpose (the default includes
:executing/:completed, which makes a running job dedup its own successor
and kills the chain).
Separate queue from log_lines' search_indexing on purpose, so a log-line
backlog can't starve dataclip indexing (or vice versa).

AI Usage

I have used Claude Code
I have used another model
I have not used AI

Pre-submission checklist

I have performed an AI review of my code
I have implemented and tested all related authorization policies.
I have updated the changelog.
I have ticked a box in "AI usage" in this PR

codecov · 2026-06-03T14:58:52Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.3%. Comparing base (9b000a0) to head (1aa5272).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@          Coverage Diff          @@
##            main   #4821   +/-   ##
=====================================
  Coverage   90.3%   90.3%           
=====================================
  Files        443     444    +1     
  Lines      22562   22577   +15     
=====================================
+ Hits       20379   20397   +18     
+ Misses      2183    2180    -3

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The dataclips AFTER INSERT trigger built search_vector with jsonb_to_tsvector on the synchronous insert path. For large bodies under load this could hold the connection past the timeout and roll back the insert, so the dataclip was never saved and the run's following events cascade-failed - losing the run. Move vector building off the insert path, mirroring the log_lines approach: - safe_jsonb_to_tsvector(regconfig, jsonb): COALESCE(body,'{}') so a NULL/wiped body yields ''::tsvector (never NULL, so it can't stick in the pending index), catching program_limit_exceeded -> ''::tsvector. - Partial index dataclips_pending_search_idx over (inserted_at) WHERE search_vector IS NULL, built CONCURRENTLY (dataclips is unpartitioned). - Drop the set_search_vector trigger and update_dataclip_search_vector function (down restores the program_limit_exceeded-catching version). - DataclipSearchVectorWorker on a dedicated dataclip_search_indexing queue (concurrency 1) drains pending rows newest-first with FOR UPDATE SKIP LOCKED, snowballing when its per-run budget is exhausted, otherwise minute-ly cron. Uses english_nostop to match the read side (Lightning.Invocation). Dataclip search is now eventually consistent. The insert no longer blocks on or rolls back from vector building.

A 2,500-row batch is a single ~21s transaction pushing ~158MB WAL, and a batch catching multi-MB dataclip bodies blows past 60s. Dropping to 250 keeps each transaction short (~2s, bounded WAL/lock time) while the snowball re-enqueue and minute-ly cron carry overall throughput. @max_batches stays at 10 so jobs finish quickly and remain resilient across deploys. Also decouples the moduledoc's queue-isolation note from the sibling log_lines PR: the rationale is now self-contained (own queue avoids starving or being starved by unrelated background work) rather than referencing a queue defined on another, unmerged branch.

Deferring dataclips.search_vector indexing off the insert path means inserted dataclips have a NULL search_vector until DataclipSearchVectorWorker drains them. Tests that insert dataclips and then search them on the body field matched nothing, since the worker never runs on its own in the test environment. Add Lightning.TestUtils.flush_dataclip_search_index/0 (sibling to flush_log_search_index/0), which runs the worker synchronously in-process via Oban.Testing.perform_job/3 so it indexes the uncommitted sandbox rows, and call it in the invocation_test setups and the work_order_live filter test before searching. Also add a positive-control assertion so a regression that re-NULLs the vector fails loudly rather than passing on an empty result.

Mirror the log_lines fix: make DataclipSearchVectorWorker batch_size/max_batches configurable through the Lightning.Config seam (defaults 250/10 in config.exs, 2/2 in test.exs), restructure drain/2 into drain/4, and add a test exercising the recursive drain, budget guard, and snowball enqueue.

github-actions · 2026-06-04T05:50:36Z

Security Review ✅

S0 (project scoping): N/A — the only data write is DataclipSearchVectorWorker updating dataclips.search_vector (lib/lightning/invocation/dataclip_search_vector_worker.ex:48-55), a system-level background index maintenance task whose SQL is intentionally cross-project; project scoping happens at the read side (Invocation.search_workorders).
S1 (authorization): N/A — no new controllers, LiveView events, or web-layer entrypoints; the diff only adds Lightning.Config callbacks and threads them into an existing Oban worker.
S2 (audit trail): N/A — no writes to configuration resources (workflows, credentials, project settings, etc.); the worker only backfills a derived search_vector tsvector column.

github-project-automation Bot added this to Core Jun 1, 2026

github-project-automation Bot moved this to New Issues in Core Jun 1, 2026

stuartc mentioned this pull request Jun 1, 2026

Defer log_lines search_vector indexing off the insert path #4818

Merged

7 tasks

stuartc changed the base branch from main to defer-log-lines-search-vector June 3, 2026 14:45

stuartc force-pushed the defer-dataclip-search-vector branch from 8fc24ba to 1392175 Compare June 3, 2026 14:49

stuartc force-pushed the defer-dataclip-search-vector branch from 1392175 to 5621332 Compare June 3, 2026 16:20

Base automatically changed from defer-log-lines-search-vector to main June 4, 2026 05:32

stuartc added 4 commits June 4, 2026 07:37

stuartc force-pushed the defer-dataclip-search-vector branch from 5621332 to 1aa5272 Compare June 4, 2026 05:43

stuartc marked this pull request as ready for review June 4, 2026 05:49

stuartc merged commit f590caa into main Jun 4, 2026
6 checks passed

stuartc deleted the defer-dataclip-search-vector branch June 4, 2026 05:52

github-project-automation Bot moved this from New Issues to Done in Core Jun 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defer dataclip search_vector indexing off the insert path#4821

Defer dataclip search_vector indexing off the insert path#4821
stuartc merged 4 commits into
mainfrom
defer-dataclip-search-vector

stuartc commented Jun 1, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stuartc commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Caveat

Validation steps

Additional notes for the reviewer

AI Usage

Pre-submission checklist

Uh oh!

codecov Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Jun 4, 2026

Security Review ✅

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

stuartc commented Jun 1, 2026 •

edited

Loading

codecov Bot commented Jun 3, 2026 •

edited

Loading