Skip to content

Optimised raw-knex resource fetching for URL service boot#26689

Draft
rob-ghost wants to merge 4 commits intomainfrom
chore/raw-knex-boot-optimisations
Draft

Optimised raw-knex resource fetching for URL service boot#26689
rob-ghost wants to merge 4 commits intomainfrom
chore/raw-knex-boot-optimisations

Conversation

@rob-ghost
Copy link
Copy Markdown
Contributor

The URL service boots by fetching every published post, page, tag, and author from the database via raw-knex.js — a fast path that bypasses Bookshelf's ORM layer. On large sites (250k+ posts), this boot path was taking ~9s. This PR brings it down to ~5.4s (a 39% reduction) by addressing three distinct bottlenecks in how raw-knex fetches and processes data.

Include lists instead of exclude lists — The resource config previously listed columns to exclude, which meant every new column added to the schema was fetched by default even if the URL service didn't need it. This inverts the approach: each resource type now declares exactly which columns it needs (id, slug, type, featured, dates, images, canonical_url). This makes the contract explicit and keeps the query payload minimal.

Removed Bookshelf ORM overhead — After fetching raw rows, the old code wrapped each row in a fake Bookshelf model just to call toJSON(), fixBools(), and fixDatesWhenFetch(). That's a lot of ceremony for what amounts to boolean coercion. This replaces all of that with a direct loop over schema-derived boolean columns, eliminating lodash, prototype binding, and serialize overhead entirely. This single change accounts for the largest speedup (~1.3s on 250k posts).

Parallel split-query relation loading — Relations (tags, authors) were previously loaded sequentially using a JOIN query whose WHERE IN clause contained every post ID as a literal value. This rewrites relation loading to use a split-query pattern: one query for the pivot table (e.g. posts_tags) filtered by a subquery, and a second query for the lookup table (e.g. tags), joined in JS. Because the relation queries use subqueries instead of materialised ID lists, they don't depend on the main query's results — all queries fire in parallel via Promise.all. Posts sharing the same tag or author reference the same object, avoiding hundreds of thousands of per-row allocations.

The PR also includes an integration test suite (commit 0) that boots the URL service with custom fixtures and verifies both URL cache state and sitemap XML output. This covers default and custom routing configurations, ensuring drafts, canonical posts, and orphan tags are correctly excluded. This test suite serves as a safety net for all the optimisation work — any regression in the boot path or event wiring will surface here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 4, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 847337fe-1c39-41f9-b4e2-51ac84a04d77

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chore/raw-knex-boot-optimisations

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ErisDS
Copy link
Copy Markdown
Member

ErisDS commented Mar 4, 2026

🤖 Velo CI Failure Analysis

Classification: 🟠 SOFT FAIL

  • Workflow: CI
  • Failed Step: Run yarn nx affected -t lint --base=d066dba172c7de052b0d1a22823ac7e8b50bbd9b
  • Run: View failed run
    What failed: Lint errors in test file: unused variables 'forKnex' and 'markdownToMobiledoc'
    Why: The failure is caused by lint errors in the test file, specifically two unused variables. This is a code issue that the author should fix, as it does not involve any infrastructure problems.
    Action:
    The author should address the lint errors by either removing the unused variables or marking them as allowed unused variables in the lint configuration.

Safety-net test that boots the URL service with custom fixtures and
verifies both URL resolution and sitemap XML output from the same
entrypoint. Tests outcomes: correct paths per collection, canonical_url
exclusion, draft exclusion, orphan tag exclusion, feature_image in
sitemap image nodes, and multi-collection routing.
@rob-ghost rob-ghost force-pushed the chore/raw-knex-boot-optimisations branch from 7ab0009 to 43cd03b Compare March 4, 2026 13:59
The old config used exclude lists that grew with every new column added
to the schema. Include lists are explicit about what the URL service
needs: only fields used for URL generation (permalink patterns, NQL
filter evaluation), sitemap XML (dates, images, canonical_url), and
runtime change detection. This reduces the query payload for posts from
~20 columns to 10, and similarly for other resource types.

Also updated raw-knex.js to support `include` option for column
selection, and resources.js to derive ignored-properties for change
detection from the include list rather than the exclude list.
raw-knex.js existed specifically to bypass Bookshelf's per-row overhead,
but still called toJSON(), fixBools(), and fixDatesWhenFetch() through
the Bookshelf prototype on every row. toJSON/serialize just shallow-copy
attributes with no meaningful transformation. fixDatesWhenFetch parses
dates with moment.js but knex already returns JavaScript Date objects.
fixBools is the only necessary operation — replaced with a pre-computed
boolean column loop that runs without Bookshelf prototype lookups or
moment.js overhead.
Relation queries (tags, authors) previously used sequential JOINed
queries that materialized every post ID as a literal in WHERE IN,
producing one large row per post-relation pair with repeated target
data. For 250k posts with ~2 tags each, this meant 500k rows with
slug and visibility duplicated on every row, processed sequentially
after the main query completed.

Now uses a split-query pattern (prefetch_related): a lean pivot-table
query (just parent_id + fk, no JOIN) and a tiny lookup query on the
target table, joined in JS via a hash map. Relation queries use
subqueries instead of materialized ID lists, which means they don't
depend on the main query results and can run in parallel via
Promise.all. Also removed lodash dependency — all iteration and
grouping uses native for-of loops and Object.create(null) maps.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants