Optimised raw-knex resource fetching for URL service boot by rob-ghost · Pull Request #26689 · TryGhost/Ghost

rob-ghost · 2026-03-04T13:26:38Z

The URL service boots by fetching every published post, page, tag, and author from the database via raw-knex.js — a fast path that bypasses Bookshelf's ORM layer. On large sites (250k+ posts), this boot path was taking ~9s. This PR brings it down to ~5.4s (a 39% reduction) by addressing three distinct bottlenecks in how raw-knex fetches and processes data.

Include lists instead of exclude lists — The resource config previously listed columns to exclude, which meant every new column added to the schema was fetched by default even if the URL service didn't need it. This inverts the approach: each resource type now declares exactly which columns it needs (id, slug, type, featured, dates, images, canonical_url). This makes the contract explicit and keeps the query payload minimal.

Removed Bookshelf ORM overhead — After fetching raw rows, the old code wrapped each row in a fake Bookshelf model just to call toJSON(), fixBools(), and fixDatesWhenFetch(). That's a lot of ceremony for what amounts to boolean coercion. This replaces all of that with a direct loop over schema-derived boolean columns, eliminating lodash, prototype binding, and serialize overhead entirely. This single change accounts for the largest speedup (~1.3s on 250k posts).

Parallel split-query relation loading — Relations (tags, authors) were previously loaded sequentially using a JOIN query whose WHERE IN clause contained every post ID as a literal value. This rewrites relation loading to use a split-query pattern: one query for the pivot table (e.g. posts_tags) filtered by a subquery, and a second query for the lookup table (e.g. tags), joined in JS. Because the relation queries use subqueries instead of materialised ID lists, they don't depend on the main query's results — all queries fire in parallel via Promise.all. Posts sharing the same tag or author reference the same object, avoiding hundreds of thousands of per-row allocations.

The PR also includes an integration test suite (commit 0) that boots the URL service with custom fixtures and verifies both URL cache state and sitemap XML output. This covers default and custom routing configurations, ensuring drafts, canonical posts, and orphan tags are correctly excluded. This test suite serves as a safety net for all the optimisation work — any regression in the boot path or event wiring will surface here.

coderabbitai · 2026-03-04T13:26:48Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 847337fe-1c39-41f9-b4e2-51ac84a04d77

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch chore/raw-knex-boot-optimisations

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

ErisDS · 2026-03-04T13:54:15Z

🤖 Velo CI Failure Analysis

Classification: 🟠 SOFT FAIL

Workflow: CI
Failed Step: Run yarn nx affected -t lint --base=d066dba172c7de052b0d1a22823ac7e8b50bbd9b
Run: View failed run
What failed: Lint errors in test file: unused variables 'forKnex' and 'markdownToMobiledoc'
Why: The failure is caused by lint errors in the test file, specifically two unused variables. This is a code issue that the author should fix, as it does not involve any infrastructure problems.
Action:
The author should address the lint errors by either removing the unused variables or marking them as allowed unused variables in the lint configuration.

Safety-net test that boots the URL service with custom fixtures and verifies both URL resolution and sitemap XML output from the same entrypoint. Tests outcomes: correct paths per collection, canonical_url exclusion, draft exclusion, orphan tag exclusion, feature_image in sitemap image nodes, and multi-collection routing.

The old config used exclude lists that grew with every new column added to the schema. Include lists are explicit about what the URL service needs: only fields used for URL generation (permalink patterns, NQL filter evaluation), sitemap XML (dates, images, canonical_url), and runtime change detection. This reduces the query payload for posts from ~20 columns to 10, and similarly for other resource types. Also updated raw-knex.js to support `include` option for column selection, and resources.js to derive ignored-properties for change detection from the include list rather than the exclude list.

raw-knex.js existed specifically to bypass Bookshelf's per-row overhead, but still called toJSON(), fixBools(), and fixDatesWhenFetch() through the Bookshelf prototype on every row. toJSON/serialize just shallow-copy attributes with no meaningful transformation. fixDatesWhenFetch parses dates with moment.js but knex already returns JavaScript Date objects. fixBools is the only necessary operation — replaced with a pre-computed boolean column loop that runs without Bookshelf prototype lookups or moment.js overhead.

Relation queries (tags, authors) previously used sequential JOINed queries that materialized every post ID as a literal in WHERE IN, producing one large row per post-relation pair with repeated target data. For 250k posts with ~2 tags each, this meant 500k rows with slug and visibility duplicated on every row, processed sequentially after the main query completed. Now uses a split-query pattern (prefetch_related): a lean pivot-table query (just parent_id + fk, no JOIN) and a tiny lookup query on the target table, joined in JS via a hash map. Relation queries use subqueries instead of materialized ID lists, which means they don't depend on the main query results and can run in parallel via Promise.all. Also removed lodash dependency — all iteration and grouping uses native for-of loops and Object.create(null) maps.

rob-ghost force-pushed the chore/raw-knex-boot-optimisations branch from 7ab0009 to 43cd03b Compare March 4, 2026 13:59

rob-ghost added 3 commits March 4, 2026 14:18

rob-ghost force-pushed the chore/raw-knex-boot-optimisations branch from 43cd03b to 434822a Compare March 4, 2026 14:26

rob-ghost mentioned this pull request Mar 4, 2026

Improved URL service and sitemap boot performance #26676

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimised raw-knex resource fetching for URL service boot#26689

Optimised raw-knex resource fetching for URL service boot#26689
rob-ghost wants to merge 4 commits intomainfrom
chore/raw-knex-boot-optimisations

rob-ghost commented Mar 4, 2026

Uh oh!

coderabbitai bot commented Mar 4, 2026 •

edited

Loading

Review skipped

Uh oh!

ErisDS commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

rob-ghost commented Mar 4, 2026

Uh oh!

coderabbitai bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

ErisDS commented Mar 4, 2026

🤖 Velo CI Failure Analysis

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Mar 4, 2026 •

edited

Loading