Optimised raw-knex resource fetching for URL service boot#26689
Draft
Optimised raw-knex resource fetching for URL service boot#26689
Conversation
Contributor
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Member
🤖 Velo CI Failure AnalysisClassification: 🟠 SOFT FAIL
|
Safety-net test that boots the URL service with custom fixtures and verifies both URL resolution and sitemap XML output from the same entrypoint. Tests outcomes: correct paths per collection, canonical_url exclusion, draft exclusion, orphan tag exclusion, feature_image in sitemap image nodes, and multi-collection routing.
7ab0009 to
43cd03b
Compare
The old config used exclude lists that grew with every new column added to the schema. Include lists are explicit about what the URL service needs: only fields used for URL generation (permalink patterns, NQL filter evaluation), sitemap XML (dates, images, canonical_url), and runtime change detection. This reduces the query payload for posts from ~20 columns to 10, and similarly for other resource types. Also updated raw-knex.js to support `include` option for column selection, and resources.js to derive ignored-properties for change detection from the include list rather than the exclude list.
raw-knex.js existed specifically to bypass Bookshelf's per-row overhead, but still called toJSON(), fixBools(), and fixDatesWhenFetch() through the Bookshelf prototype on every row. toJSON/serialize just shallow-copy attributes with no meaningful transformation. fixDatesWhenFetch parses dates with moment.js but knex already returns JavaScript Date objects. fixBools is the only necessary operation — replaced with a pre-computed boolean column loop that runs without Bookshelf prototype lookups or moment.js overhead.
Relation queries (tags, authors) previously used sequential JOINed queries that materialized every post ID as a literal in WHERE IN, producing one large row per post-relation pair with repeated target data. For 250k posts with ~2 tags each, this meant 500k rows with slug and visibility duplicated on every row, processed sequentially after the main query completed. Now uses a split-query pattern (prefetch_related): a lean pivot-table query (just parent_id + fk, no JOIN) and a tiny lookup query on the target table, joined in JS via a hash map. Relation queries use subqueries instead of materialized ID lists, which means they don't depend on the main query results and can run in parallel via Promise.all. Also removed lodash dependency — all iteration and grouping uses native for-of loops and Object.create(null) maps.
43cd03b to
434822a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The URL service boots by fetching every published post, page, tag, and author from the database via
raw-knex.js— a fast path that bypasses Bookshelf's ORM layer. On large sites (250k+ posts), this boot path was taking ~9s. This PR brings it down to ~5.4s (a 39% reduction) by addressing three distinct bottlenecks in how raw-knex fetches and processes data.Include lists instead of exclude lists — The resource config previously listed columns to exclude, which meant every new column added to the schema was fetched by default even if the URL service didn't need it. This inverts the approach: each resource type now declares exactly which columns it needs (id, slug, type, featured, dates, images, canonical_url). This makes the contract explicit and keeps the query payload minimal.
Removed Bookshelf ORM overhead — After fetching raw rows, the old code wrapped each row in a fake Bookshelf model just to call
toJSON(),fixBools(), andfixDatesWhenFetch(). That's a lot of ceremony for what amounts to boolean coercion. This replaces all of that with a direct loop over schema-derived boolean columns, eliminating lodash, prototype binding, and serialize overhead entirely. This single change accounts for the largest speedup (~1.3s on 250k posts).Parallel split-query relation loading — Relations (tags, authors) were previously loaded sequentially using a JOIN query whose
WHERE INclause contained every post ID as a literal value. This rewrites relation loading to use a split-query pattern: one query for the pivot table (e.g.posts_tags) filtered by a subquery, and a second query for the lookup table (e.g.tags), joined in JS. Because the relation queries use subqueries instead of materialised ID lists, they don't depend on the main query's results — all queries fire in parallel viaPromise.all. Posts sharing the same tag or author reference the same object, avoiding hundreds of thousands of per-row allocations.The PR also includes an integration test suite (commit 0) that boots the URL service with custom fixtures and verifies both URL cache state and sitemap XML output. This covers default and custom routing configurations, ensuring drafts, canonical posts, and orphan tags are correctly excluded. This test suite serves as a safety net for all the optimisation work — any regression in the boot path or event wiring will surface here.