Skip to content

bug(docs-sync): greedy frontmatter extraction doubles content for files with markdown HR separators #19

@stackbilt-admin

Description

@stackbilt-admin

Summary

scripts/docs-sync.sh line 241 extracts existing frontmatter with:

existing_fm=$(sed -n '/^---$/,/^---$/p' "$existing")

This sed pattern is a range match. After the first range terminates, sed restarts looking for the start pattern; if the body contains additional ^---$ lines (used as markdown horizontal-rule section separators), each pair captures another range. The captured "frontmatter" is then prepended to the upstream content — duplicating swaths of body each time the script runs against a file that uses HR separators.

Side effect: each successive sync makes the file worse, because the pollution is written back to the local file and gets re-captured on the next run.

Reproduction (just observed today)

Stackbilt-dev/stackbilt-web/docs/api-reference.md upstream has 65 ^---$ lines (HR separators between API sections). After re-syncing on 2026-05-02:

File Upstream Expected (upstream + ~9-line frontmatter) Actual after sync
mcp.md 108 lines, 0 HR 117 117 ✓
ecosystem.md 190 lines, 0 HR 199 199 ✓
platform.md 110 lines, 0 HR 119 119 ✓
api-reference.md 1852 lines, 65 HR ~1862 2906 ⚠️ doubled

Actual ^---$ count in the synced api-reference.md: 132 (8-line frontmatter + 65*2 from the doubled body). Confirmed two # Stackbilder Platform API Reference headings in the file at lines 1 (within prepended fragment) and 1055 (start of upstream body proper).

Root cause

sed -n '/A/,/B/p' is documented to print every range from A to B. With a single A and a single B, that's one range. With 1+(2N) ^---$ lines (1 frontmatter open + 1 close + N HR pairs), sed prints 1 + N ranges, and the script captures every one of them as existing_fm.

The fix needs to capture exactly the first matched range, then stop. Three options:

Option 1 — awk (cleanest)

existing_fm=$(awk '/^---$/{if(++n==2){print;exit}} n>=1' "$existing")

Reads first ---, includes it; reads body; reads second ---, includes it, exits. Stops at the closing fence regardless of what's after.

Option 2 — sed with quit

existing_fm=$(sed -n '1,/^---$/{p;}; /^---$/,/^---$/{p;/^---$/q;}' "$existing")

Less readable; works but fragile.

Option 3 — head + awk

existing_fm=$(awk '/^---$/{n++} n<=2 {print} n==2 && /^---$/ {exit}' "$existing")

Also explicit.

I'd recommend Option 1 — it's the shortest, most readable, and uses awk's natural state machine.

Workaround

Currently: only sync files whose body doesn't contain ^---$ HR separators. Or repair the polluted output by hand:

{ head -8 "$existing_local"; echo; cat "$upstream_canonical"; } > "$existing_local"

Worked around manually for api-reference.md on 2026-05-02 (commit 2bcd7cc in this repo) — the api-reference.md change was reverted to the previous-sync state since it had been current; only the cleaner files (mcp.md, ecosystem.md, platform.md) committed.

Severity

  • Doubles content silently — easy to miss in a --dry-run since dry-run still calls the same code path
  • Each subsequent sync compounds the problem — pollution is written back, then re-captured on next run
  • Caught here only because api-reference.md jumped from 1861 → 2906 lines and triggered curiosity
  • Files without HR separators in the body are unaffected — explains why mcp.md / ecosystem.md / platform.md synced cleanly

Test plan after fix

  1. Run ./scripts/docs-sync.sh --dry-run --source stackbilt-web — expected: 0 files marked for update if local matches upstream
  2. Touch a single line in stackbilt-web/docs/api-reference.md upstream; re-run the sync
  3. Verify the resulting src/content/docs/api-reference.md is ~1860 lines (frontmatter + upstream body), not ~2900
  4. Re-run the sync immediately — should report 0 changes (idempotent), not double again

Related

  • Surfaced 2026-05-02 during the post-consolidation docs-site update (8952ffb + 2bcd7cc).

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions