Skip to content

Wire regex format presets into sources#21

Merged
simonsmallchua merged 2 commits into
mainfrom
work/distracted-kare-ad4b71
May 6, 2026
Merged

Wire regex format presets into sources#21
simonsmallchua merged 2 commits into
mainfrom
work/distracted-kare-ad4b71

Conversation

@simonsmallchua
Copy link
Copy Markdown
Contributor

@simonsmallchua simonsmallchua commented May 6, 2026

Summary

  • [[sources]] accepts format = "<preset>" (one of json, apache-combined, nginx-default, syslog-rfc5424); the iteration parser routes through the format layer when set, so non-JSON payloads parse cleanly without forking.
  • format_keys stays JSON-only and is rejected when combined with a non-json format so the operator's intent is never silently dropped.
  • Cursor-filter compatibility is documented honestly: it still keys on a leading ISO timestamp, so non-leading-TS shapes (Apache combined, nginx default) are most useful end-to-end with sources that don't rely on overlap dedup. The wiring is in place for when file / kubectl / cloudwatch land.

Top-level v0.2 shortlist item from docs/ROADMAP.md.

Changes

  • src/paperbark/iteration.py: new line_format parameter on summarise_lines / summarise_log_file; format-driven path mirrors the JSON shape, treats all-empty regex matches as failed_to_parse.
  • src/paperbark/dispatcher.py: new _resolve_format resolver against paperbark.formats.registered_formats; flyctl branch accepts format, conflict-checks with format_keys, attaches source.line_format. capture_iteration reads the attribute and threads it through.
  • src/paperbark/sources/flyctl.py: accepts line_format constructor arg.
  • Docs: docs/CONFIG.md, docs/SOURCES.md, docs/ROADMAP.md, CHANGELOG.md updated.
  • Tests: 9 new tests covering preset resolution, conflict rejection, format-driven parsing (Apache combined + RFC 5424 syslog priority→level), and dispatcher-level wiring.

Test plan

  • uv run ruff check src/ tests/
  • uv run ruff format --check src/ tests/
  • uv run mypy src/
  • uv run pytest -q (368 passed)
  • uv run --with pip-audit pip-audit (no vulnerabilities)
  • uv run pre-commit run --files <touched> (all hooks pass)

View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

Summary by CodeRabbit

  • New Features

    • Added a format option for sources with presets (json, apache-combined, nginx-default, syslog-rfc5424) to parse non‑JSON logs.
    • format_keys is now JSON‑only and cannot be combined with non‑JSON formats.
  • Documentation

    • Updated configuration docs, sources reference and roadmap with format option, presets and examples.
  • Tests

    • Added tests covering preset wiring, validation and summarisation behaviour.

@blacksmith-sh
Copy link
Copy Markdown
Contributor

blacksmith-sh Bot commented May 6, 2026

Blacksmith Account Suspended

This Blacksmith account requires additional verification. Jobs targeting Blacksmith runners will not be picked up and will remain queued until they timeout.

Please contact Blacksmith Support for assistance.

3 similar comments
@blacksmith-sh
Copy link
Copy Markdown
Contributor

blacksmith-sh Bot commented May 6, 2026

Blacksmith Account Suspended

This Blacksmith account requires additional verification. Jobs targeting Blacksmith runners will not be picked up and will remain queued until they timeout.

Please contact Blacksmith Support for assistance.

@blacksmith-sh
Copy link
Copy Markdown
Contributor

blacksmith-sh Bot commented May 6, 2026

Blacksmith Account Suspended

This Blacksmith account requires additional verification. Jobs targeting Blacksmith runners will not be picked up and will remain queued until they timeout.

Please contact Blacksmith Support for assistance.

@blacksmith-sh
Copy link
Copy Markdown
Contributor

blacksmith-sh Bot commented May 6, 2026

Blacksmith Account Suspended

This Blacksmith account requires additional verification. Jobs targeting Blacksmith runners will not be picked up and will remain queued until they timeout.

Please contact Blacksmith Support for assistance.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 6, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 11b983d4-222c-44e8-948e-7eb54e5acaa7

📥 Commits

Reviewing files that changed from the base of the PR and between 75afb77 and 78cfdd5.

📒 Files selected for processing (3)
  • docs/CONFIG.md
  • src/paperbark/iteration.py
  • tests/test_iteration.py

📝 Walkthrough

Walkthrough

This PR wires named regex-format presets into the Flyctl source path and the iteration summariser: a new format option selects a preset (e.g., apache-combined, nginx-default, syslog-rfc5424) causing lines to be parsed via a Format layer; format_keys remains JSON-only and is rejected with a non-JSON format.

Changes

Format Preset Support for Sources

Layer / File(s) Summary
Type & Configuration Foundation
src/paperbark/sources/flyctl.py, src/paperbark/dispatcher.py
FlyctlSource gains a `line_format: Format
Format Resolution & Validation
src/paperbark/dispatcher.py
New _JSON_FORMAT_NAME constant and _resolve_format(raw, source_name) helper resolve preset names to registered Format instances, raise DispatcherError for unknown presets, and enable validation logic preventing incompatible combos.
Source Configuration Wiring
src/paperbark/dispatcher.py
Flyctl source parsing accepts "format" in options; dispatcher computes line_format via _resolve_format() and parses format_keys, enforcing that format_keys cannot be combined with a non-JSON line_format.
Iteration Integration / Parsing Path
src/paperbark/iteration.py, src/paperbark/dispatcher.py
summarise_lines() and summarise_log_file() signatures accept line_format; new _summarise_with_format() processes records through a Format, aggregates per-minute buckets and emits flat rows. capture_iteration() forwards line_format and format_keys into summarisation.
Helpers & Internals
src/paperbark/iteration.py
New helpers _format_record_parsed() and _format_minute_key() plus relocated parsing helpers centralise JSON and format-aware utilities to support both summarisation paths.
Tests & Documentation
tests/test_dispatcher.py, tests/test_iteration.py, CHANGELOG.md, docs/CONFIG.md, docs/SOURCES.md, docs/ROADMAP.md
Added tests asserting preset resolution (apache-combined, syslog-rfc5424), explicit json yields no line_format, unknown presets raise DispatcherError, and combining format with format_keys raises an error. Docs and changelog describe format/format_keys, examples, and roadmap status.

Possibly related PRs

  • Good-Native/paperbark#15: Wires regex-format presets into the dispatcher and iteration pipeline — directly implements the core feature described in planning.
  • Good-Native/paperbark#5: Earlier dispatcher/source construction changes that this PR extends by adding per-source line_format resolution and propagation.
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarises the main change: adding regex format preset support to sources configuration, which is well-reflected across all modified files and tests.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/CONFIG.md`:
- Around line 238-245: The note currently warns that `format =
"apache-combined"` on a `flyctl` source will be dropped by paperbark's cursor
filter because lines lack a leading ISO timestamp; update this caveat to also
mention `syslog-rfc5424` (RFC5424 lines start with `<PRI>1` and likewise lack a
leading ISO timestamp) so readers know `format = "syslog-rfc5424"` will be
dropped by the cursor filter before parsing; mirror the same wording change in
the corresponding ROADMAP/CHANGELOG entries and ensure the docs reference
`format = "apache-combined"` and `format = "syslog-rfc5424"` explicitly so
there's no ambiguity about flyctl compatibility.

In `@src/paperbark/iteration.py`:
- Around line 102-108: summarise_lines currently silently ignores format_keys
when line_format is provided; mirror build_source's behavior by rejecting the
combination at the API boundary: inside summarise_lines (the branch that calls
_summarise_with_format when line_format is not None) detect if format_keys is
not None and raise a clear ValueError (or custom exception) indicating
format_keys is incompatible with line_format (referencing format_keys and
line_format in the message); this ensures the same fail-closed behavior as
build_source and prevents the silent contract change when callers use
summarise_lines calling _summarise_with_format.
- Around line 319-327: The helper _format_record_parsed currently treats a
record as parsed only if one of timestamp, level, message, or component is
non-empty, which misclassifies records that only populate status or duration_ms;
update _format_record_parsed to also consider record.status and
record.duration_ms (or their equivalent keys on the parsed record) when deciding
if a record is parsed so that any non-empty canonical field including status or
duration_ms returns True; reference the _format_record_parsed function and
ensure the boolean expression includes record.status and record.duration_ms.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: f48d21f9-4e41-4846-8ecb-234f6796fba9

📥 Commits

Reviewing files that changed from the base of the PR and between 5736bb4 and 75afb77.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (9)
  • CHANGELOG.md
  • docs/CONFIG.md
  • docs/ROADMAP.md
  • docs/SOURCES.md
  • src/paperbark/dispatcher.py
  • src/paperbark/iteration.py
  • src/paperbark/sources/flyctl.py
  • tests/test_dispatcher.py
  • tests/test_iteration.py

Comment thread docs/CONFIG.md Outdated
Comment thread src/paperbark/iteration.py
Comment thread src/paperbark/iteration.py Outdated
@simonsmallchua simonsmallchua merged commit 4ef86d1 into main May 6, 2026
5 checks passed
This was referenced May 6, 2026
This was referenced May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant