Skip to content

Add website contribution import support#333

Merged
PrzemyslawKlys merged 9 commits intomainfrom
codex/contribution-author-metadata
Apr 30, 2026
Merged

Add website contribution import support#333
PrzemyslawKlys merged 9 commits intomainfrom
codex/contribution-author-metadata

Conversation

@PrzemyslawKlys
Copy link
Copy Markdown
Member

Summary

  • add powerforge-web contributions validate/import for public Website.Contributions bundles
  • validate post folder shape, Markdown image references, author profiles, asset types, and size limits
  • import posts, assets, and author catalog data into the private website repo
  • emit author metadata into Article JSON-LD, OpenGraph, RSS, Atom, and JSON Feed output

Verification

  • dotnet build C:/Support/GitHub/PSPublishModule/PowerForge.Web.Cli/PowerForge.Web.Cli.csproj -c Debug -f net8.0
  • dotnet C:/Support/GitHub/PSPublishModule/PowerForge.Web.Cli/bin/Debug/net8.0/PowerForge.Web.Cli.dll contributions validate --root C:/Support/GitHub/Website.Contributions
  • Website rebuild completed through ./build.ps1 -Only build -SkipSourcesSync -SkipContentProcessing -ExpectedOutputPath _site from the dependent website branch

@PrzemyslawKlys PrzemyslawKlys force-pushed the codex/contribution-author-metadata branch from 6afaa76 to 19a06cd Compare April 29, 2026 19:48
@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Code Review — PR #333: Add website contribution import support

Overview

This PR adds a contributions command to the PowerForge Web CLI, enabling validation and import of external contributor post bundles into a website repository. The implementation spans ~1,100 lines across new models, a core processor, CLI command handler, and author metadata wiring into RSS/Atom/JSON Feed/OpenGraph/JSON-LD outputs. The scope is well-structured, the security surface is generally considered (path-traversal guards, URL validation, asset size limits), and the logic is readable. Below are the specific issues and suggestions found during review.


Bugs / Correctness

1. FrontMatterImageRegex rewrites image: outside front matter

RewritePostMarkdown runs FrontMatterImageRegex.Replace(markdown, ...) against the entire markdown string, not just the front matter block. Because the regex uses (?m)^ it will match image: at the start of any line — including inside YAML or TOML fenced code blocks in the body. A contributed article demonstrating YAML configuration with an image: key would have that content silently rewritten.

Fix: extract the front matter block, apply the regex only to it, then reconstruct.

2. EnsureImportedAuthorMetadata duplicates front matter fields

The function appends author:, author_names:, author_urls:, and social_twitter_creator: via InsertFrontMatterFields without first checking whether those fields already exist in the contributor's front matter. A contributed post that already declares author: will end up with two author: keys in the same YAML block — most parsers pick the first one, silently ignoring the injected metadata.

Fix: strip (or replace) the existing fields before inserting the canonical ones.

3. Import leaves the target in a partial/dirty state on failure

Directory.CreateDirectory(targetContentRoot);
Directory.CreateDirectory(targetAssetRoot);

foreach (var asset in post.Assets)
{
    if (CopyFile(sourceAsset, targetAsset, options.Force, errors))
        result.CopiedAssetCount++;
}

if (errors.Count > 0)
    continue;   // directories already created, some assets may be copied

When a mid-import error occurs (e.g. an asset copy fails) the destination directories have already been created and some assets have been written. There is no rollback, and a subsequent run without --force will then fail for different reasons (pre-existing files). Consider either staging to a temp directory and moving atomically, or collecting errors per-post without writing any files until all posts pass.

4. NormalizeSlug uses an inline (non-compiled) Regex.Replace

normalized = Regex.Replace(normalized, @"[^a-z0-9]+", "-", RegexOptions.CultureInvariant, RegexTimeout).Trim('-');

This creates a new Regex object on every call instead of using the pre-compiled static SlugRegex. While the sanitisation and validation patterns differ, a private static _slugSanitizeRegex for the normalization pass would be consistent with the rest of the class and avoids repeated compilation.

5. TryReadValue uses IReadOnlyDictionary but receives Dictionary

ReadStringList checks raw is IReadOnlyDictionary<string, object?> ro — this works because Dictionary<K,V> implements the interface, but any nested objects deserialized by YamlDotNet will be Dictionary<object, object> (not Dictionary<string, object?>). The nested path lookup (key.Split('.')) will silently return false for any dotted key referencing a nested map. In practice this means "author.name" etc. always falls through to the outer dictionary lookup in ReadMetaString — which may be intentional, but it's worth a comment or test.


Security

6. TryResolveBundleAsset uses OrdinalIgnoreCase on a potentially case-sensitive filesystem

if (!candidate.StartsWith(rootPrefix, StringComparison.OrdinalIgnoreCase))
    return false;

On Linux, /tmp/Contrib/ and /tmp/contrib/ are different directories. The case-insensitive prefix check could allow a path that case-folds to look like it is inside the bundle root while actually residing elsewhere. While symlink following and .. traversal are already blocked by Path.GetFullPath, using StringComparison.Ordinal (or OrdinalIgnoreCase only on Windows) would be more correct. Same comment applies to ResolveInside.

7. YAML deserialization is vulnerable to a billion-laughs / alias explosion

new DeserializerBuilder().Build().Deserialize<Dictionary<string, object?>>(File.ReadAllText(path));

YamlDotNet's default deserializer resolves YAML anchors and aliases, which means a malicious authors/author.yml containing a crafted alias tree can cause exponential memory/CPU usage. Since the authors/ directory is under contributor control, this is a plausible DoS for anyone running the CLI against untrusted contribution bundles. Consider setting a MaximumAliasesAllowed policy or switching to a streaming / safe deserializer.


Performance

8. FencedCodeBlockRegex is matched twice per document

MaskFencedCodeBlocks and ReplaceOutsideFencedCodeBlocks each call FencedCodeBlockRegex.Matches(markdown) independently. For large documents the regex is executed twice. A single call whose result is reused (or combining both operations) would be more efficient.

9. LoadAuthors creates a new Deserializer per file

var map = new DeserializerBuilder().Build().Deserialize<...>(File.ReadAllText(path));

DeserializerBuilder.Build() is not free — it builds an internal type model. Cache the IDeserializer instance as a private static or pass it in.


Code Quality

10. Redundant action check after early defaulting

var action = subArgs.Length > 0 && !subArgs[0].StartsWith("--", StringComparison.Ordinal)
    ? subArgs[0].Trim().ToLowerInvariant()
    : "validate";
var effectiveArgs = action == "validate" || action == "import" ? subArgs.Skip(1).ToArray() : subArgs;
if (action is not ("validate" or "import"))
    return Fail(...);

effectiveArgs is computed for the action == "validate" || action == "import" case, but when action is neither, effectiveArgs equals subArgs — and then the function immediately returns an error, so effectiveArgs is never used. The ternary on line 3 can just be subArgs.Skip(1).ToArray() since the guard eliminates the else branch.

11. No test coverage for the new processor

WebContributionProcessor contains ~550 lines of logic (layout validation, front matter parsing, image rewriting, author injection, import) with numerous edge cases. None of the test files visible in the diff exercise this code. At minimum, unit tests for ValidatePost, RewritePostMarkdown, EnsureImportedAuthorMetadata, and TryResolveBundleAsset would catch the bugs noted above and prevent regressions.

12. WebContributionResult.Posts uses Array.Empty<> as default but is always replaced

Minor: the default Array.Empty<WebContributionPostResult>() in the model and the immediate assignment in Process() are fine, but the fields ImportedPostCount, CopiedAssetCount, and CopiedAuthorCount are only meaningful during import — they are always 0 for a validate-only run. A comment or a separate result shape for import would avoid confusion in JSON output consumers.


Summary

Category Severity Count
Bugs / Correctness Medium–High 5
Security Medium 2
Performance Low 2
Code Quality Low 3

The highest-priority items to address before merge are #2 (duplicate front matter fields on import), #3 (partial import on failure), #1 (body rewrite for image:), and #6 (case-sensitive path guard on Linux). The YAML billion-laughs issue (#7) is worth a fix if contribution bundles can ever come from untrusted sources.

Review generated with Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Code Review — PR #333: Add website contribution import support

Overview

This PR introduces a contributions command to the PowerForge Web CLI, enabling validation and import of external contributor post bundles into the private website repository. It also enriches RSS, Atom, JSON Feed, OpenGraph, and JSON-LD output with structured author metadata.

The implementation is generally well-structured and defensive (path traversal guards, regex timeouts, asset size limits, asset type allowlists). The following issues range from bugs to performance nits.


🐛 Bugs

1. FrontMatterImageRegex rewrites image paths in the document body and code blocks

In RewritePostMarkdown, the front matter image replacement is a plain .Replace over the entire markdown string:

var rewritten = FrontMatterImageRegex.Replace(markdown, match => { ... });

The regex ((?m)^image\s*:) will match any line in the body that starts with image:, including inside fenced YAML code-block examples. The markdown image replacement further down correctly uses ReplaceOutsideFencedCodeBlocks, but the front matter one does not. An imported post with a YAML code example could have its body silently corrupted.

Fix: Extract and rewrite only the front matter section, then concatenate with the body unchanged. A simple split on the closing --- before calling the regex would be sufficient.


2. errors.Count > 0 guard in Import loop creates partially-imported state

foreach (var asset in post.Assets)
{
    if (CopyFile(...)) result.CopiedAssetCount++;
}

if (errors.Count > 0)
    continue;  // skip writing the markdown

errors is a shared list accumulated across all posts. If post A's asset copy fails, the check fires for post B too: post B's assets are still copied, but its markdown is never written. The site ends up with orphaned asset files but no corresponding content file.

Fix: Track per-post errors separately, or save the pre-loop error count and compare against it after the asset copies for the current post.


⚡ Performance

3. DeserializerBuilder instantiated per author file in LoadAuthors

foreach (var path in Directory.GetFiles(...))
{
    var map = new DeserializerBuilder().Build().Deserialize<...>(File.ReadAllText(path));
    ...
}

DeserializerBuilder.Build() creates a new deserializer for every file. Move it above the loop (or cache as a static field), matching the pattern used elsewhere in the project (e.g., WebSiteBuilder.Navigation.cs).


4. Non-compiled inline Regex.Replace in NormalizeSlug

normalized = Regex.Replace(normalized, @"[^a-z0-9]+", "-", RegexOptions.CultureInvariant, RegexTimeout).Trim('-');

Every call allocates a new Regex. This should follow the file's own convention — a private static readonly Regex SlugNormalizerRegex = new(...) field.


🔍 Minor / Suggestions

5. catch (Exception ex) discards inner exceptions

In WebCliCommandHandlers.Contributions.cs:

catch (Exception ex)
{
    return Fail(ex.Message, outputJson, logger, "web.contributions");
}

ex.Message omits inner exception details. For a CLI whose output is often consumed programmatically, ex.ToString() (or at minimum ex.InnerException?.Message) gives better diagnostic signal when something wraps an IOException or a YAML parse exception.


6. Duplicate author metadata on re-import with --force

EnsureImportedAuthorMetadata appends author:, author_names:, author_urls:, and social_twitter_creator: to the front matter via InsertFrontMatterFields, which inserts without first checking for existing keys.

When --force is set the source file is re-read from the contributions repo (not the previously-imported file), so there is no duplication problem today. However, this assumption isn't obvious and could become a bug if the write-back path ever changes. A brief comment explaining the invariant would help.


7. No test coverage

1,107 lines of new validation/import logic — slug normalization, path traversal guards, front matter rewriting, YAML deserialization, asset size enforcement — are added with no corresponding tests. The verification section lists manual dotnet CLI commands, which is a reasonable smoke-test, but the edge cases (malformed front matter, missing authors, path traversal attempts in asset names, mixed line endings) are the kind that unit tests catch cheaply and reliably.


8. TryReadValue dotted-key traversal silently fails for nested YAML maps

if (current is IReadOnlyDictionary<string, object?> ro && ro.TryGetValue(part, out var next))

YamlDotNet deserializes nested maps to Dictionary<object, object>, which does not implement IReadOnlyDictionary<string, object?>. Dotted keys like "author.name" used elsewhere in WebSiteBuilder would silently return no value when called on YAML-sourced data. This latent mismatch won't affect the current contribution code (which only uses simple keys), but if dotted keys are ever added to contribution front matter queries, it will silently produce wrong results.


Summary

Severity Issue
🐛 Bug Front matter image regex rewrites document body and code blocks
🐛 Bug Shared error list in import loop produces partially-copied state
⚡ Performance DeserializerBuilder allocated per file in author-loading loop
⚡ Performance Non-compiled inline Regex in NormalizeSlug
🔍 Minor ex.Message swallows inner exception details
🔍 Minor No unit tests for validation/import logic
🔍 Minor TryReadValue dotted-key traversal silently fails on YAML nested maps

The path-traversal prevention in ResolveInside / TryResolveBundleAsset and the asset allowlist/size guards are good defensive choices. The ReplaceOutsideFencedCodeBlocks approach for markdown images is solid — applying the same treatment to the front matter image replacement would make the two rewrites consistent.

@PrzemyslawKlys
Copy link
Copy Markdown
Member Author

Addressed the contribution importer review feedback in the latest push:

  • scoped front matter image and draft rewrites to YAML front matter only, so body examples and fenced code blocks stay untouched
  • replaced existing imported author metadata before inserting canonical author fields, avoiding duplicate YAML keys
  • changed import overwrite checks to work per post, so one existing target does not block unrelated posts
  • reused the YAML deserializer, added a compiled slug-normalizer regex, and made path containment comparison case-sensitive on non-Windows
  • kept full CLI exception details with ex.ToString()
  • added regression coverage for front matter/code-block rewriting and partial-import behavior

Local checks passed:

  • dotnet build PowerForge.Web.Cli -c Debug -f net8.0
  • dotnet test PowerForge.Tests -c Debug --filter WebContributionProcessorTests
  • contributions validate/import against Website.Contributions

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Code Review — PR #333: Add website contribution import support

Overview

This PR adds a new contributions validate|import CLI command and supporting library that validates contributor-submitted blog post bundles and imports them into the website repository. It also wires author metadata into JSON-LD, OpenGraph, RSS/Atom, and JSON Feed output.

The feature is well-structured and security-conscious overall. Below are findings organized by severity.


Bugs / Correctness Issues

1. X/Twitter handle validation rejects underscores (false negatives)

IsEmptyOrValidSocialValue validates bare handles against SlugRegex (^[a-z0-9]+(?:-[a-z0-9]+)*$), which does not allow underscores. However, X/Twitter handles can contain underscores (e.g., john_doe). A contributor with such a handle would fail validation incorrectly.

// WebContributionProcessor.cs — IsEmptyOrValidSocialValue
if (!trimmed.Contains("://", StringComparison.Ordinal))
    return SlugRegex.IsMatch(trimmed.ToLowerInvariant()); // rejects "john_doe"

Consider a separate handle-specific regex: ^[a-zA-Z0-9_]{1,50}$.

2. Empty asset directory always created for posts with no assets

In Import, Directory.CreateDirectory(targetAssetRoot) is called unconditionally before iterating assetCopies. Posts that have no binary assets will leave an empty directory in the static assets tree.

Directory.CreateDirectory(targetContentRoot);
Directory.CreateDirectory(targetAssetRoot); // created even when assetCopies is empty

Guard it: if (assetCopies.Length > 0) Directory.CreateDirectory(targetAssetRoot);


Behavioral / Regression Risk

3. image added to JSON-LD imageOverride lookup — existing articles affected

In WebSiteBuilder.StructuredDataProfiles.cs, the image key is now included in the imageOverride lookup:

// Before:
var imageOverride = ReadMetaString(item.Meta, "article.image", "news.image", "schema.image", "social_image");
// After:
var imageOverride = ReadMetaString(item.Meta, "article.image", "news.image", "schema.image", "social_image", "image");

Articles that already have an image key in their front matter (e.g., a post thumbnail, not intended as the structured-data image) will now have that value surfaced in JSON-LD. This is a silent behavioral change for all existing content, not just contributed posts. If the intent is to make contributed posts emit correct structured data, consider a more specific key (e.g., schema.image or og_image) rather than the generic image.

4. Silent year fallback in ResolveYear

When a post's date is missing, null, or outside 2000–2100, the asset year silently defaults to DateTime.UtcNow.Year. A contributor who submits a post with a malformed date will have their assets placed under an unexpected year path without any warning.

private static int ResolveYear(DateTime? date)
{
    if (date is { Year: >= 2000 and <= 2100 } value)
        return value.Year;
    return DateTime.UtcNow.Year; // silent fallback
}

Emit a warning to the warnings list here, or propagate the missing-date error before reaching import.


Test Coverage

5. Very thin test suite for a ~900-line processor

Only two tests exist (WebContributionProcessorTests.cs). Key paths without coverage:

  • Validate-only mode (no Import = true)
  • Author profile validation errors (missing name, bad LinkedIn URL, invalid slug)
  • Post validation errors (missing title, description, language, unknown author)
  • Asset type enforcement (.exe or other disallowed extensions)
  • Asset size limit enforcement (MaxAssetBytes, MaxPostAssetBytes)
  • Multi-language post layout (posts/fr/my-post/index.md)
  • Path traversal attempts in image references (../../../etc/passwd)
  • Publish = false (draft flag should be preserved)
  • RSS/Atom/JSON Feed/OpenGraph author metadata output (the WebSiteBuilder changes have zero tests)

The path-traversal case is worth a targeted test given the explicit mitigation in TryResolveBundleAsset.


Minor / Style

6. InsertFrontMatterFields has a misleading parameter name

The method's markdown parameter actually receives only the front matter content (not the full document) when called from EnsureImportedAuthorMetadata via RewriteFrontMatter. Renaming to frontMatter would prevent future confusion.

7. Missing namespace in test file

WebContributionProcessorTests.cs uses no namespace declaration, which may be inconsistent with other test files in the project.

8. DeserializerBuilder instantiated per Process call

A new YamlDotNet deserializer is built inside LoadAuthors on every invocation. This is fine for current usage but a static readonly instance would be marginally more efficient.

9. Repeated action == "import" string comparisons in CLI handler

WebCliCommandHandlers.Contributions.cs compares action == "import" in three places. A local bool isImport = action == "import" at the top of the method would be cleaner and avoid any future typo divergence.


What's Done Well

  • Path traversal protectionResolveInside and TryResolveBundleAsset both verify that resolved paths remain within their respective roots.
  • Regex timeout — All compiled regexes have a 1-second timeout, protecting against ReDoS on adversarial input.
  • Fenced code block awarenessReplaceOutsideFencedCodeBlocks correctly skips image rewrites inside code fences, and the test validates this.
  • UTF-8 without BOM, LF line endingsnew UTF8Encoding(false) and .Replace("\r\n", "\n") are the right choices for cross-platform text files.
  • Error deduplicationerrors.Distinct(StringComparer.OrdinalIgnoreCase) prevents noisy duplicate messages.
  • Partial-import resilience — The second test correctly verifies that one failing post does not block others.

@PrzemyslawKlys
Copy link
Copy Markdown
Member Author

Followed up on the second review pass too:

  • X/Twitter bare handles now allow underscores while GitHub usernames keep stricter slug-style validation
  • imports no longer create an empty asset directory when a post has no assets
  • author YAML deserializer is reused as a static instance
  • contribution CLI now uses one isImport flag for the repeated import checks
  • renamed the front matter helper parameter for clarity
  • expanded regression tests to cover underscore X handles and image references that try to escape the post bundle

I left the image JSON-LD lookup in place intentionally: image is already treated as the article/social image by the existing social-image resolver, and the contribution guide now renders the expected article image in JSON-LD.

Local checks passed again:

  • dotnet build PowerForge.Web.Cli -c Debug -f net8.0
  • dotnet test PowerForge.Tests -c Debug --filter WebContributionProcessorTests
  • contributions validate/import against Website.Contributions

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Code Review: Add website contribution import support

Overall: This is a solid, well-structured addition. The validation pipeline, path-traversal guards, and regex-timeout discipline are all good. A few items worth addressing before merge.


Overview

Adds a contributions validate|import sub-command to powerforge-web that:

  • Validates contributor post bundles (posts/<lang>/<slug>/index.md + assets) against a schema
  • Optionally imports them into a private website repo, rewriting relative image paths to site-rooted asset routes
  • Emits author metadata into JSON-LD, OpenGraph, RSS (dc:creator), Atom, and JSON Feed outputs

Potential Bugs

1. RemoveFrontMatterKeys silently eats blank lines adjacent to removed keys

In WebContributionProcessor.cs, the skipping block continues on any string.IsNullOrWhiteSpace(line). An empty line between two front matter keys that follows a removed key will be swallowed, compressing the front matter:

if (string.IsNullOrWhiteSpace(line) || char.IsWhiteSpace(line[0]) || line.TrimStart().StartsWith("-", ...))
    continue;

In practice YAML front matter rarely has blank lines between keys, but it is technically valid YAML. At minimum this deserves a comment explaining the assumption.

2. ResolveYear falls back to DateTime.UtcNow.Year for posts with missing/invalid dates

private static int ResolveYear(DateTime? date)
{
    if (date is { Year: >= 2000 and <= 2100 } value)
        return value.Year;
    return DateTime.UtcNow.Year;   // non-deterministic
}

A post with date: null silently gets the current year. This makes imports non-deterministic across year boundaries and could produce wrong asset paths if the import runs on Dec 31 vs Jan 1. Consider making this an error or at least a warning, since a missing date is already validated as an error at line ~712.

3. Duplicate error messages possible in CopyFile vs pre-check in Import

The Import method pre-checks all targets exist before copying, then calls CopyFile which has the same guard. The continue on errors.Count > postErrorStart prevents double-firing under normal conditions, but a TOCTOU window means CopyFile could add a second copy of the same message. The Distinct dedup at the end would suppress it, but it's worth noting the redundancy.


Code Quality

4. WebContributionProcessor.cs exceeds the 800-line project guideline

Per AGENTS.md/Build/linecount.js, the limit is 800 lines. This file is 906 lines. A natural split would be:

  • WebContributionValidator.cs — all Validate* methods
  • WebContributionImporter.csImport, RewritePostMarkdown, EnsureImportedAuthorMetadata
  • Keep helpers and regex statics in the processor or a shared internal class

5. TryReadValue returns true with value = meta for an empty key string

object? current = meta;
foreach (var part in key.Split('.', ...))  // yields zero parts for ""
{ ... }
value = current;  // = meta itself
return true;

Only called with hardcoded string constants today, but worth guarding with an early return or a check that parts is non-empty.

6. FrontMatterImageRegex could match non-image image-* keys

(?m)^(?<prefix>\s*image\s*:\s*["']?)...

The image\s*: pattern with (?i) would also match image_alt:, image_url:, etc. because the character class before : allows whitespace. In practice the front matter key would need to be literally image (with optional surrounding spaces), which YAML wouldn't normally allow with spaces in key names, so this is low risk — but worth a targeted test for image_alt lines to confirm they are not rewritten.


Security

7. Path-traversal protection is correct and tested — good

TryResolveBundleAsset and ResolveInside both canonicalize via Path.GetFullPath and check the prefix, and there is a dedicated test (Validate_RejectsImageReferencesThatEscapePostBundle). This is the right approach.

8. YamlDotNet default deserializer

AuthorDeserializer = new DeserializerBuilder().Build() uses the permissive default builder. For contributor-supplied YAML, consider adding .WithMaximumRecursion(10) (or similar) to limit potential stack-overflow from deeply nested adversarial input, even though Dictionary<string, object?> prevents type-confusion gadget chains.


Test Coverage

The four existing tests are well-written. Missing coverage worth adding:

Scenario Why it matters
Publish = falsedraft: true remains Confirms the flag is respected
Multiple authors in one post EnsureImportedAuthorMetadata iterates all authors
Force = true overwrites existing target Happy path for the force flag
Post with no image front matter Ensure rewriter doesn't corrupt posts without a cover
External image ref (https://) in body Should pass through unchanged
Author with minimal profile (name + slug only) ResolveAuthorPublicUrl returns null; no author_urls array written

Minor Nits

  • new[] { single.Trim() } in WebSiteBuilder.AuthorMetadata.cs:416 — prefer [single.Trim()] (collection expression) for consistency with other C# 12 patterns in the codebase if that style is used elsewhere.
  • The CLI dispatch registers both "contributions" and "contribution" (singular). This is fine ergonomically but might be worth documenting in the help text shown by Fail(...) so users know both spellings are accepted.
  • Error output in HandleContributions prints warnings before errors (logger.Warn then logger.Error). Most CLIs print warnings last to leave errors visible at the end. Small UX point.

Summary

The core logic is solid and the security-sensitive pieces (path traversal, URL validation, regex timeouts) are handled correctly. The main asks before merge:

  1. Split WebContributionProcessor.cs to stay within the 800-line guideline.
  2. Fix or document the ResolveYear non-determinism (missing-date post should not silently use current year when date is already a required field).
  3. Add a few targeted tests (especially Publish=false and multi-author).

Everything else is low-priority polish. Nice work overall.

@PrzemyslawKlys
Copy link
Copy Markdown
Member Author

Final small cleanup pass from the latest review:

  • ResolveYear now uses a deterministic fallback (2000) for invalid/missing dates; those posts are already validation failures, so import is still blocked
  • TryReadValue now rejects empty keys explicitly
  • added a comment around imported-author front matter key removal assumptions
  • added coverage that Publish = false preserves draft: true
  • added a regression assertion that image_alt is not touched by image rewriting

I am intentionally not splitting WebContributionProcessor.cs in this PR. It is a good follow-up refactor, but doing a large mechanical split while the feature is still settling would add review noise without changing behavior.

Local checks passed again:

  • dotnet build PowerForge.Web.Cli -c Debug -f net8.0
  • dotnet test PowerForge.Tests -c Debug --filter WebContributionProcessorTests
  • contributions validate/import against Website.Contributions

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Code Review — PR #333: Add website contribution import support

Overview

This PR adds a full contribution pipeline for PSPublishModule: contributor bundles (a posts/<lang>/<slug>/index.md + assets directory) can be validated or imported into the private website repo, with author metadata emitted into JSON-LD, OpenGraph, RSS, Atom, and JSON Feed outputs. The scope is well-contained and the general architecture is sound.


Strengths

  • Path traversal protection is solid: TryResolveBundleAsset and ResolveInside both resolve to Path.GetFullPath and assert the result remains inside the declared root. The Validate_RejectsImageReferencesThatEscapePostBundle test confirms this works. Cross-platform comparison (PathComparison) is correctly handled.
  • Fenced code block masking in MaskFencedCodeBlocks / ReplaceOutsideFencedCodeBlocks correctly prevents image-pattern rewrites inside code examples, and the import test verifies the behaviour.
  • Regex patterns are compiled and cached as statics, with a 1-second timeout — good practice for user-supplied content.
  • UTF-8 BOM-free output (new UTF8Encoding(false)) and \r\n\n normalisation are the right choices for web files.
  • Error accumulation: validation collects all problems and reports them together rather than failing fast. Good UX for contributors.

Issues and Suggestions

1. image key added to structured-data fallback chain may affect non-contribution posts

// WebSiteBuilder.StructuredDataProfiles.cs
var imageOverride = ReadMetaString(item.Meta, "article.image", "news.image", "schema.image", "social_image", "image");

The newly appended "image" key is also the standard Hugo/Jekyll featured-image front matter field used by any post — not just contribution imports. If a non-contribution post carries image: ./cover.webp (a bundle-relative path), it will now be forwarded into ResolveSocialImagePath as the image override, potentially producing a broken URL in the JSON-LD output. Consider restricting this fallback to posts that were imported via the contribution pipeline (e.g. guard with a sentinel front matter key like pf_contribution: true), or document the expectation that this field must always be site-rooted or absolute.

2. FrontMatterImageRegex rewrites image_alt — confirm it doesn't

The regex is:

(?m)^(?<prefix>\s*image\s*:\s*['""]?)(?<target>[^'""\r\n]+)(?<suffix>['""]?\s*)$

\s* between image and : allows image : — but because _ is not \s, the pattern does not match image_alt:. This is safe as-is, but worth a comment noting the intent, because readers will immediately ask the same question.

3. RemoveFrontMatterKeys drops blank lines between removed keys and the next field

When a key-to-remove is immediately followed by a blank line and then the next valid key, the blank separator line is consumed along with the removed key's continuation. In practice this just tightens the vertical spacing in the output front matter, which is cosmetic. There is no test covering this layout, so the behaviour is invisible.

4. ResolveYear fallback of 2000 is silent

private static int ResolveYear(DateTime? date)
{
    if (date is { Year: >= 2000 and <= 2100 } value)
        return value.Year;
    // Missing or invalid dates are validation errors; ...
    return 2000;
}

If matter.Date is non-null but falls outside 2000–2100, ValidatePost does not emit a validation error (only null is checked). A date like 1999-12-31 silently uses year 2000 and will produce a confusing asset path. Add a range-check error alongside the null check.

5. WebContributionOptions.MaxAssetBytes / MaxPostAssetBytes limits are untested

Test coverage is good overall but does not exercise the size-limit paths. A test that writes a >5 MB asset (or mocks FileInfo.Length) would protect these guards against regressions.

6. Author Atom entries — missing <uri> element

The Atom <author> element supports both <name> and <uri>. The implementation emits only <name>. For parity with the JSON Feed and OpenGraph author outputs (which do emit URLs), consider adding <uri> when an author URL is available.

<!-- current -->
<author><name>Jane Doe</name></author>

<!-- suggested when URL available -->
<author>
  <name>Jane Doe</name>
  <uri>https://www.linkedin.com/in/janedoe</uri>
</author>

7. CLI: --publish flag has no help text in the handler

HandleContributions parses --publish silently. The --force flag is at least implied by error messages. Consider adding a logger.Info or --help path that describes both flags, especially since --publish removes draft: true markers — a consequence a contributor won't expect unless told.

8. Minor: duplicate dispatch alias not documented

"contributions" => HandleContributions(...)
"contribution"  => HandleContributions(...)

Both singular and plural aliases work, but this is undocumented in the PR description or any help text. Not a bug, just worth noting in README/docs so users don't have to discover it by trial.


Test coverage summary

Scenario Covered
Front matter + body image rewriting
Fenced code block preservation
Partial import (existing target, --force)
Draft preservation (Publish=false)
X handle with underscore
Path traversal in markdown image
Asset size limits
Invalid / out-of-range date
Author profile validation failures
Validate-only (no Import) with success path

Summary

The core design is solid. The two issues most worth addressing before merge are #1 (the "image" key broadening the fallback chain in structured data) and #4 (silent fallback for out-of-range dates). The rest are improvements rather than blockers.

@PrzemyslawKlys
Copy link
Copy Markdown
Member Author

Addressed the two concrete items from the newest review:

  • generic front matter image is now only used as a JSON-LD image override when it is already site-rooted or absolute; relative bundle-style values like ./cover.webp are ignored for structured-data output
  • contribution validation now rejects dates outside 2000..2100 instead of letting them fall through to the deterministic fallback year
  • added tests for safe/unsafe structured-data image overrides and out-of-range contribution dates
  • kept the explicit image_alt regression assertion and the canonical-image regex comment

Local checks passed:

  • dotnet build PowerForge.Web.Cli -c Debug -f net8.0
  • dotnet test PowerForge.Tests -c Debug --filter "WebContributionProcessorTests|WebSiteStructuredDataProfilesTests"
  • contributions validate/import against Website.Contributions

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Code Review: PR #333 — Add website contribution import support

Overview

This PR adds a full validate/import pipeline for external contributor post bundles (Website.Contributions). The implementation includes:

  • WebContributionProcessor — core validation + import logic
  • WebContributionModels — new DTOs
  • WebSiteBuilder.AuthorMetadata — author resolution helpers
  • Output format additions: dc:creator in RSS, <author> in Atom, authors in JSON Feed, article:author in OG, and person-level JSON-LD
  • CLI wiring (contributions validate / contributions import)
  • Test coverage in WebContributionProcessorTests and extended WebSiteStructuredDataProfilesTests

The overall design is solid and security-conscious. A few things worth addressing before merge:


Line-Count Limit

WebContributionProcessor.cs is 914 lines. The repo's own quality gate (node .\Build\linecount.js . 800) flags files over 800 lines. Consider splitting it — a natural seam exists between the pure-validation helpers (path resolution, YAML reading, regex utilities) and the import/rewrite logic. Two files of ~450 lines each would stay well inside budget.


Potential Bugs

RemoveFrontMatterKeys uses AppendLine on normalized input
The function normalises \r\n\n before splitting, but StringBuilder.AppendLine emits \r\n on Windows. The intermediate result is then passed through RewriteFrontMatter, which does a final .TrimEnd('\r', '\n') + "\n", so the round-trip is safe — but it's fragile. A simple builder.Append(line).Append('\n') removes the dependency on the runtime platform.

Dead branch in ReadStringList

if (raw is string[] stringArray)   // YamlDotNet never produces string[] here

YamlDotNet's deserializer returns List<object> for sequences, so this path is never hit. The IEnumerable<object?> branch below handles it correctly. The string[] check is harmless but misleading.

RSS namespace declared unconditionally
The dc namespace is added to the <rss> root element even when no items in that feed carry author metadata. This is valid XML but will trigger "declared but not used" warnings in strict feed validators. Consider conditionally adding it only when at least one item has author data.


Security

  • Path traversal protection in ResolveInside and TryResolveBundleAsset is correct and well-tested. ✓
  • TryResolveBundleAsset uses Path.GetFullPath + prefix check; the test Validate_RejectsImageReferencesThatEscapePostBundle validates the ../secret.png case. ✓
  • YAML deserialized to Dictionary<string, object?> avoids gadget-style deserialisation attacks. ✓
  • All regex patterns carry a TimeSpan.FromSeconds(1) timeout — good ReDoS defence. ✓
  • EscapeYamlScalar correctly escapes \ and " for double-quoted scalars. ✓

Missing CLI Exposure

WebContributionOptions.MaxAssetBytes (5 MB) and MaxPostAssetBytes (30 MB) are enforced during validation but are not exposed via --max-asset-size / --max-total-size CLI flags. Contrib maintainers may want to tighten or relax these limits per-repo without forking the engine. Even if not done now, a // TODO comment would prevent the gap from being forgotten.


Test Coverage Gaps

The existing tests are good. The following scenarios are not covered:

Missing case Risk if untested
Author profile validation errors (invalid LinkedIn URL, missing name) Silent acceptance of bad author data
Asset over MaxAssetBytes / MaxPostAssetBytes Size-limit enforcement silently bypassed
Language derived from folder (no language: in front matter) ResolveLanguage folder-fallback path
Multi-author post (author metadata injection with >1 author) author_names / author_urls list alignment
Author catalog JSON output shape Regression in catalog.json format

Minor Observations

  • IsFrontMatterKeyLine finds the first : in a line, so author_url: ... is correctly not treated as an author key. ✓ But the same logic would misidentify author: "Name: Suffix" as having key author — actually that's correct and intentional, so this is fine.
  • ResolveYear returns 2000 as a sentinel for invalid dates. The comment explains why, and import is blocked before this value is used. ✓
  • FrontMatterDelimiterRegex is tested indirectly by every import test. A direct unit test for the bounds extraction edge cases (empty front matter, BOM, missing closing ---) would increase confidence.
  • The alias "contribution" => HandleContributions(...) alongside "contributions" is a nice UX touch.

Summary

The feature is well-designed, security-aware, and meaningfully tested. The main actionable items before merge are:

  1. Split WebContributionProcessor.cs to stay within the 800-line project limit.
  2. Fix AppendLineAppend('\n') in RemoveFrontMatterKeys to avoid latent Windows line-ending bugs.
  3. Add at minimum the multi-author and asset-size-limit test cases.
  4. Conditionalise the dc namespace in RSS output, or note it as a deliberate trade-off.

Good work overall — the traversal guards and regex timeout discipline in particular are exactly the right instinct for a code path that processes externally-submitted content.

@PrzemyslawKlys
Copy link
Copy Markdown
Member Author

One last small cleanup from the latest review pass:

  • RemoveFrontMatterKeys now appends \n explicitly instead of using platform-dependent AppendLine()
  • removed the dead string[] branch in ReadStringList
  • added an asset size limit regression test

I checked the line-count script too. It reports many existing files over 800 lines, so I am leaving the WebContributionProcessor.cs split as a follow-up refactor rather than mixing a large file move into this already-reviewed feature branch.

Local checks passed:

  • dotnet build PowerForge.Web.Cli -c Debug -f net8.0
  • dotnet test PowerForge.Tests -c Debug --filter "WebContributionProcessorTests|WebSiteStructuredDataProfilesTests"
  • contributions validate/import against Website.Contributions

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Code Review — PR #333: Add website contribution import support

Overview

This PR introduces a complete contribution pipeline for the PowerForge website: a WebContributionProcessor that validates and imports community post bundles, author metadata enrichment across JSON-LD / OpenGraph / RSS / Atom / JSON Feed, and a new contributions CLI subcommand. The feature is well-scoped and the code is clearly structured.


Security — Solid Foundations

  • Path traversal prevention is correctly implemented in both ResolveInside (options paths) and TryResolveBundleAsset (contributor-supplied asset refs). The approach of computing Path.GetFullPath and checking the rootPrefix prefix is the right pattern.
  • Path traversal test (Validate_RejectsImageReferencesThatEscapePostBundle) verifies the ../secret.png case.
  • Regex timeouts (RegexTimeout = TimeSpan.FromSeconds(1)) on all compiled regexes is a good ReDoS safeguard.
  • Asset allowlist (AllowedAssetExtensions) prevents executable or unexpected file types from being imported.
  • IsAbsoluteOrSiteRootedUrl correctly gates the front-matter image field in JSON-LD — only /-rooted or http(s):// values are emitted, preventing relative ./cover.webp from leaking into structured data.

Potential Issues

1. JsonSerializer.Serialize without AOT context (WebContributionProcessor.cs ~line 1033)

The catalog write uses reflection-based serialization:

var json = JsonSerializer.Serialize(catalog, new JsonSerializerOptions { ... });

The rest of the CLI layer uses the source-generated WebCliJson.Context / PowerForgeWebCliJsonContext for trimming/AOT safety. If this code path is ever reached in a NativeAOT or trimmed build, it will fail at runtime. Consider using a source-generated context or restructuring the catalog type so it can be registered.

2. IsFrontMatterKeyLine removes indented keys too

private static bool IsFrontMatterKeyLine(string line, ISet<string> keys)
{
    var trimmed = line.TrimStart();   // ← strips leading whitespace first
    var colon = trimmed.IndexOf(':');
    return colon > 0 && keys.Contains(trimmed[..colon].Trim());
}

Because the method calls TrimStart(), an indented YAML key like social_twitter_creator: "x" (inside a nested block) would also be removed. For the current set of target keys (author, author_names, author_urls, social_twitter_creator) this is unlikely to cause problems since they are expected at the top level, but it is a latent correctness risk if contributors nest any of these keys inside another mapping.

3. ResolveInside throws on bad options paths instead of returning a user error

When options.ContentBlogPath, options.StaticBlogAssetsPath, or options.TargetAuthorsPath escape their root (e.g. a misconfigured ../../etc in code), ResolveInside throws InvalidOperationException. Since these values come from WebContributionOptions (caller-controlled, not contributor-controlled), this is acceptable for now, but validating them at the top of Process would give callers a cleaner failure mode than an unhandled exception.

4. TryReadValue nested YAML type assumption

The method casts nested objects to IReadOnlyDictionary<string, object?> for dotted-key traversal. This mirrors the existing TryGetMetaValue pattern across the codebase and is therefore consistent. However, note that the contributor YAML author profiles are only read with simple (non-dotted) keys in this PR, so no regression exists here. Worth keeping in mind if dotted paths are added later.


Minor Observations

MaskFencedCodeBlocks LINQ scan

FencedCodeBlockRegex.Replace(markdown, match => new string('\n', match.Value.Count(static ch => ch == '\n')));

For documents with large fenced blocks this is O(n) per block. Not a real concern for blog posts, but match.Value.AsSpan().Count('\n') would be marginally cleaner in .NET 8.

AppendLine on Windows may produce \r\n
In EnsureImportedAuthorMetadata, StringBuilder.AppendLine uses the environment line ending. The final Replace("\r\n", "\n") in RewritePostMarkdown normalizes the whole output, so the result is correct. Just worth being aware of if the normalization is ever removed or moved.

Dual alias registration

"contributions" => HandleContributions(...),
"contribution"  => HandleContributions(...),

Intentional and appreciated — the singular alias is a common mistake and this is a clean way to handle it.


Test Coverage

Scenario Covered
Front matter + body image rewrite
Fenced code block immunity
Partial import (existing target, Force=false)
Draft preservation (Publish=false)
X handle with underscore
Path traversal via ../
Out-of-range date
Asset size limit
JSON-LD image: only safe overrides
Author profile validation errors (bad LinkedIn, missing name, etc.) ❌ Missing
BuildContentAuthorStructuredData with multiple authors → JSON-LD array ❌ Missing
OpenGraph article:author meta tag emission ❌ Missing
RSS dc:creator / Atom <author> / JSON Feed authors emission ❌ Missing

The missing tests are for the feed/meta output changes and author validation rules — they are lower-risk since the structured-data profile test does cover the safe-image filtering path, but adding them would improve confidence in the author URL resolution and feed author injection.


Summary

The core contribution processor is well-designed, security-conscious, and comes with a solid set of behavioral tests. The three items above worth addressing before merge are:

  1. AOT serialization for the catalog write (correctness risk in trimmed builds).
  2. IsFrontMatterKeyLine indentation sensitivity (latent bug).
  3. Adding tests for author profile validation and feed author emission.

Everything else is clean and consistent with the existing codebase conventions.

🤖 Reviewed with Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Code Review — PR #333: Add website contribution import support

Overview

This PR introduces a complete contribution pipeline for the PowerForge website: contributors can submit post bundles (Markdown + assets + author YAML) to a separate repo, and this tooling validates the structure and optionally imports it into the main website repo. The feature is well-scoped, security-conscious, and ships with a solid test suite. A few issues are worth addressing before merge.


Positive Highlights

  • Path traversal protection is thorough: TryResolveInside prevents configured paths from escaping the root, and TryResolveBundleAsset rejects .. segments in image references.
  • Regex safety: all compiled regexes carry a 1-second timeout, guarding against ReDoS.
  • Dual-mode design (validate-only vs. validate+import) is clean and well-represented in tests.
  • Force / Publish flags give operators precise control over overwrite and draft-removal behaviour.
  • Fenced-code-block masking correctly prevents image rewrites inside code examples, with a dedicated test.
  • Source-generated JSON serialization (WebContributionJsonContext) is consistent with the rest of the codebase.

Issues

1. Test helper hardcodes the author slug (bug)

WriteAuthor in WebContributionProcessorTests.cs (line 346):

File.WriteAllText(Path.Combine(authorsRoot, slug + ".yml"),
    $$"""
    name: {{name}}
    slug: jane-doe   // ← always "jane-doe" regardless of the `slug` parameter
    linkedin: {{linkedin}}
    x: {{x}}
    """);

The slug parameter is used for the filename but not written into the YAML body. Any test that passes a slug other than jane-doe will silently produce a file whose slug field disagrees with its filename. Replace jane-doe with {{slug}}.


2. FrontMatterImageRegex can rewrite nested image: keys

The regex (line 778–781 of WebContributionProcessor.cs):

(?m)^(?<prefix>\s*image\s*:\s*[""']?)(?<target>[^""'\r\n]+)(?<suffix>[""']?\s*)$

has \s* before image, so it matches indented YAML lines such as:

metadata:
  image: ./thumbnail.png

During RewritePostMarkdown, the front matter is passed through this regex, meaning nested image: fields would also get their relative paths rewritten to the canonical asset route. The comment says "image_alt/image_url are intentionally excluded" but the indentation ambiguity is not addressed. A test covering a nested image: key under metadata: would expose this. If only the top-level key should be rewritten, anchor the regex more tightly (e.g., require no leading whitespace: ^(?<prefix>image\s*:\s*...)).


3. Misleading error message for path-traversal in image refs

In ValidateMarkdownImages (line 1100–1102):

if (!TryResolveBundleAsset(bundleRoot, target, out var fullPath) || !File.Exists(fullPath))
    errors.Add($"{relative}: markdown image target '{target}' does not exist.");

TryResolveBundleAsset returns false for paths containing .., so the error for ../secret.png says the file "does not exist" rather than "escapes the post bundle". The existing test (Validate_RejectsImageReferencesThatEscapePostBundle) passes because it only checks that the target path appears in the error string — but the message is semantically misleading for contributors trying to understand why their post failed. Consider distinguishing the two cases.


4. Race condition in Import between existence check and CopyFile

In Import (lines 1156–1160), asset targets are checked with File.Exists before calling CopyFile, which checks again internally. In a concurrent environment (e.g., two operators running import simultaneously) the pre-check could succeed while CopyFile's internal check also succeeds, resulting in a silent overwrite rather than an error when Force = false. This is low-risk in practice but worth noting if the tool is ever called from a CI pipeline with parallelism.


5. Missing test for catalog.json output

The Import path writes a catalog.json author catalog to data/authors/catalog.json, including name, slug, LinkedIn, X handle, etc. No test currently asserts the content or structure of that file. Given that downstream site templates likely consume this JSON, a test verifying the serialized output would be valuable.


6. No explicit encoding when reading author YAML

LoadAuthors (line 906) calls File.ReadAllText(path) without specifying Encoding.UTF8, while all Markdown files are read with Encoding.UTF8 explicitly. Minor inconsistency, but worth aligning for cross-platform correctness with BOM-less UTF-8 files.


7. Validation errors are deduplicated case-insensitively (minor)

result.Errors = errors.Distinct(StringComparer.OrdinalIgnoreCase).ToArray();

Two distinct errors that differ only in case (e.g., paths on a case-sensitive filesystem) would be silently merged. This is unlikely to cause real problems, but StringComparer.Ordinal may be a safer default.


Minor / Nits

  • ResolveYear fallback to 2000 (line 1292): the comment explains this keeps paths deterministic until import is blocked. This is fine but a future reader might be confused — a short comment at the call site (or inline) would help.
  • CLI --site alias (line 532): --site is accepted but not documented in the command dispatch or help text visible from Fail(...). Worth adding to any usage strings.
  • WebSiteBuilder.AuthorMetadata.cs — fallback chain in ResolveContentAuthorNames: the function falls through author_names, authors.names, authors_names, schema.author_names, then a single-value author field. There is no test verifying the interaction between the flat author: field (set by the contribution importer) and author_names: (also set by the importer). The first populated list wins, which is correct, but a test would add confidence.

Summary

The implementation is solid and security-aware. The most important fix before merge is issue #1 (hardcoded slug in the test helper) and issue #2 (nested image: rewriting), since both could silently cause incorrect behaviour in production. Issues 3–7 are lower priority but worth addressing in a follow-up.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Code Review: PR 333 - Website Contribution Import Support

OVERVIEW

This PR adds a contributions validate/import pipeline for public Website.Contributions bundles. New components: WebContributionProcessor (validates post bundles, imports into website repo), CLI routing for contributions/contribution sub-commands, and author metadata emission into JSON-LD, OpenGraph, RSS, Atom, and JSON Feed.

The design is sound and test coverage is solid. A few items need attention before merge.


ISSUES

== 1. FILE TOO LONG: WebContributionProcessor.cs (938 lines) ==

AGENTS.md documents a hard 800-line limit (node Build/linecount.js . 800). At 938 lines this file already fails that gate. Suggested split:

  • WebContributionValidator.cs: all Validate* methods and helpers
  • WebContributionImporter.cs: Import, RewritePostMarkdown, EnsureImportedAuthorMetadata
  • Keep WebContributionProcessor.cs as the thin orchestrator

== 2. DraftRegex matches indented nested draft: true ==

Pattern is (?m)^\sdraft\s:\strue\s$. The leading \s* matches an indented 'draft: true' inside a nested YAML block (e.g. inside a metadata: sub-object). Should use (?m)^draft\s*:\strue\s$. The validation code itself uses ^ without \s* for other key patterns.

== 3. EscapeYamlScalar is incomplete ==

Only backslash and double-quote are escaped. If an author Name or Title contains a tab, newline, or other YAML control character the generated front matter will be malformed. At minimum escape newline and carriage-return.

== 4. Import: catalog File.WriteAllText can throw unhandled exception ==

The catalog write is not wrapped in a try/catch. On permission error or disk full, the exception bypasses the errors list and the try/catch in HandleContributions. Wrap in try/catch that appends to errors.


OBSERVATIONS

  • Duplicate read helpers: WebContributionProcessor defines TryReadValue/TryReadString/ReadStringList while WebSiteBuilder.StructuredDataProfiles.cs has ReadMetaString/ReadMetaStringList. Consolidating would reduce divergence risk.
  • FrontMatterImageRegex scope is correct but fragile. An explicit test asserting og_image and image_alt are NOT rewritten would lock this in.
  • Error messages expose full filesystem paths in JSON output. Consider site-relative paths.
  • Dual alias routing (contributions + contribution) is a nice touch; document the canonical form.

WHAT IS WORKING WELL

  • Path traversal protection: TryResolveBundleAsset and TryResolveInside use GetFullPath plus prefix matching. The traversal test (../secret.png) covers this explicitly.
  • Regex timeouts: All compiled regexes carry a 1-second RegexTimeout, preventing ReDoS.
  • Fenced code block masking: MaskFencedCodeBlocks/ReplaceOutsideFencedCodeBlocks cleanly prevent image: and Markdown image rewrites inside code examples, tested explicitly.
  • ResolveYear fallback to 2000 on invalid dates keeps paths deterministic. Clean defensive design.
  • Cross-platform path handling: ToSlash/FromSlash consistent; case-insensitive on Windows.
  • 9 focused xUnit facts covering validate/import paths including partial-import and metadata-nesting.
  • Structured data author upgrade: Person[] replaces hardcoded Organization fallback, backed by new test.

SUMMARY - before merging:

  1. Must fix: Split WebContributionProcessor.cs to pass the 800-line gate.
  2. Must fix: Correct DraftRegex to not match indented nested keys.
  3. Should fix: Wrap catalog File.WriteAllText in try/catch recording to errors.
  4. Should fix: Strengthen EscapeYamlScalar for control characters (tab, newline).

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Code Review — PR #333: Add website contribution import support

Overview

This PR introduces a full contribution pipeline for the PowerForge website engine: contributors submit post bundles (Markdown + assets) to a sibling repo, and WebContributionProcessor validates the bundle layout, author profiles, asset types/sizes, and image references, then optionally imports them into the site repo. Author metadata is also propagated into JSON-LD, OpenGraph, RSS (dc:creator), Atom, and JSON Feed output. The architecture is clean and the test coverage is solid for the core scenarios.


Correctness Concerns

1. Front matter --- delimiter in post body may silently corrupt rewrites

TryGetFrontMatterBounds takes matches[1] as the closing delimiter, but --- on its own line is valid Markdown syntax for a thematic break (horizontal rule). Any post whose body contains a bare --- line will cause RewriteFrontMatter (called three times in RewritePostMarkdown) to treat everything up to that first body --- as front matter, potentially corrupting the output silently.

The import path should either fail-fast when matches.Count > 2 or document that --- separators in body text are unsupported (and add a validation check for them).

2. Atom feed entries without author_names have no fallback author

The Atom spec (RFC 4287 §4.1.2) requires that either the feed has a <author> element or every entry does. The new code adds <author> only when ResolveContentAuthorNames returns something. Entries without author metadata will produce no <author>, making the feed invalid if there is no feed-level author defined. At minimum, fall back to the site name (matching the existing JSON-LD Organization fallback), or add a feed-level <author>.

3. RemoveFrontMatterKeys is fragile for multi-line scalar values

The line-by-line removal logic skips lines that start with whitespace or - (list items). This works for the current author fields, but silently fails to remove a key whose value is a YAML block scalar (e.g. author: |) — the block body lines start with spaces and will be consumed, but only by accident. A YAML block-literal author field would not be removed cleanly. This is unlikely in practice, but worth either a validation check ("author field must be a scalar or sequence") or a comment documenting the limitation.

4. YamlDotNet deserializer permits YAML anchors/aliases

new DeserializerBuilder().Build() in AuthorDeserializer allows YAML anchors and aliases. A crafted author profile could use alias expansion to create extremely large in-memory structures (<<: *anchor bomb). Consider adding .DisableAliases() to the builder, which is a one-line defensive change for untrusted contributor input.


Minor Issues

5. File line budget is nearly reached

Per AGENTS.md, the project enforces an 800-line limit (node ./Build/linecount.js . 800). WebContributionProcessor.cs is 729 lines — 71 lines from the limit. The partial-class split with Validation.cs is already helping; if follow-up work lands in the main file it will hit the ceiling.

6. Error deduplication with Distinct may hide repeated legitimate errors

result.Errors = errors.Distinct(StringComparer.OrdinalIgnoreCase).ToArray();

Two different posts with the same validation error message (e.g., two posts both missing image_alt) would be collapsed into one error, making it appear only one post has the problem. Consider including the per-post path prefix (already done inside the per-post validators) and preserving duplicates, or use DistinctBy(e => e) only if you have a concrete deduplication need.

7. FrontMatterImageRegex replaces image: inside the front matter region but operates on the whole front matter string

In RewritePostMarkdown, the regex is applied inside RewriteFrontMatter so it only sees front matter text. However, the regex has RegexOptions.IgnoreCase and the comment says it intentionally excludes image_alt/image_url. These are excluded by the fact that ^image\s*: only matches exact image: at line start — the case-insensitive flag here is harmless but worth removing to match the comment's intent (and avoid future confusion when someone adds IMAGE_ALT: handling).

8. ResolveYear silently falls back to 2000 for invalid dates

private static int ResolveYear(DateTime? date)
{
    if (date is { Year: >= 2000 and <= 2100 } value)
        return value.Year;
    // Missing or invalid dates are validation errors; keep result paths deterministic until import is blocked.
    return 2000;
}

The comment is correct that validation will catch this upstream. However, if Process is ever called with Import = true without prior validation (e.g. through a future code path that bypasses the error-count check), assets would be silently placed under .../2000/<slug>/. A debug assertion or Debug.Assert(date is not null) here would make this contract explicit.


Missing Test Coverage

The following scenarios have no corresponding unit tests:

  • RSS dc:creator output — the new namespace and per-item dc:creator elements added to RenderRssOutput are not tested.
  • Atom <author> elements — the per-entry Atom author injection is not tested.
  • JSON Feed authors field — the JSON Feed author array is not tested.
  • OpenGraph article:author — the new <meta property="article:author"> tags are not tested.
  • Post body containing --- (thematic break) — ties to correctness concern Release version number does not automatically update #1.
  • Multi-author post import — author_names/author_urls arrays with two authors.

The existing WebSiteStructuredDataProfilesTests.cs test (Build_ArticleStructuredData_UsesOnlySafeFrontMatterImageOverride) is a good model for adding feed/OG tests.


What's Good

  • Path traversal protection is thorough: TryResolveBundleAsset, TryResolveInside, and TryNormalizeRelativeAssetPath all canonicalize paths and check the root prefix before accepting them.
  • Regex timeouts on all compiled regexes prevent ReDoS from crafted input.
  • Fenced code block masking correctly prevents rewriting image references inside YAML/code examples (and the test verifies this).
  • EscapeYamlScalar is applied to all author values written back to front matter — no YAML injection.
  • Force semantics are consistent: the check happens before any writes, and partial-failure continues processing remaining posts.
  • The test helper pattern (try/finally with temp directory cleanup) is reliable and matches existing test style.
  • Dual CLI alias (contributions / contribution) is user-friendly and low-risk.

Summary

The core logic is sound and the security-critical path traversal checks are well-done. The main items worth addressing before merge are the Atom feed validity gap (#2), the --- delimiter collision risk (#1), and the missing feed/OG tests. The YamlDotNet alias concern (#4) is low-severity but a trivial one-line fix. Everything else is minor polish.

@PrzemyslawKlys PrzemyslawKlys merged commit 3cda050 into main Apr 30, 2026
6 checks passed
@PrzemyslawKlys PrzemyslawKlys deleted the codex/contribution-author-metadata branch April 30, 2026 07:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant