Skip to content

Add sitemap SEO consistency guards#346

Merged
PrzemyslawKlys merged 5 commits intomainfrom
codex/seo-sitemap-canonical-guards
May 7, 2026
Merged

Add sitemap SEO consistency guards#346
PrzemyslawKlys merged 5 commits intomainfrom
codex/seo-sitemap-canonical-guards

Conversation

@PrzemyslawKlys
Copy link
Copy Markdown
Member

Summary

  • add PowerForge audit checks that compare sitemap URLs with generated page canonicals
  • flag duplicate sitemap <loc> entries and keep the existing noindex-in-sitemap validation in the same SEO pass
  • skip self-equivalent alias redirects while preserving query-based legacy redirects
  • add regression coverage for sitemap/canonical drift and redirect alias handling

Why

Google Search Console showed slash URL pages being treated as alternate canonicals because rendered pages declared a different canonical than the sitemap/runtime URL. This turns that class of SEO drift into an engine-level audit warning so PowerForge-built sites catch it before deploy.

Validation

  • dotnet test .\PowerForge.Tests\PowerForge.Tests.csproj --filter "FullyQualifiedName~WebSiteAuditSeoMetaTests|FullyQualifiedName~Build_SkipsAliasRedirects_ThatResolveToCanonicalRoute|FullyQualifiedName~Build_ExpandsAliasRedirects_ForSlashVariants" --no-restore
  • git diff --check

Note: the focused test run passed; it emitted transient copy retry warnings because another local PowerForge CLI audit was running at the same time.

@claude
Copy link
Copy Markdown

claude Bot commented May 7, 2026

Code Review — PR #346: Add sitemap SEO consistency guards

Overview

This PR adds three useful capabilities to the PowerForge audit engine:

  1. Sitemap/canonical drift detection — flags <loc> entries whose target page declares a different canonical URL.
  2. Duplicate <loc> detection — flags repeated sitemap URLs.
  3. Self-equivalent alias redirect skipping — avoids emitting a redirect when an alias resolves to the same route as the page itself.

The motivation (Google Search Console treating slash/non-slash variants as alternate canonicals) is clear, and the regression tests are well-structured. A few things worth addressing:


Bugs / Correctness

1. Silent catch { continue; } in CollectSitemapSeoMetadata

catch
{
    continue;
}

Both the file-read error path and the AngleSharp parse error path are swallowed silently. The existing ValidateSitemapSeoConsistency issues a "warning" via addIssue when it can't parse sitemap.xml, and the main per-file loop in WebSiteAuditor.cs typically logs problematic files. A silent skip here means a corrupt/unreadable HTML file will cause that page's canonical to be absent from pagesByRoute with no indication — canonical-mismatch warnings will simply not fire. At minimum, consider logging a diagnostic or counting the skipped files and surfacing a summary issue.

2. Existing test changed without explanation (Audit_FlagsSitemapNoIndexMismatch)

+ options.Include = new[] { "index.html" };

This modifies an existing passing test. Since CollectSitemapSeoMetadata ignores options.Include (intentionally using sitemapSeoHtmlFiles), this change must have been required for another reason. A brief comment explaining why this Include filter was added would prevent future confusion about whether the test is now over-constrained.

3. Silent last-write-wins in pagesByRoute

pagesByRoute[candidate] = metadata;

If two different pages both generate the same route candidate (e.g. a slug collision), the second silently overwrites the first. This means one of the pages will never be checked for canonical drift. A guard or warning for this case would make the behavior explicit.


Design / Quality

4. Double EnumerateHtmlFiles call

var sitemapSeoHtmlFiles = options.CheckSeoMeta
    ? EnumerateHtmlFiles(siteRoot, Array.Empty<string>(), options.Exclude, options.UseDefaultExcludes)
        .OrderBy(...)
        .ToList()
    : new List<string>();

On large sites this enumerates the filesystem twice. A comment explaining why Include is bypassed here (to avoid missing canonical data for pages not in the filtered set) would make the intent obvious. You might also consider whether allHtmlFiles (already enumerated above, without Include filtering in its base case) could be reused, avoiding the second enumeration.

5. Double-normalization in NormalizeRedirectComparisonPath

var normalized = NormalizeAlias(path).Trim();
// ...
var route = "/" + NormalizePath(pathOnly).Trim('/');

NormalizeAlias and NormalizePath are both called on the path component. If these two functions aren't idempotent when composed (e.g. different slug/dash handling), this could produce unexpected results for edge-case inputs. A test covering a path that goes through both transforms would provide confidence here.

6. NormalizeExactSeoUrl silently drops fragments

The function retains uri.Query but discards uri.Fragment. For canonical URLs this is correct (fragments are client-side only), but it's a subtle omission. A brief comment would save future readers from wondering whether it was intentional.


Minor Nits

7. Trailing blank line — after the removal of the HasNoIndexRobots block in WebSiteAuditor.cs (around line 336 in the diff), there's a residual empty line that could be cleaned up.

8. Hint key format inconsistencyseo-sitemap-duplicate-loc-more and seo-sitemap-canonical-mismatch-more are fine, but the ValidateSitemapSeoConsistency method builds the hint for the "more" overflow case with a hardcoded string rather than deriving it from the prefix used in the loop. Extracting the prefix as a const would keep them in sync if the hint names ever change.


Test Coverage

The three new tests are well-structured:

  • They use isolated temp directories with Guid-namespaced paths — good, avoids cross-test contamination.
  • finally { Directory.Delete(..., true) } cleanup is correct.
  • The canonical-mismatch test correctly uses a non-trailing-slash canonical against a trailing-slash <loc> — this is the exact production scenario described in the PR.

One gap: there's no test for the "pagesByRoute collision / two pages with the same route" scenario mentioned in point 3 above.


Summary

The core logic is sound and solves a real problem. The main items to address before merge:

  • Fix the silent error swallowing in CollectSitemapSeoMetadata (or at minimum document the intent).
  • Explain the options.Include addition to the existing test.
  • Consider a comment on why sitemapSeoHtmlFiles bypasses Include.

Nice work on the alias self-skip logic — the NormalizeRedirectComparisonPath approach handles the query-string preservation case cleanly, and the test covers both the skip (slash-variant alias) and the keep (legacy ?page_id= alias) paths.

@claude
Copy link
Copy Markdown

claude Bot commented May 7, 2026

Code Review — PR #346: Add sitemap SEO consistency guards

Overview

This PR introduces two distinct feature areas:

  1. Sitemap SEO auditing: canonicals/duplicates/noindex now checked in a unified post-build pass that always covers the full output tree, even when the main audit is scoped with Include.
  2. Self-equivalent alias redirect pruning: alias sources that normalise to the same route as the canonical are skipped at build time (but query-string legacy redirects are preserved).

The design intent is clear, the test coverage is solid, and the helper extraction is clean. A few things worth looking at:


Correctness / Bugs

Route collision silently drops the new page's metadata (WebSiteAuditor.Helpers.cs, CollectSitemapSeoMetadata)

if (pagesByRoute.TryGetValue(candidate, out var existing) && ...)
{
    routeCollisionCount++;
    AddSample(...);
    continue;   // ← only the continue skips the new entry
}
pagesByRoute[candidate] = metadata;

The continue exits the inner foreach (var candidate ...) loop, so the winning entry is always whichever file was enumerated first. Because EnumerateHtmlFiles sorts by path, the winner is deterministic, but both the collision diagnostic and the canonical-mismatch check will silently miss the second file. That is a false-negative risk for sites with intentional multi-slug pages. Consider either:

  • documenting the "first-wins" policy in a comment, or
  • raising the issue at "info" level so the operator can investigate.

NormalizeRedirectComparisonPath strips query strings before comparison (WebSiteBuilder.RenderAssetsAndRouting.cs)

var queryIndex = normalized.IndexOf('?', StringComparison.Ordinal);
var query = queryIndex >= 0 ? normalized[queryIndex..] : string.Empty;
var pathOnly = queryIndex >= 0 ? normalized[..queryIndex] : normalized;
// ...
return route + query;   // query is appended back to the final value

The source path keeps its query string in the return value, so when the caller does source.Equals(target, ...) it will never match a canonical route (which has no query string), which is the correct behaviour. But the intermediate stripping-and-re-appending adds cognitive load. A comment like // query-bearing aliases can never equal a canonical route, but normalise the path segment for the comparison would prevent a future reader from thinking the query stripping is a bug.


Performance

sitemapSeoHtmlFiles ignores MaxHtmlFiles (WebSiteAuditor.cs, lines ~50-54)

var sitemapSeoHtmlFiles = options.CheckSeoMeta
    ? EnumerateHtmlFiles(siteRoot, Array.Empty<string>(), options.Exclude, options.UseDefaultExcludes)
          .OrderBy(...)
          .ToList()
    : new List<string>();

On a large site this enumerates and reads every HTML file a second time (once for the normal audit pass, once for the sitemap SEO pass), with a separate AngleSharp parse for each. For sites in the thousands of pages that could be seconds of extra wall time. The approach is correct (sitemap validation must see the full tree), but it's worth noting:

  • MaxHtmlFiles being ignored here is intentional, but a short comment would prevent someone from "fixing" it later.
  • If this runs frequently in CI you may want to fuse the two passes in a future iteration so each file is read and parsed only once.

Code Quality / Style

Two URL-normalisation helpers with overlapping responsibility

NormalizeRouteLikeValue, NormalizeComparableSeoUrl, and the new NormalizeExactSeoUrl all normalise URLs but with subtly different semantics. They're used in different contexts, but the naming is a little close. A one-line doc-comment on each clarifying when trailing slashes are preserved vs stripped would make the intent explicit and prevent accidental cross-use.

AddSample is a thin single-concern helper — consider inlining or a named utility class

private static void AddSample(List<string> samples, string value, int sampleLimit)
{
    if (samples.Count < sampleLimit && !string.IsNullOrWhiteSpace(value))
        samples.Add(value);
}

No issue with extracting it, but the sampleLimit parameter is always the compile-time constant 5. If it's truly fixed, passing sampleLimit as a parameter rather than a constant makes callers slightly noisy. Either inline the guard or make sampleLimit a named const at class scope.

ValidateSitemapSeoConsistency is now a large method (~100 lines)

The three logical passes (duplicate-locs, noindex, canonical-mismatch) each have their own list-building + issue-emitting loop. Extracting each into a private method (e.g. CheckDuplicateSitemapLocs, CheckNoIndexInSitemap, CheckCanonicalMismatches) would keep each concern testable independently and lower the cognitive load. Not a blocker, just a readability note.


Tests

The three new tests are well-structured: they create isolated temp directories, build or audit a minimal site, assert on the output, and clean up in finally. A few observations:

  • Build_SkipsAliasRedirects_ThatResolveToCanonicalRoute correctly asserts absence of the self-equivalent redirect and presence of the query-string redirect. Good.
  • The two new WebSiteAuditSeoMetaTests tests set options.Include = new[] { "index.html" } to exercise the "sitemap pass ignores Include scope" behaviour. This is the right approach and the comment explains why.
  • Missing test: there is no test for the route-collision diagnostic (seo-sitemap-route-collision). Given that the collision logic silently drops data (see the correctness note above), a test that verifies the warning is emitted would catch regressions.

Summary

Area Status
Feature correctness ✅ looks correct for the stated scenarios
Route-collision first-wins policy ⚠️ works but silent — add a comment or an info-level issue
Performance (double full-tree scan) ⚠️ acceptable now, worth a comment about the intentional MaxHtmlFiles bypass
Normalisation helper naming ⚠️ slightly ambiguous — consider short comments
Test coverage ✅ good; add a collision-diagnostic test
Style / conventions ✅ consistent with the surrounding code

Overall a solid addition. The correctness concern around the route-collision path and the missing test are the two things I'd resolve before merging; the rest are polish.

@claude
Copy link
Copy Markdown

claude Bot commented May 7, 2026

Code Review — PR #346: Add sitemap SEO consistency guards

Overview

This PR introduces three related improvements:

  1. Sitemap/canonical drift detection: A new audit pass that compares every sitemap <loc> against the corresponding generated page's canonical tag and flags mismatches.
  2. Duplicate <loc> detection: Flags sitemap entries with the same URL appearing more than once.
  3. Self-alias redirect suppression: Skips generating redirect entries where the alias source resolves to the same canonical route (e.g. /allianz/ to /allianz for a page whose slug is allianz), while preserving query-based legacy redirects like /?page_id=8328.

The motivation is real (Google Search Console treating trailing-slash variants as alternates due to sitemap/canonical disagreement), and the implementation is generally clean. Notes below.


Strengths

  • Good separation of concerns. Splitting CollectSitemapSeoMetadata from ValidateSitemapSeoConsistency is the right call — collection and validation are independent passes and each is independently testable.
  • sealed record types for SitemapPageSeoMetadata and SitemapSeoScan are idiomatic C# and keep the data shapes lightweight.
  • 50-item caps on issue reporting prevent flooding the audit output for large broken sitemaps.
  • Sample-limited error lists (cap 5) in the metadata collection pass prevent unbounded memory growth when many files are unreadable.
  • Tests are well-structured: dedicated [Fact] per scenario, real temp-directory fixtures, proper try/finally cleanup. Build_SkipsAliasRedirects_ThatResolveToCanonicalRoute is especially crisp — it verifies both that the self-alias is gone and that the query-based alias is kept.

Concerns

Performance: double file-system walk

var sitemapSeoHtmlFiles = options.CheckSeoMeta
    ? EnumerateHtmlFiles(siteRoot, Array.Empty<string>(), options.Exclude, options.UseDefaultExcludes)
        .OrderBy(...).ToList()
    : new List<string>();
var allHtmlFiles = EnumerateHtmlFiles(siteRoot, options.Include, ...)
    .OrderBy(...).ToList();

When CheckSeoMeta is true this walks the file system twice — once for the Include-filtered allHtmlFiles and again (unfiltered) for sitemapSeoHtmlFiles. For large sites this is noticeable. A cheaper alternative: collect the full unfiltered set first, then derive allHtmlFiles by filtering it in memory.

Bare catch blocks swallow fatal exceptions

catch
{
    readErrorCount++;
    AddSample(readErrorSamples, relativePath, sampleLimit);
    continue;
}

Bare catch catches OutOfMemoryException, StackOverflowException, etc. Prefer catch (IOException) / catch (UnauthorizedAccessException) for the file-read case, and catch (Exception) for the parse case (parse failures are unrecoverable per-file, not per-process — Exception is appropriate there).

Route collision: first-registered page wins silently

if (pagesByRoute.TryGetValue(candidate, out var existing) && ...)
{
    routeCollisionCount++;
    AddSample(routeCollisionSamples, ...);
    continue;  // skips ALL remaining candidates for this metadata
}
pagesByRoute[candidate] = metadata;

The continue exits the inner foreach over candidates, so a page that collides on its first candidate is not registered under any of its other candidates either. This is the right choice to avoid partial state, but it means mismatches for that page are silently undetectable. A short comment explaining first-registered-wins would help future readers.

Also: the collision count over-counts — a page with N colliding candidates increments routeCollisionCount N times instead of once per colliding page. Consider tracking unique colliding pages instead of raw candidate hits.

Layered normalization in NormalizeRedirectComparisonPath

var normalized = NormalizeAlias(path).Trim();
// ...
var route = "/" + NormalizePath(pathOnly).Trim('/');

This chains NormalizeAlias then NormalizePath then manual leading-slash logic. It is hard to reason about what each layer does in combination. A comment identifying which step handles which input class (URL-encoded chars, trailing slashes, path separators) would help, as would tests for edge cases like //slug or /slug#anchor.

NormalizeRootNotFoundPublicRoute placement

This helper lives in WebSiteBuilder.Navigation.LocalizationAndVersioning.cs, which is focused on language routing. The 404 path normalization is a canonical-route concern; consider moving it alongside other output-path normalization helpers.

Minor: naming ambiguity

Two similar methods now coexist — NormalizeExactSeoUrl (new) and NormalizeComparableSeoUrl (existing). "Exact" vs "Comparable" does not immediately convey the behavioral difference. A short inline comment or a more descriptive name would help.


Test Coverage Gaps

Coverage is solid for the happy paths. A few gaps worth considering:

  • Route collision in canonical scan: No test covers what happens when two pages claim the same route candidate with different canonicals.
  • Sitemap entry with no matching generated page: No test confirms that unmatched sitemap entries are silently skipped rather than reported as mismatches.
  • The modification to Audit_FlagsSitemapNoIndexMismatch_WhenNoIndexRouteIsInSitemap (adding options.Include) is correct, but a comment explaining why the sitemap pass ignores the Include filter would help readers unfamiliar with the new flow.

Summary

The core logic is sound and well-motivated. Main items worth addressing before merge:

Priority Item
Medium Replace bare catch with typed exception filters
Medium Fix over-counting in route collision tracking
Low Avoid double file-system walk when CheckSeoMeta is true
Low Add comment explaining first-registered-wins semantics for route collisions
Low Relocate NormalizeRootNotFoundPublicRoute to a more appropriate file

Happy to discuss any of these — the feature itself is a good addition to the SEO audit pass.

@claude
Copy link
Copy Markdown

claude Bot commented May 7, 2026

Code Review — PR #346: Add sitemap SEO consistency guards

Overview

This PR adds three SEO consistency features to the PowerForge engine:

  1. Sitemap canonical mismatch detection — flags when a sitemap <loc> URL differs from the page's <link rel=\"canonical\">.
  2. Duplicate <loc> detection — flags when the same URL appears more than once in the sitemap.
  3. Self-equivalent alias redirect suppression — prevents writing a redirect when an alias resolves to the same route as the page itself.

It also includes a 404-route localization fix and extends the separate sitemap scan to bypass Include/MaxHtmlFiles scoping.

The motivation is solid and well-explained (Google Search Console slash-URL alternate-canonical issue). Tests cover the golden path for all three new behaviors. Overall this is a good change; a few things are worth addressing.


Issues

Double File Enumeration / Double HTML Parse (performance)

WebSiteAuditor.cs, ~line 41–48

allHtmlFiles is already enumerated from options.Include. sitemapSeoHtmlFiles then re-enumerates the entire site unconditionally. When options.Include is empty (the common case — no scoping), both lists contain the same files, so:

  • The filesystem is walked twice.
  • Every HTML file is read and AngleSharp-parsed twice (once in the main audit loop, once inside CollectSitemapSeoMetadata).

The old approach accumulated noIndexRoutes inline during the main pass at zero extra cost. The new two-pass approach trades that for a cleaner API, but at non-trivial I/O cost on large sites.

Suggested fix: Reuse allHtmlFiles when options.Include is empty:

var sitemapSeoHtmlFiles = options.CheckSeoMeta
    ? (options.Include is { Length: > 0 }
        ? EnumerateHtmlFiles(siteRoot, Array.Empty<string>(), options.Exclude, options.UseDefaultExcludes)
              .OrderBy(path => path, StringComparer.OrdinalIgnoreCase)
              .ToList()
        : allHtmlFiles)
    : new List<string>();

This avoids the redundant walk in the no-Include path and keeps the bypass behavior intact when a sample scope is active.


Route Collision Silencing May Produce False Positives

WebSiteAuditor.Helpers.cs, CollectSitemapSeoMetadata, inner collision block

When two pages map to the same route candidate, the first-registered metadata wins. This is deterministic because files are sorted by path, but if the first-registered page has a wrong canonical URL while the second has the correct one, the canonical mismatch detection will incorrectly fire for the winning (wrong) entry. In practice this should be rare, but the failure mode is confusing because the reported RelativePath will be the winner, not the actual collision source.

The comment says "keep first-registered route metadata deterministic while reporting the colliding page once" — worth also noting that the winning entry is path-sort–ordered first to make this verifiable in future debugging.


NormalizeRedirectComparisonPath — Misleading Comment

WebSiteBuilder.RenderAssetsAndRouting.cs, ~line 1207

// Query-bearing aliases can never equal a canonical route, but the path segment still needs route normalization.

The comment implies the query check is a short-circuit, but NormalizeRedirectComparisonPath is called on both sides of the comparison. The route (target) will never have a query, so the query suffix is appended on the source side and produces a guaranteed non-match. The comment should say something like: "Query-bearing aliases are never skipped — they redirect query-param legacy URLs to the canonical route." The current phrasing slightly obscures that query aliases like /?page_id=8328 are intentionally preserved.


catch (Exception) Swallows Parse Errors Silently

WebSiteAuditor.Helpers.cs, CollectSitemapSeoMetadata, ~line 316

catch (Exception)
{
    parseErrorCount++;
    AddSample(parseErrorSamples, relativePath, sampleLimit);
    continue;
}

The exception type is not captured, so transient runtime errors (e.g., OutOfMemoryException from a malformed 50 MB HTML file) are silently counted alongside normal parse failures. At minimum use when (ex is not OutOfMemoryException and not StackOverflowException), or capture ex to log type info in the sample. The read-error handler directly above correctly filters IOException | UnauthorizedAccessException.


Minor Notes

NormalizeRootNotFoundPublicRoute — The 404-slug special-case mapping is correct, but a one-line comment would help future readers understand the invariant (error pages are served at /404.html, not /404/). The existing test in Part2b verifies this, but the production code has no hint.

BuildGeneratedPageRouteCandidates vs AddGeneratedPageRouteCandidates — The two methods have near-identical names but opposite return conventions (array vs void). Consider PopulateGeneratedPageRouteCandidates for the mutating variant to reduce confusion.

Duplicate LOC cap message — The overflow message says "additional duplicate URL groups" but duplicateLocs.Length - 50 counts groups (after GroupBy), not individual URLs. The wording is accurate but could be clarified: "additional groups of duplicate URLs".

AGENTS.md line-count gateWebSiteAuditor.Helpers.cs gains ~138 lines in this PR. Worth running node .\Build\linecount.js . 800 locally to verify the file is still under the 800-line limit.


What's Good

  • The SitemapSeoScan / SitemapPageSeoMetadata record types are idiomatic and clean.
  • NormalizeExactSeoUrl correctly handles scheme normalization, IDN hosts, and port canonicalization while intentionally stripping fragments (comment explains why).
  • The 50-entry cap + overflow message pattern for issue reporting is consistent with the rest of the auditor and prevents log floods on large sites.
  • IsAliasRedirectSourceEquivalentToRoute correctly handles the query-bearing alias case — /?page_id=8328 is not considered equivalent to /allianz and is preserved.
  • Test coverage is well-targeted; the trailing-slash canonical mismatch scenario mirrors the real-world trigger described in the PR.
  • The existing Audit_FlagsSitemapNoIndexMismatch test correctly adds options.Include to confirm the sitemap pass bypasses page-sample scoping.

Summary

The core logic is correct and the test coverage is solid. The main concerns are:

  1. Double parse cost — should be fixed before merging to avoid a latent regression on sites with large output trees.
  2. Swallowed parse exception — minor but should filter fatal exceptions.
  3. The comment in NormalizeRedirectComparisonPath should be clarified to avoid future misreading.

Everything else is minor polish.

@claude
Copy link
Copy Markdown

claude Bot commented May 7, 2026

Code Review — PR #346: Add sitemap SEO consistency guards

Overview

This PR introduces engine-level SEO drift detection into the PowerForge audit pipeline. The three concerns it tackles (sitemap/canonical mismatches, duplicate <loc> entries, self-equivalent alias redirects) are real problems worth catching at build time. The motivation is clear and the implementation is largely well-structured.


Strengths

  • Clean separation — extracting CollectSitemapSeoMetadata cleanly separates the HTML-scanning pass from the sitemap-validation pass. Both remain readable.
  • Per-file error isolationIOException/UnauthorizedAccessException are caught individually; the scan continues on partial failures and surfaces sampled diagnostics rather than aborting.
  • Capped reporting — the 50-item cap on emitted issues and 5-sample diagnostic cap prevent log flooding on large sites.
  • Sealed recordsSitemapPageSeoMetadata and SitemapSeoScan are appropriately concise value types.
  • Test quality — each new behavior has a dedicated test with realistic fixture data, a meaningful assertion, and proper temp-dir cleanup in a finally block.

Issues & Suggestions

1. Potential NullReferenceException on options.Include (WebSiteAuditor.cs)

var hasHtmlIncludeScope = options.Include.Any(pattern => !string.IsNullOrWhiteSpace(pattern));

If Include is nullable and initialized to null by default, this throws. The test that exercises this path sets options.Include = new[] { ... } explicitly, so the null case is never exercised. Consider a null-guard:

var hasHtmlIncludeScope = options.Include?.Any(pattern => !string.IsNullOrWhiteSpace(pattern)) == true;

2. NormalizeRootNotFoundPublicRoute only handles the exact root 404 (WebSiteBuilder.RenderAssetsAndRouting.cs)

return normalizedPath.Equals("404", StringComparison.OrdinalIgnoreCase) ||
       normalizedPath.Equals("404.html", StringComparison.OrdinalIgnoreCase)
    ? "/404.html" + suffix
    : route;

NormalizePath strips leading slashes, so /404404 and /404.html404.html are matched. But /en/404 (a localized 404 page) normalises to en/404, which this does not match. If the localization layer can produce localised 404 paths, those will still be advertised as /en/404/ rather than the correct /en/404.html. Whether that's in-scope for this PR depends on whether localized 404s are generated, but it's worth noting.

3. Minor: LINQ double-evaluation of group.Count() in duplicate detection (WebSiteAuditor.Helpers.cs)

.Where(group => group.Count() > 1)  // evaluated once in the filter
...
$"sitemap includes duplicate URL '{duplicate.First().Url}' ({duplicate.Count()} entries)."  // evaluated again per-group

After .ToArray(), each IGrouping backed by a real in-memory enumerable will re-enumerate on each .Count() / .First() call. For small sitemaps this is harmless, but a simple materialization avoids the ambiguity:

var duplicateLocs = locs
    .Where(loc => !string.IsNullOrWhiteSpace(loc.ComparableUrl))
    .GroupBy(loc => loc.ComparableUrl, StringComparer.OrdinalIgnoreCase)
    .Where(group => group.Count() > 1)
    .Select(group => group.ToList())
    .Where(group => group.Count > 1)
    .ToArray();

Or simply store count alongside: .Select(group => (Items: group.ToList())).

4. Minor: verbose exception filter style

catch (Exception ex) when (ex is IOException || ex is UnauthorizedAccessException)

C# 9 pattern matching can simplify this:

catch (Exception ex) when (ex is IOException or UnauthorizedAccessException)

5. routeCollisionRecordedForPage flag silently drops remaining candidates on collision

In CollectSitemapSeoMetadata, once a collision is recorded for a page, subsequent colliding candidates all hit continue. Non-colliding candidates are still registered. This is intentional (the comment explains "keep first-registered deterministic"), but if a page has 3 route candidates and candidates 1 and 3 collide while candidate 2 doesn't, candidate 2 is still registered (correct) but candidate 3 is silently dropped (potentially surprising). The current logic is defensible, but a brief comment explaining why we continue on collision (not just that we record only once) would help future readers.

6. Second EnumerateHtmlFiles call when hasHtmlIncludeScope is true

? EnumerateHtmlFiles(siteRoot, Array.Empty<string>(), options.Exclude, options.UseDefaultExcludes)
    .OrderBy(...)
    .ToList()
: allHtmlFiles

The second enumeration re-walks the file system. In most real-world scenarios the audit is either fully scoped or fully unscoped, so this is rarely hit — but it's worth knowing it exists. No action needed unless this surfaces as a perf issue.


Test Coverage

Scenario Covered
Sitemap/canonical mismatch Audit_FlagsSitemapCanonicalMismatch_WhenSitemapLocDiffersFromPageCanonical
Duplicate <loc> entries Audit_FlagsDuplicateSitemapLocs_WhenSameUrlAppearsMoreThanOnce
Self-alias redirect skipped Build_SkipsAliasRedirects_ThatResolveToCanonicalRoute
Query-alias redirect preserved ✅ same test (/?page_id=8328)
404 canonical assertion ✅ extended in Build_WritesRoot404HtmlForNotFoundSlug
Overflow (>50 issues) ❌ not tested — acceptable for edge cases
noindex-in-sitemap still fires ✅ existing test updated with options.Include scope guard

The regression coverage is solid. The > 50 overflow path is a low-risk omission.


Summary

The approach is sound and solves a genuine problem. Issues 1 (potential NPE on options.Include) and 2 (localized 404 may not be normalized) are the only ones worth addressing before merge; the rest are minor style suggestions. Overall this is a clean addition that fits the existing patterns in the audit pipeline.

@PrzemyslawKlys PrzemyslawKlys merged commit c8bb6e5 into main May 7, 2026
6 checks passed
@PrzemyslawKlys PrzemyslawKlys deleted the codex/seo-sitemap-canonical-guards branch May 7, 2026 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant