Add sitemap SEO consistency guards#346
Conversation
Code Review — PR #346: Add sitemap SEO consistency guardsOverviewThis PR adds three useful capabilities to the PowerForge audit engine:
The motivation (Google Search Console treating slash/non-slash variants as alternate canonicals) is clear, and the regression tests are well-structured. A few things worth addressing: Bugs / Correctness1. Silent catch
{
continue;
}Both the file-read error path and the AngleSharp parse error path are swallowed silently. The existing 2. Existing test changed without explanation ( + options.Include = new[] { "index.html" };This modifies an existing passing test. Since 3. Silent last-write-wins in pagesByRoute[candidate] = metadata;If two different pages both generate the same route candidate (e.g. a slug collision), the second silently overwrites the first. This means one of the pages will never be checked for canonical drift. A guard or warning for this case would make the behavior explicit. Design / Quality4. Double var sitemapSeoHtmlFiles = options.CheckSeoMeta
? EnumerateHtmlFiles(siteRoot, Array.Empty<string>(), options.Exclude, options.UseDefaultExcludes)
.OrderBy(...)
.ToList()
: new List<string>();On large sites this enumerates the filesystem twice. A comment explaining why 5. Double-normalization in var normalized = NormalizeAlias(path).Trim();
// ...
var route = "/" + NormalizePath(pathOnly).Trim('/');
6. The function retains Minor Nits7. Trailing blank line — after the removal of the 8. Hint key format inconsistency — Test CoverageThe three new tests are well-structured:
One gap: there's no test for the "pagesByRoute collision / two pages with the same route" scenario mentioned in point 3 above. SummaryThe core logic is sound and solves a real problem. The main items to address before merge:
Nice work on the alias self-skip logic — the |
Code Review — PR #346: Add sitemap SEO consistency guardsOverviewThis PR introduces two distinct feature areas:
The design intent is clear, the test coverage is solid, and the helper extraction is clean. A few things worth looking at: Correctness / BugsRoute collision silently drops the new page's metadata ( if (pagesByRoute.TryGetValue(candidate, out var existing) && ...)
{
routeCollisionCount++;
AddSample(...);
continue; // ← only the continue skips the new entry
}
pagesByRoute[candidate] = metadata;The
var queryIndex = normalized.IndexOf('?', StringComparison.Ordinal);
var query = queryIndex >= 0 ? normalized[queryIndex..] : string.Empty;
var pathOnly = queryIndex >= 0 ? normalized[..queryIndex] : normalized;
// ...
return route + query; // query is appended back to the final valueThe source path keeps its query string in the return value, so when the caller does Performance
var sitemapSeoHtmlFiles = options.CheckSeoMeta
? EnumerateHtmlFiles(siteRoot, Array.Empty<string>(), options.Exclude, options.UseDefaultExcludes)
.OrderBy(...)
.ToList()
: new List<string>();On a large site this enumerates and reads every HTML file a second time (once for the normal audit pass, once for the sitemap SEO pass), with a separate AngleSharp parse for each. For sites in the thousands of pages that could be seconds of extra wall time. The approach is correct (sitemap validation must see the full tree), but it's worth noting:
Code Quality / StyleTwo URL-normalisation helpers with overlapping responsibility
private static void AddSample(List<string> samples, string value, int sampleLimit)
{
if (samples.Count < sampleLimit && !string.IsNullOrWhiteSpace(value))
samples.Add(value);
}No issue with extracting it, but the
The three logical passes (duplicate-locs, noindex, canonical-mismatch) each have their own list-building + issue-emitting loop. Extracting each into a private method (e.g. TestsThe three new tests are well-structured: they create isolated temp directories, build or audit a minimal site, assert on the output, and clean up in
Summary
Overall a solid addition. The correctness concern around the route-collision path and the missing test are the two things I'd resolve before merging; the rest are polish. |
Code Review — PR #346: Add sitemap SEO consistency guardsOverviewThis PR introduces three related improvements:
The motivation is real (Google Search Console treating trailing-slash variants as alternates due to sitemap/canonical disagreement), and the implementation is generally clean. Notes below. Strengths
ConcernsPerformance: double file-system walkvar sitemapSeoHtmlFiles = options.CheckSeoMeta
? EnumerateHtmlFiles(siteRoot, Array.Empty<string>(), options.Exclude, options.UseDefaultExcludes)
.OrderBy(...).ToList()
: new List<string>();
var allHtmlFiles = EnumerateHtmlFiles(siteRoot, options.Include, ...)
.OrderBy(...).ToList();When Bare
|
| Priority | Item |
|---|---|
| Medium | Replace bare catch with typed exception filters |
| Medium | Fix over-counting in route collision tracking |
| Low | Avoid double file-system walk when CheckSeoMeta is true |
| Low | Add comment explaining first-registered-wins semantics for route collisions |
| Low | Relocate NormalizeRootNotFoundPublicRoute to a more appropriate file |
Happy to discuss any of these — the feature itself is a good addition to the SEO audit pass.
Code Review — PR #346: Add sitemap SEO consistency guardsOverviewThis PR adds three SEO consistency features to the PowerForge engine:
It also includes a 404-route localization fix and extends the separate sitemap scan to bypass The motivation is solid and well-explained (Google Search Console slash-URL alternate-canonical issue). Tests cover the golden path for all three new behaviors. Overall this is a good change; a few things are worth addressing. IssuesDouble File Enumeration / Double HTML Parse (performance)
The old approach accumulated Suggested fix: Reuse var sitemapSeoHtmlFiles = options.CheckSeoMeta
? (options.Include is { Length: > 0 }
? EnumerateHtmlFiles(siteRoot, Array.Empty<string>(), options.Exclude, options.UseDefaultExcludes)
.OrderBy(path => path, StringComparer.OrdinalIgnoreCase)
.ToList()
: allHtmlFiles)
: new List<string>();This avoids the redundant walk in the no- Route Collision Silencing May Produce False Positives
When two pages map to the same route candidate, the first-registered metadata wins. This is deterministic because files are sorted by path, but if the first-registered page has a wrong canonical URL while the second has the correct one, the canonical mismatch detection will incorrectly fire for the winning (wrong) entry. In practice this should be rare, but the failure mode is confusing because the reported The comment says "keep first-registered route metadata deterministic while reporting the colliding page once" — worth also noting that the winning entry is path-sort–ordered first to make this verifiable in future debugging.
|
Code Review — PR #346: Add sitemap SEO consistency guardsOverviewThis PR introduces engine-level SEO drift detection into the PowerForge audit pipeline. The three concerns it tackles (sitemap/canonical mismatches, duplicate Strengths
Issues & Suggestions1. Potential
|
| Scenario | Covered |
|---|---|
| Sitemap/canonical mismatch | ✅ Audit_FlagsSitemapCanonicalMismatch_WhenSitemapLocDiffersFromPageCanonical |
Duplicate <loc> entries |
✅ Audit_FlagsDuplicateSitemapLocs_WhenSameUrlAppearsMoreThanOnce |
| Self-alias redirect skipped | ✅ Build_SkipsAliasRedirects_ThatResolveToCanonicalRoute |
| Query-alias redirect preserved | ✅ same test (/?page_id=8328) |
| 404 canonical assertion | ✅ extended in Build_WritesRoot404HtmlForNotFoundSlug |
| Overflow (>50 issues) | ❌ not tested — acceptable for edge cases |
| noindex-in-sitemap still fires | ✅ existing test updated with options.Include scope guard |
The regression coverage is solid. The > 50 overflow path is a low-risk omission.
Summary
The approach is sound and solves a genuine problem. Issues 1 (potential NPE on options.Include) and 2 (localized 404 may not be normalized) are the only ones worth addressing before merge; the rest are minor style suggestions. Overall this is a clean addition that fits the existing patterns in the audit pipeline.
Summary
<loc>entries and keep the existing noindex-in-sitemap validation in the same SEO passWhy
Google Search Console showed slash URL pages being treated as alternate canonicals because rendered pages declared a different canonical than the sitemap/runtime URL. This turns that class of SEO drift into an engine-level audit warning so PowerForge-built sites catch it before deploy.
Validation
dotnet test .\PowerForge.Tests\PowerForge.Tests.csproj --filter "FullyQualifiedName~WebSiteAuditSeoMetaTests|FullyQualifiedName~Build_SkipsAliasRedirects_ThatResolveToCanonicalRoute|FullyQualifiedName~Build_ExpandsAliasRedirects_ForSlashVariants" --no-restoregit diff --checkNote: the focused test run passed; it emitted transient copy retry warnings because another local PowerForge CLI audit was running at the same time.