Skip to content

Ignore code examples in SEO leak checks#340

Merged
PrzemyslawKlys merged 1 commit intomainfrom
fix-seo-doctor-code-leak
May 2, 2026
Merged

Ignore code examples in SEO leak checks#340
PrzemyslawKlys merged 1 commit intomainfrom
fix-seo-doctor-code-leak

Conversation

@PrzemyslawKlys
Copy link
Copy Markdown
Member

Summary

  • keep SEO content-leak detection enabled while ignoring fenced/inline code content for the leak heuristic
  • add a regression test for article front matter examples rendered inside code blocks

Validation

  • dotnet test PowerForge.Tests\PowerForge.Tests.csproj --configuration Release --filter "FullyQualifiedName~WebSeoDoctorTests"

@claude
Copy link
Copy Markdown

claude Bot commented May 2, 2026

Code Review — PR #340: Ignore code examples in SEO leak checks

Overview

This PR fixes a false-positive in the CheckContentLeaks heuristic: front matter examples rendered inside <pre><code> blocks (e.g., in contribution guides or tutorials) were incorrectly triggering the content-frontmatter-leak issue. The fix adds an excludeCodeBlocks parameter to GetVisibleBodyText and uses it when building the text for leak detection.

The change is minimal, targeted, and correct. The approach is consistent with the existing code style.


Code Quality

GetVisibleBodyText signature change — Clean opt-in design using a default parameter. The original callers (bodyText at line 171, inspectedText at line 1059) are unaffected. No breaking changes.

Selector list (pre,code,kbd,samp) — A reasonable set for semantic code elements. A few notes:

  • Removing <code> in addition to <pre> correctly handles inline code (e.g., \draft: true``) as well as fenced blocks — the broader coverage seems intentional and good.
  • <kbd> and <samp> are unlikely to contain front matter patterns in practice, but including them is harmless and gives a complete "code-like" filter.
  • <var> (HTML variable element) is omitted, which is fine — it's rarely used and unlikely to contain front matter keys.

bodyText vs leakCheckText separation — The existing bodyText variable continues to serve BodyText = bodyText (line 203) and CountCaseInsensitiveOccurrences (line 325). Using full body text (including code) for keyphrase density counting is the right call, so keeping these separate is correct.


Performance Consideration

When CheckContentLeaks is enabled, doc.Body is now cloned twice — once unconditionally at line 171 (GetVisibleBodyText(doc.Body)) and once conditionally at line 193 (GetVisibleBodyText(doc.Body, excludeCodeBlocks: true)). Both calls do a full DOM clone, selector walk, and text extraction.

This is acceptable for the common case, but worth noting: if CheckContentLeaks is always enabled alongside normal analysis, every page incurs two clones. A future micro-optimization (if profiling ever flags this) would be to derive leakCheckText from the already-cloned body rather than parsing from scratch — but that's not warranted for this PR.


Test Coverage

Positive: The new test covers the stated regression scenario well — front matter inside <pre><code class="language-markdown"> does not trigger content-frontmatter-leak. It follows the established test pattern, uses try/finally for cleanup, and scopes the options correctly.

Suggestions:

  1. Inline code coverage — The test covers <pre><code> (block-level) but not inline <code> (e.g., <p>Set <code>draft: true</code> in your front matter.</p>). Since code is now in the excluded selector list, a complementary test or a second Assert.DoesNotContain case in the same test would confirm that path explicitly.

  2. Verify leak still fires outside code blocks — There's no test asserting that front matter outside a code block on the same page still produces the content-frontmatter-leak issue. Existing tests may cover this, but a targeted test would lock down the boundary condition and guard against accidentally over-excluding.

// Example: confirm the flag still fires when front matter leaks into prose
// (not inside any code element)
File.WriteAllText(Path.Combine(root, "leaked.html"), """
    <html><body><main><article>
    title: "My Post"
    description: "Some desc"
    date: "2026-04-29"
    draft: true
    ---
    </article></main></body></html>
    """);
Assert.Contains(result.Issues, i => i.Hint == "content-frontmatter-leak");

Summary

Correctness ✅ Fix is correct and addresses the root cause
Code style ✅ Consistent with existing patterns
Performance ⚠️ Minor double-clone when CheckContentLeaks is enabled (not blocking)
Test coverage ✅ Good; inline-code and boundary-condition tests would strengthen it further
Security ✅ No concerns

The change is ready to merge as-is. The suggested tests are improvements, not blockers.

@PrzemyslawKlys PrzemyslawKlys merged commit 6665d96 into main May 2, 2026
6 checks passed
@PrzemyslawKlys PrzemyslawKlys deleted the fix-seo-doctor-code-leak branch May 2, 2026 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant