Skip to content

🧹 Improve cluster risk heuristic with content analysis#15

Merged
saurabhsharma2u merged 1 commit intomainfrom
improve-cluster-risk-heuristic-6798153227314384451
Feb 25, 2026
Merged

🧹 Improve cluster risk heuristic with content analysis#15
saurabhsharma2u merged 1 commit intomainfrom
improve-cluster-risk-heuristic-6798153227314384451

Conversation

@saurabhsharma2u
Copy link
Copy Markdown
Contributor

🎯 What:
The calculateClusterRisk function in plugins/core/src/graph/cluster.ts was updated to use a content-based heuristic instead of a simple size-based one.

πŸ’‘ Why:
The previous heuristic relied solely on cluster size, which could lead to inaccurate risk assessments (e.g., small clusters of identical pages were marked low risk). By analyzing actual content duplication (Titles and H1s), the risk assessment becomes much more accurate for detecting cannibalization issues.

βœ… Verification:

  • Created a new test file plugins/core/tests/clustering_risk.test.ts with 5 test cases covering:
    • High risk scenarios (identical titles, identical H1s).
    • Low risk scenarios (unique content in small clusters).
    • Medium risk scenarios (large clusters even with unique content).
    • Fallback behavior when HTML is missing.
  • Ran npx vitest run plugins/core/tests/clustering_risk.test.ts and confirmed all tests passed.
  • Ran all tests in plugins/core (npx vitest run plugins/core/tests) and confirmed no regressions (170 tests passed).
  • Verified TypeScript compilation with npm run build.

✨ Result:
The cluster risk heuristic is now robust and content-aware, providing better insights into SEO issues like duplicate content and keyword cannibalization.


PR created automatically by Jules for task 6798153227314384451 started by @saurabhsharma2u

This commit replaces the simplified cluster risk heuristic with a content-based
approach. It now parses the HTML of cluster nodes to extract Titles and H1s,
calculating duplication ratios to determine risk levels more accurately.

- Added `cheerio` import to `plugins/core/src/graph/cluster.ts`.
- Implemented HTML parsing and duplication counting in `calculateClusterRisk`.
- Defined risk levels based on duplication ratios (> 30% -> High, > 0% -> Medium).
- Retained fallback to size-based heuristic if HTML is missing.
- Added comprehensive tests in `plugins/core/tests/clustering_risk.test.ts`.
@google-labs-jules
Copy link
Copy Markdown

πŸ‘‹ Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a πŸ‘€ emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@saurabhsharma2u saurabhsharma2u marked this pull request as ready for review February 25, 2026 15:16
@saurabhsharma2u saurabhsharma2u merged commit a0a0f41 into main Feb 25, 2026
6 checks passed
@saurabhsharma2u saurabhsharma2u deleted the improve-cluster-risk-heuristic-6798153227314384451 branch February 25, 2026 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant