🧹 Improve cluster risk heuristic with content analysis by saurabhsharma2u · Pull Request #15 · Crawlith/crawlith

saurabhsharma2u · 2026-02-25T14:28:42Z

🎯 What:
The calculateClusterRisk function in plugins/core/src/graph/cluster.ts was updated to use a content-based heuristic instead of a simple size-based one.

💡 Why:
The previous heuristic relied solely on cluster size, which could lead to inaccurate risk assessments (e.g., small clusters of identical pages were marked low risk). By analyzing actual content duplication (Titles and H1s), the risk assessment becomes much more accurate for detecting cannibalization issues.

✅ Verification:

Created a new test file plugins/core/tests/clustering_risk.test.ts with 5 test cases covering:
- High risk scenarios (identical titles, identical H1s).
- Low risk scenarios (unique content in small clusters).
- Medium risk scenarios (large clusters even with unique content).
- Fallback behavior when HTML is missing.
Ran npx vitest run plugins/core/tests/clustering_risk.test.ts and confirmed all tests passed.
Ran all tests in plugins/core (npx vitest run plugins/core/tests) and confirmed no regressions (170 tests passed).
Verified TypeScript compilation with npm run build.

✨ Result:
The cluster risk heuristic is now robust and content-aware, providing better insights into SEO issues like duplicate content and keyword cannibalization.

PR created automatically by Jules for task 6798153227314384451 started by @saurabhsharma2u

This commit replaces the simplified cluster risk heuristic with a content-based approach. It now parses the HTML of cluster nodes to extract Titles and H1s, calculating duplication ratios to determine risk levels more accurately. - Added `cheerio` import to `plugins/core/src/graph/cluster.ts`. - Implemented HTML parsing and duplication counting in `calculateClusterRisk`. - Defined risk levels based on duplication ratios (> 30% -> High, > 0% -> Medium). - Retained fallback to size-based heuristic if HTML is missing. - Added comprehensive tests in `plugins/core/tests/clustering_risk.test.ts`.

google-labs-jules · 2026-02-25T14:28:44Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

saurabhsharma2u marked this pull request as ready for review February 25, 2026 15:16

saurabhsharma2u merged commit a0a0f41 into main Feb 25, 2026
6 checks passed

saurabhsharma2u deleted the improve-cluster-risk-heuristic-6798153227314384451 branch February 25, 2026 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🧹 Improve cluster risk heuristic with content analysis#15

🧹 Improve cluster risk heuristic with content analysis#15
saurabhsharma2u merged 1 commit intomainfrom
improve-cluster-risk-heuristic-6798153227314384451

saurabhsharma2u commented Feb 25, 2026

Uh oh!

google-labs-jules Bot commented Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

saurabhsharma2u commented Feb 25, 2026

Uh oh!

google-labs-jules Bot commented Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant