π§Ή Improve cluster risk heuristic with content analysis#15
Conversation
This commit replaces the simplified cluster risk heuristic with a content-based approach. It now parses the HTML of cluster nodes to extract Titles and H1s, calculating duplication ratios to determine risk levels more accurately. - Added `cheerio` import to `plugins/core/src/graph/cluster.ts`. - Implemented HTML parsing and duplication counting in `calculateClusterRisk`. - Defined risk levels based on duplication ratios (> 30% -> High, > 0% -> Medium). - Retained fallback to size-based heuristic if HTML is missing. - Added comprehensive tests in `plugins/core/tests/clustering_risk.test.ts`.
|
π Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a π emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
π― What:
The
calculateClusterRiskfunction inplugins/core/src/graph/cluster.tswas updated to use a content-based heuristic instead of a simple size-based one.π‘ Why:
The previous heuristic relied solely on cluster size, which could lead to inaccurate risk assessments (e.g., small clusters of identical pages were marked low risk). By analyzing actual content duplication (Titles and H1s), the risk assessment becomes much more accurate for detecting cannibalization issues.
β Verification:
plugins/core/tests/clustering_risk.test.tswith 5 test cases covering:npx vitest run plugins/core/tests/clustering_risk.test.tsand confirmed all tests passed.plugins/core(npx vitest run plugins/core/tests) and confirmed no regressions (170 tests passed).npm run build.β¨ Result:
The cluster risk heuristic is now robust and content-aware, providing better insights into SEO issues like duplicate content and keyword cannibalization.
PR created automatically by Jules for task 6798153227314384451 started by @saurabhsharma2u