RFC: Automatic pagination detection for crawls (reduce manual “next page” rules) #285
vigneshwarrvenkat
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Feature description
Detect common pagination patterns (e.g.
rel="next", “next”/localized link text, numbered pages, query params like?page=) and optionally emit follow-up requests with configurable confidence / overrides.Motivation
Most crawls repeat the same boilerplate; mistakes here waste bandwidth and time.
Proposal (high level)
LinkExtractorlevel; verbose logging when a pattern is chosen.CrawlSpider/LinkExtractor.Related
ROADMAP.md: “Add functionality to automatically detect pagination URLs”Open questions
(I searched existing feature requests for this topic.)
Beta Was this translation helpful? Give feedback.
All reactions