Respect robots.txt when crawling when set as True #42

indrajithi · 2024-06-18T12:54:13Z

Option to set respect_robots_txt (Default should be True because of legal obligation in some jurisdictions)
Fetch and Parse robots.txt (urllib.robotparser will helps in parsing robots.txt)
Create crawl rule per domain
Check URL permissions before crawling a URL
Make sure it works when concurrent workers are fetching different domains
Use the rules provided in robots.txt to fetch. (eg: use robots.txt crawl-delay if present. Check the rule before crawling a path)

The text was updated successfully, but these errors were encountered:

Mews · 2024-06-18T17:54:04Z

@indrajithi I started working on this already, assign me when you can :)

indrajithi changed the title ~~Respect robots.txt when crawling~~ Respect robots.txt when crawling when set as True Jun 18, 2024

indrajithi mentioned this issue Jun 18, 2024

First Major Release v1.0.0 #24

Open

25 tasks

indrajithi assigned Mews Jun 18, 2024

Mews mentioned this issue Jun 19, 2024

Respect robots txt #45

Merged

indrajithi closed this as completed in #45 Jun 19, 2024

Provide feedback