Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Default URL filter: exclude localhost and private address spaces #543
To avoid that information is leaked and exposed in a public search interface, a web crawler should never follow links to localhost, loop-back addresses and private address spaces, see commoncrawl/news-crawl#21. These are excluded by regular expressions for http/https URLs.
Of course, could comment out the rules to enable test crawls of the local Apache web server.