Web Crawler is a recursive link extractor which crawls a domains website in its entirety respecting rules implemented in robots.txt
. The implementation contains a simple reporting provider writing to an output stream of your choice in the following format for each page within the domain.
BASE URL http://example.com
Parameters
LAST_MODIFIED::
HTTP STATUS 200
DOMAIN
http://example.com/foo
DOMAIN_IMAGE
http://example.com/foo.png
EXTERNAL
http://google.com/foo
EXTERNAL_IMAGE
http://google.com/foo.png
This format may be changed by implementing a custom LocationProvider
so that output may be written in any format and sent to other services or a sitemap.xml
could be created.
The project is built using gradle. Once installed building the project is done with the following command.
$> gradle
...
$>
The package is available from Maven Central and JCenter. You will find the appropriate dependency resolution for your specific build tool.
Example implementation:
String domainToCrawl = "http://example.com";
String domainRobotsTxt = "http://example.com/robots.txt";
File outFile = new File("your output file);
FileOutputStream fileOutputStream = new FileOutputStream(outFile);
LocationProvider locationProvider = new ReportingLocationProvider(fileOutputStream);
RobotsTxt robotsTxt = new RobotsTxtParser(domainRobotsTxt);
WebCrawler webCrawler = new DomainWebCrawler()
.useRobotstxt(robotsTxt);
.useLocationProvider(locationProvider);
webCrawler.crawlUrl(new URL(domainToCrawl));
Project is deployed to Maven Central and JCenter via TravcisCI on tagged releases.