Skip to content

Latest commit

 

History

History
57 lines (44 loc) · 2.54 KB

README.md

File metadata and controls

57 lines (44 loc) · 2.54 KB

Web Crawler

alt travis Maven Central Download

Web Crawler is a recursive link extractor which crawls a domains website in its entirety respecting rules implemented in robots.txt. The implementation contains a simple reporting provider writing to an output stream of your choice in the following format for each page within the domain.

BASE URL http://example.com
Parameters
LAST_MODIFIED::

HTTP STATUS 200

	DOMAIN
		http://example.com/foo
	DOMAIN_IMAGE
		http://example.com/foo.png
	EXTERNAL
		http://google.com/foo
	EXTERNAL_IMAGE
		http://google.com/foo.png

This format may be changed by implementing a custom LocationProvider so that output may be written in any format and sent to other services or a sitemap.xml could be created.

Building

The project is built using gradle. Once installed building the project is done with the following command.

$> gradle
...
$>

Usage

The package is available from Maven Central and JCenter. You will find the appropriate dependency resolution for your specific build tool.

Example implementation:

String domainToCrawl = "http://example.com";
String domainRobotsTxt = "http://example.com/robots.txt";
File outFile = new File("your output file);
FileOutputStream fileOutputStream = new FileOutputStream(outFile);
LocationProvider locationProvider = new ReportingLocationProvider(fileOutputStream);
RobotsTxt robotsTxt = new RobotsTxtParser(domainRobotsTxt);
WebCrawler webCrawler = new DomainWebCrawler()
	.useRobotstxt(robotsTxt);		
	.useLocationProvider(locationProvider);
webCrawler.crawlUrl(new URL(domainToCrawl));

Continuous Integration

TravcisCI

Continuous Deployment

Project is deployed to Maven Central and JCenter via TravcisCI on tagged releases.