Web Crawler

Web Crawler is a recursive link extractor which crawls a domains website in its entirety respecting rules implemented in robots.txt. The implementation contains a simple reporting provider writing to an output stream of your choice in the following format for each page within the domain.

BASE URL http://example.com
Parameters
LAST_MODIFIED::

HTTP STATUS 200

	DOMAIN
		http://example.com/foo
	DOMAIN_IMAGE
		http://example.com/foo.png
	EXTERNAL
		http://google.com/foo
	EXTERNAL_IMAGE
		http://google.com/foo.png

This format may be changed by implementing a custom LocationProvider so that output may be written in any format and sent to other services or a sitemap.xml could be created.

Building

The project is built using gradle. Once installed building the project is done with the following command.

$> gradle
...
$>

Usage

The package is available from Maven Central and JCenter. You will find the appropriate dependency resolution for your specific build tool.

Example implementation:

String domainToCrawl = "http://example.com";
String domainRobotsTxt = "http://example.com/robots.txt";
File outFile = new File("your output file);
FileOutputStream fileOutputStream = new FileOutputStream(outFile);
LocationProvider locationProvider = new ReportingLocationProvider(fileOutputStream);
RobotsTxt robotsTxt = new RobotsTxtParser(domainRobotsTxt);
WebCrawler webCrawler = new DomainWebCrawler()
	.useRobotstxt(robotsTxt);		
	.useLocationProvider(locationProvider);
webCrawler.crawlUrl(new URL(domainToCrawl));

Continuous Integration

TravcisCI

Continuous Deployment

Project is deployed to Maven Central and JCenter via TravcisCI on tagged releases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Web Crawler

Building

Usage

Continuous Integration

Continuous Deployment

Files

README.md

Latest commit

History

README.md

File metadata and controls

Web Crawler

Building

Usage

Continuous Integration

Continuous Deployment