Skip to content

TurnerSoftware/InfinityCrawler

main
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

Bumps [BenchmarkDotNet](https://github.com/dotnet/BenchmarkDotNet) from 0.13.4 to 0.13.8.
- [Release notes](https://github.com/dotnet/BenchmarkDotNet/releases)
- [Commits](dotnet/BenchmarkDotNet@v0.13.4...v0.13.8)

---
updated-dependencies:
- dependency-name: BenchmarkDotNet
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
4b56f68

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
February 16, 2022 11:08
August 10, 2022 23:11
August 18, 2019 16:22
August 4, 2019 16:53
February 16, 2022 11:09
August 4, 2019 16:53
October 21, 2021 14:58

Icon

Infinity Crawler

A simple but powerful web crawler library for .NET

Build Codecov NuGet

Features

  • Obeys robots.txt (crawl delay & allow/disallow)
  • Obeys in-page robots rules (X-Robots-Tag header and <meta name="robots" /> tag)
  • Uses sitemap.xml to seed the initial crawl of the site
  • Built around a parallel task async/await system
  • Swappable request and content processors, allowing greater customisation
  • Auto-throttling (see below)

Licensing and Support

Infinity Crawler is licensed under the MIT license. It is free to use in personal and commercial projects.

There are support plans available that cover all active Turner Software OSS projects. Support plans provide private email support, expert usage advice for our projects, priority bug fixes and more. These support plans help fund our OSS commitments to provide better software for everyone.

Polite Crawling

The crawler is built around fast but "polite" crawling of website. This is accomplished through a number of settings that allow adjustments of delays and throttles.

You can control:

  • Number of simulatenous requests
  • The delay between requests starting (Note: If a crawl-delay is defined for the User-agent, that will be the minimum)
  • Artificial "jitter" in request delays (requests seem less "robotic")
  • Timeout for a request before throttling will apply for new requests
  • Throttling request backoff: The amount of time added to the delay to throttle requests (this is cumulative)
  • Minimum number of requests under the throttle timeout before the throttle is gradually removed

Other Settings

  • Control the UserAgent used in the crawling process
  • Set additional host aliases you want the crawling process to follow (for example, subdomains)
  • The max number of retries for a specific URI
  • The max number of redirects to follow
  • The max number of pages to crawl

Example Usage

using InfinityCrawler;

var crawler = new Crawler();
var result = await crawler.Crawl(new Uri("http://example.org/"), new CrawlSettings {
	UserAgent = "MyVeryOwnWebCrawler/1.0",
	RequestProcessorOptions = new RequestProcessorOptions
	{
		MaxNumberOfSimultaneousRequests = 5
	}
});