Skip to content
Permalink
Browse files

Updated feature list and added basic example

  • Loading branch information...
Turnerj committed Aug 17, 2019
1 parent 73e76ed commit 96dbb4c6538b03a059dac29b6a4e42c12a8c07fe
Showing with 25 additions and 2 deletions.
  1. +25 −2 README.md
@@ -7,8 +7,10 @@ A simple but powerful web crawler library in C#

## Features
- Obeys robots.txt (crawl delay & allow/disallow)
- Obeys in-page robots rules (`X-Robots-Tag` header and `<meta name="robots" />` tag)
- Uses sitemap.xml to seed the initial crawl of the site
- Built around a parllel task `async`/`await` system
- Built around a parallel task `async`/`await` system
- Swappable request and content processors, allowing greater customisation
- Auto-throttling (see below)

## Polite Crawling
@@ -21,4 +23,25 @@ You can control:
- Artificial "jitter" in request delays (requests seem less "robotic")
- Timeout for a request before throttling will apply for new requests
- Throttling request backoff: The amount of time added to the delay to throttle requests (this is cumulative)
- Minimum number of requests under the throttle timeout before the throttle is gradually removed
- Minimum number of requests under the throttle timeout before the throttle is gradually removed

## Other Settings
- Control the UserAgent used in the crawling process
- Set additional host aliases you want the crawling process to follow (for example, subdomains)
- The max number of retries for a specific URI
- The max number of redirects to follow
- The max number of pages to crawl

## Example Usage
```csharp
using InfinityCrawler;
var crawler = new Crawler();
var result = await crawler.Crawl(new Uri("http://example.org/"), new CrawlSettings {
UserAgent = "MyVeryOwnWebCrawler/1.0",
RequestProcessorOptions = new RequestProcessorOptions
{
MaxNumberOfSimultaneousRequests = 5
}
});
```

0 comments on commit 96dbb4c

Please sign in to comment.
You can’t perform that action at this time.