Skip to content
Branch: master
Find file History
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
src/main/java/com/digitalpebble/stormcrawler/aws
README.md Initial commit : added code for caching and retrieving content from A… Jan 19, 2017
aws-conf.yaml
pom.xml [maven-release-plugin] prepare for next development iteration Jan 14, 2020

README.md

stormcrawler-aws

================================

AWS resources for Storm-Crawler, currently contains an indexer bolt for CloudSearch and another bolt for storing and retrieving web pages to/from S3.

Prerequisites

Add storm-crawler-aws to the dependencies of your project:

<dependency>
    <groupId>com.digitalpebble.stormcrawler</groupId>
    <artifactId>storm-crawler-aws</artifactId>
    <version>XXXX</version>
</dependency>

Edit ~/.aws/credentials, see [http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html] for details. Note that this should not be necessary when running on EC2.

CloudSearch

  • How to use?

Add storm-crawler-aws as a Maven dependency, use the class CloudSearchIndexWriter in your Storm topology alongside the core StormCrawler components and create a yaml configuration file (see below).

  • AWS credentials

Requires the AWS credentials to be stored in ~/.aws/credentials (see prerequisites above). You need to have a pre-existing search domain on CloudSearch

  • Create a CloudSearch domain

This can be done using the web console [https://eu-west-1.console.aws.amazon.com/cloudsearch/home?region=eu-west-1#] or the AWS CLI [http://docs.aws.amazon.com/cloudsearch/latest/developerguide/creating-domains.html]. You can use the temp file generated with cloudsearch.batch.dump (see below) to bootstrap the field definition.

Note that the creation of the domain can take some time. Once it is complete, note the document endpoint and region name.

  • Configuration

See file [aws-conf.yaml] for an example of configuration.

You'll need to define cloudsearch.endpoint and cloudsearch.region , unless you set cloudsearch.batch.dump to true in which case the batch of documents to index in JSON format will be dumped on the default tmp directory. The files have the prefix "CloudSearch_" e.g. /tmp/CloudSearch_4822180575734804454.json. These temp filse can be used as templates when defining the fields in the domain creation (see above).

There are two additional configurations for CloudSearch :

cloudsearch.batch.maxSize : number of documents to buffer before sending as batch to CloudSearch. Default value -1. cloudsearch.batch.max.time.buffered : max time allowed before flushing the buffer of documents to CloudSearch, in seconds. Default value 10.

In both cases the restriction set by CloudSearch on the size of a batch will take precedence.

  • General behaviour

In case of an exception while sending a batch to CloudSearch, the corresponding tuples will be failed. The behaviour of the topology depends on the fail logic of the spouts. Any errors will be logged and the topology should continue without interruption.

Any fields not defined in the CloudSearch domain will be ignored by the CloudSearchIndexWriter. Again, the logs will contain a trace of any field names skipped. It is advisable to check the logs and modify the indexing options for your CloudSearch domain accordingly.

S3

Add S3ContentCacher or S3CacheChecker to your crawl topology.

You can’t perform that action at this time.