Configuring storage targets

HOME » SNOWPLOW SETUP GUIDE » Step 4: setting up alternative data stores > Configuring storage targets

This documentation is for an old version of Snowplow Storage targets!

🚧 The documentation for the latest version can be found on the Snowplow documentation site.

Snowplow offers the option to configure certain storage targets. This is done using configuration JSONs. When running EmrEtlRunner, the --targets argument should be populated with the filepath of a directory containing your configuration JSONs. Each storage target JSON file can have arbitrary name, but must conform it's JSON Schema.

Some targets are handled by EmrEtlRunner (duplicate tracking, failure tracking) and some by RDB Loader (enriched data).

Here's a list of currently supported targets, grouped by purpose:

Enriched data
Failures
- ElasticSearch
Duplicate tracking
- AWS DynamoDB

Redshift

Schema: iglu:com.snowplowanalytics.snowplow.storage/redshift_config/jsonschema/3-0-0

name, a descriptive name for this Snowplow storage target
host, the host (endpoint in Redshift parlance) of the databse to load.
database, the name of the database to load
port, the port of the database to load. 5439 is the default Redshift port
schema, the name of the database schema which will store your Snowplow tables
username, the database user to load your Snowplow events with. You can leave this blank to default to the user running the script
password, the password for the database user. Either plain-text password or ec2ParameterStore object
maxError, a Redshift-specific setting governing how many load errors should be permitted before failing the overall load. See the Redshift COPY documentation for more details
compRows, a Redshift-specific setting defining number of rows to be used as the sample size for compression analysis. Should be between 1000 and 1000000000
purpose: common for all targets. Redshift supports only ENRICHED_EVENTS
sslMode, determines how to handle encryption for client connections and server certificate verification. The the following sslMode values are supported:

DISABLE: SSL is disabled and the connection is not encrypted.
REQUIRE: SSL is required.
VERIFY_CA: SSL must be used and the server certificate must be verified.
VERIFY_FULL: SSL must be used. The server certificate must be verified and the server hostname must match the hostname attribute on the certificate.

roleArn: AWS Role ARN allowing Redshift to read data from S3
sshTunnel: optional bastion host configuration

Note: The difference between VERIFY_CA and VERIFY_FULL depends on the policy of the root CA. If a public CA is used, VERIFY_CA allows connections to a server that somebody else may have registered with the CA to succeed. In this case, verify-full should always be used. If a local CA is used, or even a self-signed certificate, using VERIFY_CA often provides enough protection.

Postgres

Schema: iglu:com.snowplowanalytics.snowplow.storage/postgresql_config/jsonschema/2-0-0

name, enter a descriptive name for this Snowplow storage target
host, the host (endpoint in Redshift parlance) of the databse to load.
database, the name of the database to load
port, the port of the database to load. 5439 is the default Redshift port; 5432 is the default Postgres port
schema, the name of the database schema which will store your Snowplow tables
username, the database user to load your Snowplow events with. You can leave this blank to default to the user running the script
password, the password for the database user. Leave blank if there is no password
sslSode, determines how to handle encryption for client connections and server certificate verification. The the following sslMode values are supported:

DISABLE: SSL is disabled and the connection is not encrypted.
REQUIRE: SSL is required.
VERIFY_CA: SSL must be used and the server certificate must be verified.
VERIFY_FULL: SSL must be used. The server certificate must be verified and the server hostname must match the hostname attribute on the certificate.

purpose: common for all targets. PostgreSQL supports only ENRICHED_EVENTS
id: machine-readable config id in UUID format
sshTunnel: optional bastion host configuration

Snowflake

Schema: iglu:com.snowplowanalytics.snowplow.storage/snowflake_config/jsonschema/1-0-1

Snowflake configuration is available at dedicated Snowflake Loader wiki.

Elasticsearch

Schema: iglu:com.snowplowanalytics.snowplow.storage/elastic_config/jsonschema/1-0-1

name: a descriptive name for this Snowplow storage target
port: The port to load. Normally 9200, should be 80 for Amazon Elasticsearch Service.
index: The Elasticsearch index to load
nodesWanOnly: if this is set to true, the EMR job will disable node discovery. This option is necessary when using Amazon Elasticsearch Service.
type: name of type
purpose: common for all targets. Elasticsearch supports only FAILED_EVENTS
id: optional machine-readable config id

For information on setting up Elasticsearch itself, see Setting up Amazon Elasticsearch Service.

Amazon DynamoDB

Schema: iglu:com.snowplowanalytics.snowplow.storage/amazon_dynamodb_config/jsonschema/1-0-1

name: a descriptive name for this Snowplow storage target
accessKeyId: AWS Access Key Id
secretAccessKey: AWS Secret Access Key
awsRegion: AWS region
dynamodbTable: DynamoDB table to store information about processed events
purpose: common for all targets. Elasticsearch supports only DUPLICATE_TRACKING
id: optional machine-readable config id

HOME » SNOWPLOW SETUP GUIDE » Step 4: Setup alternative data stores

Setup Snowplow

Useful resources

Troubleshooting
IAM Setup
Hosted assets
Glossary of Terms
Upgrade Guide
Snowplow Version Matrix
Batch Pipeline Steps (block dataflow diagram)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly