Configuring storage targets
HOME » SNOWPLOW SETUP GUIDE » Step 4: setting up alternative data stores > Configuring storage targets
🚧 The documentation for the latest version can be found on the Snowplow documentation site.
Snowplow offers the option to configure certain storage targets. This is done using configuration JSONs.
When running EmrEtlRunner, the --targets
argument should be populated with the filepath of a directory containing your configuration JSONs.
Each storage target JSON file can have arbitrary name, but must conform it's JSON Schema.
Some targets are handled by EmrEtlRunner (duplicate tracking, failure tracking) and some by RDB Loader (enriched data).
Here's a list of currently supported targets, grouped by purpose:
- Enriched data
- Failures
- Duplicate tracking
Schema: iglu:com.snowplowanalytics.snowplow.storage/redshift_config/jsonschema/3-0-0
-
name
, a descriptive name for this Snowplow storage target -
host
, the host (endpoint in Redshift parlance) of the databse to load. -
database
, the name of the database to load -
port
, the port of the database to load. 5439 is the default Redshift port -
schema
, the name of the database schema which will store your Snowplow tables -
username
, the database user to load your Snowplow events with. You can leave this blank to default to the user running the script -
password
, the password for the database user. Either plain-text password orec2ParameterStore
object -
maxError
, a Redshift-specific setting governing how many load errors should be permitted before failing the overall load. See the RedshiftCOPY
documentation for more details -
compRows
, a Redshift-specific setting defining number of rows to be used as the sample size for compression analysis. Should be between 1000 and 1000000000 -
purpose
: common for all targets. Redshift supports onlyENRICHED_EVENTS
-
sslMode
, determines how to handle encryption for client connections and server certificate verification. The the followingsslMode
values are supported:
-
DISABLE
: SSL is disabled and the connection is not encrypted. -
REQUIRE
: SSL is required. -
VERIFY_CA
: SSL must be used and the server certificate must be verified. -
VERIFY_FULL
: SSL must be used. The server certificate must be verified and the server hostname must match the hostname attribute on the certificate.
-
roleArn
: AWS Role ARN allowing Redshift to read data from S3 -
sshTunnel
: optional bastion host configuration
Note: The difference between VERIFY_CA
and VERIFY_FULL
depends on the policy of the root CA. If a public CA is used, VERIFY_CA
allows connections to a server that somebody else may have registered with the CA to succeed. In this case, verify-full
should always be used. If a local CA is used, or even a self-signed certificate, using VERIFY_CA
often provides enough protection.
Schema: iglu:com.snowplowanalytics.snowplow.storage/postgresql_config/jsonschema/2-0-0
-
name
, enter a descriptive name for this Snowplow storage target -
host
, the host (endpoint in Redshift parlance) of the databse to load. -
database
, the name of the database to load -
port
, the port of the database to load. 5439 is the default Redshift port; 5432 is the default Postgres port -
schema
, the name of the database schema which will store your Snowplow tables -
username
, the database user to load your Snowplow events with. You can leave this blank to default to the user running the script -
password
, the password for the database user. Leave blank if there is no password -
sslSode
, determines how to handle encryption for client connections and server certificate verification. The the followingsslMode
values are supported:
-
DISABLE
: SSL is disabled and the connection is not encrypted. -
REQUIRE
: SSL is required. -
VERIFY_CA
: SSL must be used and the server certificate must be verified. -
VERIFY_FULL
: SSL must be used. The server certificate must be verified and the server hostname must match the hostname attribute on the certificate.
-
purpose
: common for all targets. PostgreSQL supports onlyENRICHED_EVENTS
-
id
: machine-readable config id in UUID format -
sshTunnel
: optional bastion host configuration
Schema: iglu:com.snowplowanalytics.snowplow.storage/snowflake_config/jsonschema/1-0-1
Snowflake configuration is available at dedicated Snowflake Loader wiki.
Schema: iglu:com.snowplowanalytics.snowplow.storage/elastic_config/jsonschema/1-0-1
-
name
: a descriptive name for this Snowplow storage target -
port
: The port to load. Normally 9200, should be 80 for Amazon Elasticsearch Service. -
index
: The Elasticsearch index to load -
nodesWanOnly
: if this is set to true, the EMR job will disable node discovery. This option is necessary when using Amazon Elasticsearch Service. -
type
: name of type -
purpose
: common for all targets. Elasticsearch supports onlyFAILED_EVENTS
-
id
: optional machine-readable config id
For information on setting up Elasticsearch itself, see Setting up Amazon Elasticsearch Service.
Schema: iglu:com.snowplowanalytics.snowplow.storage/amazon_dynamodb_config/jsonschema/1-0-1
-
name
: a descriptive name for this Snowplow storage target -
accessKeyId
: AWS Access Key Id -
secretAccessKey
: AWS Secret Access Key -
awsRegion
: AWS region -
dynamodbTable
: DynamoDB table to store information about processed events -
purpose
: common for all targets. Elasticsearch supports onlyDUPLICATE_TRACKING
-
id
: optional machine-readable config id
Home | About | Project | Setup Guide | Technical Docs | Copyright © 2012-2021 Snowplow Analytics Ltd. Documentation terms of use.
HOME » SNOWPLOW SETUP GUIDE » Step 4: Setup alternative data stores
- Step 1: Setup a Collector
- Step 2a: Setup a Tracker
- Step 2b: Setup a Webhook
- Step 3: Setup Enrich
-
Step 4: Setup alternative data stores
- 4.1: setup Redshift
- 4.2: setup PostgreSQL
- 4.3: installing the StorageLoader
- 4.4: using the StorageLoader
- 4.5: scheduling the StorageLoader
- 4.6: loading shredded types
- 4.7: setup Elasticsearch
- 4.8: setup the Kinesis LZO S3 Sink
- 4.9: setup Druid
- 4.10: setup Amazon DynamoDB
- 4.11: configuring storage targets
- 4.12: setup Google Cloud Storage Loader
- Step 5: Data modeling
- Step 6: Analyze your data!
Useful resources
- Troubleshooting
- IAM Setup
- Hosted assets
- Glossary of Terms
- Upgrade Guide
- Snowplow Version Matrix
- Batch Pipeline Steps (block dataflow diagram)