s3-to-redshift is responsible for syncing data from
s3 into AWS Redshift for data analysis.
Note: this repository formerly was called
redshifter, but has been modified to fit a slightly different design pattern.
AWS Redshift is a columnar storage based data warehouse solution. It is optimized for analysis and business intelligence and has many useful integrations.
However, getting bulk data into
Redshift can be tricky and requires many steps, like:
- Modifying tables in
- Modifying whatever process is collecting your data
- Modifying whatever process is submitting your data into
We are trying to minimize the amount of work to add or modify data going into
Redshift by automatically:
- Finding the latest data
- Modifying the destination
Redshifttables, if necessary
- Refreshing the latest
Redshiftdata (see: the
granularityflag) by efficiently loading s3 data using the
We split up environment variables and command line flags based on realistic ways one might run this in production.
Essentially, for data that corresponds to
s3 connections, the config is stored in environment variables.
Otherwise, it is likely one will want to change data such as
table between runs, so information like that is expected in flag form.
Testing can be done manually by running
s3-to-redshift locally with the desired parameters:
SCHEMA=api TABLES=business_metrics_auth_counts ark start s3-to-redshift -e clever-dev -l
See the Makefile for a complete list of parameters you can use for testing.
Possible flags and their meanings:
Redshiftschema to insert into
Redshifttables to insert into, comma separated
s3bucket to pull from
truncate: clear the table before inserting
force: refresh the data even if the data date is after the current
date: the date string for the data in question
config: override of the usual auto-discovery of the config
delimiter: required to use CSV files, what the file is delimited in (likely use the '|' pipe character as that is AWS' default). If
""then JSON copy is assumed
granularity: how often we expect to append new data for each table (i.e. daily, or hourly buckets)
timezone: specifies what timezone the target data is in (i.e. 'America/Los_Angeles'). Must be in the IANA Time Zone database.
Note on general usage:
This worker is intended to have a good amount of power and intelligence, instead of being a simple connector.
Thus, it looks for the right data automatically, and compares against what's in
s3-to-redshift has determined the s3 file exists, the worker inspects the target
- If there is not data in the table, no checks are needed and the process continues.
- If there is already data in the table,
s3-to-redshiftfinds the column that corresponds to the date of that data and compares with the date of the latest data in
Note that this "data date" is not necessarily the date the data itself was written to disk - it is not modified time, but instead the actual time the data was collected at its source.
The date parameter is required, and should match the date in the file name of the data file to transform.
This parameter should be the specific, full RFC3999 date, such as:
When the data already in the database is newer by "data date" than the data in
s3, we do not overwrite it or insert it.
This should protect us from accidental duplicate information or replacing newer data with older data.
However, if you do need to overwrite, passing the
--force flag will skip this check.
--force flag may be useful when:
- Business logic has changed and data needs to be overwritten
- An upstream process has written incorrect data which needs to be reinserted into
- Upstream processes write data out-of-order by design, and each run of
s3-to-redshiftis invoked with the
In normal operation, the worker looks for a config file for each schema/table combination.
This takes the form:
config_<data filename without suffix>.yml
The worker looks for a config file with that date as the data timestamp.
Your upstream producer might not want to write a config file for each set of data, or perhaps you have a central configuration location.
In this case, you can use the
--config parameter to pass a specific config file.
This file is accessed via Pathio, so the file may reside on
s3 or locally.
--truncate option set,
s3-to-redshift will insert into an existing table but leave any data already remaining in the table (except for the most recent data within the past granularity time range, which will be refreshed as new syncs come in).
s3-to-redshift will not insert or overwrite for a particular time period thus the worker is idempotent and duplicate data is not a concern.
If you instead are adding snapshot / dimension data to
Redshift, you should use the
--truncate option to clear out the existing data before inserting the current "state of the world".
One caveat: the
--truncate option does not also imply
If the data in
s3 is not newer than the data in
Redshift, the worker will refuse to truncate and replace the data without
Also please note that this can cause performance problems if you are not running a vacuum at least weekly.
--granularity flag describes how often we expect to append new data to the destination table. For instance, perhaps we would like to track daily school counts in
Redshift. Therefore, we expect one set of values per day to be stored in this table (and we specify this with
s3-to-redshift syncs updating the daily school count can still happen each day, but only the most recent sync data will be stored (as
s3-to-redshift will simply overwrite the existing school counts for the most recent day). As a result,
s3-to-redshift refreshes data in the latest time range, while leaving historical data untouched (and modifiable only via
--force). The width of this time range is specified by
We also support a 'streaming' granularity which means that we don't have a fixed granularity and we can pull data whenever. If the granularity is streaming we only add new data to the table.
Currently supported granularities are
Assuming that environment variables have been set:
go run cmd/main.go -schema=api_hits -tables=pages,sessions \ -bucket=analytics -config=s3://analytics/api.yml -date=2015-07-01T00:00:00Z -force=true -delimiter="|"
Please view the dev-handbook for instructions.