mongo fields to an
-config string String corresponding to an env var config -collection string The mongo collection you wish to pull from (required) -database string Database url if using existing instance (required) -bucket string s3 bucket to upload to
mongo-to-s3 does a few things:
- connects to the provided mongo
- determines the correct "data date" by rounding down to the nearest hour
- parses the provided config file
- for each table in the config file
- pulls the whitelisted fields from mongo
- flattens objects into dot-separated fields
- streams to gzipped, timestamped JSON files on s3
- prints the payload to be used in a s3-to-redshift job to process this data
- the job is kickstarted automatically by a workflow
mongo-to-s3 will attempt export all fields/tables in the
X_config.yml whitelist which it's called with.
Updating config files
Configs are env vars in
YAML and follow this format:
tablename-whateveryouwant: dest: <redshift_table_name> source: <mongo_table_name> columns: - dest: _data_timestamp type: timestamp sortord: 1 - dest: <column_name_in_redshift> source: <column_name_in_mongo> type: text primarykey: true notnull: true distkey: true meta: datadatecolumn: _data_timestamp schema: <redshift_schema_name>
Inrternal note: configs are located in ark-config
There are a few tricky things, including some items that are changing in the near future.
datadatecolumnis to help keep track of the date of the data going into the data warehouse, and to prevent us from overwriting new data with old. Therefore, we want to set it to approximately when the data was created.
Currently, we do this via a special column that we specify in the
Whatever column you specify here will be overwritten with the date the
mongo-to-s3 worker is run, rounded down to the nearest hour.
Note that we don't require a
source here as we populate it in
We currently don't support more than one
sortkey, so the only valid value for
You also have to set
primarykeycolumns, even though that is implied.
Accepted column types are:
- text (256 characters)
- longtext (65535 characters)
It should be easy to add more, however.
You may want to think about issues if some data arrives sooner than other data to the data warehouse. For instance, suppose item A is only "active" if an item B exists in the database and points to A. If you've synched over A significantly before B, it may appear that A is 'inactive' until B is synced over. In reality, A has always been 'active'.
While you pass collections to run on as parameters to
mongo-to-s3, the eventual
s3-to-redshftjob will post with the destination table names as parameters.