Skip to content

Signiant/dynamodb-emr-exporter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DynamoDB EMR Exporter

Uses EMR clusters to export and import dynamoDB tables to/from S3. This uses the same routines as dataPipeline BUT it runs everything though a single cluster for all tables rather than a cluster per table.

Export Usage

The tool is packaged into a Docker container with all the prerequisites required. To run this:

  • Create a new IAM role ** Must be named dynamodb_emr_backup_restore ** Use the IAM policy contained in config-samples/dynamodb_emr_backup_restore.IAMPOLICY.json

  • Create a new EMR Security Configuration in any region to backup or restore to ** Must be named dynamodb-backups

  • Run the docker container as follows:

docker run \
  signiant/dynamodb-emr-exporter \
    app_name \
    emr_cluster_name \
    table_filter \
    read_throughput_percentage \
    s3_location \
    export_region \
    spiked_throughput \
    number_of_clusters

Where

  • app_name is a 'friendly name' for the DynamoDB table set you wish to export
  • emr_cluster_name is a name to give to the EMR cluster
  • table_filter is a filter for which table names to export (ie. MYAPP_PROD will export ALL tables starting with MYAPP_PROD)
  • read_throughput_percentage is the percent of provisioned read throughput to use (eg 0.45 will use 45% of the provisioned read throughput)
  • S3_location is a base S3 location to store the exports and all logs (ie. s3://mybucket/myfolder)
  • export_region is the AWS region where the tables to export exist
  • spiked_throughput is an optional provisioned read throughput value to spike the read throughtput to on the table being backed up
  • number_of_clusters is an optional value to specify how many clusters to use (default 1)

An optional environment variable DEBUG_OUTPUT can also be specified to the container which will run the underlying script with debug enabled

Excluding tables

You can place an optional file called excludes into the S3 based location (ie. whatever you have specified for S3_location) to exclude tables. The format is one table per line and it must be the full table name (no wildcards are supported here). Any tables which match the table_filter BUT also match an entry in the excludes file will NOT be exported

Import Usage

When the export runs, it also generates the configuration needed to execute an import. You can find the configuration file for imorting within the S3 location you specified (importSteps.json).

Running the import

The import can be run from Docker but you'll need to exec into the container to run it.

docker run \
  --entrypoint bash \
  signiant/dynamodb-emr-exporter

Before running the import, you need to perform 2 tasks

  1. The tables you are importing data into MUST already exist with the same key structure in the region you wish to import into
  2. Copy the importSteps.json file from the S3 bucket which contains the exports into the Docker container into the /app/common-json folder

Once these are done, you can invoke the restore like so

./restoreEMR.sh app_name emr_cluster_name local_json_files_path s3_path_for_logs cluster_region

Where

  • app_name is a 'friendly name' for the DynamoDB table set you wish to import
  • emr_cluster_name is a name to give to the EMR cluster
  • local_json_files_path is a folder to containing the json files produced by the export (generally, this will be /app/common-json)
  • s3_path_for_logs is a base S3 location to store logs from EMR related to the import
  • cluster_region is the AWS region in which to start the EMR cluster. This does not have to be the same region as the tables are being imported to

NOTE The write throughput to use for the DynamoDB tables is actually defined in the script that runs at export time. This is because it's then configured in the importSteps.json file. If you wish to increase this, you can edit the generated importSteps.json file.

Workings

The basic mechanics of the process are as follows

Export

  1. Check and see if there are any EMR clusters already running for 'this' app. If so, exit. Otherwise, carry on
  2. Setup the common configuration for the cluster
  3. Call the python script to generate the steps (tasks) for EMR for each table. This essentially lists all the tables in the region, applies the provided filter and then generates the JSON that can be passed to EMR to export the tables
  4. Once the steps JSON is present, create a new cluster with the AWS CLI. We have to handle cluster setup failure here so retries are used for failures.
  5. Submit the tasks to the cluster and poll the cluster until it's complete. Any errors of a step will result in a failure being logged
  6. Once we know everyyhing was successful, write the export and import steps files to S3 in case this machine has issues. We also write flag files to S3 indicating the progress of the export (in progress, complete, error, etc.) in case another process needs to ingest this data, it can poll on these status files.

Import

  1. Create a new EMR cluster with the import steps file as the tasks to perform
  2. Poll the cluster to ensure success

About

Uses EMR clusters to export dynamoDB tables to S3 and generates import steps

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •