DynamoDB EMR Exporter

Uses EMR clusters to export and import dynamoDB tables to/from S3. This uses the same routines as dataPipeline BUT it runs everything though a single cluster for all tables rather than a cluster per table.

Export Usage

The tool is packaged into a Docker container with all the prerequisites required. To run this:

Create a new IAM role ** Must be named dynamodb_emr_backup_restore ** Use the IAM policy contained in config-samples/dynamodb_emr_backup_restore.IAMPOLICY.json
Create a new EMR Security Configuration in any region to backup or restore to ** Must be named dynamodb-backups
Run the docker container as follows:

docker run \
  signiant/dynamodb-emr-exporter \
    app_name \
    emr_cluster_name \
    table_filter \
    read_throughput_percentage \
    s3_location \
    export_region \
    spiked_throughput \
    number_of_clusters

Where

app_name is a 'friendly name' for the DynamoDB table set you wish to export
emr_cluster_name is a name to give to the EMR cluster
table_filter is a filter for which table names to export (ie. MYAPP_PROD will export ALL tables starting with MYAPP_PROD)
read_throughput_percentage is the percent of provisioned read throughput to use (eg 0.45 will use 45% of the provisioned read throughput)
S3_location is a base S3 location to store the exports and all logs (ie. s3://mybucket/myfolder)
export_region is the AWS region where the tables to export exist
spiked_throughput is an optional provisioned read throughput value to spike the read throughtput to on the table being backed up
number_of_clusters is an optional value to specify how many clusters to use (default 1)

An optional environment variable DEBUG_OUTPUT can also be specified to the container which will run the underlying script with debug enabled

Excluding tables

You can place an optional file called excludes into the S3 based location (ie. whatever you have specified for S3_location) to exclude tables. The format is one table per line and it must be the full table name (no wildcards are supported here). Any tables which match the table_filter BUT also match an entry in the excludes file will NOT be exported

Import Usage

When the export runs, it also generates the configuration needed to execute an import. You can find the configuration file for imorting within the S3 location you specified (importSteps.json).

Running the import

The import can be run from Docker but you'll need to exec into the container to run it.

docker run \
  --entrypoint bash \
  signiant/dynamodb-emr-exporter

Before running the import, you need to perform 2 tasks

The tables you are importing data into MUST already exist with the same key structure in the region you wish to import into
Copy the importSteps.json file from the S3 bucket which contains the exports into the Docker container into the /app/common-json folder

Once these are done, you can invoke the restore like so

./restoreEMR.sh app_name emr_cluster_name local_json_files_path s3_path_for_logs cluster_region

Where

app_name is a 'friendly name' for the DynamoDB table set you wish to import
emr_cluster_name is a name to give to the EMR cluster
local_json_files_path is a folder to containing the json files produced by the export (generally, this will be /app/common-json)
s3_path_for_logs is a base S3 location to store logs from EMR related to the import
cluster_region is the AWS region in which to start the EMR cluster. This does not have to be the same region as the tables are being imported to

NOTE The write throughput to use for the DynamoDB tables is actually defined in the script that runs at export time. This is because it's then configured in the importSteps.json file. If you wish to increase this, you can edit the generated importSteps.json file.

Workings

The basic mechanics of the process are as follows

Export

Check and see if there are any EMR clusters already running for 'this' app. If so, exit. Otherwise, carry on
Setup the common configuration for the cluster
Call the python script to generate the steps (tasks) for EMR for each table. This essentially lists all the tables in the region, applies the provided filter and then generates the JSON that can be passed to EMR to export the tables
Once the steps JSON is present, create a new cluster with the AWS CLI. We have to handle cluster setup failure here so retries are used for failures.
Submit the tasks to the cluster and poll the cluster until it's complete. Any errors of a step will result in a failure being logged
Once we know everyyhing was successful, write the export and import steps files to S3 in case this machine has issues. We also write flag files to S3 indicating the progress of the export (in progress, complete, error, etc.) in case another process needs to ingest this data, it can poll on these status files.

Import

Create a new EMR cluster with the import steps file as the tasks to perform
Poll the cluster to ensure success

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
app		app
config-samples		config-samples
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

app

app

config-samples

config-samples

.gitignore

.gitignore

Dockerfile

Dockerfile

LICENSE

LICENSE

README.md

README.md

Repository files navigation

DynamoDB EMR Exporter

Export Usage

Excluding tables

Import Usage

Running the import

Workings

Export

Import

About

Releases

Packages

Contributors 4

Languages

License

Signiant/dynamodb-emr-exporter

Folders and files

Latest commit

History

Repository files navigation

DynamoDB EMR Exporter

Export Usage

Excluding tables

Import Usage

Running the import

Workings

Export

Import

About

Topics

Resources

License

Stars

Watchers

Forks

Languages