A simple tool for replaying S3 file creation Lambda invocations. This is useful for backfill or replay on real-time ETL pipelines that run transformations in Lambdas triggered by S3 file creation events.
Steps:
- Collect inputs from user
- Scan S3 for filenames that need to be replayed
- Batch S3 files into payloads for Lambda invocations
- Spawn workers to handle individual Lambda invocations/retries
- Process the work queue, keeping track of progress in a file in case of interrupts
- First step is to setup a python3 venv to hold our deps
./setup-venv.sh
- Run the command and follow the prompts
python s3-lambda-replay.py
Run the help command for a full list of available command line options
python s3-lambda-replay.py --help
Note that we must escape the $
This example is also included in the file run_replay.sh
python3 s3-lambda-replay.py \
-b gamesight-collection-pipeline-us-west-2-prod \
-p twitch/all/chatters/\$LATEST/objects/dt=2020-01-02-08-00/,twitch/all/chatters/\$LATEST/objects/dt=2020-01-02-09-00/ \
-l gstrans-prod-twitch-all-chatters
Using the command line also allows us to quickly include paths that aren't along /
separation lines. For example, we can use the path twitch/all/chatters/\$LATEST/objects/dt=2020-01-02-1
to look at all records between 10:00 and 20:00 on 2020-01-02, or twitch/all/chatters/\$LATEST/objects/dt=2020-01-02
to just run all of the objects from that day.