Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -134,4 +134,8 @@ dmypy.json

# VSCode
.vscode/
.idea/
.idea/

# SAM
.aws-sam/
tests/sam/env.json
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,6 @@ repos:
types: ["python"]
- id: pip-audit
name: pip-audit
entry: pipenv run pip-audit
entry: pipenv run pip-audit --ignore-vuln GHSA-4xh5-x5gv-qwph
language: system
pass_filenames: false
1 change: 0 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,4 @@ RUN pip3 install pipenv
RUN pipenv requirements > requirements.txt
RUN pip3 install -r requirements.txt --target "${LAMBDA_TASK_ROOT}"

# Default handler. See README for how to override to a different handler.
CMD [ "lambdas.format_input.lambda_handler" ]
14 changes: 13 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ ruff:
pipenv run ruff check .

safety: # Check for security vulnerabilities and verify Pipfile.lock is up-to-date
pipenv run pip-audit
pipenv run pip-audit --ignore-vuln GHSA-4xh5-x5gv-qwph
pipenv verify

# apply changes to resolve any linting errors
Expand Down Expand Up @@ -89,3 +89,15 @@ publish-stage: ## Only use in an emergency

update-lambda-stage: ## Updates the lambda with whatever is the most recent image in the ecr (intended for developer-based manual update)
aws lambda update-function-code --function-name $(FUNCTION_STAGE) --image-uri $(ECR_URL_STAGE):latest


####################################
# SAM Lambda
####################################
sam-build: # Build SAM image for running Lambda locally
sam build --template tests/sam/template.yaml

sam-example-libguides-extract: # Example command for invoking lambda
sam local invoke \
--env-vars tests/sam/env.json \
-e tests/fixtures/event_payloads/libguides-full-extract.json
1,793 changes: 975 additions & 818 deletions Pipfile.lock

Large diffs are not rendered by default.

95 changes: 28 additions & 67 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,22 +72,6 @@ The output will vary slightly depending on the provided `source`, as these somet
}
```

## Ping Handler

Useful for testing and little else.

### Example Ping Event

```json
{}
```

### Example Ping Result

```json
pong
```

## Development

* To preview a list of available Makefile commands: `make help`
Expand All @@ -102,70 +86,47 @@ The `update-format-lambda` is required anytime an image contains a change to the

GitHub Actions is configured to update the Lambda function with every push to the `main` branch.

### Running Locally with Docker
### Running Locally with [AWS SAM](https://aws.amazon.com/serverless/sam/)

- Build the container
Ensure that AWS SAM CLI is installed: https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/install-sam-cli.html.

```bash
make dist-dev
```
All following actions and commands should be performed from the root of the project (i.e. same directory as the `Dockerfile`).

- Run the default handler for the container
1- Create an environment variables override file:

```bash
docker run -e TIMDEX_ALMA_EXPORT_BUCKET_ID=alma-bucket-name \
-e TIMDEX_S3_EXTRACT_BUCKET_ID=timdex-bucket-name \
-e WORKSPACE=dev \
-p 9000:8080 timdex-pipeline-lambdas-dev:latest
```
```shell
cp tests/sam/env.json.template tests/sam/env.json
```

- POST to the container
Note: running this with next-step transform or load involves an actual S3 connection and is thus tricky to test locally. Better to push the image to Dev1 and test there.
Then update as needed. Defaults are okay for `extract` commands, but real bucket names will be needed for `transform` and `load` commands.

```bash
curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{
"next-step": "extract",
"run-date": "2022-03-10T16:30:23Z",
"run-type": "daily",
"source": "YOURSOURCE",
"verbose": "true",
"oai-pmh-host": "https://YOUR-OAI-SOURCE/oai",
"oai-metadata-format": "oai_dc",
"oai-set-spec": "YOUR-SET-SPEC"
}'
**NOTE:** AWS credentials are automatically passed from the terminal context where `sam invoke ...` is called; they do not need to be explicitly set as env vars in the `env.json` file that provides container overrides.

2- Build the SAM docker image:

```shell
make sam-build
```

- Observe output
-

3- Run a test invocation:
```shell
make sam-example-libguides-extract
```

Note the final lines of the output in the terminal is what the lambda would have returned:

```json
{
"run-date": "2022-03-10",
"run-type": "daily",
"source": "YOURSOURCE",
"verbose": true,
"next-step": "transform",
"extract": {
"extract-command": [
"--host=https://YOUR-OAI-SOURCE/oai",
"--output-file=s3://timdex-bucket-name/YOURSOURCE/YOURSOURCE-2022-03-09-daily-extracted-records-to-index.xml",
"--verbose",
"harvest",
"--metadata-format=oai_dc",
"--set-spec=YOUR-SET-SPEC",
"--from-date=2022-03-09"
]
}
}
{"run-date": "2025-10-14", "run-type": "full", "source": "libguides", "verbose": true, "harvester-type": "oai", "next-step": "transform", "extract": {"extract-command": ["--verbose", "--host=https://libguides.mit.edu/oai.php", "--output-file=s3://timdex-bucket/libguides/libguides-2025-10-14-full-extracted-records-to-index.xml", "harvest", "--metadata-format=oai_dc", "--exclude-deleted", "--set-spec=guides"]}}
```

### Running a Specific Handler Locally with Docker
You can call any handler you copy into the container (see Dockerfile) by name as part of the `docker run` command.
4- Run your own, custom invocation by preparing a JSON payload. This can be achieved either by passing a JSON file like the `Makefile` example `sam-example-libguides-extract` does, or by providing a `stdin` JSON string like this:

```bash
docker run -p 9000:8080 timdex-pipeline-lambdas-dev:latest lambdas.ping.lambda_handler
curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{}'
```shell
echo '{"next-step": "extract", "run-date": "2025-10-14", "run-type": "full", "source": "libguides", "verbose": "true", "oai-pmh-host": "https://libguides.mit.edu/oai.php", "oai-metadata-format": "oai_dc", "oai-set-spec": "guides", "run-id": "abc123"}' | sam local invoke -e -
```

Note that `--env-vars tests/sam/env.json` was not passed or needed. The [template YAML file](tests/sam/template.yaml) provides default values for env vars, and because they weren't actually used by the `extract` command generation work, the defaults were fine. Overrides to sensitive env vars are generally only needed when they will actually be used.

## Environment Variables

### Required
Expand Down
2 changes: 0 additions & 2 deletions lambdas/ping.py

This file was deleted.

11 changes: 11 additions & 0 deletions tests/fixtures/event_payloads/libguides-full-extract.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"next-step": "extract",
"run-date": "2025-10-14",
"run-type": "full",
"source": "libguides",
"verbose": "true",
"oai-pmh-host": "https://libguides.mit.edu/oai.php",
"oai-metadata-format": "oai_dc",
"oai-set-spec": "guides",
"run-id": "abc123"
}
6 changes: 6 additions & 0 deletions tests/sam/env.json.template
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"TimdexPipelineLambda": {
"TIMDEX_ALMA_EXPORT_BUCKET_ID":"timdex-bucket",
"TIMDEX_S3_EXTRACT_BUCKET_ID":"timdex-bucket"
}
}
30 changes: 30 additions & 0 deletions tests/sam/template.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Resources:
TimdexPipelineLambda:
Type: AWS::Serverless::Function
Properties:
PackageType: Image
Timeout: 900
MemorySize: 2048
Events:
BagitApi:
Type: HttpApi
Properties:
Path: /{proxy+}
Method: ANY
ImageUri: timdexpipelinelambda:latest
Environment:
# While tests/sam/env.json is required for sensitive env vars, ALL env vars
# used in the lambda must exist here as well, even just as "..." placeholders.
Variables:
WORKSPACE: "dev"
WARNING_ONLY_LOGGERS: "asyncio,botocore,urllib3,s3transfer,boto3"
SENTRY_DSN: "None"
TIMDEX_ALMA_EXPORT_BUCKET_ID: "timdex-bucket"
TIMDEX_S3_EXTRACT_BUCKET_ID: "timdex-bucket"
Metadata:
DockerContext: ../../.
DockerTag: latest
Dockerfile: Dockerfile
SamResourceId: TimdexPipelineLambda
9 changes: 0 additions & 9 deletions tests/test_ping.py

This file was deleted.