Update format Lambda to generate GeoHarvester extract command #177

jonavellecuerdo · 2024-01-05T21:09:18Z

Purpose and background context

Update format Lambda function to generate GeoHarvester extract command based on source input variable provided in payload to TIMDEX StepFunction.

For more details, see Confluence documentation: TIMDEX StepFunction Updates | Pipeline Lambdas

How can a reviewer manually see the effects of these changes?

Build the Container
Run the default handler for the container

Open a new terminal and post the following to the container:

curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{"next-step": "extract", "run-date": "2022-03-10T16:30:23Z", "run-type": "daily", "source": "gismit", "verbose": "true"}'

Observe the GeoHarvester extract command in the following output:

{"run-date": "2022-03-10", "run-type": "daily", "source": "gismit", "verbose": true, "harvester-type": "geo", "next-step": "transform", "extract": {"extract-command": ["--verbose", "harvest", "--harvest-type=incremental", "--from-date=2022-03-09", "--output-file=s3://timdex-bucket-name/gismit/gismit-2022-03-10-daily-extracted-records-to-index.jsonl", "mit"]}}

Notice that (a) --harvest-type=incremental because run-type=daily and the (b) .jsonl file type appended to the output-file parameter.

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/GDT-115

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed and verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: * The Lambda function needs to determine which harvester to use based on the source name provided in a payload to the TIMDEX StepFunction. How this addresses that need: * Add a conditional to call GeoHarvester when source in ["gismit", "gisogm"] * Set OAI Harvester as default * Update helpers.generate_step_output_file to set file_type = "jsonl" when dealing when source is geospatial layers (i.e., source name contains "gis") * Add 'harvester-type' property indicating harvester used in extract step to output payload of format Lambda * Add test to verify successful creation of GeoHarvester extract command * Update config.validate_input to only raise error for missing OAI harvest fields Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-115

github-actions · 2024-01-08T14:47:59Z

Pull Request Test Coverage Report for Build 7466269261

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-1.1%) to 98.864%

Totals
Change from base Build 7398649316:	-1.1%
Covered Lines:	261
Relevant Lines:	264

💛 - Coveralls

ghukill

Overall, think it looks good. I believe it would support the functionality needed.

Requesting a couple changes and happy to discuss further.

lambdas/helpers.py

tests/test_helpers.py

ghukill · 2024-01-08T16:50:53Z

lambdas/commands.py

+    if source in ["gismit", "gisogm"]:
+        extract_command.append("harvest")
+        if run_type == "daily":
+            extract_command.append("--harvest-type=incremental")
+            extract_command.append(
+                f"--from-date={helpers.generate_harvest_from_date(run_date)}"
+            )
+        elif run_type == "full":
+            extract_command.append("--harvest-type=full")

-    if set_spec := input_data.get("oai-set-spec"):
-        extract_command.append(f"--set-spec={set_spec}")
+        extract_command.append(
+            f"--output-file=s3://{timdex_bucket}/{extract_output_file}"
+        )
+        extract_command.append(source.removeprefix("gis"))

-    if run_type == "daily":
+    else:
+        extract_command.append(f"--host={input_data['oai-pmh-host']}")
        extract_command.append(
-            f"--from-date={helpers.generate_harvest_from_date(run_date)}",
+            f"--output-file=s3://{timdex_bucket}/{extract_output_file}"
        )
-    elif run_type == "full":
-        extract_command.append("--exclude-deleted")
+        extract_command.append("harvest")
+        if source in ["aspace", "dspace"]:
+            extract_command.append("--method=get")
+        extract_command.append(f"--metadata-format={input_data['oai-metadata-format']}")
+        if run_type == "daily":
+            extract_command.append(
+                f"--from-date={helpers.generate_harvest_from_date(run_date)}",
+            )
+        elif run_type == "full":
+            extract_command.append("--exclude-deleted")
+
+        if set_spec := input_data.get("oai-set-spec"):
+            extract_command.append(f"--set-spec={set_spec}")


While I think these changes are very clean, easy to follow, and inline with the previous style of this function, what do you think about breaking the functionality of creating OAI and Geo harvester command generation into helper functions? e.g. something like generate_oai_harvester_extract_command() and generate_geo_harvester_extract_command()? They could live in the helpers.py file for the time being?

Ultimately, I'd imagine it would make them easier to test, though it would require some changes to the current tests.

As discussed during our check-in, we decided to leave the code as-is for now and make additional improvements to the architecture (e.g., experiment with developing a class/module at the source-level to hold all required configurations for generating appropriate comands) in the near future (potentially when we integrate browsertrix-harvester into TIMDEX).

Note: Will leave this comment 'Unresolved' for future reference.

ghukill · 2024-01-08T16:58:02Z

lambdas/config.py

+        if input_data["source"] not in ["gismit", "gisogm"]:
+            if missing_harvest_fields := [
+                field for field in REQUIRED_OAI_HARVEST_FIELDS if field not in input_data
+            ]:
+                message = (
+                    "Input must include all required harvest fields when starting with "
+                    f"harvest step. Missing fields: {missing_harvest_fields}"
+                )
+                raise ValueError(message)


While this definitely works, just noting a lot of nested if statements here. However, because the OAI Harvester is the "default", you kind of do need a negation check here.

Ultimately, I could see validation functions for next-step, run-type, and then for extract it could hand off to validators depending on the harvester.

However, not sure that's worth the effort now. If we are going to explore unifying the TIMDEX sources across parts of the pipeline, where this lambda is a critical piece of that, might be better to leave as-is -- which is readable and easy to scan -- and consider refactoring down the road.

Note: Will leave this comment 'Unresolved' for future reference.

jonavellecuerdo · 2024-01-08T20:01:17Z

@ghukill and @ehanson8 -- Please see the latest updates to this PR!

ehanson8

Looks good and given the comments about your discussions with Graham, I think it's good to go. 1 question about why some logging was removed

lambdas/format_input.py

ghukill

Looking good. Proposed one more change, define the source list ["gismit", "gisogm"] as a config-level constasnt, and that can be checked against throughout the code, as opposed to checking against the hardcoded list. And apologies again for not proposing that the first time around.

But, marking this PR as "Comment" because with or without that change, otherwise looks good and approved! Please feel free to merge, from my POV, with or without that change.

Logistically, once this is merged, I'd suggest we keep GDT-114 and GDT-115 open until we run the StepFunction through a few test runs now that most things are in place. We can make sure these two stories are aligned with respect to the StepFunction flow.

lambdas/commands.py

lambdas/format_input.py

lambdas/helpers.py

ghukill · 2024-01-09T14:02:55Z

Looking good. Proposed one more change, define the source list ["gismit", "gisogm"] as a config-level constasnt, and that can be checked against throughout the code, as opposed to checking against the hardcoded list. And apologies again for not proposing that the first time around.

But, marking this PR as "Comment" because with or without that change, otherwise looks good and approved! Please feel free to merge, from my POV, with or without that change.

Logistically, once this is merged, I'd suggest we keep GDT-114 and GDT-115 open until we run the StepFunction through a few test runs now that most things are in place. We can make sure these two stories are aligned with respect to the StepFunction flow.

Looks as though the deployed version to Dev1 is operating as hoped!

@jonavellecuerdo - if and when you merge this PR, from my POV, I think it's probably also safe to also mark GDT-115 as done.

* Return log statement * Define config-level constant: GIS_SOURCES

jonavellecuerdo · 2024-01-09T17:13:51Z

@ghukill and @ehanson8 -- Please see the latest commit, addressing your comments. :)

ghukill

Approved! Thanks for all the work on this.

lambdas/config.py

ghukill · 2024-01-09T17:47:13Z

tests/test_commands.py

+def test_generate_extract_command_geoharvester():
+    input_data = {
+        "run-date": "2022-01-02T12:13:14Z",
+        "run-type": "daily",
+        "next-step": "extract",
+        "source": "gismit",
+    }
+    assert commands.generate_extract_command(
+        input_data, "2022-01-02", "test-timdex-bucket", False
+    ) == {
+        "extract-command": [
+            "harvest",
+            "--harvest-type=incremental",
+            "--from-date=2022-01-01",
+            "--output-file=s3://test-timdex-bucket/gismit/"
+            "gismit-2022-01-02-daily-extracted-records-to-index.jsonl",
+            "mit",


Thanks for adding. If we need to adjust the GeoHarvester command at any point, I think this test will be handy to keep us aligned.

ehanson8

Great final change!

* Add test to verify appropriate file type extension for geoharvester extract * Update conditional for helpers.generate_step_output_filename * Return log statement * Define config-level constant: GIS_SOURCES

jonavellecuerdo requested a review from ghukill January 5, 2024 21:09

jonavellecuerdo self-assigned this Jan 5, 2024

jonavellecuerdo marked this pull request as draft January 5, 2024 21:20

jonavellecuerdo added 2 commits January 8, 2024 09:45

Update PR template

30ad207

jonavellecuerdo force-pushed the GDT-115-generate-geoharvester-extract-command branch from 4e245af to 30ad207 Compare January 8, 2024 14:46

jonavellecuerdo marked this pull request as ready for review January 8, 2024 15:00

ghukill requested changes Jan 8, 2024

View reviewed changes

jonavellecuerdo requested a review from ehanson8 January 8, 2024 20:00

ehanson8 approved these changes Jan 8, 2024

View reviewed changes

lambdas/format_input.py Outdated Show resolved Hide resolved

ghukill self-requested a review January 8, 2024 23:08

ghukill reviewed Jan 8, 2024

View reviewed changes

lambdas/commands.py Outdated Show resolved Hide resolved

lambdas/format_input.py Outdated Show resolved Hide resolved

lambdas/helpers.py Outdated Show resolved Hide resolved

jonavellecuerdo added a commit that referenced this pull request Jan 9, 2024

Address comments on PR #177

6eb019b

* Return log statement * Define config-level constant: GIS_SOURCES

jonavellecuerdo added a commit that referenced this pull request Jan 9, 2024

Address comments on PR #177

87fe9cf

* Return log statement * Define config-level constant: GIS_SOURCES

jonavellecuerdo force-pushed the GDT-115-generate-geoharvester-extract-command branch from 6eb019b to 87fe9cf Compare January 9, 2024 17:12

jonavellecuerdo requested review from ghukill and ehanson8 January 9, 2024 17:14

ghukill approved these changes Jan 9, 2024

View reviewed changes

ehanson8 approved these changes Jan 9, 2024

View reviewed changes

Address comments on PR #177

a3e74f6

* Add test to verify appropriate file type extension for geoharvester extract * Update conditional for helpers.generate_step_output_filename * Return log statement * Define config-level constant: GIS_SOURCES

jonavellecuerdo force-pushed the GDT-115-generate-geoharvester-extract-command branch from 87fe9cf to a3e74f6 Compare January 9, 2024 20:02

jonavellecuerdo merged commit 132785c into main Jan 9, 2024
3 checks passed

jonavellecuerdo deleted the GDT-115-generate-geoharvester-extract-command branch January 9, 2024 20:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update format Lambda to generate GeoHarvester extract command #177

Update format Lambda to generate GeoHarvester extract command #177

jonavellecuerdo commented Jan 5, 2024 •

edited by ehanson8

Loading

github-actions bot commented Jan 8, 2024 •

edited

Loading

ghukill left a comment

ghukill Jan 8, 2024

jonavellecuerdo Jan 8, 2024 •

edited

Loading

ghukill Jan 8, 2024

jonavellecuerdo Jan 8, 2024 •

edited

Loading

jonavellecuerdo commented Jan 8, 2024

ehanson8 left a comment

ghukill left a comment

ghukill commented Jan 9, 2024 •

edited by jira bot

Loading

jonavellecuerdo commented Jan 9, 2024

ghukill left a comment

ghukill Jan 9, 2024

ehanson8 left a comment

Update format Lambda to generate GeoHarvester extract command #177

Update format Lambda to generate GeoHarvester extract command #177

Conversation

jonavellecuerdo commented Jan 5, 2024 • edited by ehanson8 Loading

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

github-actions bot commented Jan 8, 2024 • edited Loading

Pull Request Test Coverage Report for Build 7466269261

💛 - Coveralls

ghukill left a comment

Choose a reason for hiding this comment

ghukill Jan 8, 2024

Choose a reason for hiding this comment

jonavellecuerdo Jan 8, 2024 • edited Loading

Choose a reason for hiding this comment

ghukill Jan 8, 2024

Choose a reason for hiding this comment

jonavellecuerdo Jan 8, 2024 • edited Loading

Choose a reason for hiding this comment

jonavellecuerdo commented Jan 8, 2024

ehanson8 left a comment

Choose a reason for hiding this comment

ghukill left a comment

Choose a reason for hiding this comment

ghukill commented Jan 9, 2024 • edited by jira bot Loading

jonavellecuerdo commented Jan 9, 2024

ghukill left a comment

Choose a reason for hiding this comment

ghukill Jan 9, 2024

Choose a reason for hiding this comment

ehanson8 left a comment

Choose a reason for hiding this comment

jonavellecuerdo commented Jan 5, 2024 •

edited by ehanson8

Loading

github-actions bot commented Jan 8, 2024 •

edited

Loading

jonavellecuerdo Jan 8, 2024 •

edited

Loading

jonavellecuerdo Jan 8, 2024 •

edited

Loading

ghukill commented Jan 9, 2024 •

edited by jira bot

Loading