Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update format Lambda to generate GeoHarvester extract command #177

Merged
merged 3 commits into from
Jan 9, 2024

Conversation

jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented Jan 5, 2024

Purpose and background context

Update format Lambda function to generate GeoHarvester extract command based on source input variable provided in payload to TIMDEX StepFunction.

For more details, see Confluence documentation: TIMDEX StepFunction Updates | Pipeline Lambdas

How can a reviewer manually see the effects of these changes?

  1. Build the Container
  2. Run the default handler for the container
  3. Open a new terminal and post the following to the container:
    curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{"next-step": "extract", "run-date": "2022-03-10T16:30:23Z", "run-type": "daily", "source": "gismit", "verbose": "true"}'
    
  4. Observe the GeoHarvester extract command in the following output:
    {"run-date": "2022-03-10", "run-type": "daily", "source": "gismit", "verbose": true, "harvester-type": "geo", "next-step": "transform", "extract": {"extract-command": ["--verbose", "harvest", "--harvest-type=incremental", "--from-date=2022-03-09", "--output-file=s3://timdex-bucket-name/gismit/gismit-2022-03-10-daily-extracted-records-to-index.jsonl", "mit"]}}
    
    • Notice that (a) --harvest-type=incremental because run-type=daily and the (b) .jsonl file type appended to the output-file parameter.

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/GDT-115

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed and verified
  • New dependencies are appropriate or there were no changes

@jonavellecuerdo jonavellecuerdo self-assigned this Jan 5, 2024
@jonavellecuerdo jonavellecuerdo marked this pull request as draft January 5, 2024 21:20
Why these changes are being introduced:
* The Lambda function needs to determine which harvester to use based on the
source name provided in a payload to the TIMDEX StepFunction.

How this addresses that need:
* Add a conditional to call GeoHarvester when source in ["gismit", "gisogm"]
* Set OAI Harvester as default
* Update helpers.generate_step_output_file to set file_type = "jsonl" when
dealing when source is geospatial layers (i.e., source name contains "gis")
* Add 'harvester-type' property indicating harvester used in extract step
to output payload of format Lambda
* Add test to verify successful creation of GeoHarvester extract command
* Update config.validate_input to only raise error for missing OAI harvest fields

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/GDT-115
@jonavellecuerdo jonavellecuerdo force-pushed the GDT-115-generate-geoharvester-extract-command branch from 4e245af to 30ad207 Compare January 8, 2024 14:46
Copy link

github-actions bot commented Jan 8, 2024

Pull Request Test Coverage Report for Build 7466269261

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-1.1%) to 98.864%

Totals Coverage Status
Change from base Build 7398649316: -1.1%
Covered Lines: 261
Relevant Lines: 264

💛 - Coveralls

@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review January 8, 2024 15:00
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, think it looks good. I believe it would support the functionality needed.

Requesting a couple changes and happy to discuss further.

lambdas/helpers.py Outdated Show resolved Hide resolved
tests/test_helpers.py Show resolved Hide resolved
Comment on lines 31 to 63
if source in ["gismit", "gisogm"]:
extract_command.append("harvest")
if run_type == "daily":
extract_command.append("--harvest-type=incremental")
extract_command.append(
f"--from-date={helpers.generate_harvest_from_date(run_date)}"
)
elif run_type == "full":
extract_command.append("--harvest-type=full")

if set_spec := input_data.get("oai-set-spec"):
extract_command.append(f"--set-spec={set_spec}")
extract_command.append(
f"--output-file=s3://{timdex_bucket}/{extract_output_file}"
)
extract_command.append(source.removeprefix("gis"))

if run_type == "daily":
else:
extract_command.append(f"--host={input_data['oai-pmh-host']}")
extract_command.append(
f"--from-date={helpers.generate_harvest_from_date(run_date)}",
f"--output-file=s3://{timdex_bucket}/{extract_output_file}"
)
elif run_type == "full":
extract_command.append("--exclude-deleted")
extract_command.append("harvest")
if source in ["aspace", "dspace"]:
extract_command.append("--method=get")
extract_command.append(f"--metadata-format={input_data['oai-metadata-format']}")
if run_type == "daily":
extract_command.append(
f"--from-date={helpers.generate_harvest_from_date(run_date)}",
)
elif run_type == "full":
extract_command.append("--exclude-deleted")

if set_spec := input_data.get("oai-set-spec"):
extract_command.append(f"--set-spec={set_spec}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I think these changes are very clean, easy to follow, and inline with the previous style of this function, what do you think about breaking the functionality of creating OAI and Geo harvester command generation into helper functions? e.g. something like generate_oai_harvester_extract_command() and generate_geo_harvester_extract_command()? They could live in the helpers.py file for the time being?

Ultimately, I'd imagine it would make them easier to test, though it would require some changes to the current tests.

Copy link
Contributor Author

@jonavellecuerdo jonavellecuerdo Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed during our check-in, we decided to leave the code as-is for now and make additional improvements to the architecture (e.g., experiment with developing a class/module at the source-level to hold all required configurations for generating appropriate comands) in the near future (potentially when we integrate browsertrix-harvester into TIMDEX).

Note: Will leave this comment 'Unresolved' for future reference.

Comment on lines 86 to 94
if input_data["source"] not in ["gismit", "gisogm"]:
if missing_harvest_fields := [
field for field in REQUIRED_OAI_HARVEST_FIELDS if field not in input_data
]:
message = (
"Input must include all required harvest fields when starting with "
f"harvest step. Missing fields: {missing_harvest_fields}"
)
raise ValueError(message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this definitely works, just noting a lot of nested if statements here. However, because the OAI Harvester is the "default", you kind of do need a negation check here.

Ultimately, I could see validation functions for next-step, run-type, and then for extract it could hand off to validators depending on the harvester.

However, not sure that's worth the effort now. If we are going to explore unifying the TIMDEX sources across parts of the pipeline, where this lambda is a critical piece of that, might be better to leave as-is -- which is readable and easy to scan -- and consider refactoring down the road.

Copy link
Contributor Author

@jonavellecuerdo jonavellecuerdo Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Will leave this comment 'Unresolved' for future reference.

@jonavellecuerdo
Copy link
Contributor Author

@ghukill and @ehanson8 -- Please see the latest updates to this PR!

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and given the comments about your discussions with Graham, I think it's good to go. 1 question about why some logging was removed

lambdas/format_input.py Outdated Show resolved Hide resolved
@ghukill ghukill self-requested a review January 8, 2024 23:08
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Proposed one more change, define the source list ["gismit", "gisogm"] as a config-level constasnt, and that can be checked against throughout the code, as opposed to checking against the hardcoded list. And apologies again for not proposing that the first time around.

But, marking this PR as "Comment" because with or without that change, otherwise looks good and approved! Please feel free to merge, from my POV, with or without that change.

Logistically, once this is merged, I'd suggest we keep GDT-114 and GDT-115 open until we run the StepFunction through a few test runs now that most things are in place. We can make sure these two stories are aligned with respect to the StepFunction flow.

lambdas/commands.py Outdated Show resolved Hide resolved
lambdas/format_input.py Outdated Show resolved Hide resolved
lambdas/helpers.py Outdated Show resolved Hide resolved
@ghukill
Copy link
Contributor

ghukill commented Jan 9, 2024

Looking good. Proposed one more change, define the source list ["gismit", "gisogm"] as a config-level constasnt, and that can be checked against throughout the code, as opposed to checking against the hardcoded list. And apologies again for not proposing that the first time around.

But, marking this PR as "Comment" because with or without that change, otherwise looks good and approved! Please feel free to merge, from my POV, with or without that change.

Logistically, once this is merged, I'd suggest we keep GDT-114 and GDT-115 open until we run the StepFunction through a few test runs now that most things are in place. We can make sure these two stories are aligned with respect to the StepFunction flow.

Looks as though the deployed version to Dev1 is operating as hoped!

@jonavellecuerdo - if and when you merge this PR, from my POV, I think it's probably also safe to also mark GDT-115 as done.

jonavellecuerdo added a commit that referenced this pull request Jan 9, 2024
* Return log statement
* Define config-level constant: GIS_SOURCES
jonavellecuerdo added a commit that referenced this pull request Jan 9, 2024
* Return log statement
* Define config-level constant: GIS_SOURCES
@jonavellecuerdo jonavellecuerdo force-pushed the GDT-115-generate-geoharvester-extract-command branch from 6eb019b to 87fe9cf Compare January 9, 2024 17:12
@jonavellecuerdo
Copy link
Contributor Author

@ghukill and @ehanson8 -- Please see the latest commit, addressing your comments. :)

Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved! Thanks for all the work on this.

lambdas/config.py Show resolved Hide resolved
Comment on lines +59 to +75
def test_generate_extract_command_geoharvester():
input_data = {
"run-date": "2022-01-02T12:13:14Z",
"run-type": "daily",
"next-step": "extract",
"source": "gismit",
}
assert commands.generate_extract_command(
input_data, "2022-01-02", "test-timdex-bucket", False
) == {
"extract-command": [
"harvest",
"--harvest-type=incremental",
"--from-date=2022-01-01",
"--output-file=s3://test-timdex-bucket/gismit/"
"gismit-2022-01-02-daily-extracted-records-to-index.jsonl",
"mit",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding. If we need to adjust the GeoHarvester command at any point, I think this test will be handy to keep us aligned.

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great final change!

* Add test to verify appropriate file type extension for geoharvester extract
* Update conditional for helpers.generate_step_output_filename
* Return log statement
* Define config-level constant: GIS_SOURCES
@jonavellecuerdo jonavellecuerdo force-pushed the GDT-115-generate-geoharvester-extract-command branch from 87fe9cf to a3e74f6 Compare January 9, 2024 20:02
@jonavellecuerdo jonavellecuerdo merged commit 132785c into main Jan 9, 2024
3 checks passed
@jonavellecuerdo jonavellecuerdo deleted the GDT-115-generate-geoharvester-extract-command branch January 9, 2024 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants