-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update format Lambda to generate GeoHarvester extract command #177
Update format Lambda to generate GeoHarvester extract command #177
Conversation
Why these changes are being introduced: * The Lambda function needs to determine which harvester to use based on the source name provided in a payload to the TIMDEX StepFunction. How this addresses that need: * Add a conditional to call GeoHarvester when source in ["gismit", "gisogm"] * Set OAI Harvester as default * Update helpers.generate_step_output_file to set file_type = "jsonl" when dealing when source is geospatial layers (i.e., source name contains "gis") * Add 'harvester-type' property indicating harvester used in extract step to output payload of format Lambda * Add test to verify successful creation of GeoHarvester extract command * Update config.validate_input to only raise error for missing OAI harvest fields Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-115
4e245af
to
30ad207
Compare
Pull Request Test Coverage Report for Build 7466269261
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, think it looks good. I believe it would support the functionality needed.
Requesting a couple changes and happy to discuss further.
lambdas/commands.py
Outdated
if source in ["gismit", "gisogm"]: | ||
extract_command.append("harvest") | ||
if run_type == "daily": | ||
extract_command.append("--harvest-type=incremental") | ||
extract_command.append( | ||
f"--from-date={helpers.generate_harvest_from_date(run_date)}" | ||
) | ||
elif run_type == "full": | ||
extract_command.append("--harvest-type=full") | ||
|
||
if set_spec := input_data.get("oai-set-spec"): | ||
extract_command.append(f"--set-spec={set_spec}") | ||
extract_command.append( | ||
f"--output-file=s3://{timdex_bucket}/{extract_output_file}" | ||
) | ||
extract_command.append(source.removeprefix("gis")) | ||
|
||
if run_type == "daily": | ||
else: | ||
extract_command.append(f"--host={input_data['oai-pmh-host']}") | ||
extract_command.append( | ||
f"--from-date={helpers.generate_harvest_from_date(run_date)}", | ||
f"--output-file=s3://{timdex_bucket}/{extract_output_file}" | ||
) | ||
elif run_type == "full": | ||
extract_command.append("--exclude-deleted") | ||
extract_command.append("harvest") | ||
if source in ["aspace", "dspace"]: | ||
extract_command.append("--method=get") | ||
extract_command.append(f"--metadata-format={input_data['oai-metadata-format']}") | ||
if run_type == "daily": | ||
extract_command.append( | ||
f"--from-date={helpers.generate_harvest_from_date(run_date)}", | ||
) | ||
elif run_type == "full": | ||
extract_command.append("--exclude-deleted") | ||
|
||
if set_spec := input_data.get("oai-set-spec"): | ||
extract_command.append(f"--set-spec={set_spec}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I think these changes are very clean, easy to follow, and inline with the previous style of this function, what do you think about breaking the functionality of creating OAI and Geo harvester command generation into helper functions? e.g. something like generate_oai_harvester_extract_command()
and generate_geo_harvester_extract_command()
? They could live in the helpers.py
file for the time being?
Ultimately, I'd imagine it would make them easier to test, though it would require some changes to the current tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed during our check-in, we decided to leave the code as-is for now and make additional improvements to the architecture (e.g., experiment with developing a class/module at the source-level to hold all required configurations for generating appropriate comands) in the near future (potentially when we integrate browsertrix-harvester
into TIMDEX).
Note: Will leave this comment 'Unresolved' for future reference.
lambdas/config.py
Outdated
if input_data["source"] not in ["gismit", "gisogm"]: | ||
if missing_harvest_fields := [ | ||
field for field in REQUIRED_OAI_HARVEST_FIELDS if field not in input_data | ||
]: | ||
message = ( | ||
"Input must include all required harvest fields when starting with " | ||
f"harvest step. Missing fields: {missing_harvest_fields}" | ||
) | ||
raise ValueError(message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this definitely works, just noting a lot of nested if
statements here. However, because the OAI Harvester is the "default", you kind of do need a negation check here.
Ultimately, I could see validation functions for next-step
, run-type
, and then for extract
it could hand off to validators depending on the harvester.
However, not sure that's worth the effort now. If we are going to explore unifying the TIMDEX sources across parts of the pipeline, where this lambda is a critical piece of that, might be better to leave as-is -- which is readable and easy to scan -- and consider refactoring down the road.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Will leave this comment 'Unresolved' for future reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and given the comments about your discussions with Graham, I think it's good to go. 1 question about why some logging was removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Proposed one more change, define the source list ["gismit", "gisogm"]
as a config-level constasnt, and that can be checked against throughout the code, as opposed to checking against the hardcoded list. And apologies again for not proposing that the first time around.
But, marking this PR as "Comment" because with or without that change, otherwise looks good and approved! Please feel free to merge, from my POV, with or without that change.
Logistically, once this is merged, I'd suggest we keep GDT-114 and GDT-115 open until we run the StepFunction through a few test runs now that most things are in place. We can make sure these two stories are aligned with respect to the StepFunction flow.
Looks as though the deployed version to Dev1 is operating as hoped!
@jonavellecuerdo - if and when you merge this PR, from my POV, I think it's probably also safe to also mark GDT-115 as done. |
* Return log statement * Define config-level constant: GIS_SOURCES
* Return log statement * Define config-level constant: GIS_SOURCES
6eb019b
to
87fe9cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved! Thanks for all the work on this.
def test_generate_extract_command_geoharvester(): | ||
input_data = { | ||
"run-date": "2022-01-02T12:13:14Z", | ||
"run-type": "daily", | ||
"next-step": "extract", | ||
"source": "gismit", | ||
} | ||
assert commands.generate_extract_command( | ||
input_data, "2022-01-02", "test-timdex-bucket", False | ||
) == { | ||
"extract-command": [ | ||
"harvest", | ||
"--harvest-type=incremental", | ||
"--from-date=2022-01-01", | ||
"--output-file=s3://test-timdex-bucket/gismit/" | ||
"gismit-2022-01-02-daily-extracted-records-to-index.jsonl", | ||
"mit", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding. If we need to adjust the GeoHarvester command at any point, I think this test will be handy to keep us aligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great final change!
* Add test to verify appropriate file type extension for geoharvester extract * Update conditional for helpers.generate_step_output_filename * Return log statement * Define config-level constant: GIS_SOURCES
87fe9cf
to
a3e74f6
Compare
Purpose and background context
Update format Lambda function to generate GeoHarvester extract command based on
source
input variable provided in payload to TIMDEX StepFunction.For more details, see Confluence documentation: TIMDEX StepFunction Updates | Pipeline Lambdas
How can a reviewer manually see the effects of these changes?
--harvest-type=incremental
becauserun-type=daily
and the (b).jsonl
file type appended to theoutput-file
parameter.Includes new or updated dependencies?
NO
Changes expectations for external applications?
NO
What are the relevant tickets?
https://mitlibraries.atlassian.net/browse/GDT-115
Developer
Code Reviewer(s)